Follow Datanami:
November 10, 2014

Businesses Are Going About Data Science Wrong–Here’s How To Get It Right

Dillon Woods

Data scientists make the bold promise of granting businesses the power to predict the future. The adoption of this relatively new job title has gained traction, as savvy companies seek to move beyond the rearview mirror of traditional business intelligence practices, and strive to implement customer segmentation, attribution models, churn analysis, and other next generation analytics.

The current data scientist shortage may lead you to believe that the most difficult step is finding personnel to help your company achieve its goals. However, bringing data scientists on board while neglecting to provide them with the tools they need is a much bigger problem, not to mention a waste of time and money. When the potential of data science goes untapped, so goes a company’s investment in what could be the key to a more successful future.

With strong math backgrounds, knowledge of complex data systems and understanding of business problems, it’s often assumed that data scientists can create business insights using any tools at hand. Many data scientists today are forced to make due with nothing more than Excel spreadsheets and simple programming software installed on their workstations. They beg or borrow data from database administrators, explore or transform it using rules provided by business analysts, and pass the results on to business units via email.

Given the ad hoc nature of analytic workflows created this way, often only one person on the team understands each model well enough to run the entire process. If an analytic model needs to be run regularly, that person becomes the weak link in the delivery chain and results arrive late to the business units seeking to act on the insights. The results of each run can also be inconsistent, as the rules involved are ill defined and applied manually each time. A data science team can easily become overloaded by the responsibility of generating regular results, leaving little time to develop new analytics. The results of such a process can turn out to be useless, as they are often delivered in the form of a spreadsheet, or simply pasted into the body of an email, bypassing all standard business intelligence systems.

Some teams respond to the pitfalls of this manual process by asking data scientists to leverage a data warehouse. Interacting directly with the data tables should theoretically alleviate the manual data pull process, but in practice it leads to data duplication and confusion. Data models are often too restrictive for analytics work, and often need to be changed or augmented with additional sources and tables.

For database administrators, it’s difficult to determine when this data should be rolled off, and tracking which business project each table is intended to be associated with is extremely time-consuming.

When organizations are ready to graduate to the next level of data science maturity they often make a large investment in building a Hadoop cluster, an open-source framework for large-scale storage and processing of data sets. Hadoop is a logical choice at this stage because it addresses some of the main pain points the organization is usually feeling. By standardizing the data distribution process, and providing a central pool for all data flowing into the organization, Hadoop creates a smoother, more comprehensive and effective analytics workflow.

First and foremost, the data within the Hadoop system can be used directly by an analytics team with guaranteed consistency, as all scientists on the team have a single source for finding the data they need. Hadoop also improves the performance and reliability of analytic workflows, with jobs running in parallel across a distributed system, which can then be scaled out to meet any performance demands. Many machine-learning algorithms can be so computationally hadoop elephantexpensive and running them can simply take too long without leveraging a large compute cluster like Hadoop.

The Hadoop ecosystem also gives data scientists a suite of tools to work with, allowing teams to write all analytic processes in Java using the MapReduce programming framework. Pig can even provide a higher-level abstraction on top of MapReduce. Reusability of code is greatly increased when all data scientists are utilizing the same programming languages. HBase can be used as a simple NoSQL key/value store or Hive can bring SQL-like querying capability to data stored in HDFS.  These give the advantage of much needed flexibility when integrating with other systems or interfacing with other teams who may not be familiar with running MapReduce programs or fetching their final results from HDFS.

Scalable data storage, reliable computational capability, and a built-in suite of complimentary tools are all compelling reasons to bring Hadoop into an enterprise. Yet, although Hadoop is a critical part of any big data strategy, it isn’t sufficient by itself. Organizations can better enable their data science teams by building up a robust software ecosystem around their Hadoop platform.

Handling real time data is one area that Hadoop typically has problems addressing. This is a serious issue because analyzing data in real time as it is generated is a critical use case for many companies. For this reason a complimentary real time processing system should be considered.  Here are a few important criteria to keep in mind when evaluating one for a data science use case: it should be capable of scoring data using a pre-built statistical model; it should easily integrate with the other systems in the ecosystem so data streams through the real time system and lands in the Hadoop data lake.

Armed with a well-rounded Hadoop toolbox, complete with suitable add-ons and applications catered to a company’s needs, data science teams will be all the more able to achieve a company’s optimal analytics goals. And instead of wasted potential, your company will be able to meet its full data science potential.

About the author: Dillon Woods is the Field Chief Technology Officer at Alpine Data Labs, a developer of collaborative, scalable, and visual solutions for Advanced Analytics on Big Data and Hadoop.