Expanding Beyond Your Hadoop Ecosystem
Leveraging the predictive benefits of data science was not so long ago an under-the-radar secret of smart businesses that recognized the value of interpreting and projecting data. Today, although the secret is out, the true value of data science is still being largely untapped, as many companies investing in data science go only halfway. Given the vastness and complexity of big data, the methods used to take advantage of it must be proportionately vast and complex. Unfortunately, the list of data science tools companies invest in almost always begins and ends with Hadoop.
Although Apache Hadoop is a critical part of any big data strategy, it is insufficient on its own in adequately empowering a data science group. Organizations can better enable their data science teams by building up a robust software ecosystem around their Hadoop platform.
By thinking more broadly about the tools in their big data ecosystem, enterprises can not only make their data scientists more effective, they can also increase the overall value of their Hadoop platforms.
Baseline Setup for Hadoop
The best way to begin exploring the other tools needed in the Hadoop ecosystem is by considering the traditional data science workflow. The first step in any analytics workflow is data ingestion. While the primary sources of data should already exist in the analytics platform, the data scientist must be empowered to augment that data with other interesting sets of information. A mechanism must be used to give data scientists the ability to ingest these sources of data in a self-service fashion. Ideally this should be a graphical tool for connecting different data sources together, including a simple push-button way to upload data from a local source.
When possible, data transfer between systems should be optimized for performance. For example, the Apache Sqoop tool should be used for transferring data between Hadoop systems. Without such robust integration, transferring large amounts of data between systems could take a prohibitively long time. A flexible and consistent way to ingest new sources of data into the analytics platform will lead to faster analytics development and improved results.
Once all the data needed for a project has been ingested into the system, it is important to make sure that data is properly isolated. To this end, an analytics sandbox should be created for every project underway. he mechanism for creating the sandbox will vary depending on the data platform. For a relational database the sandbox might be a separate schema, while for a Hadoop system the sandbox might simply be a protected directory in HDFS. For large or complex projects the sandbox might even be a completely separate environment. Moving data between the central data lake and such an isolated environment should be a simple process.
After a project is finished, the results should be published to another area of the system where a wider audience can take advantage of them. Anyone in the organization should have a way to discover the results of past projects so they can reuse them or build on them without repeating work. Providing isolated analytic sandboxes is a simple way to encourage collaboration among teams.
Enabling Exploration for Collaboration
Data exploration isn’t just an important part of a data scientist’s workflow, but a crucial tool for any member of a business team. Nearly everyone in the organization can benefit from the ability to easily find information about data assets.
Finding data spread throughout disparate data systems and across isolated sandboxes can be challenging. Data scientists need a way to search and explore all the data in the organization simultaneously so they can discover all the sources that might be important to their current or future projects. Systems that provide this capability are either known as central data catalogs or global data dictionaries. Either way, the system should provide a single interface for viewing metadata associated with all the data stored in the enterprise.
This system should provide basic information about the data, such as where it is stored and how much data there is, as well as information about the structure of the data and the columns and data types present. Project collaborators should be able to add their own notes or metadata to each data source, thereby creating an archive of institutional knowledge that is accumulated over time. This will speed up the exploration phase of future projects since data scientists won’t have to hunt for domain experts within the company to explain the data to them.
Real Time Concerns
Handling real time data is one area that Hadoop typically has problems addressing. This is a serious issue because analyzing data in real time as it is generated is critical for many companies. It is often necessary to score records using a pre-built model as soon as they are created so appropriate action can be taken. Even a scoring delay of just a few minutes can be too long in many cases.
Hadoop is intrinsically a batch processing system and its strength is in running complex processes on large sets of data. For this reason a complimentary real time processing system should be considered. Many of these systems have been developed and are available on the market today.
It is easy to overlook or delay real time capability when building out a big data environment, but implementing this piece of functionality early will give data scientists additional flexibility and will ensure future use cases can be developed without major modifications to the infrastructure.
Data science itself is the application of mathematics and statistical models to solve business problems, yet none of the tools needed in the Hadoop ecosystem support that directly. Systems for data ingestion, self-service provisioning, metadata exploration, and real time streaming are all critical to the data science process, but none of them assist the data scientist with actual model development.
Delivering End Results
Perhaps the most important part of the data science process is delivering actionable results to the business users who can take advantage of them. Unfortunately this is another chronic and often overlooked problem for most businesses. The end product from a data science project should be more robust than a simple spreadsheet. If a project is investigative in nature, the data scientists and business users on the team should collaborate on delivering a defensible business decision. More often than not, however, the result of a data science project is a tabular set of data.
The most expressive way to represent data is almost always through a visualization of some kind. Many enterprises have a standard business intelligence tool that can be leveraged for this task. The BI application should have direct access to the analytic platform and the data scientist should be able to use it for creating visualizations. Most BI tools have ODBC and JDBC connectivity that make them easy to connect to any relational data source or even Hadoop systems running Hive or HBase. If the final result is a file in HDFS, then specialized Hadoop visualization systems can be used instead. These have become more prominent and powerful over the last few years. When a pre-built BI tool doesn’t meet the project requirements, a framework like D3 can be used to build specialized visualizations for almost any purpose. Almost all successful data science projects end with a result that allows users to directly derive business value.
It may seem counterintuitive to spend so much time building this kind of infrastructure around Hadoop, but it is estimated that up to 80% of the average data scientist’s time is spent on activities other than modeling. Any tools that help reduce that time will make data scientists more effective and will have a direct impact on the time to value of new Hadoop environments.
Even worse than time being misspent is the fact that many data science projects today end without actionable results. Investing in a robust Hadoop ecosystem will increase the percentage of projects that end in success.
About the author: Dillon Woods is the Field Chief Technology Officer at Alpine Data Labs, a developer of collaborative, scalable, and visual solutions for Advanced Analytics on Big Data and Hadoop.