Four Ways Automation Can Rescue Your Data Lake
Ask Gartner Research and you’ll find that as of late 2017, 60% of big data projects failed to survive the pilot phase and only 17% of Hadoop deployments went on to the production phase. However, it’s often not due to a lack of desire or appreciation of big data’s value among executive leadership. It’s simply that most organizations aren’t aware of many of big data’s most common and formidable challenges.
There are two big challenges. The first one involves underestimating the technical complexities to develop a data lake and the associated expertise required to overcome those complexities, and the second one is related to underestimating the ongoing effort required to maintain an often brittle operational environment where every successive analytics project takes longer and costs more.
Let’s face it: big data is extremely complex, a fact that vendors and deploying organizations aren’t always willing to admit publicly. Even open source platforms aren’t enterprise ready right out of the box without significant effort by your team or a third party systems integrator.
2010 saw the First Wave of companies using data lakes as a mechanism to store all their raw data in a cost-effective way. The problem was while the data lake turned out to be a great way to store data cheaply, it also became a dumping ground and a terrible way to generate actual value, with data left languishing unknown, ungoverned and unusable. But that’s changing with new approaches that leverage the underlying compute power of the data lake itself to automate and simplify much of the development, ongoing management and governance of the data engineering and dataops processes. Data lakes are now starting to recapture their original luster, evolving to deliver the best combination of both agility and governance.
But how can agility and self-service be achieved if you’re also enforcing rules to make the data lake a fully governed environment? Those objectives have historically been seen as mutually exclusive. If you want agility, you had to give up control and vice versa. But it turns out that by leveraging the compute power of the data lake to also manage the data lake itself (by using statistics, heuristics and machine learning to automate development, performance tuning and ongoing management), data lakes can deliver both agility and governability. The key is to automate the data lake wherever possible and avoid hand coding. Fortunately, there’s a new Second Wave of data lake technologies that provide some of this automation.
If you had previously liked the concept of the data lake and either built a data lake but then hit a scalability wall or you never tried to build a data lake because you didn’t have the skills in-house to get it to work, now is a good time to revisit this concept. But before you do, it’s important to understand why building a successful data lake was so complex in the first place.
Most organizations are surprised to discover their Hadoop, Spark or cloud provider doesn’t resolve the following four challenges for them when it comes time to deploy their data lake into an enterprise-class production environment. If you watch for these issues, you can be aware of the challenges going into your data lake project, secure in the knowledge that there are many automated solutions available:
Challenge 1: Ingestion
Before you can even use a data lake, you need to get the data into the big data infrastructure. This is harder than most people think.
Transitioning from ingesting data for a proof of concept or development sandbox into real world production is where most data lakes fail. It’s one thing to get data into your environment once for a specific insight. It’s an entirely different matter to do this over and over again while meeting your SLAs. Large tables, for one, can take forever to ingest. In development sandboxes, data scientists don’t have to worry so much about how long it takes to get a large test data set into the lake. They just need the raw data loaded once so they can do ad hoc analytics. Very often it doesn’t need to be updated, so performance is not an issue.
One way data engineers minimize the time it takes to load large source data sets is by loading the entire data set once and then subsequently loading only the incremental changes to that source data. This is a process called change data capture (CDC). However, incremental changes need to then be merged with the base data on the data lake. Most big data stores do not support merge or update operations, so this is an exercise that is left for the user to resolve. And what about schema changes in the source? Just as data changes constantly, so do the schemas in the source systems. Most of these ingestion pipelines simply break when the source schemas change.
Challenge 2: Data Prep
After you get your data loaded into the data lake, it still needs to be transformed in preparation for downstream use. Most companies create their data lakes with the philosophy that they will ingest raw data that has not been cleaned or modified in any way. Combining it (sales data, for instance) with other data (like weather) is something that will be done after the data has been landed in the data lake.
To transform and prepare the data typically requires a developer who can not only write the code but also get it to perform. Quickly designing analytics and machine learning data pipelines is only one aspect of data preparation. You also have to develop data pipelines that can ultimately be put into production at enterprise scale. Production pipelines have additional characteristics that you don’t have to worry about in a development environment. For instance, they must be built to be started, stopped or paused as part of a workflow and scale with the size of the execution environment without requiring code changes.
Challenge 3: Making Data Ready to be Queried
Once your data has been prepped, you can’t just dump it in a Hive table and hope that you can point a data visualization tool at it. It will work (kind of), but performance will be terrible. Market leading business intelligence and data visualization tools are amazing in their ability to present enterprise-wide data to non-technical business analysts.
The problem is these tools were never designed to handle the large data volumes now typically associated with big data. At the same time, data sources like Hive and NoSQL weren’t built to deliver sub-second response times for complex queries. The combination is even worse. Using open source technology is a great foundation, but it requires a lot of expertise to get it to work in a full blown production environment.
In order to overcome these issues, you have to either generate in-memory data models and OLAP cubes for reporting or move your data back into a warehouse after you have processed it in the data lake, which may defeat the value of the agility you were hoping to achieve. Creating these models and cubes requires time and expertise in order to get them to both scale for large data volumes and provide sub-second query performance for large numbers of users.
Challenge 4: Operationalization and Cross Platform Portability
Once you’ve designed your environment, you need to be able to run your data pipelines on it—over and over again—sometimes on different platforms. Running ad hoc analytics (where data is loaded only once for a data scientist to run experiments to find the recipe they will use to gain a particular insight) is valuable and necessary. But once you have that recipe, you will want to run it on a regular basis—every day or every hour, etc.—and use the output to drive business decision making. This requires a level of operational reliability that goes well beyond simply getting it to work once. If you’re baking a cake for yourself, you can crack your own eggs manually. But baking thousands of cakes requires automation to crack all those eggs, over and over again, and error handling to deal with problems when they come up.
Another consideration: you may need to operationalize these workflows across multiple deployment environments. Many organizations are now running hybrid big data environments, often choosing multiple cloud providers in order to avoid vendor lock-in. This means you need to be able to run the same data pipelines on multiple platforms. Some data might be in Azure, some might be on-premises, some might be in another cloud like AWS/EMR or Google Cloud Platform. Data has to flow across all of these environments and your pipelines need to run in all these environments, many of which are based on different underlying technologies, making portability of any hand-coded effort a problem.
The four challenges presented above are just some of the more glaring examples of the complexity of implementing a modern data architecture. That’s the bad news. The good news: after the First Wave of big data investments, there was a Second Wave of venture capitalists investing in companies that can fill in the complexity gap. Organizations looking to build a data lake can now expect to bring at least some level of automation to each of the areas discussed above.
About the author: Ramesh Menon is vice president of product at Infoworks, where he drives development of the company’s agile data engineering software. Ramesh has over 20 years of experience building enterprise analytics and data management products. He previously led the team at Yarcdata that built the world’s largest shared-memory appliance for real-time data discovery and one of the industry’s first Spark-optimized platforms. At Informatica, Ramesh was responsible for the go-to-market strategy for the company’s MDM and Identity Resolution products.