Filling Your Data Lake One Stream at a Time
“One Deceptively Simple Secret for Data Lake Success” was one of the best articles I have read regarding Hadoop and the whole Data Lake fiasco in a long time and it was here – on Datanami. I felt the article provided great business level guidance on how to approach building a Data Lake, but that a sequel was required to answer the question, “How do you go about implementing this vision?”
Once you have followed the SAM test (Strategic, Actionable, and Material) and determined your data is worthy, how do you, in-practice, go about collecting your data and adding it to your lake?
The high-level answer to this is summarized at the end of the article:
“…you end up building your data lake one business use case at a time. You’re not going to start with 50 data sources. You start with three or four. And as you add a new use case, you keep adding new data…”
However, getting to your data may not always be easy and adding raw data to the Lake may not always be the correct answer. What’s more, dealing with batches of data and moving them wholesale into the Lake is going to add latency and it won’t meet today’s requirements for instant insights.
So, what are the stages, from source to Lake, that are required to ensure rapid, high-quality insights? You need to start with what you want to achieve, what questions do you want answers to. Then, from that, you can derive the sources you need and the processing required to prepare the data to answer your questions.
One Stream at a Time
The first obstacle to overcome is getting to your data. Where does it come from? How should it be collected? This obviously depends on the use case but the more critical issue is that the data is already stale by the time it reaches a Lake.
The majority of data being pushed into Had
oop right now is in the form of log files. These can come from web servers, application servers, system health information, network devices and many more sources. However, they are most often just moved file by file, or in batches of files into Hadoop. This adds latency, and limits your view of the enterprise.
What is really needed is to turn these log files into streams of log events that populate the Lake as they happen. This can be achieved through collectors that monitor the log files on many machines and stream out events as soon as they are written to the logs.
There are many more sources of data than just log files. An obvious one is enterprise databases. These are often left out of the Lake equation because either they are forgotten, or extracting data from them is too expensive (on resources) or too complex. However, your view of the Enterprise is not complete without this data.
The notion of running queries against a production database for ETL or other extraction mechanisms like SQOOP is generally severely frowned upon. Expensive read-only replicas can be used for this purpose, but a better solution is to use Change Data Capture (CDC). This effectively turns what is happening in a database into a stream of changes that can be fed continually into the Lake. CDC works against the transaction log of the database in a non-intrusive, low-impact manner, and has been approved by DBAs of large organizations for use against production databases.
CDC has many advantages over SQL querying. Receiving the data as it changes means your view of the enterprise becomes real-time, and because data is delivered continuously, you don’t need batch windows or large expensive queries. Additionally, you don’t miss anything. If you are running periodic queries against a database, you can miss things that happen between those queries. If a record was inserted, then subsequently deleted between queries, you won’t see it unless you log some kind of history in the database as well. CDC resolves these issues.
Devices and sensors (the things in the Internet of Things) bring another host of issues. The exponential growth in world-wide data generation is directly related to the growth in connected devices producing data. You can’t (or shouldn’t) store all of this data, and the shear volume of data threatens to overload corporate networks.
This is where the difference between data and information is important. We shouldn’t be sending the raw unprocessed data from devices into the Lake; we need to only store the information content of that data. For example we shouldn’t store the second-by-second readings from a thermostat, we only need to know when the temperature changes. This means we need to process sensor and device data at the edge to filter, aggregate and remove redundancy. This ensures that we dramatically reduce the volume, and that we are only dealing with useful information, rather than useless or repetitive data points.
Making Sense of It All
Which brings us to processing. Once you have discovered what data to collect and how to collect it, you need to identify how to process that data before feeding the Lake. Raw data in a Lake may have value for certain use-cases such as Auditing or Compliance, but it is rare that you want all of the raw data, or that the raw data is useful as-is. There are a number of different operations that you can perform on raw data to make it more useful, namely filtering, cleansing, transformation, aggregation, and enrichment.
This may sound a lot like ETL. Well, it is… and it isn’t. The key difference is that ETL is usually high-latency and batch-oriented, whereas here we are enabling the continuous scalable processing of high volumes of streaming data.
Perhaps the most important of these operations is enrichment. Yes, you need to filter out useless or irrelevant data, cleanse it to make it accurate, and transform it to match some canonical format across sources. However, a lot of raw data doesn’t contain sufficient context to make it useful.
The Importance of Context
Imagine you are performing change data capture from a highly normalized relational database. Most of the data you get is in the form of IDs. If you simply land this data into your Lake, your queries will have a hard time making sense of those IDs. Similarly if you have web log users, or call records with a subscriber ID, you only have a text string to query, not detailed information about your customer.
If the questions you want to ask – and the queries you want to run – pertain to what your customers have been doing, and you want to slice and dice this based on attributes of your customer (such as location, preferences, demographics, etc.) you will need expensive joins for every query.
A better solution is to load your reference customer information in-memory, then join your raw data streams with the customer information as the data is flowing through. In this way, the data that lands in your Lake now has the relevant customer context for querying.
A Data Lake can be a powerful ally if it is built in the correct way. By utilizing streaming data collection and processing, you can feed your Lake with data designed to answer your questions, and ensure it is current and relevant. After all, what is better for a lake than a supply of good, healthy streams?
About the author: Steve Wilkes, the founder and Chief Technology Officer of Striim (formerly WebAction), is a life-long technologist, architect, and hands-on development executive. Prior to founding Striim, Steve was the senior director of the Advanced Technology Group at GoldenGate Software. Here he focused on data integration, and continued this role following the acquisition by Oracle, where he also took the lead for Oracle’s cloud data integration strategy. His earlier career included Senior Enterprise Architect at The Middleware Company, principal technologist at AltoWeb and a number of product development and consulting roles including Cap Gemini’s Advanced Technology Group. Steve has handled every role in the software lifecycle and most roles in a technology company at some point during his career. He still codes in multiple languages, often at the same time. Steve holds a Master of Engineering Degree in microelectronics and software engineering from the University of Newcastle-upon-Tyne in the UK.