Follow Datanami:
October 15, 2018

Is it Time to Drain the Data Lake?

Dominic Wellington

The term “Big Data” has been a major point of enterprise technology conversation for decades, and with the rise of the data lake, it’s back in the spotlight.

Early on, the idea of big data was how to store the mass amounts of it being generated by the newly created Web-scale infrastructures. Best practices used to indicate the file system portion assigned to logging data — or the /var/log — should be measured in single-digit hundreds of megabytes. But suddenly that was barely sufficient even for the systems’ own internal, technical logs, let alone anything relating to the purpose that system was designed to serve — and the volume of data was increasing by the day.

In today’s IT world, if data storage is measured in less than terabytes, we likely won’t monitor it. Data transmission can still lag behind, but network speeds measured in gigabits per second are still fast enough within a single environment. With easy access to mass amounts of storage and transmission speed, our first reaction is to grab hold of every piece of data that comes through. The assumption of the common metaphor “data lake” is that if we pour enough big data into the lake, we’ll have enough information to eventually find what we are looking for. Who knows, it might be useful later, right? That’s not always the case, and here’s why.

Data is, And Will Always Be, Valuable

Data is much more valuable than it sometimes receives credit for, especially since for so long, it was difficult to gather in the first place.

(cherezoff/Shutterstock)

With the enormous amount of data that we’re seeing each day, most of it isn’t even queried, let alone put to use. Imagine this: an Alpine dam with an untouched surface, but a huge mass of cold, anoxic water beneath it with dead tree trunks and boulders, or in IT terms, the data lake. Information in a data lake is valuable but similar to the dead tree trunks and boulders, may be difficult to get out among the clutter.

The Importance of Filtering Out the Noise

Similar to the above metaphor, collecting data from a data lake can be time-consuming and tedious. When fishing for valuable things, nets can get snagged on meaningless rubbish like old car tires or boat parts — completely missing anything valuable.

And, it’s not always worth the effort to dive deep into these lakes to fish out the most important information, especially when there’s really only enough time to scrape the surface. Because the lake is so deep and dense with data, it takes time to filter out the noise to determine what is useful or what can be tossed back in. What is lacking right now is true observability around the mass amounts of data we hold.

How to Benefit from AI in Data

Recently, big data has been discussed in terms of how artificial intelligence and machine learning will take over the task of extracting value in the data lake, rather than humans. Big data can spot data patterns and automatically flag them before people have to ask about it. Sounds like a win-win situation, right? At first, people didn’t worry about big data — simply put, they continued to keep data from the capabilities that AI can offer while paying astronomical bills for storage and management.

(kmlmtz66/Shutterstock)

If we have a data lake deep enough, almost any pattern can be found, including ones that aren’t relevant. But if we are able to filter the data streams before they enter the data lake, we’re able to apply analytic algorithms to the actual unfiltered data lake. Redundancy filtering can be done at the source or in stream, but in contrast, high-order analytics can carry out on the “leftover” data after its filtered.

As we gain a better understanding of what this means and how to decipher it, we’re finally able to rid of age-old filtering dilemmas and focus on tools that remove just enough data, but not too much, allowing us to focus on what’s most important directly at the network edge. With the help of AI-enabled filtering, we’re able to ensure that all relevant and actionable data makes it to the next stage while leaving irrelevant data behind.

It’s becoming more apparent that what we see on the surface of a data lake — or that shiny object in the distance — may not always be as useful as we once thought.

The Future of Data Lakes

Data can be incredibly insightful, but it’s not necessarily the hero of today. Of course, we can benefit from gathering mass amounts of data, but AI can help spot useful patterns and bring them to the attention of human specialists from different domains, without getting lost in the clutter of a data lake.

More specifically, we can benefit from a process that can identify a pattern from a specific user request, the exact code path being used, and system resource utilization, determining a new type of problem and allowing specialists to strategize and pin down the meaning of why the algorithm identified that specific pattern — in real time.

Sure, subsidiary systems can identify whether or not this pattern is new or a recurring issue by searching the data lake, but eventually, more and more incidents will begin occurring for the first time and begin to clutter the data lake even more. In turn, this can deplete the value of the data lake as a whole. This isn’t to say that data lakes are completely invaluable, but without the ability to monitor in real-time, they are becoming less and less of a necessity.

About the author: Dominic Wellington is Global IT Evangelist at Moogsoft. At Moogsoft, Dominic’s  primary focus is on the emerging field of Algorithmic IT Operations (AIOps). He has been involved in IT operations for a number of years, working on SecOps, cloud computing, and data center automation. Dominic is fluent in English, Italian, French, German, and Spanish, and has studied and worked between Italy, England and Germany.

Related Items:

Data Lake Showdown: Object Store or HDFS?

Transforming ‘Data Swamps’ into Data Lakes

What Lies Beneath the Data Lake

Datanami