Follow Datanami:
June 9, 2022

Is Index-Free The Answer to the Looming Data Deluge Problem?

John Smith

(whiteMocca/Shutterstock)

The legacy log management industry is in trouble. With data volumes expected to reach 181 zettabytes by 2025, many log management vendors are rapidly approaching a breaking point. Their prevalent technology is index driven, meaning their indices will soon dwarf the actual reference data users are trying to collect!

In a recent Stack Overflow thread, I noted the following response to a question about why an Elasticsearch index was 30GB when the log source was only 3GB.

“If you’re working with a source of 3GB and your indexed data is 30GB, that’s a multiple of about 10x over your source data. That’s big, but not necessarily unheard of. If you’re including the size of replicas in that measurement, then 30GB could be perfectly reasonable.”

This reads like the definition of technical debt. The yield of this equation is only 10%. And we’re not just dealing with an increase in storage; that additional 30GB will also significantly impact the hardware required to support it. What ends up happening is the memory, IO and cores you want to use for threat-hunting, alerting and ad hoc queries are instead spent on servicing these indices.

This begins to resemble the poor debtor whose interest payments have become so enormous that they can never get out from under their obligation. Like our poor debtor, we’re left wondering how our legacy log management solutions went from racks to rows in our data center.

It’s not uncommon to have index-driven solutions that require hundreds of servers to accommodate existing logging needs, just to get to tens of terabytes of data ingestion. So what’s the solution to the data deluge? Do you add more nodes? Cores? Memory?

How Did We Get Here?

(cybrain/Shutterstock)

Let’s start by looking at the evolution of logs over the last 25 years. When my career started I might purchase a server and would run some flavor of Unix on it (AIX, Solaris, HP-UX, etc.). In addition to my routers and switches I’d also send syslogs from this server to a SIEM or log management solution. Later, not only did hardware drop in price, the evolution of the hypervisor came about, and my single piece of iron was now running a hypervisor and 10-15 guest operating systems.

Today, we have Docker/Kubernetes, where a pod OS and hundreds of containers run on the same piece of iron. What was once a single source of logging now sends logs from hundreds of systems. If you couple this with applications that actually embed the capability to log, the number of connected devices will increase by 5x over the next 10 years. Welcome to the data deluge.

The Index-Free Answer

At a minimum, the time has come to investigate a hybrid solution that combines new and old approaches. In this scenario, you can have one system designed to handle the bulk of log collection while still giving teams the flexibility to forward important logs to their incumbent solutions. This enables teams to share the burden of log management with a better performing and less expensive alternative.

You may have heard the term “index-free” technology being discussed. But what exactly is it? And how can dropping indices lead to faster searches and reduced storage requirements?

Index-free is a combination of several different technologies that changes the way data is processed when it’s ingested. By removing indexing from the ingestion process, it opens up new ways for teams to relate to their data by speeding up search results and reducing costs.

A traditional log management approach would require writing the data, querying it and then populating the results to a dashboard. With an index-free approach, when searching data stored on disk, it’s limited to interactive, ad-hoc queries during the incident leveraging bloom filters.

(Lane V. Erickson/Shutterstock)

In addition, an index-free logging approach can also be a supplementary solution to augment your existing investment and remove some of the burden of high-capacity logs. With an index-free architecture operating at petabyte scale, you can finally say yes to what you are currently saying no to, and use event forwarding to send important events to your index-driven solutions. This is huge for organizations who want to aggregate, manage and use log data to make real-time decisions across both the IT and business landscapes.

The data deluge is upon us. With indices soon to outgrow the data you’re trying to collect, your ability to interact with the data in an ad hoc fashion is limited, as they become difficult to compute. Eventually, the size of the index severely impacts the performance to the point that the very reason for the index in the first place (faster queries) is undermined altogether.

 

The time has come to investigate alternatives to heavy indices and find a way to coexist with legacy log management solutions, so your DevOps, ITOps and SecOps teams can reclaim visibility of their infrastructure and handle the data deluge at scale.

About the author: John Smith is the director of technical marketing engineering at Humio, a Crowdstrike. John has more than 20 years of experience holding a variety of roles from big data, DevOps, SecOps to Sales, Marketing and Integration leadership. He has worked in security for more than 13 years, including pioneering work with event correlation, behavioral analytics and remote access.

Related Items:

Rethinking Log Analytics at Cloud Scale

Log Storage Gets ‘Chaotic’ for Communications Firm

Index-Free Log Management: Surf the Approaching Tidal Wave of Data Instead of Drowning in It

Datanami