Index-Free Log Management: Surf the Approaching Tidal Wave of Data Instead of Drowning in It
Whether an organization is diagnosing a system outage, mitigating a malicious attack or trying to get to the bottom of an application response-time issue, speed is critical. Pinpointing and resolving issues quickly and easily can mean the difference between success and crisis for any business, regardless of size or industry. Network and system administrators, security professionals and developers all depend on detailed log data to investigate issues, troubleshoot problems and optimize performance.
However, traditional approaches to log management, namely the index model, can no longer provide organizations with the speed necessary to analyze today’s ever growing data pools and address issues in real-time. Enter the index-free approach.
As a senior sales engineer in the modern log management platform industry, the most common question I receive revolves around how log management platforms can be so effective without the need for an index.
After being asked this same question many times, I’ve developed the following analogy to help others understand how this unique approach works, why it matters, what it yields and why enterprises ought to take note. On the surface it may sound complicated, but the foundation of the analogy I’m offering below stems from an Einstein quote I subscribe to: “If you cannot explain something simply you don’t understand it well enough.”
The Traditional Index Model
To explain this, we need to understand that the traditional approach to searching data uses indexes. Used for centuries as a fast lookup table located at the back of reference books such as encyclopedias and recipe books, the index approach works because the content of the book doesn’t change. Generally, people use the index to locate the pages that contain the key words they are looking for.
Indexes are most effective when text in a book is not predictable (random), and people look up reference material significantly more than the reference material is updated. When referencing “random,” this means that generally the reader doesn’t know beforehand where in the reference to start looking.
The challenge for indexes is when information contained within the book is updated or appended, such as what happens to real-time data, whether on or offline. For every new key word that is added, the index needs to be modified. As the number of keywords increases, the index increases in size and eventually overtakes the size of the reference itself. The effort to maintain the index and pages to write it on becomes more expensive than the reference material. And as data today is almost an organic entity, literally evolving, changing, practically propagating in milliseconds every day, indexes as they have always existed simply don’t work the way we need them to now. The traditional index model for log management actually hinders data search rather than helps it because it’s both cumbersome, cost-intensive and time-consuming, and by design, is always playing catch-up.
The Index Model Online
Even the index model online proves challenging, not necessarily for logging historically small volumes of data, but for our new world of so-called big data. As data proliferates and our SaaS usage and sheer volume continue to increase, indexing once again becomes problematic. Keep in mind that some systems generate upwards of 5TB of logs daily. Imagine trying to keep up with those logs!
The infrastructure, maintenance and staff costs just to store and maintain the indexes are disproportionate to the valuable reference data. As a result, organizations are forced to archive off the oldest logs, keeping only a small percentage of logs online to be able to search. The aim here is to keep the indexes smaller and allow for efficient querying on a smaller data set. But doesn’t that defeat the purpose of having a rich mine of data to leverage?
The Index-Free Approach
Returning to the analogy of a single book, imagine that a reference book is actually telling the story of the history of the world. What we know to be true is that the start of the book is where the world begins, the end of the book would be the world as of today, and the sections, chapters and pages in between would be ordered historically, from the oldest to the most recent events.
With the history of the world reference book, it is easy for readers to find what happened during a period of time, in the past or a specific day, as they can use the table of contents to find the section (i.e., the 1800’s), and then “thumb through” a greatly reduced number of pages to find what they are looking for. This is a much more efficient process than having to look up each term in an ever-growing index, look for the page and then scan the page for that content.
The challenge of real-world data, and the history of the world, is that it is changing every second and the book needs to be constantly updated. Given the updates are coming through sequentially, the pages can be simply added to the end of the book as the most recent event, making it simple to search. This is especially true if the reader is interested in things that happened for a period relative to today, as they know that what they are looking for is going to be in the last few pages.
To make the search even quicker, a bit of forethought comes into play. With the book analogy, many of us are well versed in adding Post-It notes to each page outlining key information that can be found there. This results in a secondary “filtering” whereby a broad historical time frame can be rapidly reduced to a small subset of pages by simply glancing at the Post-It note and including or excluding that page to search through.
With an index-free mindset, an organization can apply this approach to log files by tagging log streams on ingest, as well as parsing and filtering using a highly efficient implementation of “Bloom Filters.” The result is an exceedingly efficient way to query very large volumes of time-series data and log files, without the need to create, manage and store large indexes.
There are other attributes to modern log management platforms, such as heavy compression, live queries and object storage, that when used with this index-free approach, add to the ability to search significant amounts of streaming event data feasibly without the need to store and maintain an index.
As a person gets deeper into modern log management, they will see where the analogy matches up with some of its core components:
- Sentences on a Page – A log message
- Pages – Segment files
- Books – Repositories
- Book Genres – Tags
- Post-It notes – Bloom Filters
- Library or Shelves – Storage for hot data retention
- Off-Site Archive – Cold Storage for Offline retention
I hope that the reference book analogy helps explain how modern log management helps organizations search time-series events such as logs and metrics, and delivers a more efficient, and economical option for log storage and querying.
As we look ahead to 2022, it is clear modern log management solutions are critical for IT leaders who are preparing to improve organizational efficiencies, find new ways to drive business continuity and understand their company operations with ease. Leaders are seeking solutions that can provide a holistic picture of their distributed infrastructure in real-time while allowing for speed and scalability. As leaders chose to leverage modern log management solutions which utilize an index-free approach, many will quickly realize their path to adoption of streaming event data and modern observability within their organization may be shorter and require much less overhead than they think.
About the author: A technologist at heart, Andrew Latham has accumulated more than 20 years of hands-on IT security experience and is currently jointly responsible for regional business growth and success as a solutions architect for Humio in APJ. A practicing Certified Information Systems Security Professional (CISSP) for over two decades, Andrew has extensive real-world experience in team leading, developing and delivering solutions to such as risk analysis, security control selection and implementation, and vulnerability assessment to customers in all industries and at various sizes of organisations worldwide. This deep experience in the IT security industry allows Andrew unique insight into how to execute; not just plan. Beginning his IT career after graduating with a Bachelor’s Degree in IT (Computer Sciences) Andrew has worked his way through the ranks of research and development, service delivery, pre-sales engineering, professional services, and enterprise solutions architecture.