In-Data-Lake BI Is the Next Frontier for Business Intelligence
Remember the days before the relational database? Neither do I, so I have no idea how painful it was to keep track of data back then. But even after relational databases became more mainstream, data was still mostly about keeping track of stuff: product inventory, customer information, customer orders, etc. Then, one day, everyone wanted to use their databases to make better decisions. Before we knew it, we had data warehouses and business intelligence (BI) tools. Soon after, big data appeared, and smart people realized the relational database and data warehouse weren’t very good for that. Search engines, Apache Hadoop, NoSQL databases, Apache Kafka, et al., got more attention.
The jump from data warehouses to “big data platforms,” especially Apache Hadoop, wasn’t nearly as smooth as we all hoped. Analytical queries were slow, sometimes acceptably (like with MapReduce), but often not. People tried to bolt on data warehouse business intelligence (BI) tools onto their Hadoop deployment, and that approach didn’t work, either. People blamed Hadoop and data lakes.
No Such Thing as Big Data BI?
Then, the conversation shifted to “big data BI.” Some pundits used to say that there was no such thing as “big data.” For them, the concept of “big data BI” really had no chance.
But people are coming around to the idea of big data BI, especially when it comes to big data platforms like Hadoop and big data architectures like data lakes. These approaches allow organizations to load data directly from the source and are useful for analysis without the need for extensive up-front modeling and transformation. The ability to glean insights from unstructured data types that was difficult and impractical with traditional data warehouses was a game-changer, and data experts were recognizing this.
Traditional BI tools (those you currently use with your data warehouse) supposedly support Hadoop, but they still require data to be extracted and transformed from Hadoop to a dedicated BI server. That doesn’t sound like “big data BI.” On the other hand, the Forrester Research Native Hadoop BI Platforms Wave report was one of the first documented assertions that big data BI was a real market. The report was written in 2016, and the market has grown since then, but at the same time, Hadoop itself has gotten a lot of criticism. Some began to feel that maybe Hadoop isn’t right for BI-style analytics, and that the “native Hadoop BI platform” category was going to be subsumed by the broader, traditional BI market.
Turns out that still, after over two years, the traditional BI platforms can’t handle big data efficiently. Industry experts are calling this out; for now in a subtle way, but I believe this story is going to pick up more steam in the coming months.
For example, Dresner Advisory Services recently published a research report on big data analytics, recognizing the use of BI tools specifically for a big data environment. Boris Evelson of Forrester Research discusses how new BI architectures are required for agile insights in a recent report on BI migration. In its recent 15-Step Methodology for Shortlisting BI Vendors, Forrester refers to this new architecture as “in data lake cluster BI platforms,” which it defines as a repository “where all data and applications reside inside data lake clusters, such as Amazon Web Services, Hadoop, or Microsoft Azure.” (Forrester has since updated the term to “in-data-lake BI” in a subsequent report on systems of insight).
That means BI professionals must adapt to the more advanced environments that data lakes present. We believe “in-data-lake BI” is the next generation of BI. This generation of modern BI tools has four key characteristics:
A Scale-Out Distributed Architecture
In a scale-out architecture, organizations can add servers/processors to their existing cluster in a linear fashion. In theory, this provides nearly unlimited scale. Unlimited scale and the way in which it’s achieved, offer flexibility and agile provisioning with low cost. The ability to scale out is in stark contrast to legacy architectures. These approaches leverage dedicated BI servers and data warehouses that require scale-up growth or massively parallel architectures. Both techniques are far more expensive and limiting than a scale-out model.
In practice, we might consider the example of a large information services company that provides marketing analytics to major global corporations. The sheer volume of data this company provides is too much for traditional BI technologies to handle, despite using modern platforms such as a Hadoop data lake. The problem here is that, while the Hadoop-based data lake offers reliable storage and processing, the interaction with traditional BI tools presents a bottleneck for delivery of analytics to end users.
This fact is not so much the fault of Hadoop-style data lakes as it is that of traditional BI tools and processes. These types of approaches do not scale to match the degree to which the organization was growing. Adding more data—to service additional customers—to the analytics process became expensive and time consuming in a scale-up architecture and imposed too many performance restrictions on the system. For a company like this, an in-data-lake BI approach represents a huge win in terms of saving cost, time, and effort, and also provided the performance and user concurrency customers demand.
BI Processing Runs Natively In the Data Lake
Non-native BI tools require extract databases with numerous downsides. Downsides include redundancy and inconsistency with source data, data movement effort, extra systems to manage, and processing and storage overhead. Extract tables and multidimensional cubes take a long time to create, increasing the risk that the data will be stale by the time it’s ready to use. Finally, some regulated industries must restrict duplication of production data, which makes non-native BI tools more inflexible. Native processing, by comparison, takes advantage of the servers in a data lake cluster, in a model popularized by Hadoop, and does not require data movement.
We can watch this play out in an example with a company that collects telemetry data from customer-deployed storage arrays. Information the company collects might help identify issues—related to usage, warning conditions, and failure—that, if addressed would help the company better serve its customers. Because the different customer arrays generate so much data, a data lake environment offers the only scalable analytics environment. However, all that data might be for nothing if the company isn’t able to analyze it quickly and with minimal overhead.
An in-data-lake BI platform is ideal for this use case, since the company can analyze their customer data as soon as it lands without additional overhead of moving the data externally to a data mart or other dedicated BI platform. As a result, they can immediately identify when customers are ready for additional storage, when components are failing and need replacement, or understanding what factors contribute to lower reliability.
Many industries such as financial services use data analytics in a variety of line-of-business operations: customer retention and acquisition, and fraud detection, to name a few. These firms are leveraging machine learning algorithms to analyze enormous volumes of data in order to quickly scan transactional records to make cost-reducing decisions. For all of this, a data lake architecture makes sense. But for it to work, the analytics technologies must be deeply integrated into the architecture, rather than “bolted-on” to the existing architecture. In-data-lake BI provides that deep integration and can allow firms to move quickly and react immediately in a dynamic market.
Support for Multiple Data Sources
While research suggests the majority of companies collect data from fewer than five external sources, a number of organizations still leverage five or more external data-generating resources. As the number of IoT devices continues to rise, and organizations learn to implement machine learning algorithms and other artificial-intelligence enabling tools, the number and variety of external data sources should continue to proliferate.
Sources today include Hadoop HDFS, cloud object storage like Amazon S3 and Microsoft ADLS, and distributed streaming platforms like Apache Kafka. It is absolutely crucial that today’s in-data-lake BI platform integrates with these examples, as well as other modern data platforms.
Flexible Deployment Options
In-data-lake BI platforms need to work across whatever combination of platforms customers choose, providing insights to end users while also simplifying IT work. To achieve this cross-platform functionality, organizations should look to on-premises, cloud, hybrid cloud, and multi-cloud as equally viable ways to run BI analytics systems.
The BI platform should be able to run on almost any reasonably sized computer, whether physical or virtual, as scale (and performance to some degree) is achieved through the addition of more nodes to the cluster. An important aspect of the deployment options is the ability to support object storage to enable environments where data is decoupled from the compute engines. Object storage is used by organizations today regardless of where the computing layer resides, even on-premises.
Organizations are still figuring out how to get the most out in-data-lake BI, and so the architecture will evolve with customer demands. One thing is clear, though: BI tools must evolve with customers’ data landscapes. The winning companies in today’s business environment will be the ones who carve the shortest path to decisions. To achieve the competitive edge promised by data lakes, organizations should look for the modern BI tools that align with the above four characteristics of in-data-lake BI.
About the author: Shant Hovsepian is a co-founder and CTO of Arcadia Data, where he is responsible for the company’s long-term innovation and technical direction. Previously, Shant was an early member of the engineering team at Teradata, which he joined through the acquisition of Aster Data. Shant interned at Google, where he worked on optimizing the AdWords database, and was a graduate student in computer science at UCLA. He is the co-author of publications in the areas of modular database design and high-performance storage systems.