Databricks Cranks Delta Lake Performance, Nabs Redash for SQL Viz
Today at its Spark + AI Summit, Databricks unveiled Delta Engine, a new layer in its Delta Lake cloud offering that uses several techniques to significantly accelerate the performance of SQL queries. The company also announced the acquisition of Redash, which develops a visualization tool that will be integrated with Databricks’ Lakehouse.
Delta Engine is a new layer that sits atop Delta Lake, the structured transactional data storage layer that Databricks launched three years ago to address a variety of data ingestion and quality issues that customers were facing with the emergence of data lakes running atop cloud object stores, such as Amazon S3.
While object stores provided great scalability and reliability for storing structured and unstructured data, they failed to deliver the type of performance that companies had grown to expect from data warehouses. So Databricks developed Delta Lake to sit atop cloud object stores, and provided a series of techniques and processes (including ACID transactions to order data; use of Spark to handle growing metadata; indexing; and data schema validation) to go from “bronze” level data quality in data lakes to “gold” level data tables in support of downstream data warehousing and machine learning workloads.
“But from a performance perspective, indexing alone isn’t enough,” Databricks CEO Ali Ghodsi said today during his keynote address at Spark + AI Summit. “For many analytics workloads that involve small queries that need to be executed very fast, you need more. You want a high-performance query engine that can deliver the performance required for traditional analytic workloads. I’m pleased to announce that at Databricks, we’ve been working on a solution that we call Delta Engine.”
Delta Engine is a new query engine that’s based on the recently released Apache Spark 3.0 framework, which itself features a range of enhancements for SQL processing. The software is compatible with the Spark SQL API, the Spark Dataframe APIs, and the Spark Koalas APIs.
According to Databricks Chief Architect Reynold Xin, Delta Engine boosts performance in three main ways: through an improved query optimizer; through a vectorized execution engine written from scratch in C++; and through a caching layer that sits between execution engine layer and cloud object store for faster I/O throughput.
Delta Engines’ new query optimizer builds on Spark’s existing cost-based optimizer as well as its adaptive query execution optimizer with more advanced statistics, Xin says. “This technique can acutely deliver up to 18x performance increase for star-schema workloads,” he says.
Its new caching layer, meanwhile, automatically chooses which input data to cache for the user. “And it’s not just a dumb bytes cache that caches the raw data,” Xin says. “It transcodes data into a more secure, efficient format that’s faster to decode query processing layer. With this we can fully leverage the throughput of the local NVMe SSDs….This delivers up to 5x scan performance increase for virtually all workloads.”
Xin spent a bit of time in his keynote address today to dig into the design considerations behind the execution engine itself. The Databricks co-founder lamented that, while storage and network capabilities have increased by around 100x since 2010, the CPU layer has languished with virtually no change. X86 processors are still running at about 3Ghz, he said.
“So as every other dimension has become faster and faster, the CPU becomes more of the bottleneck,” Xin says. “The first important question to ask ourselves in designing execution engine for Delta Engine is how do we achieve the next level of performance?”
In addition to looking at modern hardware trends, the Databricks team looked at how data teams were working. As the pace of data-driven business innovation increases, data teams have less time to do the kind of detailed data modeling work that enabled them to get maximum performance out of data analytics systems, Xin said.
“If your business context is changing every six months, there’s really no point to spend six months to a year modeling your data, like how we did it back in the 1990s and 2000s with data warehouses,” he said. “Unfortunately, lot of the characteristics of the new workloads — the lack of data modeling, are not benefiting the performance of the query execution because many of the query engines were designed in an era in which data was very well modeled.”
The second important question the Databricks team asked itself is whether it’s possible to get both data agility and high performance, without sacrificing on either side? “Can we get great performance for very well-modeled data, but also still pretty good or great performance for not so well modeled data? And Photon is our answer to those two question.”
Photon is the name of the new execution that powers Spark SQL workloads in Delta Engine. As Xin says, it was built from scratch in C++ to maximize control over the underlying hardware. “It really leverages two important principles. The first is called vectorization,” he said. “The idea here is we exploit data-level parallelism and instruction-level parallelism for performance.”
Benchmark tests show that, with the vectorization provided by the Photon, traditional SQL workloads will see a 3.3x speedup compared to running Data Lake without Photon and Data Engine, Xin said.
Databricks used some other tricks in Data Engine to boost performance, including the capability to turn off UTF-8 string encoding when the system detects pure ASCII. UTF-8 is useful for working with more complex characters, but it comes at a steep performance cost relative to procedsing basic ASCII data, which the majority of data sets are.
According to Xin, the ability to turn off UTF-8 processing for ASCII data gives Delta Engine a performance benefit that is “orders of magnitude” better than leaving UTF-8 turned on. It’s an “auto-magic” system that doesn’t require any user input, he said.
Meanwhile, Databricks opened its war chest and acquired Redash, which gives the company a new interface for their customers to use for querying databases and visualizing the results of SQL queries.
Redash is an open source tool that is used by millions of users at more than 7,000 customers. The software provides a SQL interface that customers can use to query their databases in its natural SQL syntax, and builds up on that with a variety of visualizations. It supports more than 40 databases out of the box, including NoSQL databases, and supports a variety of APIs.
Databricks says Redash can be used today via a free connector. In the coming months, the software will be fully integrated into Databricks’ Unified Data Analytics Platform, as well as the Databricks workspace, its Juypter-based notebook interface. It will also be optimized to take advantage of Delta Engine, the company says.
Databricks have several other announcements planned for the Spark + AI Summit, which is taking place this week on the Internet. Stay tuned for more news.