Getting to Optimal: An HP Vertica Query Tale
Query optimizers are, by their nature, not perfect. While there is theoretically an “optimal” way to execute any given database query, finding that perfect path through the database’s innards takes too much time and resources, so a “good enough” approach is used. A similar story unfolded at HP Vertica in the development of the query optimizer itself, which also affected how the column-oriented database would be positioned to tackle big data problems of the future.
Query optimizers are unsung bits of code that run silently, out of sight and out of mind of the end users and applications that impatiently demand answers from databases. They are products of computer engineers, the result of a “best guess” approach to anticipating what kinds of information might be requested, and the quickest way to extract that data given various database structures and levels of complexity. Due to their designs, some query optimizers do well in some circumstances but poorly in others. They are, like people, not perfect.
When the folks at Vertica (now Hewlett-Packard) set out to build a query optimizer for its big data database, there were several directions available to take. Just as a query optimizer must select from a limited number of options in how to execute a database request, the engineers at Vertica had different paths they could take in developing the optimizer itself. After exploring two options, the company finally hit pay dirt with the third.
The first attempt to build a query optimizer ran into problems with the schema and the shape of real world customer data. As the company explains in a recent white paper, the first query optimizer, called StarOpt, had Kimball-style design, which, the company says, assumes that any customer schema and query could be modeled as a star or snow flake.
But the rosy assumption behind this approach fell flat when it hit real world data. “It was clear that most customer data did not exactly conform to the star ideal,” the company writes.
Making matters worse was the fact that StarOpt ran distribution and sort algorithms after the join order was chosen, which affected the query optimizer’s capability to minimize the movement of data, the white paper’s authors write. “As we learned, an industrial optimizer must handle all the vagaries of messy schemas and data, however nonsensical they may seem to the implementers,” the company writes.
Vertica’s next attempt at a query optimizer, called StarifiedOpt, was a variation on StarOpt, and basically forced non-star queries to look like a star to the optimizer so that the company could reuse its existing StarOpt algorithms. As the company writes in the paper, this approach worked much better than the company could have hoped.
StarifiedOpt was, in the end, an interim fix, but it bought the company time to absorb the lessons learned from the first two development attempts, and develop a third optimizer, called V2Opt, which was much closer to the company’s idea of “optimal.”
V2Opt addresses the shortcomings of the previous two query optimizers. For starters, it’s completely aware of distribution, sortedness, and non-star schemas, HP Vertica says. As the default optimizer since version 4.0, it’s been implemented thousands of times and helped optimize hundreds of millions of queries.
But what really stands out about V2Opt is that it’s largely future proof and adaptable to big data challenges of the future. HP Vertica says it can change “key elements” of the optimizer without ripping out and rewriting lots of code. “V2Opts inherently-extensible design has already enabled us to incorporate knowledge gleaned from our end-user experiences, without significant engineering efforts,” the company writes.
V2Opt isn’t the perfect query optimizer. That, after all, is a theoretical thing that doesn’t exist in reality. However, thanks to a clever design, the product allows the engineers to continually evolve the query optimizer in the pursuit of perfection.
The adaptability of the product is evident in how it goes about its business. The product plans a query by first categorizing and classifying the physical properties associated with the query, such as column selectivity, projection column sort order, projection data segmentation, prejoin projection availability, and integrity constraint availability, the company says in its white paper.
The product then assigns a degree of importance for each physical property. It does this by experimenting on the customers’ real-life data sets and queries. The results of these experiments on physical properties are combined with a “cost-model pruning strategy” that assigns CPU and network transfer costs with the query plan.
This approach will be very valuable as new data sources are uncovered that affect the physical properties of the query. Instead of ripping out bits of the query optimizer and replacing them with new stuff, Vertica can simply add a new ranking module to V2Opt to adapt to the new data, eliminating the need to change the entire optimizer.