AWS Debuts PartiQL for Query Agnosticism
Amazon Web Services last week debuted partiQL, a new query language that’s agnostic to the type of database, format, or model that the underlying data is stored in. The language, which is based on SQL++, can be used to query data in relational databases as well as less structured systems, such as NoSQL databases and data lakes. AWS has released the new software as open source under the Apache 2 license.
AWS announced the new query language on August 1 in a blog post authored by four individuals, including Dr. Yannis Papakonstantinou, who oversaw the creation of SQL ++ by his computer science graduate students at the University of California San Diego, and who is currently a senior principal scientist for AWS.
“The root of the problem is that data is typically spread across a combination of relational databases, non-relational data stores, and data lakes,” the AWS authors write. “Some data may be highly structured and stored in SQL databases or data warehouses. Other data may be stored in NoSQL engines, including key-value stores, graph databases, ledger databases, or time-series databases. Data may also reside in the data lake, stored in formats that may lack schema, or may involve nesting or multiple values (e.g., Parquet, JSON).”
Each data type and storage repository suits a particular use case, and often comes with its own query language. That may serve an individual project well, but in the long run, it results in a hodge-podge of different data storage mediums and associated query language. At that point, any changes that the user may want to make – like changing data formats, using a different database engines, or even moving the data – will often requires the user to make substantive changes to the application and the queries.
Instead of making those laborious application changes, AWS presents another solution in the form of PartiQL. By embedding PartiQL support into a given query engine, users can query it no matter if the data is stored in a relational databases or data warehouse, in a semi-structured and nested data format, like the Amazon S3 data lake, or even in a schema-less NoSQL database.
AWS built PartiQL in part to solve the data query challenges faced by the retail side of the house (you may have heard of it). “Amazon’s retail business already had vast sets of semi-structured data, most often in the Ion format,” the AWS authors wrote. Chris Suver, a distinguished engineer at Amazon, wanted a SQL-like language that could be used across a multitude of different data stores, including ION, which is a rich, hierarchical data format based on JSON.
Other AWS properties also expressed a desire for more flexible data querying, including the RedShift data warehousing team and Amazon Quantum Ledger Database (Amazon QLDB), which presents a centralized, ledger-based system for tracking changes to data.
“We therefore set out to create a language that offers strict SQL compatibility, achieves nested and semi-structured processing with minimal extensions, treats nested data as a first-class citizen, allows optional schema, and is independent of physical formats and data stores,” the AWS authors wrote.
Seeing that Papakonstantinou’s grad students had already checked some of these boxes with SQL++ had, the AWS team decided to start with that as the basis for PartiQL. According to the AWS team, PartiQL achieved its design tenets, including:
- SQL compatibility;
- First-class support for nested data;
- Optional support for schemas
- Minimal use of extensions;
- Data format independence;
- Data store independence.
According to the authors, PartiQL is already being used in the AWS cloud, including with Amazon S3 Select, Glacier Select, Redshift Spectrum, QLDB, and internal Amazon systems. It’s also being used by EMR to push data to S3 Select, the company says.
PartiQL is already supported by at least one vendor in the outside (i.e. non AWS) world: Couchbase, developer of a NoSQL database that stores data in a variant of JSON. Couchbase delivered a new Analytics Service last fall that is based on SQL++ and gives users the capability to use SQL constructs to query JSON data stored in Couchbase Server.
PartiQL has been endorsed by Don Chamberlin, who is one of the creators of SQL and also an advisor to Couchbase.
“The SQL++ proposal of Dr. Yannis Papakonstantinou, and languages based on SQL++ such as PartiQL, have shown that the extensions to SQL needed for querying semistructured data are fairly minimal,” Chamberlin wrote in the AWS blog piece. “I hope that these small language extensions will help to facilitate a new generation of applications that process data in JSON and other flexible formats, with and without predefined schemas.”
One of the AWS customers looking forward to using PartiQL is Yelp, the customer review site. “PartiQL addresses the critical missing piece in a poly-store environment — a high-level declarative language that works across multiple domain-specific data stores,” writes Yelp Software Engineer Steven Moy in the blog post.