The SQL Service at the Heart of Amazon Kinesis Analytics
Amazon Web Services gave real-time application developers a boost last week when it unveiled Amazon Kinesis Analytics, a new service for continuously querying streaming data using good old SQL. The service hooks into the entire Kinesis suite available from AWS, but the core underlying technology was developed by a third-party vendor you might not be familiar with.
The rise of fast-moving data streams such as those flowing through the Internet of Things (IoT) is putting pressure on companies to do something useful with the data before it become stale. Batch-oriented systems such as Hadoop have proven inadequate when latency and interactive-level response time are needed. That has spurred the rise of new streaming frameworks, such as Spark Streaming, Storm, Flink, Kafka Streams, Beam, and Concord.
While the real-time streaming data frameworks can be powerful, one of the challenges they pose to would-be users is the need to learn new languages, or at least new APIs. Organizations that are accustomed to querying data using SQL have few options available to them. Some of these options include the new Structured Streaming technique in Spark 2.0, which builds on the Spark SQL API, as well as NewSQL databases such as those available from MemSQL and VoltDB.
Now AWS is giving users another SQL option with Kinesis Analytics. According to the vendor, which launched the new service last Friday, continuously querying streaming data in real-time is as simple as writing SQL queries.
“With the addition of Amazon Kinesis Analytics, we’ve expanded what’s already the broadest portfolio of analytics services available and made it easy to use SQL to do analytics on real-time streaming data so that customers can deliver actionable insights to their business faster than ever before,” says Roger Barga, AWS’ general manager of Amazon Kinesis.
Getting started with Kinesis Analytics is simple. Customers go to their AWS management console, select a data stream from Kinesis Streams or Kinesis Firehose. The service can automatically recognizes standard data formats, and will suggest to the user which schema should be used. The user then composes the SQL queries in an online editor, perhaps using pre-built templates, and choose where the results will be put. Kinesis does the rest, including continuously querying the data and scaling the whole service to match volumes and latency requirements.
The approach sounds worthwhile, particularly for fast-moving data streams such as clickstreams, log files, and data flowing from connected devices over the IoT. What might be surprising is that Kinesis Analytics is based on technology that AWS is OEMing from a San Francisco-based SQLstream.
In a 2013 interview with Datanami, SQLstream CEO Damian Black described the deceptively simple approach the company takes by using the analysis of call detail records (CDRs) from a telecommunications firm as an example.
“Say you have a million records per second coming in. So a record is generated anytime someone clicks a browser or makes a telephone call,” Black said. “In a database world, if you want to have a real-time average, you basically have to run a query that will aggregate all of the numbers. It will count the records, and divide the sum by the count. It may have to process a billion records, if it’s done in main memory.
“However, that query will be launched a million times per second,” Black continues. “So you have a million times a billion–a thousand trillion operation per second. Even with the fastest in-memory database, it’s just not viable to take that approach, at least not for any finite amount of money; whereas, we can run that kind of query on a continuous basis on a four-core commodity server. The reason we can do that without skipping a beat is that the queries we’re running are running continuously over the live data.”
As big data gets bigger and fast data gets faster, getting actionable insights out of raw data will become harder and harder. While the analytic framework approach taken by Spark, Storm, and the others will gain steam, it’s clear there’s room in the conversation for other approaches, notably those that can give new life to old technologies, like ANSI SQL.
“The store-before-query analytics and conventional ETL models are irrelevant in a world where streaming analytics can empower businesses to take the next right action, continuously and real time,” Black said last week in statement. “The combination of Amazon Kinesis Analytics and SQLstream Blaze makes it easier than ever for businesses to securely and cost-effectively ingest, analyze, and manage streaming data on and between public cloud, private cloud, and on premises.”