Streaming SQL for Real-Time Analytics
Streaming analytics is moving into the mainstream as more enterprises capitalize on the technology to gain greater efficiency, competitiveness, and profitability. In fact, the streaming analytics market will grow to $13.7 billion by 2021, at a compound annual growth rate of 34.8%, according to Markets and Markets.
Behind the demand are a number of emerging business uses for streaming analytics that are proving to deliver real-world benefits. These are scenarios where the value of insights is very high shortly after they appear and then quickly diminishes over time. Stock markets, fraud, surveillance, patient monitoring, and traffic monitoring are just a few examples of such use cases.
At the same time, streaming analytics technology is becoming increasingly accessible through open source projects, such as Apache Storm and Apache Spark, and complex event processing (CEP) engines. Further facilitating the use of streaming analytics are streaming SQL languages that let developers capitalize on their SQL query experience to rapidly incorporate streaming analytics into their apps and services. No longer is it necessary to write code as was required with the first generation of streaming engines.
While similar to traditional SQL, streaming SQL is specifically designed to handle fundamental differences in how data streams are structured and managed. This article looks at how streaming SQL works, where users will see overlapping functionality with standard SQL, and how to take advantage of special operators for querying streaming data.
Streaming SQL for Querying Streaming Data
“Streaming SQL” refers to languages that enable users to write SQL-like queries for streaming data without having to write code. The platform handles data transfers and data parsing, as well as provide common operators, such as joins, windows, and patterns directly in the language.
Today, a number of solutions provide streaming SQL
languages. Among these are five open source Apache projects: Apache Storm, Apache Flink, Apache Kafka, Apache Samza, and Apache Beam. They also include vendor solutions, such as the proprietary SQLstream Blaze and open source WSO2 Stream Processor.
Before examining how to use streaming SQL, it is important to understand how SQL is mapped to streams. Basically, a stream is a table of data on the move. Think of it as a never-ending table where new data appears as time goes on. One record or a row in a stream is called an event. However, it has a schema and behaves just like a database row. So essentially, SQL streams replace tables with streams. To understand these ideas in more depth, Tyler Akidau’s talk at Strata is a great resource.
However, there is an important difference between SQL and streaming SQL. When you write SQL queries, you write them on data that is stored in a database right now. By contrast, when you write a streaming SQL query, you write them on data that is present now as well as the data that will come in the future. Hence, streaming SQL queries never end. That might sound like a problem, but it works because the outputs of those queries are streams. An event will be placed in output streams once the event is matched, and the output events become available right away. Although the queries do not end, they continuously produce results as streams.
Similarly, a stream represents all events that can come through a logical channel, and it never ends. For example, if we have a temperature sensor in a boiler, we can represent the output from the sensor as a stream. Classic SQL would ingest data stored in a database table, process it, and write it to a database table. In this case, the streaming SQL query will ingest a stream of data as it comes in and produces a stream of data as output.
Taking the scenario further, let’s assume the boiler stream emits an event once every ten minutes. When we write a projection query using the boiler stream, it will create a new stream that produces an event in the result stream once every ten minutes.
Streaming SQL languages have two kinds of operators. The first class of operators is the same as the SQL operators used in databases or in batch processing. These shared operators include:
- Projection, which lets us control what data goes into the output
- Filters, which let us only select some of the data
- Group by, which lets us calculate aggregates for groups
- Having, which lets us filter aggregated data
The second class includes special operators that extend the power of SQL to handle streaming-specific use cases, and they support the concept of time. We will explore those in depth in the next section.
Streaming SQL Advanced Operators
Let’s consider three types of queries that go beyond standard SQL: Windows, Joins, and Patterns.
Windows. To start, we will assume that we want to calculate the moving average of the temperature reading from a stream. This is an operation used to understand the data over time. It is not about calculating the value now.
Streaming SQL introduces a Window operator that captures the idea of a window of data on top of the stream. Once grouped under the window, events can be analyzed with aggregate functions, such as average, count, etc. in classic SQL.
For example, the stock analysis uses moving averages extensively. Multiple moving averages (e.g. 1 minute, 5 minutes, and 15 minutes) act as indicators of the direction of the trend. Furthermore, Windows are used to remove noise, detect anomalies, and provide forecasts.
Joins. Now, let’s assume that we need to detect and continuously receive the temperature difference based on data from two sensors within 30 seconds with each other. This is non-trivial as different sensors will produce events at different times. Before making decisions, we need to align events emitted at different times and match them.
Streaming SQL does this by adding a Join operator that supports windows. The query looks like a classic aggregate query; the difference is that streams are used instead of tables, and at least one of them must have a window. The window collects the events, and an “on” condition aligns them. Then, the resulting combined events can be formatted using the SQL “select” statement.
Patterns. Finally, let’s assume that we want to continuously detect whether the temperature has changed more than 20% within any five-minute period. This query is impossible to do efficiently with standard SQL.
Here, the streaming SQL query handles the scenario using a Patterns operator for the “happened after” relationship, which is represented by the “–>” symbol. It asks the streaming engine to begin by looking for the first condition, which is an event in “BoilerStream.” Then once done, it starts looking for the second condition, which is an event with a temperature that has moved more than 20% from the first event’s temperature.
With its ability to query things over time, the Patterns operator is very powerful. It can be used to detect the rise or fall of a value, a peak or a trough, chart patterns (i.e. for stock markets). Additionally, it can be used to identify a behavior signature (e.g. walking or running), and check for the completeness of a task (e.g., Does the reply to an email come within an hour?).
The article, Stream Processing 101: From SQL to Streaming SQL, provides a detailed discussion of different types of Window, Join, and Patterns operators, as well as how they work.
Increasingly, businesses are relying on streaming analytics to support real-time or near-real-time decision-making. With streaming SQL, users have a powerful language for querying data in motion that builds on the familiar concepts of standard SQL for processing data at rest—and provides advanced operators for accomplishing so much more.
I am confident you will be pleasantly surprised by the innovative applications that steaming SQL enables you to create. To learn more, I encourage you to check out the article, 13 Stream Processing Patterns for building Streaming and Realtime Applications, which provides a description of the patterns and use cases that are possible with streaming SQL.
About the Author: Srinath Perera is the vice president of research in the CTO office at WSO2. He a scientist, software architect, and a programmer who has authored two books about MapReduce and is a frequent author of technical articles. Srinath also is a member of the Apache Software foundation and a key architect behind several widely used projects, such as Apache Axis2 and WSO2’s complex event processing technology. Additionally, he serves as a research scientist at the Lanka Software Foundation, and he teaches as a visiting faculty member at the University of Moratuwa.