SQLstream Analyzes Data On the Flow
In the world of analytics there’s no time like the present, which is why there’s such a big push to retrofit Hadoop as a real-time system. But there are other approaches, including the one taken by the San Francisco software house SQLstream, which uses SQL to query big data as it flows through the memory chips of cheap, commodity servers.
The approach that SQLstream takes with its analytics products is deceptively simple. As data streams in, SQL queries are run continuously against it, generating an uninterrupted flow of answers for whatever questions it’s been programmed to ask.
SQLstream founder and CEO Damian Black calls it big data stream processing. “We’re there to feed and enhance your core data systems, and provide continuous analytics, continuous cleaning and validation, continuous alerting and alarming,” he says. “We’re turning the raw data into streams of data that you may then want to store in some of your other big data systems. Or you may just want to interpret now because you may not need to store all the information.”
The data flowing into SQLstream is your typical semi-structured data, such as clickstream data, log files, point of sale (POS) transactions, telephone call logs–the same type of data that many customers put in Hadoop. But instead of assembling a giant cluster of servers and then running batch jobs on it to analyze huge sets of data (i.e. the standard Hadoop/MapReduce way of doing things), a small server will do for SQLstream , and answers are generated continuously.
Black uses the example from the telecommunications industry to demonstrate the advantage of his approach to big data. Say the CIO of a telecommunications firm wants some data about the state of operations. He wants to know how many telephone calls are active at any point in time, what the average call length is of the calls, how many open Internet connections there are. This information may not be useful by itself, but is quite valuable for filling in the bigger picture.
|A screenshot of the new SQLstream s-Visualizer tool.|
“Say you have a million records per second coming in. So a record is generated anytime someone clicks a browser or makes a telephone call,” Black tells Datanami in a phone conversation. “In a database world, if you want to have a real-time average, you basically have to run a query that will aggregate all of the numbers. It will count the records, and divide the sum by the count. It may have to process a billion records, if it’s done in main memory.
“However, that query will be launched a million times per second,” Black continues. “So you have a million times a billion–a thousand trillion operation per second. Even with the fastest in-memory database, it’s just not viable to take that approach, at least not for any finite amount of money; whereas, we can run that kind of query on a continuous basis on a four-core commodity server. The reason we can do that without skipping a beat is that the queries we’re running are running continuously over the live data.”
This type of workload isn’t suited for SAP HANA or Oracle Exa- products, Black says. It’s not too big for Hadoop, which will eventually get you the answer you’re looking for. But by then, it will be too late to matter. SQLstream ‘s motto, “query the future,” is slightly cutesy, because, obviously, nobody knows what will happen in the future, but it shows you how the company is tackling the problem of how best to analyze streaming data.
“To be fair, we’re not solving the same problem as Hadoop or in-memory databases, because we’re querying the future continuously,” Black says. “If we want to store all the records of information, to do data mining or post hoc analysis, then we’d stream out the set of results into Hadoop, and then you’re crunching the data in Hadoop, maybe to fine tune your predictive algorithms.”
The notion that SQLstream can query wild data on the hoof is incorrect. The company is not instantiating the data, as one would do when it’s placed in a standard relational data store. “Normally to make these things tractable, there will be windows of time or numbers of records involved. So it will join two stream together over a rolling five milliseconds, five minutes, five hours, five days, or five months,” Black says.
The advantage of this approach is that, after the period of time has elapsed, the data is simply discarded, making way for fresher data–better data–to be loaded into the SQLstream analysis pipeline.
SQLstream is often used to keep a running tally of events for a certain type of data. Things get interesting when a user stacks several of these computations together, say by feeding the results of one real-time query into a second query, and so forth. The fact that SQLstream doesn’t store the data for any length means that data doesn’t have to fit any predefined schemas, giving it flexibility, Black says.
“We can create any new output on any new schemas on the fly, and they can co-exist with existing ones, and stream out multiple format of information,” he says. “At the same time, because we’re not storing the data, we don’t have those problem or pain points that other technologies have. All we have to be able to do is process the information, pause it to get the data we need, and stream out a format of data that can be used by other programs or people.”
|Mozilla built the Firefox Glow download visualization using a combination of SQLstream and HBase technology.|
HBase is particularly well-suited for providing additional processing of data that’s been through one or two stages of refinement in SQLstream , Black says. “Hbase is good for enhancing the stream,” he says. “Imagine if we wanted to process telephone numbers, and we wanted to see who this phone number belongs to, or which part of the world this IP address is coming from. In Hadoop, that would require you to traverse and search through the records in a MapReduce style, unless you’re using the latest release of Cloudera, which has a separate search application. But HBase allows you to do key-value lookups, so it’s much faster.”
Mozilla actually used SQLstream in combination with HBase to display the a visualization of all Firefox downloads as they occur in real time. The downloads are captured in SQLstream, and the IP addresses are handed off to HBase to generate longitude and latitude coordinates, which are then displayed in a Web browser. You can see it live at Mozilla’s website.
This week SQLstream unveiled SQLstream s-Visualizer, a tool for building live dashboards over streaming data. The software allows users to build customized dashboards in drag and drop fashion.
SQLstream is still ramping up its business. It’s approaching 30 customers, and has been granted six patents for its technology.