October 9, 2013

SQLstream Analyzes Data On the Flow

Alex Woodie

In the world of analytics there’s no time like the present, which is why there’s such a big push to retrofit Hadoop as a real-time system. But there are other approaches, including the one taken by the San Francisco software house SQLstream, which uses SQL to query big data as it flows through the memory chips of cheap, commodity servers.

The approach that SQLstream takes with its analytics products is deceptively simple. As data streams in, SQL queries are run continuously against it, generating an uninterrupted flow of answers for whatever questions it’s been programmed to ask.

SQLstream founder and CEO Damian Black calls it big data stream processing. “We’re there to feed and enhance your core data systems, and provide continuous analytics, continuous cleaning and validation, continuous alerting and alarming,” he says. “We’re turning the raw data into streams of data that you may then want to store in some of your other big data systems. Or you may just want to interpret now because you may not need to store all the information.”

The data flowing into SQLstream is your typical semi-structured data, such as clickstream data, log files, point of sale (POS) transactions, telephone call logs–the same type of data that many customers put in Hadoop. But instead of assembling a giant cluster of servers and then running batch jobs on it to analyze huge sets of data (i.e. the standard Hadoop/MapReduce way of doing things), a small server will do for SQLstream , and answers are generated continuously.

Black uses the example from the telecommunications industry to demonstrate the advantage of his approach to big data. Say the CIO of a telecommunications firm wants some data about the state of operations. He wants to know how many telephone calls are active at any point in time, what the average call length is of the calls, how many open Internet connections there are. This information may not be useful by itself, but is quite valuable for filling in the bigger picture.


A screenshot of the new SQLstream s-Visualizer tool.

“Say you have a million records per second coming in. So a record is generated anytime someone clicks a browser or makes a telephone call,” Black tells Datanami in a phone conversation. “In a database world, if you want to have a real-time average, you basically have to run a query that will aggregate all of the numbers. It will count the records, and divide the sum by the count. It may have to process a billion records, if it’s done in main memory.

“However, that query will be launched a million times per second,” Black continues. “So you have a million times a billion–a thousand trillion operation per second. Even with the fastest in-memory database, it’s just not viable to take that approach, at least not for any finite amount of money; whereas, we can run that kind of query on a continuous basis on a four-core commodity server. The reason we can do that without skipping a beat is that the queries we’re running are running continuously over the live data.”

This type of workload isn’t suited for SAP HANA or Oracle Exa- products, Black says. It’s not too big for Hadoop, which will eventually get you the answer you’re looking for. But by then, it will be too late to matter. SQLstream ‘s motto, “query the future,” is slightly cutesy, because, obviously, nobody knows what will happen in the future, but it shows you how the company is tackling the problem of how best to analyze streaming data.

“To be fair, we’re not solving the same problem as Hadoop or in-memory databases, because we’re querying the future continuously,” Black says. “If we want to store all the records of information, to do data mining or post hoc analysis, then we’d stream out the set of results into Hadoop, and then you’re crunching the data in Hadoop, maybe to fine tune your predictive algorithms.”

The notion that SQLstream can query wild data on the hoof is incorrect. The company is not instantiating the data, as one would do when it’s placed in a standard relational data store. “Normally to make these things tractable, there will be windows of time or numbers of records involved. So it will join two stream together over a rolling five milliseconds, five minutes, five hours, five days, or five months,” Black says.

The advantage of this approach is that, after the period of time has elapsed, the data is simply discarded, making way for fresher data–better data–to be loaded into the SQLstream analysis pipeline.

SQLstream is often used to keep a running tally of events for a certain type of data. Things get interesting when a user stacks several of these computations together, say by feeding the results of one real-time query into a second query, and so forth. The fact that SQLstream doesn’t store the data for any length means that data doesn’t have to fit any predefined schemas, giving it flexibility, Black says.

“We can create any new output on any new schemas on the fly, and they can co-exist with existing ones, and stream out multiple format of information,” he says. “At the same time, because we’re not storing the data, we don’t have those problem or pain points that other technologies have. All we have to be able to do is process the information, pause it to get the data we need, and stream out a format of data that can be used by other programs or people.”


Mozilla built the Firefox Glow download visualization using a combination of SQLstream and HBase technology.

HBase is particularly well-suited for providing additional processing of data that’s been through one or two stages of refinement in SQLstream , Black says. “Hbase is good for enhancing the stream,” he says. “Imagine if we wanted to process telephone numbers, and we wanted to see who this phone number belongs to, or which part of the world this IP address is coming from. In Hadoop, that would require you to traverse and search through the records in a MapReduce style, unless you’re using the latest release of Cloudera, which has a separate search application. But HBase allows you to do key-value lookups, so it’s much faster.”

Mozilla actually used SQLstream in combination with HBase to display the a visualization of all Firefox downloads as they occur in real time. The downloads are captured in SQLstream, and the IP addresses are handed off to HBase to generate longitude and latitude coordinates, which are then displayed in a Web browser. You can see it live at Mozilla’s website.

This week SQLstream unveiled SQLstream s-Visualizer, a tool for building live dashboards over streaming data. The software allows users to build customized dashboards in drag and drop fashion.

SQLstream is still ramping up its business. It’s approaching 30 customers, and has been granted six patents for its technology.

SQLstream Fine-Tunes Real-Time Platform for Speed, Scale

What it Takes to Deliver Real-Time Traffic Info

Applications: Predictive Analytics, Visualization

Technologies: Systems

Sectors: Financial Services, Healthcare, Retail

Vendors: Startups and More...

Tags: Hadoop, mapreduce, sql

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

SQLstream Analyzes Data On the Flow

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

SQLstream Analyzes Data On the Flow

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link