Too many big data initiatives are science projects that take months of effort, risk failure and require highly trained data scientists with scarce skills. According to a CSC survey, 55 percent of big data projects aren’t completed and many others fall short of their objectives.Read more...
Amazon Tames Big Fast Data with Kinesis Pipe
Amazon Web Services added another engine to its big data powerhouse this week when it unveiled Kinesis for real-time streaming data. Kinesis allows users to create new apps that analyze high-throughput data streams, such as log files, financial transactions, and click-stream data, at rates of more than 100 TB per hour.
There are all types of fast-moving data streams that organizations would like to tap for actionable insight. But getting a handle on data streams–such as stock tickers, social media feeds, geospatial data, results from massive multi-player games, inventory levels, and any machine data from the Internet of Things–is easier said than done.
Enter Kinesis, which sits upon AWS’ EC2 cloud and allows users to quickly start analyzing their streaming data with just a few clicks of the mouse. Instead of sending these data streams off to some server where they may never be analyzed, Kinesis is designed to make it easy to pipe in big data streams to AWS servers, analyze the data, and then discard the digital waste byproduct, or recycle it into brand new streams.
Amazon envisions all sorts of uses for Kinesis. In an e-commerce setting, Kinesis can be used to generate product recommendations based on Web clickstreams generated by mobile users. Financial services companies can use it to ingest stock ticker information, which can be used to refactor financial models on a near-continuous basis. A manufacturer could use it to monitor inventory data, and generate alerts when inventories get too low. Kinesis can be used to mine millions of Tweets to identify patterns or trends, or used with Facebook’s social graph for purposes of consumer sentiment analysis.
|Amazon’s rendition of what “big data pipes” might look like.|
There are all sorts of potential uses for a streaming data analysis machine such as Kinesis, particularly in the area of machine-generated data and machine learning. This real-time processing need is driving big interest in products such as Splunk and Apache Storm, which sits atop Hadoop. What Amazon brings to the table is the capability to spin up a real-time stream processing system without the need to actually build or deploy any infrastructure. It’s a powerful concept, and a great real-time complement to Amazon’s batch-oriented Hadoop offering, Elastic MapReduce.
Users can get started with Kinesis by provisioning a new data stream from their AWS web management console. Alternatively, a data stream can be provisioned by using the Kinesis API or SDK. Amazon provides client libraries to allow developers to integrate Kinesis data processing into their Java applications.
|Amazon CTO Werner Vogels introducing Kinesis during a keynote speech at the AWS re:Invent conference yesterday|
From a user standpoint, Kinesis operates on data streams in terms of shards. According to Amazon, each shard ingests data, using Kinesis’ HTTP-based PutRecord function, in blocks of 1,000 write transactions, at rates up to 1 MB per second. Conversely, each shard egresses data, using the GetNextRecords function, in blocks of 20 read transactions at rates up to 2 MB per second. Users can scale their shards, or blocks, up or down on the fly, without restarting the stream or impacting the data sources pushing data into Kinesis, Amazon says.
Kinesis limits a user to analyzing data streams from the past 24 hours. This trailing 24-hour window should give the user enough time to extract the useful bits of information. If the user still wants to hang onto the data after the 24-hour window, he can move it to another Kinesis stream, or move it to other AWS offerings, including S3, DynamoDB, or RedShift, each of which is pre-integrated with Kinesis.
Amazon charges for Kinesis based on the number of PUTs and for each shard of throughput capacity. The company charges $.028 for each 1 million PUT transactions, and $0.015 per hour for the sharding capacity. Since Kinesis runs inside of AWS EC2, a user must pay for their EC2 capacity as well.
Kinesis is currently in limited preview. The new offering was unveiled yesterday during Amazon CTO Werner Vogels’ keynote address at the AWS re:Invent conference that took place in Las Vegas. “This is an amazing new service where we can build tremendously innovative real time applications,” Vogels says during his keynote.
You can watch Vogels’ entire presentation via YouTube below.