Hitching a Big Data Ride on Amazon’s Cloud
Amazon Web Services generated some headlines last week at its AWS re:Invent conference, where it launched Kinesis, a big data streaming engine. But the biggest news may be the outpouring of support from big data vendors, such as players like Syncsort and MarkLogic, who have launched new offerings that live on the Amazon cloud.
Big data, by its very nature, requires that you have a place to store your data, and another place to process it. That often requires a big investment in hardware, software, and network infrastructures, as well as paying big salaries to personnel to run it all.
Things are much simpler in the Amazon cloud world, where the actual location of your data is not a major concern, and moving the data around to be processed by the large and growing collection of available engines does not require a big investment in data replication, network connectivity, and worker’s comp benefits.
As more customers move their data to Amazon’s Simple Storage Service (S3) and load their apps on Elastic Compute Cloud (EC2), the value proposition increases for software vendors to offer a service that lives there, too. The growth of AWS has proven that the business model of providing dynamic scalability in storage and processing and simple pay-as-you-go software licensing is a winner.
Here are the notable third-party additions to AWS unveiled last week:
- Splunk normally requires that customers install servers to store and process machine-generated data in its proprietary format. But now that Splunk has a place on the AWS cloud, customers no longer have to worry about the physical infrastructure supporting their Splunk big data analysis systems. Specifically, Splunk last week announced Amazon Machine Images (AMIs) for Splunk Enterprise 6 and its Hadoop-version of Splunk, called Hunk. It also released a new version of the Splunk App for AWS, which uses the new AWS CloudTrail service that logs all AWS API calls for security and compliance purposes. Among the Splunk customers who are already using Splunk on the AWS include Adobe and FamilySearch.
- MarkLogic also announced that the latest release of its NoSQL database is available on Amazon’s EC2. Customers can deploy Web-based applications running on MarkLogic 7, and pay $0.99 per hour for the privilege. The marriage of MarkLogic and EC2 gives customers more options when it comes to running established operational databases in the cloud, says Joe Pasqua, MarkLogic’s senior vice president of product strategy. “Many organizations are frustrated by the tradeoffs they face when trying to choose a database for use in the cloud,” he says. “They find either limited-functionality versions of enterprise products or newcomers that don’t provide essential features such as ACID transactions, security or HA/DR.”
- Syncsort also used the AWS re:Invent conference as a springboard to launch IronCluster, a new AWS version of its DMX-h high-performance sort tool for Hadoop. Running alongside Amazon’s Hadoop service, called Elastic MapReduce, IronCluster allows users to replace the “mediocre” built-in sort that comes with Apache Hadoop with Syncsort’s higher performing sort jobs. According to Syncsort CEO Lonne Jaffe, the new IronCluster offering will make it more feasible to offload legacy data analytics or data transformation routines running on Teradata or IBM mainframes into the Amazon cloud. “It’s just one click, spin up the whole cluster, and siphon off the really expensive workload from your legacy systems into the cloud,” Jaffe told Datanami. You can read more about IronCluster in our story from last week.
- Amazon also used the show to tout its own hosted NoSQL DynamoDB database. The company says it serves 2.2 trillion database requests per month, or about 1 million transaction per second. That number caught the eye of other NoSQL vendors who are already running in the Amazon cloud–namely Datastax, which distributes a version of Apache Cassandra. Datastax claims that just one of its Cassandra-on-AWS customers is processing more than all of its DynamoDB customers put together. Netflix, it says, is running 1.3 million writes per second on its Amazon hosted Cassandra database. (Of course, Netflix is also a big Amazon customer, so it’s not like it looks horrible in this light.)
Relational databases are also experiencing a surge in popularity, as organizations discover that Hadoop and NoSQL are simply not good fits for certain types of workloads, especially those handling highly structured data, such as straight SQL transactions. To that end, there was a flurry of activity around relational databases running in AWS, starting with Amazon’s embrace of PostgresSQL
- Amazon’s new RDS for PostgreSQL offering complements the existing support that Amazon has for MySQL, SQL Server, and Oracle database. The new service will undoubtedly be utilized by startups and software as a service (SaaS) firms looking for a low-cost relational database on which to base their offerings. “Some of the people that I have talked to want to move their existing applications over to RDS,” writes Amazon’s Jeff Barr. “Others want to build new applications that take advantage of the data compression, ACID compliance, spatial data (via PostGIS), or fulltext indexing that PostgreSQL has to offer.”
- GenieDB also used the AWS show to launch a new management console to its MySQL-as-a-Service offering. GenieDB runs its MySQL relational database service across several cloud infrastructures, including Amazon EC2, Rackspace, Google Compute Engine, and HP Cloud. With the new console, administrators get better tools to deploy and manage distributed MySQL database clusters that can retain running despite regional outages.
AWS, along with its various internal and third-party engines, is obviously here to stay. And thanks to the way the AWS services are structured–where compute resources actually get less expensive as more customers sign up and share the financial burden–it’s going to become an increasingly viable alternative to hosting your own big data infrastructure in the future.