Syncsort Siphons Up Legacy Workloads for Amazon EMR
Syncsort is bringing its flavor of super-charged MapReduce job generation capabilities to Amazon’s Elastic MapReduce cloud, the companies announced today. The IronCluster ETL as-a-service offering will allow Amazon EMR customers to generate faster MapReduce jobs from a GUI, which the companies say will make it easier to migrate expensive data warehouse workloads from Teradata or the IBM mainframe into Amazon’s incredibly inexpensive cloud.
IronCluster is a new, cloud-based version of Syncsort’s existing ETL tool, called DMX-h. What makes DMX-h unique is the way it allows developers to use a GUI to create very high performing MapReduce jobs, as well as its capability to access exotic mainframe assets, such as Cobol Copybooks and EBCDIC data. Syncsort actually worked with Hadoop distributors (namely Cloudera) to get modifications committed into the Apache Hadoop project that allow DMX-h to get down and dirty with MapReduce and Hadoop.
“Our ETL product is built on the chassis of our sort engine, so it essentially becomes a bit of a Trojan Horse. It allows us to have our ETL product run in a deeply instrumented native way on every node in the cluster,” says Syncsort CEO Lonne Jaffe.
Hadoop has a “mediocre built in sort,” function in MapReduce, he says. But the sort engine in Syncsort’s ETL product is a “high performance engine that includes dozens of algorithms that optimize each workload on each machine down to memory and I/O and CPU levels. So when you craft a MapReduce job using our ETL [interface]… the MapReduce paradigm basically punches out to our tool on each of the nodes, does the fast joins and merges and sorts and aggregations, and you get this level of performance that exceeds even the best well-designed Pig code.”
Syncsort’s customers pay hundreds of thousands of dollars in software license fees for the privilege of using DMX-h to move data from source systems (relational databases, applications, data marts, etc.) and then do the first level of sorting natively in big production Hadoop clusters, Jaffe says. The fact that now customers can get access to IronCluster (DMX-h’s cloud cousin) for a fraction of that on the Amazon EMR cloud is what makes the announcement potentially successful.
“The idea is, with a single click, you’ll spin up a whole series of Amazon MapReduce nodes, with IronCluster running on it, which can do all the great things it can do in the on-prem version,” Jaffe tells Datanami. “It will allow people to do all the Teradata offloading or mainframe data access or Hadoop cluster build-out without having to build anything. It’s just one click, spin up the whole cluster, and siphon off the really expensive workload from your legacy systems into the cloud.”
This allows users to run Hadoop workloads on the platform that makes the most economic sense. “You’ll be able to get started in the cloud in a development environment, then move back on premise, or move workload back and forth, or start on premise and then move to the cloud when you need more capacity during certain part of the day,” Jaffe says. “If you only need to do the expensive processing for a couple of hours in the evening, you don’t need to buy all these servers, and then buy all these perpetual software licenses or term-based licenses. You just scale it up, and pay per the hour to Amazon.”
What’s more, Syncsort is giving away the software for small EMR environments. “It’ll be free up to 10 nodes, and from there, we’ll charge a very low, essentially hourly charge for usage of nodes,” Jaffe says. Considering the ease at which EMR users can add Hadoop nodes, you could call this Syncsort’s Trojan Horse marketing program.
Whatever it is, Syncsort is aiming squarely at the heart of Teradata’s customer base and, to a lesser extent, IBM’s System z mainframe franchise. While Teradata denies that Hadoop is having much of an impact on its business, there’s no avoiding the big yellow elephant standing in the corner of the room.
“There was a surprisingly large percentage of companies exhibiting [at Strata + Hadoop World] that had a large part of their business model offloading legacy spend from Teradata to Hadoop,” Jaffe says. The fact that storage per terabyte on Hadoop is several orders of magnitude less expensive than it is for Teradata is a big part of that interest, he says.
Today, Teradata customers may be experimenting with Hadoop, and using their small Hadoop clusters to perform some pre-processing of data before loading it into Teradata. But eventually, as the software around Hadoop matures, Jaffe predicts they will start shutting down their Teradata warehouses.
The momentum behind Hadoop is already big, and it’s just getting bigger. “The world has seen that what you can do with the data once it’s already in Hadoop is getting better and better every day,” he says. “It’s still a little bit immature. But it’s improving rapidly and there’s a lot of money moving into that space. So the next-gen BI tools and the data application companies are all furiously working either making their system run directly against Hadoop or against smoothing that’s relatively close by, like one the NoSQL repositories, like HP Vertica.”