See EBCDIC Run on Hadoop and Spark
Only 20,000 or so of the big beasts still exist in the wild. They’re IBM mainframes, and despite the scorn of a legacy label, they continue to run critical processes companies simply don’t trust to commodity Intel boxes. Today Syncsort announced it’s providing a way for mainframe owners to process data in Spark and Hadoop while keeping the data in its original mainframe data format, EBCDIC.
IBM (NYSE: IBM) is one of the only server makers that uses the Extended Binary Coded Decimal Interchange Code (EBCDIC) character encoding to store data, as opposed to the more popular American Standard Code for Information Interchange (ASCII) data format that’s used by every other operating system on the planet. IBM uses EBCDIC with two systems: its System z mainframe as well as its IBM i midrange line of servers, the “baby mainframe” used by more than 100,000 organizations globally. Fujitsu also uses EBCDIC in its mainframe.
But because no server is an island today, IBM and Fujitsu customers are constantly translating their EBCDIC data into ASCII to integrate and process it with data on other systems, for transactional and data warehousing workloads. That extra step is a hassle, and creates opportunity for bad things to happen.
But now Syncsort has found a way to enable mainframe customers to process their EBCDIC data on X86-based Hadoop and Spark clusters without first translating that data into the ASCII character set. This is important, Syncsort says, because it allows organization to maintain a natural and untouched lineage of mainframe data for compliance purposes.
How does it work? The Woodcliff Lake, New Jersey, company says its DMX-h data integration software essentially “teaches” Hadoop how to talk EBCDIC.
“DMX-h comes with its own Hadoop InputFormat and OutputFormat implementations to deal with mainframe data in Hadoop MapReduce, so we ‘teach’ Hadoop how to speak EBCDIC,” a company spokesperson tells Datanami. “DMX-h engine running natively in the cluster can process EBCDIC data. The same InputFormat and OutputFormat implementations are used in Apache Spark.”
Syncsort says this new capability will benefit companies in regulated industries, such as banking, insurance, and healthcare, that have struggled to analyze their mainframe data using Hadoop and Spark because of the need to preserve data in its original EBCDIC format.
Previously, Syncsort addressed the EBCDIC data issue by converting the data into ASCII as part of its ETL offload offering using DMX-h. The software can still convert the mainframe data into ASCII if required, but the new capability, ostensibly, should eliminate the need for that extra step.
The new EBCDIC capability solves a technical issue that lets Syncsort’s customers do things that “were previously impossible,” says Tendü Yoğurtçu, the general manager of Syncsort’s big data business. “Not only do we simplify and secure the process of accessing and integrating mainframe data with big data platforms, but we also help organizations who need to maintain data lineage when loading mainframe data into Hadoop,” she says in a statement.
There are other ways to analyze mainframe data in Spark, Hadoop, or both. IBM itself provides a version of its Hadoop distribution called IBM InfoSphere BigInsights for Linux on System z that’s designed to run on the mainframe’s Linux subsystem. But this product works by translating the EBCDIC data into ASCII, according to an IBM senior product marketing manager.
Syncsort—which does a fair amount of business enabling mainframe shops to offload their big ETL workloads from mainframes to Hadoop clusters–also introduced a new DMX Data Funnel capability that allows large collections of database tables to be imported into Hadoop en masse. Companies that regularly need to move large amounts of data into Hadoop will benefit from Data Funnel by being able to move hundreds of tables into HDFS with a single click of a button, the company says.