Inside IBM ML: Real-Time Analytics On the Mainframe
“Bring the compute to the data,” is a common refrain you hear in the big data age. Now IBM is heading that advice with today’s launch of IBM Machine Learning for z/OS, a new offering due this quarter that will bring Watson-based machine learning software to the mainframe atop an Apache Spark execution engine, attached to a Jupyter-based data science workbook.
There are on the order of 20,000 IBM mainframe systems in the wild, which is practically a rounding error compared to tabulations of global population of Windows and Linux servers. But mainframes are a different sort of beast, in large part because they run the most valuable transactional workloads for the biggest companies in the world.
These companies are eager to do data science on that data, but it’s difficult because of where it’s stored. The answer for many mainframe shops has been to move the data to a separate environment—such as a Teradata warehouses or Hadoop data lake—using extra, transform and load (ETL) techniques.
However, these ETL jobs consume a non-trivial number of MIPS on the mainframe. And banks and healthcare companies, in particular, are loathe to move personally identifiable information (PII) and personal health information (PHI) more than they have to, for fear of running afoul of regulators, or worse—losing it to hackers.
IBM’s answer to this conundrum is to get serious about enabling modern analytics to run directly on the mainframe. Today’s launch of IBM ML for z/OS is just the start of a series of new analytic capabilities IBM will be delivering for the mainframe over the coming months and years, says Rob Thomas, general manager of IBM’s analytics group.
“Let’s be honest—there’s a bunch of what I would describe as machine learning toys in the world now, which are lightweight cloud services that do some interesting machine learning tasks or activities,” Thomas tells Datanami. “But most clients are not moving their most sensitive data–their transactional data, their customer data–into a third party cloud to do that analysis on that data.”
Enter IBM ML
IBM created the IBM ML for z/OS product to address that shortcoming. The product—which has been in beta with 20 mainframe customers for months and will GA on March 17, according to today’s IBM announcement letter–brings together several pieces. It starts with a version of Apache Spark running on a Linux OS running under the System z Integrated Information Processor (zIIP). Spark will serve as the runtime for IBM ML workloads.
The IBM ML for z/OS product also includes the Cognitive Assist component of IBM Watson. This Watson software will provide the framework, language interface, and model management capabilities for doing machine learning on transactional data stored on the mainframe, Thomas says.
“We’ve extracted the machine learning from Watson…and we’re making that available here for private clouds, particularly on the mainframe,” Thomas says. “The idea of being able to pull Watson technology in the form of machine learning and apply that [to mainframe data]—that has got a lot of interest.”
The Watson Cognitive Assist technology will drive much of the automation around machine learning in IBM ML for z/OS. According to Thomas, the software will recommend which algorithms to use on particular pieces of data. As the models are built and the data is analyzed, the software may change which algorithms it thinks is best to use. This feedback loop is a central feature to IBM ML for z/OS.
“It’s pretty amazing when you go in there,” Thomas says. “You attach it to the data set, and it’s suddenly telling you what algorithm you should to use…It’s kind of magical actually. It’s the most automatic way of analyzing data that we’ve seen.”
Initially, the Watson Cognitive Assist environment will work with Scala, but eventually IBM plans to open it up to Python, Java, and R, IBM says. The IBM ML for z/OS product will also eventually work with other machine learning packages besides Watson Cognitive Assist and Spark ML, such as TensorFlow and H2O.
Don’t Say ‘Data Lake’
Because the IBM ML for z/OS analytics will primarily be centered on analyzing highly structured data stored on the mainframe and its various data repositories, there won’t be much of a need for analyzing things like images, videos, or less structured text.
The idea is not to turn the mainframe into a Hadoop-style data lake. Although Hadoop does run under z/OS, the so-called zDoop offering IBM unveiled in 2014, IBM doesn’t see much of a future in analyzing unstructured or semi-structured data here. While some mainframe customer will import subsets of third-party data into the IBM ML for z/OS, the product is mainly geared for analyzing highly valuable mainframe data by itself, not as a big data mixer.
The IBM ML software will be intuitive enough for end-users to use, without requiring technical services from IBM. “We don’t view it as a services opportunity because the product is really good and it’s pretty self explanatory for anybody who’s worked in and around….z/OS or the mainframe, or for anybody with a data science background,” Thomas says. “It’s pretty straightforward.”
And it may even eliminate the need for mainframe shops to hire more data scientists. “Everybody’s struggling to hire data scientists,” Thomas continues. “This really relieves the burden on data scientists.”
The user interface for IBM ML for z/OS will be based on the Data Science Experience, which is a Jupyter-based data science notebook that IBM unveiled last year. “That’s the UI paradigm that we’re thinking of for this because that’s the right way to visualize ML and how you’re building models,” Thomas says. “That’s the direction we’re going.”
A Mainframe Analytic Revolution?
The announcement of IBM ML for z/OS comes less than a week after IBM’s mainframe business announced that it‘s working to bring the Anaconda open data science platform from Continuum Analytics to run natively on the z/OS mainframe. Anaconda packages up more than 600 different data science libraries in the Python and R communities, such as NumPy, which was created by Travis Oliphant, the chief data scientist at Continuum.
“My team got connected with Travis a few months back,” says Barry Baker, IBM’s vice President and offering manager for z Systems and LinuxONE. “It took me about two minutes to realize that, oh my God, I need to be doing more with Travis and his team based on what they’re able to do…. from our perspective it was a no brainer that this was the right team for us to work with, given how much they’re driving the open source community.”
The work IBM and its partner Rocket Software are doing in bringing the Anaconda package to the mainframe is being done separately from the work IBM is doing around IBM ML work. While there are definite parallels between the two projects, there are big differences too. For starters, the Anaconda distribution for the mainframe won’t be ready until late June, at the earliest, while IBM ML ships in about a month.
But eventually they will merge into single, cohesive strategy, Baker says. “Right now the IBM ML launch is just around building on top of Spark,” he says. “There’s going to be more on the roadmap for what we’re going to support on the platform in terms of using Anaconda within this flow and using Anaconda to enable a broader user base to be productive on the platform. View it as laying the groundwork for more to come as we get Anaconda and that stack on the platform as well.”
Another way to view it is this: While IBM likes Anaconda because it taps into the energy of the open data science community and brings a lot of powerful and fast-evolving data science tools to the mainframe, it sees Apache Spark as the sharp end of the stick. Anaconda may well be involved in building models, but Spark will be counted on to actually execute the scoring fresh data, to run the models in anger in the real world, and to bring real-time analytics to the mainframe.
“There’s a fair amount of invention going on and people are looking at Spark as a foundational tool now,” Baker says. “There’s a number of problems that it might actually address. We had a couple of customers emerge with using batch modernization leveraging Spark for example, which is not a use case I was thinking about.”
Nine of the 10 biggest banks in the world run their core processing systems on the z/OS mainframe. They’d like to be able to do things like detecting fraudulent requests in real time, instead of messing around with APIs, external systems, and the additional risk and delay that necessitates.
The work IBM is doing with Anaconda and Spark is part of the work in bringing real-time analytics to the mainframe. “Real time analytics, within in the scope of the transaction, is something our clients are trying to do,” Baker says. “Some are successful, and others are trying to get more successful.”