SAS Sticks Stats to Metal with HDFS
Last week statistical analysis giant SAS announced that it would be adding HDFS to its list of deployment options for their in-memory High Performance Analytics Server offering.
We were on-site for this announcement in Las Vegas during the company’s Premier Business Leadership Summit, which brought together a bevy of end users of both traditional SAS and their targeted line of high performance analytics solutions.
While the news itself is worth detailing, some of the conversations we had with SAS CTO Keith Collins and the lead architect behind the High Performance Analytics program at SAS, Oliver Schabenberger are far more revealing in terms of how we can understand the ways in which massive stats-based software companies are taking advantage of innovations in hardware (from added cores to memory stores) and keeping pace with new frameworks for ingesting and processing massive, diverse data streams.
We’ll touch on some of the key conversation points from our chat with Schabenberger and Collins in a moment, but to back up, SAS released a new set of analytical frameworks for different market segments, all of which leverage High Performance Analytics Server as the primary engine driving the solutions. Analytics Server has four areas of capability, which SAS has recently extended. Among these are statistics and data mining, forecasting, optimization, and text analytics. While these have been made available via the Teradata and EMC Greenplum boxes, as noted previously, SAS just pushed users into an entirely new deployment environment with the ability to run natively on Hadoop’s distributed file system (HDFS).
The important angle to this announcement is that Analytics Server could allow traditional SAS users to run high-end analytics. These are the same procedures they’re been doing for a long time, but until now were single-threaded and didn’t run inside the database or a distributed environment. Users now can approach parallelized operations in a distributed environment with the Teradata/EMC Greenplum approach or the new HDFS link in the same environment in which they previously ran non-high performance code. “It’s all the same program, but the computations happen in parallel in a distributed environment—the interaction with the software is still the same but data movement is reduced—and this is a huge step for us in enabling real high performance,” said SAS CTO Collins.
During a side chat with Randy Guard, CMO at SAS, we heard that more traditional SAS users are simply asking for the ability to load data into their Hadoop clusters and run analytics on the spot, close to the metal. When asked whether he sees a shift away from the mighty appliances, mainframes, and other behemoth systems, he noted that there is a slow migration, but many of the customers he’s spoken with aren’t ready to ditch their data warehouses or Teradata and Greenplum complexes quite yet. He pointed to a recent example of a large media company SAS works with that ingests terabytes of monthly audience data to analyze web and physical properties. All of their ingested web data sits in Hadoop and like many other users, says Guard, they used it simply to run basic analytics at scale. As their needs have changed (and as more capabilities for in-memory, complex analysis have emerged) they want to do far more advanced analytics that scale directly insideHadoop so there’s no need for any unnecessary latency hits on the data movement side.
Oliver Schabenberger claims that SAS has been under tremendous pressure in the marketplace to come up with a good play on HDFS. The lead architect noted that with all the options available (including building their own, which wasn’t in the cards for SAS), their goal was to find the technology solution that was the best fit for the computing environment of their customers. “Our philosophy is to co-locate the data and analytics as much as possible. That was our philosophy going inside the database and running alongside the database as we do with Greenplum and Teradata. Our computing must be alongside the hardware that houses the data, which is why HDFS is such a natural environment for us.”
HDFS is a natural choice for what SAS was trying to achieve with their high performance analytics server approach, says Schabenberger. He claims that its native failover and replication mechanism was one of the primary attractive features outside of the more obvious aspects of parallelization ability, and this is something that they weren’t ready to write themselves when HDFS supplied it. “We’ve integrated with MPP databases via Teradata and EMC Greenplum,” he noted, “and while HDFS offered this it took away the SQL access, but with this approach you get that important replication and failover aspect.”
Schabenberger explains that the company has developed their own proprietary file format to manage this parallelization complexity, noting that this is a key to transparency and functionality. The binary format, he says, “contains all the things our customers expect from a SAS environment that you won’t get from a vanilla HDFS file someone puts in that’s lacking those essential SAS signatures to support custom formats, different languages, and the like.” This means that the company can take as direct a route as possible in its interaction with HDFS. “We don’t want users struggling with how to integrate with Hadoop, decide what layers are needed—the same SAS language they’re used to is the medium with which they talk to HDFS. The rest is transparent.”
For example, on that transparency/ease-of-use front, Schabenberger notes that he began working with making HDFS approachable when the company released its Visual Analytics, which ended up completely integrated with HDFS, but in something of a backdoor way. “How we approached this,” he said, “was to cut out the Hadoop-dependent layers that aren’t needed. In the end, our integration on Visual Analytics make HDFS totally transparent—they don’t need to see it, they don’t need to understand. But for us, it’s been the backing store for our platform and the method that lets us work with data in parallel then process it.”
Just as no one is interested in running on a single core these days, so too are users expressing a resistance to having to learn new tricks to keep pace with performance possibilities, says Collins. While the news from last week encapsulates their approach to the file system showdown going down elsewhere, he says this isn’t SAS’ first rodeo when it comes to HDFS and making a seamless, easy experience.
SAS CTO Keith Collins says all hype aside, big data holds promise, especially when we start to think in terms of memory as being the first place data goes. The goal is to let data stream in without any pre-fabbing, indexing, or aggregating and scaling it straight to parallel. He says the one thing to keep in mind going forward in the ecosystem as a whole is that the future is in memory—and this is an analytic process, not a database process. It’s loaded into memory, handled in parallel in memory, and it can persist on the backend for later use. On that note, he says that for users who are looking to scale their capabilities, the ease of use needs to be in place to make it possible to start your SMP in-memory analytics with a one-line statement to go, without a training manual, from SMP to MPP.
In some ways this is a story that’s just as much about hardware as it is software. According to the architect and CTO, what’s really pushing the envelope for a company like SAS is a shift in hardware environments overall, whether we’re talking MPP or not. Software needs to be ready for the multi-core era, so for companies like theirs, it takes a dual attack on both the SMP and MPP fronts. Further, when it comes to parallelizing some of the most complex statistical algorithms out there, there are no hard-and-fast rules. The shift from Westmere to Sandy Bridge was a big step performance-wise, as was the leap from 1 GbE to Infiniband. They both agree that now the major jump for hardware to software concerns lies in the promise of deep memory boxes where a single dual 8-core with hyperthreading enabled packs a whopping 32 cores and major RAM for some innovations on the SMP front.
On a side note, Schabenberger kept using the phrases “high performance computing”(HPC) and “high performance analytics” interchangeably. When asked about where this mesh was, he said that he personally thinks of HPC as an academic exercise whereas high performance analytics is about solving a business problem in a high-performance environment, even though they both share some commonalities. “If you look at the types of problems we solve in our high-performance environments, these are the most complicated analytical problems that exist today. We are doing what we do best, which is approach these complicated problems in mathematical, statistical terms.” With this in mind, he notes that we can think of these two separate realms as a merger between high-performance hardware capabilities and the most advanced analytics out there.
While SAS is not the only established software vendor scrambling to keep pace with analytics needs that tap the capabilities of powerful commodity hardware, they claim that their real strength lies in doing what they’ve always done—solving the most complex mathematical and statistical problems known. While the performance boosts are a boon, at the end of the day, both the CTO and HPAS architect agree that this is their true bread and butter, come what may on the hardware side.