Teradata Gets In Deep with R
Because the demand for data scientists has far outstripped supply, software vendors have stepped up to fill the gap. Teradata did its part when it announced today that it has parallelized the full body of R statistical functions from partner Revolution Analytics and made them available in the new 14.10 release of its eponymous database.
Existing Teradata customers are the big winners with the new R package that will become available later this year as a built-in part of the Teradata data warehouse. Customers will not only be able to apply the full library of Revolution Analytic’s R statistical algorithms (specifically the HPA package) against data already stored in their database, but they can be assured that it’s done quickly and accurately, according to Imad Birouty, a program marketing manager at Teradata.
“R, as fantastic as it is, has some shortcomings. In itself, it’s not a parallel language. It’s limited to running on a single server,” Birouty says. “We went ahead and did this extra work [to parallelize R] because we’re focused on this stuff. You’ll hear about other database vendors talk about running R in-database. They’ll talk about parallelism. They’re not doing that. They’re doing node-level. We’re doing true parallelism.”
The difference between running R algorithms at the node level and running them at the cluster level can mean the difference between being right and being wrong.
“Let’s say we have data spread across four servers, and that you can want to figure out what’s the median of house prices in Arizona, for example,” he says. “If you run the median on each server separately, you’re going to a median for data on that server. If you bring the medians back from all four servers and now you try to take the median of the median–that’s bad math. It’s not going to work. You’re going to get a wrong result.
“Whereas with this system parallelism,” Birouty continues, “you can ask the same question, it’ll run out there, look at all the data across all the servers, and bring it back and give you one answer that’s accurate across the entire data set.”
Teradata is considering bringing the R package to bear on its Aster Hadoop appliance, Birouty says. That would give Teradata a one-two combination that enables customers to explore their data with R on Hadoop, and then put any statistical deliverables into production on the Teradata Database proper. It would also give the company a capability similar to what Revolution Analytics unveiled earlier this summer, when it announced that it parallelized its Revolution R Enterprise 7.0 R package to run in Cloudera’s Hadoop distribution.
Teradata 14.10–the first major release of the analytical database since version 14.0 shipped in early 2012–brings several other major new capabilities, including the inclusion of 615 analytical functions from Fuzzy Logix into the Teradata Database; support for new XML data types; and support for temporal and geospatial data.
As part of the deal with Fuzzy Logix, the full breadth of its analytical and statistical functions can run, in parallel, on both the Teradata relational data store, as well as the Aster Hadoop file system. The Fuzzy Logix routines are available as an add-on package, and are accessible as standard SQL, Birouty says.
“Imagine doing things like a moving average, a median mode, very advanced hypothesis testing, financial functions, and time series analysis. All of those are built-in functions that now run deep without our database,” he says. “Together with what Fuzzy brings us, and what we already had, we’re at over 1,000 database functions.”
The new XML functions in 14.10 will be particularly beneficial to Teradata customers in healthcare and financial services, which have standardized on XML for data exchange.
The database has supported the capability to “shred” XML documents for some time. But now, entire XML documents can be stored in a column in the relational data store and queried with XQuery. “We’re making it that much easier for them to store, publish, or hold” XML documents, Birouty says.
The new temporal functions will make the Teradata database more time-aware than it previously was. It gives developers the capability to build applications that can format data by segments of time, without having to write pages of SQL, which is what it would previously have taken. “It’s like the database can go through a time warp and go back in time to say ‘What did things look like on January 1, 2010?’ That’s very hard and very few database have this capability, and fewer are doing it the right way,” Birouty says.
Lastly, the new geospatial indexing capabilities will allow Teradata customers to add location data as another dimension in their databases. This will be particularly useful to customer in the utility and insurance industries, Teradata says. For example, utility companies can use this function to identify and respond to outages by tracking the location of available employees, parts, and equipment. Similarly, property and casualty insurers can use it to develop better storm damage projections.