Teradata Makes Data Warehouse More Hadoop-ish
It’s no stretch to say that the folks at Teradata aren’t the world’s biggest fans of Hadoop. If nothing else, the hoopla surrounding Hadoop has caused some Teradata customers to question future investments in the company’s traditional data warehouse technology. But with today’s launch of the next generation of Teradata’s flagship platform, the company has made its software a little more Hadoop-ish, particularly when it comes to taking the compute to the data and supporting semi-structured file formats.
The new QueryGrid functionality that Teradata introduced today with the launch of Teradata 15 is designed to minimize the movement of data into the Teradata data warehouse by enabling data analytic workloads to be processed in place, including in Hadoop. Instead of moving large amounts of raw data from Hadoop into Teradata for processing, which consumes network bandwidth and increases duplicate data, QueryGrid enables Teradata queries to be sent to the source system, where they utilize the remote systems’ resources. After processing, the results are then sent to the Teradata warehouse, where they can be analyzed along with data refined by other source systems.
It’s all about taking advantage of the best data processing engines available to the organization as part of the multi-system “logical data warehouse” concept that Gartner has been promoting, says Teradata’s program marketing manager Imad Birouty. “Users don’t care where their data is,” he says. “They just want answers to their question, regardless of where it’s sitting or how it’s processed. That’s what QueryGrid brings to the table.”
QueryGrid is one part new technology, one part rebranding, and one part roadmap promise. The company already had connectors that moved data from source systems (such as Hadoop and Oracle’s database) to its two analytic engines, the relational Teradata data warehouse and its Aster data discovery platform, which provides SQL, graph processing, and MapReduce capabilities. However, those connectors were omni-directional; raw data moved up the stack into the Teradata EDW and Aster for processing. With QueryGrid, the connectors are bi-directional: queries are sent down from Teradata EDE and Aster, and more highly refined (and smaller) data is sent back up. That should jibe better with Teradata customers who are starting to work with Hadoop.
|Teradata’s plan for QueryGrid connectors|
“We’re going to take advantage of the processing engines and what they’re good at, what the Teradata database is good at, what Aster is good at, and what Hadoop and Oracle and all the others are good at,” Birouty tells Datanami. “We call that push-down processing. We’re going to push it down to the other engines, minimize end data duplication, analyze it and process it where it resides, and do it as transparently as possible. So if you’re a user, it’s easy access to the data and analytics. You don’t have to go through multiple systems. Within one query, you can access multiple systems…through standard SQL, using the skill and the tools you have.”
Currently, the company has just a handful of QueryGrid connectors, including one that connects Teradata with Hadoop, one that connects Aster with Hadoop, one that connects Teradata with Oracle, and another that connects Aster to Oracle (not to mention the Teradata to Aster connections). The Oracle connectors, by the way, are bi-directional now, and enable the type of process-in-place functionality that Teradata is bringing to other platforms. Within the next 18 to 24 months, the company plans to launch bi-directional connectors for other databases, including IBM DB2, SQL Server, Postgres, and other relational databases. It also plans to develop QueryGrid connectors that tie into other analytic systems developed in languages such as SAS, R, Python, Perl, and Ruby.
If “query in place” sounds familiar to Hadoopers out there, it’s no coincidence. The idea of taking the compute to the data, as opposed to moving the data to the computing resources (i.e. to Teradata), has been a hallmark of the open source juggernaut since it started gaining traction about five years ago. It makes a lot of sense when faced with huge data volumes, and the value of that data perhaps doesn’t justify loading it into a sleek, powerful, and expensive data warehouse.
Teradata may be borrowing a page from Hadoop. Just don’t call what it’s doing an “enterprise data hub.”
“Cloudera is going down a path that we don’t necessarily agree with, and that analysts don’t agree with,” Birouty says. “They’re trying to make [Hadoop] everything to everyone. ‘It’s a database repository, a MapReduce engine, it’s a database with Impala on top of it. It can do everything you can ever imagine.’ No one, from a technical perspective, thinks that’s feasible… We think having the right technology for the right job is the proper path to go down.”
Teradata developed the new QueryGrid Hadoop connectors with assistance from its Hadoop partner, Hortonworks. “Teradata pioneered integration with Hadoop and HCatalog with Aster SQL-H to empower customers to run advanced analytics directly on vast amounts of data stored in Hadoop,” Hortonworks CTO Ari Zilka says in a statement. “Now they are taking it to the next level with pushdown processing into Hadoop, leveraging the Hive performance improvements from Hortonworks’ Stinger initiative, delivering results at unprecedented speed and scale.”
In another Hadoop-ish development, Teradata is also opening itself up to JSON, the semi-structured format that’s become the de-facto standard for exchanging data on the Internet. The company previously supported Web logs and XML, but the support for JSON in
|JSON support features in Teradata 15|
Teradata 15 will really open up the potential uses in data warehouse, including storing and analyzing sensor data along with other data stored on the Teradata platform, says Alan Greenspan, the product marketing manager for the Teradata Database.
One of the advantages of processing JSON is the capability to perform query-on-reads, or late data binding, as opposed to the traditional early binding, or the query-on-write method used with traditional data types. “With JSON, the database can actually discover what data is there when you run your query,” Greenspan tells Datanami. “That gives you immense flexibility to use data that’s very dynamic, that’s changing over time, or that has new data elements, because you don’t have to change your database to accept it or start processing it.”
Teradata customers asked for JSON processing to support several specific use cases, including ecommerce vendors who want to analyze transactions, ATM manufacturers analyzing status updates, or car makers analyzing data from sensors embedded into cars. “All this is coming in via JSON, and enterprise analytics needs to be able to have the customer data that’s in the data warehouse along with the data from these devices and these transitions, and work with it all together,” Greenspan says. “So it’s really deep and broad integration within the same processing engine. That’s what’s unique, versus going and getting a document database that just specializes and only does JSON data for transactions, or getting another database where they bolted on another engine to process JSON.”
From a competitive point of view, Teradata’s JSON support is interesting because it touches on both Hadoop and NoSQL. Companies are generating petabytes worth of JSON documents from applications running on NoSQL data stores, such as those from MongoDB and Couchbase, while in many cases they’re using MapReduce, Hive, Pig, Impala, or other data processing engines within Hadoop to analyze it for useful nuggets of information. (Some NoSQL database vendors also tout their capability to analyze JSON, but mostly they’re processing transactions.)
So while Teradata apparently has no qualms about accepting Hadoop into its Unified Data Architecture and giving it a supporting role for things that it’s good at—turning huge amounts of relatively unstructured data into smaller buckets of more structured data that can then be loaded into the Teradata warehouse–it is also incorporating one of Hadoop’s major workloads by adding a JSON engine to its enterprise data warehouse.
Teradata may say not-so-nice things about Hadoop publicly, but if you look at what the company is doing, it’s evolving its architecture into something that, if not a replica, at least somewhat resembles Hadoop and its concept of moving the compute to the data and picking the right analytical engine for the job. It still has its EDW at the top of logical heap, which is no surprise. But the dexterity Teradata is exhibiting and its willingness to accept the reality that people want to process some data in Hadoop (if not that Hadoop is better at some things that Teradata) is great news for existing customers.