Too many big data initiatives are science projects that take months of effort, risk failure and require highly trained data scientists with scarce skills. According to a CSC survey, 55 percent of big data projects aren’t completed and many others fall short of their objectives.Read more...
Why Hadoop Won’t Replace Your Data Warehouse
A lot has been made of Hadoop becoming the singular control point for analytics, effectively usurping the enterprise data warehouse (EDW). The recent quest to integrate SQL into Hadoop is an example of that. But a better role for Hadoop is emerging that has it working hand in hand with existing EDW implementations in support of a hybrid big data analytics architecture.
The hype level around Hadoop continues to run high as the big data wave keeps getting bigger. The latest phase of excitability is coming from the Internet of Things (IoT) and the possibilities surrounding the analysis of machine data. Hadoop, with its affordable and flexible file system, is seen as a likely candidate to store and process petabytes worth of semi-structured data.
As the IoT grows, the Hadoop community continues its previous assigned task, which is retrofitting Apache Hadoop with SQL interfaces to make it easier to use (not to mention more EDW-like). Cloudera and Hortonworks, with their Impala and Stinger initiatives, are leading the push to speed SQL access on Hadoop and eliminate the Hadoop version 1/MapReduce paradigm.
But as the SQL work continues, it raises questions. Are we re-inventing the wheel? Are we duplicating what we already have with EDWs? Is this really where Hadoop should be headed? Is this the best use of our resources?
There are a couple of angles to the SQL-on-Hadoop story. On the one hand, SQL makes Hadoop more accessible to existing business intelligence tools, as well as the millions of data analysts who can write SQL. But there is also a movement under foot to replace EDWs with Hadoop, and SQL support is part and parcel of that drive.
On a purely dollar-per-TB metric, Hadoop beats EDW hands down every time. Hadoop’s capability to deliver massive parallelization by harnessing commodity X86 processors, SATA disks, and plain vanilla Ethernet networks is clearly a force to be reckoned with. But EDWs provide far more than just storage and a SQL interface, and proposing Hadoop as a replacement for a mature EDW implementation is a proposition that’s fraught with potentially unseen complications.
Steve Wooledge, vice president of product marketing at Hadoop distributor MapR Technologies, says Hadoop has a ways to go before it can replicate the functionality delivered by mature EDWs.
“For a sophisticated data warehouse user, there are certain types of workloads, very complex SQL, that mature database technology’s [have an edge],” he says. “Hadoop’s just not there yet.”
Customers are exploring the possibility of replacing Teradata or Oracle data warehouses with Hadoop, Wooledge says. “That’s part of their data science experiments. They want to see what Hadoop’s good for. At this point in time, it’s not the right place for a data warehouse,” he says.
“Vendors that talk about replacing the data warehouse are misleading and they’re losing credibility.”
The data analytics giant SAS sees enough data going into Hadoop to make it worthwhile to offer two products on HDFS, including its SAS In-Memory Statistics for Hadoop and Visual Statistics, which it unveiled earlier this month. But that doesn’t mean customers are ditching their Teradata, Oracle, or Greenplum EDWs in favor of Hadoop, says SAS chief data scientist Wayne Thompson.
In particular, EDWs still hold an edge over Hadoop when it comes to serving data analysts with updating records, Thompson says. “The reason that we have Visual Statistics on other platforms is Hadoop is not so good for updates,” he tells Datanami. “A lot of customers are still going to have their master EDW in a Teradata or Oracle system. We still see a proliferation of these advance business analysts…who need statistics in these EDWs, and will need them for a long time to come, at least the next five years.”
A new data analytics architecture is emerging that blends next-gen platforms, such as Hadoop, in-memory data grids, and graph databases, with traditional relational databases and data warehouses. Under this hybrid architecture, each component does what it’s best at, enabling customers to get the benefits of new analytic technologies without suffering from the drawbacks.
At cloud analytics software firm Treasure Data, a trend is emerging that sees users augmenting their existing EDWs with its hosted offering, which blends MapReduce and a fast column-oriented data store called Plasma. The company’s 110 customers currently have more than 4 trillion rows of data occupying about 4 petabytes of storage in Treasure Data’s cloud.
“What we see is more folks talking about us as an adjunct cloud facility for big data alongside their classic data warehouses from Oracle, Teradata, and others,” says Rich Ghiossi, Treasure Data’s vice president of marketing. “We’re not confused about the fact that people may already had a data warehouse installed. They look at it and say, for us to put [a big data solution] into that environment is just prohibitive from cost and manageability standpoint.”
As Hadoop implementations go from proof-of-concept into full production, there will be a desire to expand the Hadoop footprint and do more with it. That is a natural reaction, especially if the organization is getting actionable insights from their Hadoop cluster that would be difficult to get elsewhere.
But the enthusiasm for Hadoop needs to be tapered with the reality of the situation, which is that Hadoop is still a fairly new technology that doesn’t offer all of the enterprise-grade features that EDWs have offered for years. MapR’s Wooledge, who used to work at Teradata, doesn’t see Hadoop offering the same level of user concurrency, dynamic workload management, and data latency capabilities that Teradata offers anytime soon. “Some of the things that Teradata has created are absolutely best in class,” he says.
One workload that Hadoop has excelled at is running ETL jobs. Ten years ago, ETL was a single-threaded process that fed data into the data warehouse from a separate app server. But now those workloads are getting the benefit of massive parallelization thanks to Hadoop. “Now that Hadoop’s here, it makes natural sense to land your data into a file, do your transformation there, and then move the data that’s analyzable into the data warehouse,” he says.
Over time as SQL on Hadoop matures, there may be other types of workloads that can move. But for now, organizations are best served by thinking about Hadoop not as a replacement for EDW, but as another cog in the data analytics machine that must play well with others.