April 6, 2015

Deep Dive Into Oracle’s Emerging Big Data Stack

Alex Woodie

Oracle has a lot of turf to protect in the multi-billion-dollar relational database market, where it owns a dominant share of the market. That creates a natural tension when it comes to big data technologies like Hadoop and NoSQL, and while the IT giant isn’t embarking upon a wholesale re-architecting its business plan around these emerging open source technologies, it is on its way toward building a comprehensive big data stack that solves many emerging use cases.

Oracle has made a considerable amount of progress on its emerging big data stack over the past six months. In January, we saw the launch of the X5 generation of its engineered systems, including a new release of the Exadata Database Machine that boast the new 18-core Intel “Haswell” processors. Other hardware products were also updated, including its Big Data Appliance, which you can get pre-loaded with Cloudera‘s Distribution for Hadoop. Cloudera, of course, is Oracle’s preferred Hadoop provider.

Hardware may get headlines, but software underpins Oracle’s big data dreams

Since the Sun acquisition, Oracle has morphed into a hardware company, but its roots are still firmly in software. Considering that most big data breakthroughs are occurring on the software side of things, that positions the big red machine to help solve its customers big data challenges, if not drive innovation in the industry. Oracle has made quite a bit of progress in the software side of big data over the past six months.

Here are the highlights of Oracle’s new big data software products:

Big Data SQL

Big Data SQL is a query engine that Oracle added to the Big Data Appliance in the second half of 2014. The software implements the well-understood concepts of running federated queries, “but with a twist,” says Neil Mendelson, vice president of big data and advanced analytics at Oracle.

“We’re actually running a bit of the Oracle database on the native targets, so we have Oracle running on the nodes of Hadoop, and on the nodes of NoSQL,” Mendelson tells Datanami. “So we’re able to federate a query, but we keep it in the family, so we can use the intelligence, the statistics, and so forth, in order to make it run, number one, fast as hell, and number two, be able to do it in a secure manner.”

Big Data Discovery

Oracle got into the big data visualization business at the Strata + Hadoop World conference in February, when it launched Big Data Discovery. Based in part on the search technology Oracle obtained with its Endeca acquisition, Big Data Discovery runs within Hadoop and is designed to allow data analysts to work with big data sets in a visual manner.

Big Data Discovery provides some rudimentary data cleaning and transformation routines, and employs machine learning algorithms to help automate the process of identifying insights that are hidden in big data. You can think of it as a lightweight combination of Trifacta and Tableau (though obviously not without all the features those two products bring).

Oracle unveiled Big Data Discovery in February

“If you find that that data is now going to become part of your mainstream processing–which is clearly about what good data discovery does,” Mendelson says, “then after you’ve enriched this data….we can access that via Big Data SQL and allow it to be surfaced in your dashboard and reports.”

GoldenGate Replication

At Strata, Oracle also launched a new Hadoop-supported version of GoldenGate, its real-time data replication solution that has traditionally been used to feed data warehouses with data from transactional and operational systems. With the new release, Oracle is now supporting Hadoop as a target for data replication; support for Hadoop as a source is expected to come later.

Specifically, GoldenGate now supports HDFS, Hive, HBase, and Flume. It allows customers to enhance big data analytics initiatives by incorporating existing real-time architectures into big data solutions, while ensuring their big data reservoirs are up to date with production systems. Oracle also added a Java interface to GoldenGate that will make it easier for Java programmers to interact with it.

While GoldenGate will feed data into Hadoop and Big Data Discovery will allow users to do some lightweight cleaning and transformation upon it, Oracle has other software up its sleeve to satisfy larger production-scale cleaning and transformation needs. One of those was unveiled today with an update to Oracle Data Integrator for Big Data.

Data Integrator for Big Data

The new Data Integrator for Big Data enables customers to run large-scale data transformations directly in Hadoop or Spark. Data analysts can create the transformations visually, and the ETL software automatically converts it into Spark, Pig, or Hive code, with Oozie providing the workflow management at runtime.

“This gives us three concrete transformation engines that we can use in a Hadoop or big data environments,” says Jeff Pollock, vice president of product management for Oracle. “You don’t have to go out and hire language specific developers who know Scala or Python. You can bring in those mainstream developers who are familiar with ETL technologies and they can instantaneously do big data development.”

Most of Oracle’s current customers view Hadoop and NoSQL data stores as complementary to their existing Oracle, Teradata, HP Vertica, or IBM Netezza data warehouses, Pollock says. They may land the data in a Hadoop data lake, but then offload the bits needed for operational reporting to the existing warehouse. ETL products such as Data Integrator for Big Data help those use cases.

Big Data Preparation

On the data cleansing side, Oracle is gearing up for a launch of the cloud-based Big Data Preparation tool, which is available now as a preview. “What the Big Data Preparation capability will add to the mix is automating all of the data capture and data movement and from unstructured sources into the BD environment as well. That will be part of the same platform,” Pollock says. “We’re using some machine learning and natural language processing to aid non-technical users in enriching this technical layer.”

The movement towards real-time big data analytics is in full swing right now, and Oracle is looking to address that in a future release of its Data Integrator tool with support for Apache Storm and Spark Streaming, Pollock says.

But don’t overlook Oracle’s 12c database or its NoSQL databases when it comes to big data. Oracle is continuously adding new data types to its flagship database, which now sports column-oriented and in-memory options to help it compete with stand-alone products. And its NoSQL database, which offers a combination of key-value and document-store capabilities, also plays a role. While its relational database supports graph data types, there are rumblings that Oracle could be gearing up to deliver (or buy) a full-fledged graph database at some point to fill a gap in its product line when it comes to graph analytics.

Oracle will never be a big believer in open source (even if it does own Java and MySQL). But that doesn’t mean it can’t learn the lessons that open source is teaching when it comes to big data analytics. And judging from recent developments, the company is moving strongly toward addressing many of the emerging needs impacting big data analytic.

Oracle Aims to Break Big Data Silos with SQL

Oracle Gives 12c Database a Column-Oriented Makeover

Applications: Data Mining, Enterprise Analytics, Research Analytics

Technologies: Cloud, Frameworks, Middleware, Storage

Sectors: Financial Services, Healthcare, Manufacturing, Retail

Vendors: Cloudera, HP, IBM, Oracle, Teradata

Tags: Hadoop, Hive, Oozie, oracle, Spark, storm