Picking the Right Tool for Your Big Data Job
There is a lot of debate in the big data space about tools and technology, and which ones are best. Is SQL better than NoSQL? Hadoop or Spark? What about R or Python? Of course no single tool or technology is the best for all situations, and you would do well to pick the right tool or technology for the job at hand.
At a high level, big data jobs fall into two buckets: transactional and analytical. Broadly speaking, transactional jobs execute in real-time, whereas analytical jobs have more leniency in execution times. Customers will often use an analytic platform like Hadoop or an enterprise data warehouse (EDW) like Teradata to discover insights in their data, and then put those insights to use in their transactional systems. But that general pattern may be shifting.
Analytical Tech for Big Data
For many, Hadoop is still top of mind when the words “big data” are uttered, and this open source technology is certainly getting attention from organizations far and wide. Doug Cutting, the original creator of Hadoop, theorized at the Strata + Hadoop World last fall that Hadoop could someday be the foundation of transactional systems. However, today it’s still firmly situated in the analytic camp. And while there are efforts to bolster Hadoop with real-time capabilities (through various SQL engines, Tez, YARN, Spark, Storm, etc.) most customers continue to build and run Hadoop application in traditional batch mode with MapReduce.
Hadoop got a little boost last week when Gartner lowered the bar of entry of what it means to be a data warehouses and opened the door to including non-relational data stores such as Hadoop in its influential reports. There’s no doubt that Hadoop has had a major disruption in the world of EDWs and analytics. However, the analyst group notes that traditional relational EDWs, such as those from Teradata, IBM, and Oracle, still dominate when it comes to spending.
Interest is surging in machine learning (ML) algorithms, particularly as the growth of data from the Internet of Things, social media, and other sources ramps up. When it comes to surfacing insights and connections hidden in hundreds of terabytes or even petabytes of data, ML algorithms running on Hadoop is a powerful combination. Hadoop is also tough to beat when it comes to archiving large amounts of structured and unstructured data for later undetermined uses. When the data lives in the cloud, solutions such as Amazon’s Elastic MapReduce solution excel.
While Hadoop gets the lion’s share of attention when it comes to big analytics, it would be a mistake to overlook other tools that have something to add. In particular, if you need lightning fast access to large amounts of mostly structured data, a column-oriented database may be a good choice. Examples of column-oriented DBs include Actian‘s ParAccel, EMC‘s Greenplum, HP Vertica, Infobright, IBM Netezza, MonetDB, and Teradata Aster, among others. Bigger organizations are getting benefit from using Hadoop to ingest massive amounts of semi-structured data and to give it greater structure, and then using a columnar data store to run deep analytics upon it.
Recently, Spark has emerged as a possible alternative to MapReduce for big analytic applications. Spark–which came out of UC Berkeley in 2010 and is being commercialized through Databricks—can run on HDFS, but it doesn’t have to. In addition to being 10 to 100 faster than MapReduce, Spark applications can be developed in Java, Scala, and Python. The open source software comes with a library of ML and graph algorithms, and also supports real-time streaming and SQL apps, via Spark Streaming and Shark, respectively. Several big data analytic software providers who develop proprietary ML algorithms will be announcing support for Spark next week. It’s worth keeping a close eye on when it comes to analytic apps for big data.
Transactional Apps for Big Data
Once you’ve sorted and sifted your big data set for valuable insights, you’ll probably want to put them into action with your production big data applications. Running a big data production app introduces a whole host of other problems, and you’ll need a separate set of tools to solve them.
For transactional jobs, NoSQL databases have rapidly grown in popularity due to their capability to scale horizontally on commodity hardware, to handle structured and unstructured data types, and to automatically shard data across multiple nodes in a distributed cluster. There are multiple types of NoSQL databases, including key/value stores, wide-column stores, document stores, and graph databases, and each type offers advantages over others.
For example, a basic key/value store, like Redis or Memcached, excels at storing vast amounts of schema-less data, which is very handy for quickly caching and retrieving content. Document-oriented databases, such as MongoDB and Couchbase, support more complex data types, such as JSON documents, and offer their own query language or APIs.
A graph databases, such as Neo4j, might in order if you want to make sense of complex data, such as social data; however, they have trouble scaling on distributed clusters. If your transactional application needs to scale far and wide, a wide-column stores, such as Cassandra or HBase, offer good performance of queries over data sets into the hundreds of terabytes.
NoSQL databases have quickly grown in popularity, but there are some trade-offs that users should be aware of. For starters, you most likely won’t be using SQL to query your data, and instead will be using a proprietary language that you’ll need to learn, such as SPARQL or Couchbase’s N1QL. You will likely also have to give the ACID principles of data consistency and durability that SQL databases have adhered to for years.
It all depends on what you want to accomplish, says Couchbase CEO Bob Weiderhold. If high supporting a high level of read and write performance in a Web or mobile application is your goal, then Couchbase can deliver that in an easy-to-use package. “But you pay a durability penalty. And for many types of data access, that’s fine,” he says. “Increasingly, the application developers, the people developing mission and business critical application, they believe the tradeoffs of NoSQL are better trade-offs.”
Recently, a new crop of relational, SQL-based databases dubbed NewSQL databases have been attracting more attention–not to mention venture capital funding. Companies like Clustrix, NuoDB, VoltDB, MemSQL, and EnterpriseDB offer some of the benefits of NoSQL databases–such as horizontal scalability on commodity hardware and automatic sharding of data across multiple nodes in a cluster–but do so within the well-defined bounds of the Structured Query Language and by maintaining adherence to ACID precepts.
The similarity that NewSQL databases have to traditional relational databases like Oracle, MySQL, SQL Server, and DB2 mean they’re often used as replacements. But recently, the NewSQL vendors have been moving more into analytic applications. MemSQL recently added a columnar data store to its in-memory database that allows the company to do basic analytic computations on transactional data.
Clusterix is also seeing strong movement to in-application analytics on the part of its customers. “In order to monetize these millions of customers who are using your application, you’re going to want to understand them and the interaction patterns with the data and the community,” Clusterix CEO Robin Purohit told Datanami this week. “That means connecting the dots pretty quickie with all these pieces of data interactions, which looks more and more like analytics. And by the way, that looks a lot easier to do with something like SQL rather than writing custom code.”
The company is working on helping customers to join data across large number of data sets. That doesn’t mean in-memory NewSQL databases, which rarely exceed the 100TB limit, will be doing the heavy-duty analytic lifting. “If you want to do large-scale, complex analytics, you really want to use a columnar store or Hadoop,” Purohit says.