Splunk Pumps Up Big Data with Hunk
Hadoop users who are looking for another way to explore and visualize their big data sets may want to check out Hunk, the new product built on MapReduce that Splunk shipped today. Hunk lets users apply the same type of data visualization and analytic processing that Splunk Enterprise users are accustomed to, but to do so against any data residing within Hadoop.
If one were to follow the buzz emanating from the area of machine-generated data, it would likely trace back to Splunk. The San Francisco software company is doing a bang up job of transitioning from the tired and boring world of IT log consolidation and management into the bright and shiny new world called The Internet of Things.
Splunk’s big data story has unfolded in parallel to the rise of Hadoop. Whereas many companies have pumped all sorts of semi-structured and un-structured data into their Hadoop repositories with the strategic idea that it will be useful at some point in the future, Spunk deployments typically follow a more tactical approach.
Many Splunk customers, such as Domino’s Pizza, start out collecting IT-related information from applications, Web servers, databases, networks, telecom equipment, and sensors, with the goal of driving efficiency into IT process. After they become familiar with Splunk and grow to like its dashboards, report generation, and real-time alerting capabilities, they start applying Splunk to other types of data. In Domino’s case, it expanded its use of Splunk to analyze food orders coming in over the Web.
Splunk Enterprise runs on standard Windows and Linux-based servers, and stores data in “Splunk Buckets” running on local disk or SANs. If a customer is storing data in a standard relational database, it can be brought over with connectors. No fancy Hadoop or exotic NoSQL data stores here.
The company started bringing some Hadoop-resident data into the “Splunkesphere” in 2012 with the launch of Splunk Hadoop Connect. The problem with that approach is that some data sets in Hadoop are simply too big to move into the Splunk environment. (In many cases, that’s why the data is in Hadoop in the first place).
So Splunk made Hunk specifically to tackle this problem, and to enable users to extend their investment in Splunk Enterprise and apply it to Hadoop-resident data. It’s another take on the in-database analytic approach that has become popular recently.
Hunk runs atop any standard Hadoop distribution, and effectively delivers the Splunk Enterprise stack for Hadoop. This enables users to build and consume the same types of analytical and data visualization dashboards and reports for Hadoop-bound data as Splunk Enterprise could for machine-generated data stored in a standard NFS or CIFS file system.
There are a couple of interesting technical differentiators in Hunk that are worth pointing out. For starters, the company touts what it calls its Splunk Virtual Index, which it says “decouples the data storage tier from the data access and analytics tiers.” The net result of this is that it speeds up search times in Hadoop. It also allows users to search Splunk Enterprise and Hadoop data stores with a single query.
Then there’s “schema on the fly,” another technology under Hunk’s covers. Schema on the fly applies structure to data the moment a query is run, according to Splunk. This allows users to explore the data sets as they see fit, without having to think about the questions they would like to ask of the data beforehand, as they would do if using SQL or Apache Hive against their Hadoop data. The software will automatically add structure and identify things in the data that would most likely interest the user, such as keywords, patterns, and top values.
Splunk says results start coming back immediately after a user submits a query in Hunk, while the MapReduce job continues to run in the background. However, don’t confuse this for Storm or another streaming Hadoop technology. Events cannot be streamed into Hunk for real-time analysis, as they can in Splunk Enterprise. This software is still part of a batch-oriented workflow. There is no real-time searching of Hadoop data in Hunk, although a preview of this available. Time-series data also isn’t supported, and data models and report acceleration features are not available in Hunk.
Hunk supports all standard Hadoop distros, including Hadoop version 1.2 and Hadoop version 2 offerings from Amazon EMR, Cloudera, Hortonworks, IBM BigInsights, MapR Technologies, and Pivotal. Pricing starts at $2,500 per Hadoop node.