This Week’s Big Data Top Five
It’s been an incredible week in the world of data-intensive computing with a number of announcements filtering in from Germany, the site of the annual International Supercomputing Conference.
With high performance computing systems and software moving closer to the machines and code required to power massive data powerhouses, it was no surprise that “big data” was at the forefront of many conversations and sessions at ISC. From the Graph 500 (which we discuss in more detail here) to announcements from storage, network and system makers that tout their ability to power the next generation of big data problems, it became abundantly clear that even the supercomputer folks are recognizing the need for further data-intensive system development.
Let’s get started with five stories that landed on radar this week, starting off with one company that has deep roots in supercomputing, and hopes to improve its position on the Graph500 list.
Convey Points to Big Data System Improvements
Texas-based Convey Computer Corporation today announced two new entries on the Graph500 List (read more about that list here); the entries increase performance three to four times over their prior results. Convey credits the speed up to their recently developed breadth-first search (BFS) implementation and to the increased processing power of the newly introduced HC-2 Series.
Convey hybrid-core systems use what they call personalities — customized instruction set architectures that increase performance of specific portions of an application, in this case the BFS algorithm. Using the new personality, the single-node Convey HC-1 increased performance from 1.7 GTEPS to 5.9 GTEPS (billion edges per second), an increase of 3.5x on problem size 27. On the new HC-2ex, the performance was even more dramatic, clocking in at 7.8 GTEPS for an increase of 4.5x over the earlier HC-1.
Convey says that this kind of performance in a single-node system reinforces their position as a performance/power leader in executing graph type applications, although on the Graph500, the list is heavily weighted in favor of IBM BlueGene and Fujitsu systems.
Convey’s latest version of the BFS is a highly threaded algorithm that utilizes over three thousand independent threads of computation. This capability allows for massive parallelism that speeds all graph type applications. In addition, the unique hybrid nature of the architecture allows portions of the benchmark to simultaneously execute across the multiple compute resources of the system.
“Our latest Graph500 numbers continue to show that Convey’s hybrid-core systems provide pound-for-pound and watt-for-watt superior processing power compared to other systems on the list,” explained Bruce Toal, CEO of Convey. “Graph algorithms are used to solve many of today’s advanced analytics challenges in areas such as genomic research, data analytics, and security. Convey is at the forefront of providing a new and better way to increase performance through our hybrid-core technology.”
Partnership Puts Government and Big Social Data Together
Connotate, Inc., which helps organizations monitor and collect data and content from the Web and Digital Reasoning, which focuses on unstructured data analytics at scale, announced a partnership to provide government agencies with a solution to provide actionable intelligence from fact-based analysis of social media’s big data.
“The problem is not whether we can glean insights from vast quantities of data but how we get to them quickly. That’s where Connotate comes in,” said Digital Reasoning CEO Tim Estes. “The Connotate solution delivers large volumes of the right data, in a structured format from any Web source to the Synthesys analytics tool, allowing us to see more – enabling a streamlined process of ‘read, resolve and reason’ that allows organizations to quickly make smart strategic decisions.”
The companies point to how the relentless onslaught of new information generated by the Internet is the primary obstacle in the journey to big data success. They claim that the fact that social media is human data – data created by humans — makes it non-specific in nature, complicated and messy.
Connotate and Digital Reasoning’s say that the partnership will create government and enterprise organizations to alleviate these pain points. Connotate’s ability to monitor dynamic social media sources, automatically reformat large-scale data into simple formats and deliver them to Digital Reasoning’s machine-learning text analytics solution will help government agencies and businesses work toward a deeper understanding of how they are perceived and connected to the world around them.
Rather than analyzing all of the world’s data, this partnership instead focuses on leveraging only relevant, timely information so that government agencies can accurately link people and organizations to a myriad of related data points, including time and location. The two companies say this capability is crucial to government agencies as well as enterprises conducting competitive intelligence or internal audits.
Karmasphere Analyzes Big Data Skills Trends
Karmasphere announced the results of a recent survey of over 350 data professionals called “Trends and Insights into Big Data Analytics”, which points to a serious shortage of skilled professionals across the big data ecosystem, which they claim is driving the need for collaborative self-service access to Hadoop across companies of all sizes. Among the survey’s findings are the following:
The Need for Self-Service Big Data Analytics
According to the survey, 60% of respondents agree that data analysts in their organizations lack the technical skills to analyze data on Hadoop. Seventy percent (70%) agreed to the need for self-service access to Hadoop, defined as the ability to grab raw, unstructured detailed data and then create ad hoc queries and find insights, and nearly 37% of those strongly agreed to this need. According to the survey, over half (52%) of respondents either have Hadoop in production or have a Hadoop cluster running. These professionals indicate an even more acute need for self-service access to Hadoop with 81% expressing this need.
Big Data Teams are Cross-Functional
The survey also revealed that Big Data teams are typically cross-functional and composed of data analysts, IT, BI and line-of-business analysts. Survey respondents indicated an average of 60 business user members assigned to Big Data teams, and an average of 17 data analysts, 21 system administrators and 8 BI specialists.
“This survey reinforces what many of us have long presumed, that self-service Hadoop access is the key to unlocking the power of Hadoop for everyone in the business. These data professionals want an easy, graphical and intuitive way to grab their data, explore their data and create actionable insights that can be shared in a collaborative way with their team members,” said Rich Guth, CMO, Karmasphere.
Marketing and Product Optimization are Key Drivers for Big Data Analytics
Although it’s clear from the survey that all departments benefit from Big Data analysis on Hadoop, according to respondents, marketing is the department gaining the most. Twenty-two (22) percent choose marketing as the number one department benefiting from Big Data, 19% choose engineering, and product management and operations tied with 14% each.
SQL is the Prevailing Skill Set for Analytics
The survey shows that nearly two thirds, 74%, choose SQL, as the pervasive skill set for Big Data Analytics, with Java coming in second with Fifty-eight percent (58%).
Big Data is Growing
The survey also confirmed that vast amounts of data are being held in Hadoop clusters. Thirteen percent (13%) of respondents indicated they are storing 100 or more Terabytes of data in their Hadoop cluster, 6% indicated 41-100 Terabytes of data, nearly 50% said they are storing between 2-40 Terabyes of data and 32% indicated they are storing 1 Terabyte of data.
Next — HANA Gets Viz Boost >>
Panopticon Brings Big Data Viz to HANA
Visual data analytics software company, Panopticon, announced that it has optimized the latest release of its data visualization software suite to support the SAP® In-Memory Appliance SAP HANA.
Clients utilize HANA to manage data in conjunction with SAP ERP systems and many firms also use HANA as a replacement for traditional data stores. The visualization software company says a large number of customers have expressed interest in using Panopticon data visualization tools with HANA to support real-time operational decision-making and “big data” analytics.
Panopticon accesses and displays HANA data in real-time and provides immediate access to the information. In addition, both HANA and Panopticon allow users to federate external data sources (including CEP engines, message queues, tick databases, traditional relational databases, and OData sources) into their analytic models in order to further amplify the utility of the combined HANA/Panopticon platform.
The Panopticon platform makes full use of HANA’s in-memory capabilities and returns interactive results that are used for instant visual analysis. The combined architecture minimizes latency by eliminating the need for abstraction layers and data warehouses and allows users to directly access data stored in HANA.
Peter Simpson, SVP Research & Development for Panopticon Software, stated, “The ability to use Panopticon as the user interface for SAP HANA means that our customers can assemble a highly optimized platform for visual analysis of real-time data very quickly.”
Next — Making Big Data Fast Data >>
Terascala Aims to Make Big Data Fast Data
Terascala has taken to calling itself the” Fast Data Company” lately to reference their stated ability to make big data workloads crunch faster. This week they announced LustreStack, which they’ve dubbed “the industry’s first software suite and open framework for speeding the development and deployment of Lustre-based storage appliances on industry-standard hardware based on Intel Xeon processors. “
The company says that LustreStack enables storage providers to deliver high throughput, high performance storage appliances to the High Performance Computing (HPC) and Fast Data markets that are easy to install and use, simple to support, and optimally tuned. Currently, LustreStack powers Dell’s Dell Terascala HPC Storage Solution (DT-HSS) and EMC VNX HPC series solution.
The initial LustreStack suite incorporates a tightly integrated toolset for break/fix management of all Lustre file system hardware including comprehensive alerting and notification, complete Lustre file system management, a rich analytics capability for workload and application optimization, and an open, configurable Lustre Appliance Dashboard for appliance health status and diagnostic drill down.
“Terascala’s LustreStack appliance software enabled us to deploy a Lustre solution with minimal effort that met our performance and capacity needs just hours after the equipment arrived on site,” said Jeff McDonald, assistant director of high performance computing operations at the University of Minnesota. “With Terascala, we gain streamlined and simplified Lustre administration and monitoring along with redundant and robust hardware.”
In the coming months, Terascala and its partners will announce additional integrated tools designed to extend and ease Lustre’s adoption into mainstream datacenters and applications. LustreStack is currently available. Individual tools may be licensed separately or in bundles.