Follow Datanami:
August 25, 2014

Five Steps to Demystify Big Data Analytics

Brett Sheppard

Too many big data initiatives are science projects that take months of effort, risk failure and require highly trained data scientists with scarce skills. According to a CSC survey, 55 percent of big data projects aren’t completed and many others fall short of their objectives.

In “Why Most Big Data Projects Fail,” Dell General Manager Darin Bartik notes that business and IT groups are not aligned on the business problem they need to solve. Employees don’t have access to the data they need, making it impossible to find answers that will make the project successful. Further complicating issues, many tools, approaches and disciplines around big data are new, so people lack the knowledge and skills necessary to work with the data and achieve a successful business result.

There is a better way. The following steps can significantly shorten the time-to-value and risk of big data initiatives.

(1) Enable All Knowledge Workers to Benefit from Big Data

Ashish Thusoo ran the data analytics team at Facebook and realized from his experience that there is high value in democratizing data. As recited by author Dave Feinleib, “[Thusoo’s] goal was to make all capabilities related to data easy, from instrumenting applications and collecting data, to understanding and analyzing it, to creating data-driven applications.”

Organizations struggle to hire and retain data scientists who understand statistics, computer science and open-source technologies such as Hadoop or NoSQL data stores. According to McKinsey Global Institute, the United States will experience a shortage of between 140,000 and 190,000 skilled data scientists, and 1.5 million managers and analysts capable of reaping actionable insights from big data (McKinsey, Big data: the next frontier for competition, May 2011).

One way to address this lack of skills is to adopt technologies that bridge the gap between data scientists and knowledge workers. According to CITO Research in Big Data for Everyone (sponsored by Splunk), you need a “point-and-shoot camera for data,” where product managers, web analysts, risk managers, security analysts and other knowledge workers can simply point at Hadoop or another data store, and start exploring, analyzing and visualizing. Knowledge workers don’t want to be constrained by writing complex fixed schemas or migrating data into a separate data mart for analytics.

(2) Encourage a Data-driven Business Culture

While insights gleaned from big data can improve decision making, they do not rule out the vagaries of human behavior. All too often, David Sandler’s observation remains true: “People make decisions emotionally and then justify them with data.”

How can you encourage your organization to make data-driven decisions, instead of relying on the “HIPO” (highest-paid person’s opinion)? By combining data-driven metrics with storytelling and visualizations.

In Made to Stick, Chip and Dan Heath document why some ideas survive and others die. While data provides credibility, stories empower people to use an idea through a memorable narrative with unexpected, concrete details.

“When data and stories are used together, they resonate with audiences on both an intellectual and emotional level,” according to Stanford University Professor of Marketing Jennifer L. Aaker in her Persuasion and the Power of Story Video. As your big data projects succeed, share how data played a major role in making the project successful within your organization. Sharing these stories is a great grass-roots way to encourage a data-drive business culture.

(3) Stop Sampling and Embrace Raw Data

A hidden secret of many big data projects is that assessments are based on models of a subset of data that’s meant to be a representative sample. While this works fine if you’re trying to determine whether an episode of Glee is as popular as other TV shows, what if you’re a retailer that wants to understand customer interactions across offline and online channels? Or an investment bank that wants to measure risk in a portfolio? You need to be able to search, analyze and visualize raw granular data, not just sample.

By embracing raw data, you can analyze granular transactional, web and mobile data at massive scale and deliver a score by account, household or segment.

(4) Adopt Complementary Technologies in a Big Data Enterprise Architecture

Use the strengths of data warehouses, business intelligence software, machine data platforms, Hadoop and NoSQL stores, and enable them to coexist in your organization’s data architecture. For example, Cloudera and Teradata jointly published a useful guide outlining requirements that are best suited for either a data warehouse or Hadoop, “Hadoop and the Data Warehouse: When to Use Which.”

The adage, “If all you have is a hammer, every problem looks like a nail,” is true for big data projects, and it’s important to understand the role of every technology. For example, there are worthwhile use cases for do-it-yourself Hadoop and Apache Pig, Apache Hive or SQL on Hadoop, but understanding where to use each, and more importantly, how they complement each other can make or break a big data project. The key is to use the strengths of complementary technologies to support your projects.

(5) Apply Role-based Security for Data Lakes

To move past data silos and take full advantage of low-cost batch storage technologies like Hadoop, many organizations are looking favorably at a “data lake” or “data reservoir” model. In this model, data is stored once and shared by multiple business and IT stakeholders. This architecture requires role-based access controls to protect sensitive data and customer privacy. Someone in finance may be authorized to see customer non-public information such as home address and credit card number, while a marketing analyst sees masked data.


About Splunk

Splunk Inc. provides the leading software platform for real-time Operational Intelligence. Splunk software and cloud services enable organizations to search, monitor, analyze and visualize machine-generated big data coming from websites, applications, servers, networks, sensors and mobile devices. To learn more, visit