Data Science Operationalization in the Spotlight at Leverage Big Data ’16
How do you create business value from data science? It may be easy to answer from a theoretical standpoint, but actually turning data into business value in the real world is another matter entirely. This week, dozens of the brightest minds from the world of big data attended Datanami‘s third annual Leverage Big Data event to discuss their big data experiences and share tips on how to operationalize data science in the real world.
The three-day event was well attended by big data practitioners from a diverse range of industries, including representatives of Ford Motor Company, Credit Suisse, Comcast, Monsanto, and Qualcomm, among others. While each organization uses data in its own unique way, there’s sufficient common ground to facilitate meaningful discussions about how big data challenges can be overcome, and what opportunities it also presents. This is what Leverage Big Data is all about.
One of the most compelling keynotes during the conference–an invitation-only Tabor Communications event held at the upscale Hyatt Avira Resort in Carlsbad, California—came from Peter Bakas, director of engineering for the Cloud Platform Engineering division of Netflix.
Netflix is a true giant among big data bigs. During peak hours, nearly 40 percent of all the traffic on the Internet originates with this company’s video streaming service, which serves up 100 million hours of video content per day. But aside from all that video traffic, the company is a true pioneer in the world of big data analytics—it was an early adopter of Hadoop, Spark, and NoSQL databases–and in particular doing big data analytics on the cloud (Netflix is, by far, Amazon Web Services‘ biggest customer).
As Netflix’s video service took off, the company found its largely batch-oriented analytics systems, with daily or hourly scans, no longer cut the mustard. “I want to deal with the now,” Bakas told the Leverage Big Data audience.
In response, Bakas and his Netflix colleagues re-thought the real-time data pipeline that feeds all sorts of data into analytics, including video viewing data, data on UI activities, error logs, and performance events. In the beginning of 2015, this system was largely based an Apache Chukwa data collection system that pushed data into Amazon S3 and, ultimately, Elastic MapReduce (EMR) instances and Hive data stores, where Netflix’s analysts and data scientists can access it.
Bakas explained that, in a quest to expose data in a more real-time manner, (i.e. sub-minute latencies), Netflix implemented Kafka-based messaging backbone. While Kafka worked well, Netflix ran into some issues with the technology, much of which stems from the fact that Kafka was designed by LinkedIn as an on-premise technology, whereas Netflix runs everything in the cloud.
So Netflix developed another piece of technology, which it called Keystone, to help it run Kafka in a cloud, bolster availability, scale back statefulness, and eliminate Zookeeper problems. Netflix first deployed its Keystone pipeline into production in December 2015. Not everything worked perfectly, however, and one of the first iterations of Keystone had a major failure that would have taken 80 percent of messages with it, had the company not had multiple levels of resilience built in.
In the “fail fast” culture of Silicon Valley, things are expected to break, and Netflix embraces that culture. In fact, it has programs, such as Chaos Monkey and Kafka Kong, that are specifically designed to make things break so resilience can be tested. “You will have failure,” Bakas told the Leverage Big Data audience. “It’s how you respond to failure that matters.”
More good advice came from Matt Walsh, Senior Director of Platform & Analytics at SAP, who spoke on Monday about the need to prepare oneself for the unrelenting pace of change during his keynote, titled “Big Data, Digital Transformation, and the Second Machine Age.”
While the steam engine marked the start of the industrial age over a century ago, connected devices and smart analytics will mark the beginning of a second machine age that will have huge ramifications for how we live and work. Nobody knows for sure what will happen when trillions of connected devices are slinging zettabytes of data across the Internet, but it’s a good bet that there will be a fair bit of disruption.
If you think the pace of change is too much to deal with now, brace yourself for more. “This is as slow as innovation is ever going to be,” Walsh told the crowd, which acknowledged the fact with an approving groan.
While close to 50 percent of organizations in a recent survey are adopting predictive analytics to help give them an edge, that in itself won’t solve all problems. The shortage of data scientists, in particular, jeopardizes organization’s efforts at digital transformation.
The relentless march of technology threatens to shake up entire industries, and send once-successful companies into bankruptcy. Walsh said that even good companies that didn’t do anything wrong—such as Nokia, which dominated the cell phone industry but is now a subsidiary of Microsoft—will fail.
It’s estimated that 40 percent of the Fortune 500 will be gone in 10 years, either bankrupt, gobbled up, or simply gone. “You need to disrupt yourself before you get disrupted,” Walsh said.
Data governance was a hot topic at the show, and an item discussed during multiple sessions. Barbara Eckman, the principal data architect at cable giant Comcast, discussed the need for strict process controls in her Tuesday keynote address, titled “Data Integration and Governance for Big Data with Apache Avro.”
With tens of millions of devices sending gigabytes of viewership data to Comcast’s data lake every day, keeping schemas in sync across multiple business domains is a critical buffer against digital chaos. “Nobody wants to be governed,” she said. “But without a common schema and an understanding of what the schemas mean, it’s very hard to integrate data.”
Apache Avro, data serialization framework for Hadoop, has been a boon to the Comcast Technology and Products (T&P) team’s quest to extract meaning from data. “Avro came to the rescue,” Eckman said.
There’s no avoiding the complexity that comes with performing data governance at scale, said Asif Alam, the Global Business Director for the Technology Sector at Thompson Reuters, one of the biggest information brokers serving the financial services industry.
Just keeping an accurate archive of market data is a challenge, considering Thompson Reuters is charged with maintaining a 30-year history of market tickers across 400 global exchanges, each of which generates raw data in different formats.
As the data has become bigger and more complex over the years, Thompson Reuters has evolved from being just a data broker to serving actionable information and even knowledge to customers, all on demand via APIs.
“Data governance is becoming a big problem,” Alam said during his keynote address Sunday evening. “Customers don’t want to deal with it. We deal with it.”
Going “up the pyramid,” as Alam puts it, puts a heavy emphasis on analyzing data, but it’s also created a bottleneck. Thompson Reuters leans heavily on tools to help deal with the crush of analytics, and according to Alam, those tools have made great progress over the years. “Problems we’re solving now were impossible to solve three years ago,” he says.
No Silver Bullets
Maintaining data for different classes of users is difficult, especially when each type of user wants something unique. While there’s no simple answers to that question, tools and technology can help.
Jason Tokayer, a data architect with CapitalOne, gave a compelling keynote on how his team is implementing data as a service while short-cutting some of the time-consuming complexities that make ETL such a painful experience in the big data age.
Tokayer’s approach relies on using data ontologies to create linked data that can be quickly called upon when needed. By storing data as entities within HBase, Tokayer says his team was able to eliminate the impedance mismatch between how data is physically stored and how it’s conceived. “We’re trying to change the way that people think about data,” he says.
Your humble Datanami editor also moderated two panels during the event, including one on operationalizing data science. Radika Subramanian, CEO of data science solutions provider Emcien, argued that most of the data we’re analyzing is of questionable value, and that much of the analytics we’re doing amounts to little more than optimization of cat videos. Data science is hard, and there are no easy fixes.
“Don’t try to Hadoop your way out of it,” Subramanian declared (perhaps setting a precedent by using “Hadoop” as a verb for the very first time).
Vivek Sakhrani, a senior consultant at management consultant firm CPCS Transcom, discussed the importance that geo-spatial and time-series data has on clients, including energy firms and governments building large transportation projects. This data can be quite sparse in some remote regions of Africa where CPCS Transcom clients have big infrastructure projects, but you do the best you can with the data you have, he said.
Sanjeev Kapor, a senior project manager at Ford, talked about the importance of developing a good culture for data science to blossom. “It’s all about the people,” he said. And Tassos Sarbanes, a mathematician and data scientist with Credit Suisse, reminded Leverage Big Data attendees that data science isn’t about making reports run faster, but answering the big, fundamental questions. “Analytics is a serious business!” he exclaimed.
During a panel on data storage strategies, the difficulty of balancing multiple variables—including data growth, application access patterns, governance, formats, and cost—came to the forefront.
Vishnu Kannan, an application architect lead at Monsanto, says his company is analyzing software-defined storage strategies, including using the open source object storage system Ceph. Shane Corder, an HPC systems engineer at Children’s Mercy Hospital in Kansas City, talked about the struggle to maintain an archive of genetic data. By law, the data must be stored for 21 years, and Children’s Mercy is currently using tape to meet that requirement.
The cloud looms large in nearly everybody’s storage strategies, but as Netflix’s Peter Bakas explained, the cloud is not a cure-all for data storage woes. The data may not reside in your data center, but it still must be managed, he said.
The three-day event was a success for all involved. That includes the attendees, who appreciated the intimacy of the show and the give-and-take dialog, as well as the sponsors, which includes IBM, Cray, DDN, Paxata, Dell, Phemi, Impetus, Aerospike, Looker, and Novetta. Planning is already beginning for the next Leverage Big Data, tentatively scheduled for San Diego in 2017. Keep your eye on the Leverage Big Data website (www.leveragebigdata.com) for more news about Leverage Big Data ’17.