Finding Big Data Treasure in the Cloud
Heading into 2014, one of the big data trends that will intensify is the transition toward end-to-end data analytic services hosted in the cloud. One of the promising big data cloud services is Treasure Data, a Silicon Valley company that offers an interesting mix of MapReduce, columnar databases, and intelligent agent technology that’s aimed at helping clients get a quick return on their big data investments.
Treasure Data was launched two years ago by former Red Hat engineer Hiro Yoshikawa and Kaz Ohta, who helped implement one of the largest Hadoop clusters in Japan. Ohta was amazed with Hadoop’s capabilities, but not so thrilled about the level of technical complexity it entailed, says Rich Ghiossi, vice president of marketing for the company.
“He came out of that experience thinking this is really cool technology, but there are things that we can do differently,” Ghiossi says. “He said, ‘Let’s take all the difficulty away. Let’s not have to hire an army of people who know MapReduce or an army of people who know how to deploy a particular distribution on this particular hardware subset. Let’s make that all transparent to the user.'”
CTO Ohta and CEO Yoshikawa have strived to do that with Treasure Data. The offering runs on Amazon’s cloud, and combines a the MapReduce component of Cloudera’s CDH with Plazma, its own multi-tenant columnar database, which it uses instead of HDFS. Treasure Data also developed its own intelligent agent technology, called Treasure Agents, which pre-process and transform data before it’s loaded into the database for analysis.
“Our approach is to streamline that whole data pipeline, from data acquisition to storage to analysis,” Ghiossi says. “We do it in a relatively easy process, and start delivering value within days.” It typically takes at least 14 days for companies to can start getting meaningful data out of their Treasure-hosted Hadoop cluster, the company says.
As we reported in August, Treasure Data’s approach was validated with $5 million in Series A venture capital funding earlier this year. Since launching the service in 2012, the company has attracted more than 90 customers, and is now storing 2PB of data for its customers. That corresponds to about 2 trillion rows of data, an amount that doubled in the past eight weeks, Ghiossi says.
Today, Treasure Data announced a partnership with big data darling Tableau Software that will see Tableau’s popular data visualization software integrated into Treasure’s service. The company’s had joint customers in the past, but the new partnership will undoubtedly bring Tableau’s brand of hands-on visualization to more Treasure customers.
Customers are free to access their Treasure data in any way they want, but most use either a BI tool like Tableau’s or get to it through SQL, HiveQL, Pig, or MapReduce. As customers’ data builds up in Treasure, it can become more difficult to track it. So last month, the company unveiled a low-end visualization tool called Treasure Viewer that makes it easier for users to get a quick glimpse of their data. It also unveiled the Treasure Query Accelerator, which is a version of Cloudera Impala that was customized to work with its columnar database. The Treasure Query Accelerator can boost query performance by anywhere from 6 to 60x, Ghiossi says.
Treasure Data has customers in a variety of industries, including several Fortune 500 firms. But so far it’s found its best traction in in the online gaming and advertising spaces. One particular online gaming firm continuously feeds its Treasure Data environment with information about its players, including what customers are playing games, how long they’ve been playing, and at what stage of the game they’re in. The Treasure Data service sucks all this data in, and updates models about the players, which the company uses to help it sell ads.
Treasure Data keeps this company’s model updated every two minutes or so, which is as close to “real time” as the customer needs it. “As long as get that data in a couple minutes they can keep those models pretty much real time,” Ghiossi says. “What they’re looking for is to make sure those models are good, and that the models are dynamics, and that’s what we’re feeding.”
This type of use case–keeping large data model continuously fed with the latest sensor or machine data streaming in from the environment—will undoubtedly become more common as organizations move their Hadoop clusters from development into production. Depending on the industry, there will be different ways that an organization can monetize the vast amount of sensor and machine data and clickstream data. Building an IT infrastructure to do this sort of thing is no easy task, which is why Treasure Data senses such a promising market opportunity is about to unfold.
“There are a lot of technologies that can deal with big data. Even traditional database environments can deal with big data,” Ghiossi says. “But the part about big data that deals with sensor or log data or clickstream type data, and getting that into a service or on premise, is not an easy task. Being able to do that and provide value to the customer in a matter of days is a significant asset.”