Apache Spark Ecosystem Continues To Build
Apache Spark was everywhere at the recent Strata + Hadoop World conference. From Tableau’s new Spark interface to the new Spark as a service (SaaS) offerings and Intel’s new Spark initiative, the big data framework was very hard to miss.
Intel jumped on Spark’s bandwagon last week when it announced it was forming a new initiative around the in-memory framework. “We have engaged with Databricks, one of the pioneers of Apache Spark, to advance analytics capability for the Spark on Intel Architecture platforms and to accelerate the development of the Spark projects such as GraphX, MLlib, and Spark Streaming,” Intel’s Michael Greene wrote in a blog post.
“We also are collaborating with AMPLab to accelerate the development of data analytics technologies in real-world solutions through developments for SparkR [distributed statistic computing in R on top of Spark] and collaborations for utilizing Tachyon [an in-memory file system in BDAS] to address challenges in Internet-scale machine learning [e.g., Parameter Server, Big Model],” wrote Greene, who’s the vice president of the Software and Services Group and general manager of System Technologies and Optimization at Intel.
Greene said Intel plans to demonstrate the efficiency of running Spark-based analytics on Intel-based servers. That showcase will involve benchmarks and “technology education,” he said. You can also expect data encryption to play a big role in that effort. Greene highlighted the fact that Cloudera, its close partner in the Hadoop world, has incorporated the new AES-NI encryption instruction set into the CDH 5.3 release.
Tableau Software used the Strata show to launch a direct connector for Spark SQL, the data processing engine designed to be a faster and easier to use alternative to Hive, which relies on MapReduce. The direct connector that ships in Tableau version 8.3.3 will allow regular Tableau users to “leverage the power of Spark SQL” directly from their Tableau user interface, the vendor says.
“[W]e want to make sure everyone with data questions can take advantage of [Spark SQL’s] breakthroughs without needing to know programming or query languages,” says Dan Jewett, vice president of product management at Tableau. With 9,100 new customer accounts in the last quarter alone, Tableau’s reach should help SparkSQL get off the ground.
Altiscale is supporting the Spark in-memory computing environment atop its hosted Apache Hadoop environment, which has been online for about a year. Some Altiscale customers have been playing around with Spark on their private Hadoop clouds, but they have been limited in how much they could rely on those environments, says Raymie Stata, the company’s co-founder and CEO.
But now that Spark is fully supported, customers of atop Altiscale’s white-glove Hadoop environment can expect first-class Spark support. “We have had customers self-serving on Spark almost from the beginning, and we have been providing discretionary help for them along the way,” the former Yahoo CTO tells Datanami. “If something goes wrong we’ve been pitching in. But we’ve kind of stopped short of providing full support for Spark.”
Stata sees Spark being used primarily as a MapReduce replacement. As companies look to modernize their hand-coded Java MapReduce codes, they’re migrating them to Spark, which can run up to 100 times faster than MapReduce. Spark’s MLlib machine learning library and the Spark SQL are also getting some tire kicking, Stata says.
Altiscale’s Hadoop customers look to the services firm to help them adopt new technologies. “One of the benefits of the Altiscale product offering is it becomes an environment in which you can quickly come and use all the cool stuff,” Stata says. “‘Hey you can do Spark, you can do Datameer, you can do H20 on Hadoop. We provide an environment for doing that.”
According to Qubole, the new Spark as a Service offering allows users to get up and running with Spark in less than 15 minutes. Users can provision a Spark cluster directly from a Web browser, leaving Qubole to handle the administration and workload management of the cluster.
While Spark is getting lots of attention among customers, deploying and maintaining Spark can be tricky, says Joydeep Sen Sarma, co-founder and CTO of Qubole. “By adding Spark to QDS, we’ve completely eliminated the barriers to taking full advantage of Spark for rapid data analytics.”
Several Qubole customers are already using and testing Spark on QDS, including Pinterest and DataLogix, the company says. Pintrest engineering manager Krishna Gade says the addition of Spark to QDS makes Qubole more valuable to the picture sharing platform.
“Qubole empowers us to use the latest big data tools at petabyte scale without needing to invest in building out, maintaining and updating our own infrastructure,” Gade says in a press release. “As a result, we can focus on extracting value from our data using the best technologies for the job, and on driving the business forward.”
Spark has risen quickly the Hadoop ecosystem. As the number and types of big data applications that support Spark (see graphic below) continues to grow, it will be interesting to see how this translates into the real world.