Follow Datanami:
July 20, 2015

Cloudera Expands Spark Support

Data management specialist Cloudera is targeting “data at scale” with the rollout of an open source project dubbed Ibis designed to make Hadoop more accessible to data scientists.

Along with its Ibis initiative that leverages the Python language, Cloudera said Monday (July 20) its big data push includes support for Apache Spark MLlib, the machine-learning library, in its upcoming release of Cloudera Enterprise 5.5. A Hadoop applications conference is also planned for October.

Cloudera, Palo Alto, Calif., said Ibis would allow data scientists to fully utilize the Python stack as Hadoop is used for more complex workloads. The project reflects the importance of Python language in data science as well as the scaling of Hadoop from a batch-processing tool to a cornerstone of a big data ecosystem.

“We want to build on this momentum and make Hadoop’s infrastructure more accessible,” Wes McKinney, a Cloudera software engineer noted in a statement. “We’re doing that by bringing Python more fully into the ecosystem, expanding our support for machine learning on Spark and focusing on the real-world, practical applications of data science.”

Python development has been limited to local data processing and smaller data sets, limiting its utility for crunching big data. It is now being used for automating ETL and other tasks. Cloudera Labs’ Ibis data analysis framework is intended to allow Python users to process data at scale without sacrificing performance.

The initial version of Ibis includes support for Python capabilities such as built-in analytics via Impala, Hadoop’s database engine, for simplified ETL. Later versions will include additional Python packages and the ability to author Python functions, Cloudera said.

Impala also provides Python users with a native platform for Hadoop that improves performance and enables scaling needed for big data analytics.

Cloudera said Ibis is available as a preview in Cloudera Labs, its “virtual incubator” for new development projects. Ibis is an Apache-licensed project and open to contributions from the developer community, the company added.

As an early supporter of Spark, Cloudera has been integrating the data processing engine into the Hadoop ecosystem. Among its efforts is a Spark-on-YARN integration for shared resource management, integration with Apache Kafka and Apache HBase as well as adding new Spark features like data loss protection.

Cloudera said it has contributed more than 370 patches and 43,000 lines of code to Spark and is driving Spark development with partner Intel.

As part of the effort, Cloudera also is adding built-in support for Spark MLlib to its Enterprise 5.5 platform scheduled for release later this year. The integration is intended to allow data scientists to leverage scalable machine learning while harnessing Spark’s performance. The Cloudera platform already includes Spark core and Spark Streaming.

The company also announced this week it is sponsoring a conference for data scientists focusing on Hadoop applications. The “Wrangle Conference” is scheduled for Oct. 22 in San Francisco.

Recent items:

Python Versus R in Apache Spark

Python Wraps Around Big, Fast Data

Datanami