July 1, 2014

Cloudera, Databricks, IBM, Intel, and MapR Collaborate

PALO ALTO, Calif., July 1 — Open source contributors, Cloudera, Databricks, IBM, Intel, and MapR announced today that they are joining efforts to broaden support for Apache Spark (Spark), while simultaneously standardizing it as the framework of choice by bringing popular tools from the MapReduce world to this new engine.

Spark has quickly become a standard in many Hadoop distributions, with rapid customer adoption and use in a variety of use cases, ranging from machine learning to stream processing workloads. To further support this growth, these five vendors have come together to collectively broaden the range of tools and technologies in the Hadoop ecosystem that leverage Spark as an underlying processing engine.

Today, besides being used independently as a programming language, Spark is used as the basis for several projects including:

1. Spark Streaming for continuous data processing
2. MLLib for a machine learning toolkit
3. GraphX for graph analytics capabilities

In recent months, other projects have also added support for Spark, as evinced by recent efforts to port Crunch, Mahout, and Concurrent’s Cascading framework to Spark.

This collaborative new effort expands upon the Spark momentum to include several key Hadoop projects — starting with the Apache Hive SQL engine (Hive). Using Spark as the underlying execution engine, this effort will improve the performance of batch SQL jobs in Hive, while seamlessly maintaining compatibility with the core Hive code base.

Simultaneously, the group is investigating ways to adapt Apache Pig to leverage Spark, as well as other popular tools, such as Sqoop and Search. By making Spark the execution layer of choice, this group is driving consolidation and standardization around Spark as the evolution of MapReduce for modern hardware.

This effort highlights the power of open source communities, with marketplace competitors coming together to help shape a common execution layer, thus creating a community standard. End users benefit by having a widely supported execution layer, preventing lock-in, while continuing to use their tools of choice. Further, the simplicity of having to manage and learn a single engine reduces operational costs.

Spark is an open source data analytics framework originally developed in the AMPLab at UC Berkeley. Quickly embraced for its inherent advantages, such as improved data processing and in-memory capabilities on Hadoop, Spark offers application performance gains – up to 100 times faster than Hadoop MapReduce for certain applications. Spark has attracted the attention of the open source community and vendors alike.

Hive is a data warehouse infrastructure initially developed by Facebook Inc. and built on top of Hadoop. Hive was created to query and manage large datasets stored across a cluster of servers. Hive continues to remain a popular choice for SQL batch processing and it offers many advantages to customers. There is an active community including enterprise vendors Cloudera, IBM, Intel and MapR, committed to furthering Hive based on cutting edge industry standards.

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Cloudera, Databricks, IBM, Intel, and MapR Collaborate

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Cloudera, Databricks, IBM, Intel, and MapR Collaborate

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link