
Baldeschwieler: Looking at the Future of Hadoop
Hadoop has come a long way, and with projects currently underway it’s got plenty of fuel to drive enterprise innovation for years to come said Hortonworks co-founder and CTO, Eric Baldeschwieler in his recent Hadoop Summit Keynote in Amsterdam, Netherlands.
During his talk, Baldeschwieler discussed the past, present, and future of the project that he has been shepherding since the framework was an infant codebase within the walls of Yahoo! in 2006.
Using Hadoop deployment within Yahoo! as a backdrop to demonstrate the framework’s growth, he discussed how Hadoop had grown from zero installations in 2006, to 42,000 unique computers within the company – a microcosm of what has happened around the world in that time.
With such explosive growth, a lot is hinging on the innovative framework, and Baldeschwieler was eager to discuss the growth that is happening as Hadoop moves into the future.
“You can’t talk about the future of Apache Hadoop without talking about Hadoop 2.0,” Baldeschwieler mused about the refactoring of the platform that’s been in progress since 2009. “It’s now in alpha and we’re very excited because we believe that it’s going to move from a place where it’s still sort of cutting edge early work, to beta this year, and then within the year we think it’s going to move GA.”
The goal with Hadoop 2.0, says Baldeschwieler, is to expand the framework to handle 10,000 of “next year’s nodes,” noting that computers keep getting bigger every year. However, beyond that scalability, the Hortonworks CTO said that extensibility is a chief focus of the Hadoop 2.0 initiative, referring to YARN.
“In Hadoop 2.0, we’ve separated out the sort of core resource management – the ability to allocate a certain fraction of your cluster to a particular set of work from MapReduce,” explained Baldeschwieler. “So now MapReduce just becomes one of a number of programming models that you can use with your Hadoop cluster.”
Baldeschwieler says that many of these new frameworks are becoming available. “We’re seeing people develop frameworks to do streaming, to support lower latency SQL queries, and more generally to provide new kinds of services.”
Baldeschwieler talked about many different initiatives happening within the Hadoop community that he believes will have a significant impact on the future, including:
- HCatalog –“This takes the table level abstraction of hive and opens them up so that all of the data tools and Hadoop can work at this higher level of abstraction. Now you can take a table and you can write it with map reduce or ETL it with Pig, store it in Hive format, use it directly – just interoperate between all of those tools.” Baldeschwieler also noted that HCatalog opens up the data to third party SQL tools to access from outside the cluster, enabling many more use cases for Hadoop.
- Ambari –“Ambari is an apache incubator project, the focus of which is to bring provisioning management and monitoring of Apache Hadoop to everybody as an open source project. Everything that Ambari does, it does through RESTful APIs, and that means that it’s very easy to integrate it into existing management suites.” Other highlights include job diagnosticsand cluster history.
- Tez – “The focus of Tez is on providing a much better programming framework in Apache 2.0 for low latency queries. That breaks down into two pieces. One is a real focus on the inner loop – how do we more efficiently process lots and lots and lots of rows of data.” The other focus, said Baldeschwieler is on prepping the cluster so that computation is done much more quickly.
- Stinger Initiative –“We think that there’s an opportunity for 100x improvement that can be delivered incrementally in a stable Hadoop-scale way that will not only address the interactive use case, but will also continue to be the best framework for very large queries, and very large workloads.” Already, the initiative has demonstrated a 45x performance increase for Hive.
- Falcon Project – “The Falcon Project is focused on automating the management of data in Hadoop. There are two sets of problems there; one is data lifecycle management – how do you get data into the cluster and how do you move it between clusters and make sure that you keep the data in the right place for the right amount of time. The other is how do you automate ETL flows in a much simpler, more declarative fashion.”
Embedding of the video was disabled by request (which seems out of character for such an open company), however you can view the entire keynote here.
Related Items:
Putting Some Real Time Sting into Hive
Hortonworks Proposes New Hadoop Incubation Projects
How Facebook Fed Big Data Continuuity
July 15, 2025
- Nutanix Survey Finds Financial Firms Embracing GenAI but Struggling with Skills Gaps
- Qdrant Launches Qdrant Cloud Inference to Unify Embeddings and Vector Search Across Multiple Modalities
- Data Axle Reveals Most Brands Still Rely on Fragmented Customer Data
- Cyberlocke Expands Data Assurance Platform with New DQS Framework
- Coralogix Introduces MCP Server to Help Customers Build Smarter AI Agents
- Forrester’s 2026 Budget Planning Guides: Leaders Grow More Cautious As Economic Uncertainty Persists
- Open Flash Platform Initiative Unveiled by Industry and Research Leaders
- Collate Raises $10M Series A to Solve the Data Intelligence Challenges for Enterprise Customers
- TigerGraph Secures Strategic Investment to Advance Enterprise AI and Graph Analytics
- Seagate Launches 30TB Drives to Power AI and Edge Workloads
July 14, 2025
- Esri’s ArcGIS Basemaps Power Dataminr First Alert for Enhanced Real-Time Event Detection and Awareness
- Hyland Launches Knowledge Enrichment to Turn Unstructured Content into AI-Ready Data
- AWS Launches EC2 UltraServers Powered by NVIDIA Grace Blackwell
- Esri Collaborates with Microsoft to Bring ArcGIS Users New AI Enhancements
July 11, 2025
- NVIDIA: AI-Powered Climate Models Go Mainstream
- RealSense Completes Spinout from Intel, Raises $50M to Accelerate AI-Powered Vision
- Emory University: Exploring the Frontiers of Data Science, Across Space and Time
- Anaconda Announces Partnership with Prefix.dev to Bring Next-Gen Functionality to conda-build
- Amplitude Acquires Kraftful to Accelerate AI Strategy
- Astronomer and AWS Announce Strategic Collaboration Agreement
- Inside the Chargeback System That Made Harvard’s Storage Sustainable
- LinkedIn Introduces Northguard, Its Replacement for Kafka
- What Are Reasoning Models and Why You Should Care
- Databricks Takes Top Spot in Gartner DSML Platform Report
- Scaling the Knowledge Graph Behind Wikipedia
- Top-Down or Bottom-Up Data Model Design: Which is Best?
- Iceberg Ahead! The Backbone of Modern Data Lakes
- Four Steps for Turning Data Clutter into Competitive Power: Your Sovereign AI and Data Blueprint
- Fine-Tuning LLM Performance: How Knowledge Graphs Can Help Avoid Missteps
- What Is MosaicML, and Why Is Databricks Buying It For $1.3B?
- More Features…
- Mathematica Helps Crack Zodiac Killer’s Code
- Supabase’s $200M Raise Signals Big Ambitions
- Solidigm Celebrates World’s Largest SSD with ‘122 Day’
- Confluent Says ‘Au Revoir’ to Zookeeper with Launch of Confluent Platform 8.0
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- With $17M in Funding, DataBahn Pushes AI Agents to Reinvent the Enterprise Data Pipeline
- AI Is Making Us Dumber, MIT Researchers Find
- ‘The Relational Model Always Wins,’ RelationalAI CEO Says
- The Top Five Data Labeling Firms According to Everest Group
- Toloka Expands Data Labeling Service
- More News In Brief…
- Gartner Predicts 40% of Generative AI Solutions Will Be Multimodal By 2027
- Seagate Unveils IronWolf Pro 24TB Hard Drive for SMBs and Enterprises
- Campfire Raises $35 Million Series A Led by Accel to Build the Next-Generation AI-Driven ERP
- BigBear.ai And Palantir Announce Strategic Partnership
- Databricks Announces Data Intelligence Platform for Communications
- Deloitte Survey Finds AI Use and Tech Investments Top Priorities for Private Companies in 2024
- Gartner Predicts 30% of Generative AI Projects Will Be Abandoned After Proof of Concept By End of 2025
- Code.org, in Partnership with Amazon, Launches New AI Curriculum for Grades 8-12
- Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027
- NIH Highlights AI and Advanced Computing in New Data Science Strategic Plan
- More This Just In…