Follow Datanami:
May 12, 2016

Apache’s Wacky But Winning Recipe for Big Data Development

When Doug Cutting set out to develop an open source Web search engine in the late 1990s, he initially chose the GPL license to distribute his wares. When that failed, he decided to give the Apache Software Foundation a shot–and in the process may have changed the course of open source software development for the next 20 years.

Cutting initially developed the Lucene search engine with the idea of building a business around it, but later decided to give the technology away for free. “I just wanted people to use it. I wanted somebody to get some value from the software,” he said during a keynote address at yesterday’s Apache: Big Data 2016 conference in Vancouver, British Columbia.

The Java developer admits he wasn’t terribly knowledgeable in the area of open source licensing, so after posting the code on Sourceforge, he picked the one licensing scheme he knew: GPL. This licensing scheme, after all, was the one used by Linus Torvalds to distribute Linux, which up to that point was the biggest open source success story.

That didn’t turn out so well. While people liked Lucene, they found the GPL (which uses a copyleft” approach) restricted use of the software in their businesses, and so they started complaining to Cutting. “It frustrated me because I was trying to give something to people so they could use it freely, and they were saying they couldn’t sue it.  Clearly this was the wrong license,” he told the audience of about 500 in the Hyatt Regency ballroom.

“Then some folks approached me from this wacky place called Apache,” he said. “They said ‘You can join us at Apache. We have a different license and you can join us here.’ I said, What the heck, seemed like a good thing to try.”

And the rest, as they say, is history. Lucene, of course, went on to become the most widely used search engine in the world. “I’d like to think it’s [so popular] because it’s technically awesome, but that is clearly not the reason,” Cutting said. “The real reason is much more likely because it’s benefited from Apache’s approach, from the open software approach in general, and Apache more specifically.”

Cutting’s Apache experience with Lucene would impact his next open source project, a little Web search engine he called Nutch. When Cutting and his development partner, Mike Cafarella, needed more computing resources to scale Nutch, Cutting joined Yahoo. When the duo needed more skilled programmers to help develop Nutch and to make it scale to big data heights, they looked to the Apache Software Foundation.

(image courtesy YouTube)

“Nothing is sacred,” says Doug Cutting (image courtesy YouTube)

Nutch, of course, would go on to become Apache Hadoop, the big data platform that today rivals Linux in terms of open source success. Cutting’s fateful decision to give the ASF a shot with Lucene turned out to be a catalyst that would kick start the modern big data movement.

“This was a real blessing for me, that this Apache process was an accelerant for software,” Cutting said. “It could help software become better and succeed and become a standard in a way that other methods couldn’t.”

Cutting lauded the hands-off approach of the ASF, which he says enabled Apache Hadoop and related projects to evolve according to the needs of the users/developers themselves, not by some far-off organization or corporation that’s removed from the day-to-day challenges of actually running this stuff.

This decentralized approach reduces the friction to innovation. “We have this process where there are random mutations sprouting up all over,” Cutting said. “Some of them end up in the incubator and become top level projects at Apache. But mostly what happens is that people start using them and they decide which ones work.

“It’s a very organic process of improving and selecting the next thing,” he continued. “It’s not set by vendors. It’s not set by foundations. It’s set by users and I think that’s a wonderful change, and I think it’s leading to not only faster change, but change that’s more directed to the problems that we already care about, and where they need solutions.”

The Apache community’s unyielding commitment to change gives traditional enterprise IT professionals the cold sweats. It’s not that development is chaotic at Apache—as Cutting said, the group manages to harness the energy of its participants to address the pain points in a relentlessly directed fashion. But the ruthless dedication to continuous improvement at Apache rubs the suits the wrong way, which is a dynamic that’s a major factor in the controversy surrounding the Open Software Platform Initiative (ODPi).

Cutting is still against the ODPi, even after the ODPi apologized, clarified, and monetized the ASF to the tune of $40,000. In his view, the ODPi threatens the Apache way, and that isn’t acceptable. Over his 15 years of working within the ASF, Cutting has become a fearless soldier for the ASF’s process, to the extent that he would even be willing to see Hadoop supplanted by whatever comes next.

“Nothing is sacred,” Cutting said. “Any components can be replaced by something that is better.”

ASF_logoCutting is a living embodiment of the Apache process. The Cloudera‘s chief architect hit it out of the ballpark with game-changing software not just once, but twice, and he’s willing to relegate it to the legacy dustbin for the chance at creating something better? This level of institutional purity gives Cutting star status among the open source faithful, but it’s fair to say that it also scares the heck out of CIOs at multi-billion corporations who just want IT platforms that don’t change on a monthly basis.

But in Cutting’s view, this is no time for stability if it means passing on the opportunity to have an even bigger impact. If you like what the ASF has given to you up to this point—given to you, for free, mind you—then just wait till you see what’s next.

“The pace of change in the big data ecosystem is astronomically greater than we saw for the 20 preceding years,” Cutting said. “We’re really now benefiting, and the way this change is happening is a key to that.  It’s decentralized change. There’s no one organization or handful of organization deciding what are the next components in the stack.”

Luck played a big role in Hadoop’s success, Cutting said. If he wasn’t already developing Nutch, and if he hadn’t have read Google’s GFS and MapReduce white papers after struggling to make it scale, Hadoop may not have come to pass.

But also key to that is this Apache process. “Without that, we wouldn’t have been able to get these users involved to build the ecosystem that is now flourishing,” he said. “What’s coming next?   What are the next hot technologies? I don’t know. If I knew I’d work on them or invest in them.

“That’s the wonderful thing–nobody knows.”

Related Items:

ODPi Offers Olive Branch to Apache Software Foundation

An Open Source Tour de Force at Apache: Big Data 2016

Cutting On Random Digital Mutations and Peak Hadoop