August 31, 2016

Tracking the Ever-Shifting Big Data Bottleneck

Alex Woodie

(phoelixDE/Shutterstock)

Bottlenecks are a fact of life in IT. No matter how fast you build something, somebody will find a way to max it out. While the performance headroom has been elevated dramatically since Hadoop introduced distributed computing to the commodity masses, the bottleneck has shifted, but it hasn’t disappeared. So where did it go? Depending on who you ask, you’ll get different answers. But one thing seems abundantly clear: it’s no longer the local network.

Ever since Google (NASDAQ: GOOG) developed its MapReduce framework — which Doug Cutting would go on to pair with his distributed Nutch file system to create Hadoop – the speed of the local area network (LAN) has been less and less a factor.

Tom Phelan, the co-founder and chief architect of big data virtualization company BlueData, has insight into the history.

“What Google did originally was to co-locate compute and storage so that data processing jobs could be split up and run in parallel on many different commodity pieces of hardware and reduce the I/O load by using local disk. That is what Hadoop was designed for and that worked very well,” he says. “That was in a world where 1Gbit networking was all you could get. That was in the 2001 to 2003 timeframe. Now enterprises typically have 20Gbit networking and above, so there’s a lot more network bandwidth available.”

As the original MapReduce framework gave way to more advanced frameworks like Spark, we’ve seen the data access patterns change accordingly. In fact, it’s changed so much, Phelan argues, that the need to co-locate storage and compute has all but disappeared.

“As we’ve moved ahead 15 years, we’re now looking at Spark as the dominant application,” Phelan tells Datanami. “What we’re seeing is there have been a number of technical changes in the networking architecture that make the need to co-locate compute and storage less important, and the algorithms themselves…. have less need to pull data in real-time from disk.”

Rethinking Storage

As the cost of fast LANs came down, the adoption rate went up. The result is that bottleneck at the LAN level has all but disappeared for most organizations. For organizations with the highest demands, more advanced HPC architectures like InfiniBand and OmniPath from Intel (NASDAQ: INTC) have kept worker nodes saturated with data.

The result of all this is that organizations are rethinking how they store their data. For some organizations, an object-based storage system may make more sense for storing data measured in the petabytes.

The rise of 10Gbit and 40Gbit Ethernet LANs has largely eliminated the network bottleneck in Hadoop environments

The rise of 10Gbit and 40Gbit Ethernet LANs has largely eliminated the network bottleneck in local environments (yakub88/Shutterstock)

“You have companies like EMC [NYSE: EMC] and its Isilon [object storage] product, which are pushing this message as well,” Phelan says. “Their enterprise-class storage subsystem connected over a network can deliver data up into Hadoop nodes just as fast as local storage.”

The higher network speeds and the elimination of the network bottleneck will help to sound a death knell for big Hadoop clusters, he says. “These days of monolithic, one-thousand-node clusters – I just don’t see those continuing forward,” Phelan says.

Jerome Lecat, the founder and CEO of Scality, which competes with EMC Isilon in the object-based storage business, says the demise of the network bottleneck at the local level is giving customers reason to use Hadoop more as a compute cluster rather than a storage cluster.

“We can absolutely saturate the network,” Lecat says. “Our design is entirely network dependent, so we’re extremely happy that the Ethernet norm is going from 1Gbit to 10 Gbit to 40Gbit. That’s very, very good.”

Lecat sees organizations running big data and artificial intelligence (AI) algorithms on Hadoop, but storing the actual source data – if it’s 1PB or larger – somewhere else. As Hadoop borrows features from object stores, we’ll see APIs based on Amazon’s S3 object-storage format become the de-facto data interchange standard for feeding Hadoop and Spark clusters with data from object stores, he says.

“I think you’re going to see traditional storage reduced to a minimum, like the mainframe, if you will,” Lecat tells Datanami. “You’re going to see a lot of hyper-converged storage based on Flash, and you’ll see a lot of object storage, and much less traditional block and traditional NAS (network attached storage).  Any block you see will be all Flash, like PureStorage.”

Clouds a Loomin’

The cloud looms large in many organizations’ storage plans for big data, says BlueData’s Phelan.

“These customers are all thinking about going to a public cloud,” he says. “Whether they do it now or in five to 10 years — they’re all thinking about that.  They’re trying to leverage the architectures, make their internal architectures more amenable to doing that.”

shutterstock_hour_glass_smartdesign91

Public clouds bring their own latency challenges (smartdesign91/Shutterstock)

Blue Data makes software that enables Hadoop stacks to be deployed as Docker containers. Once deployed as a Docker container, a Blue Data customer can move their application and data from the local data center to the cloud, and back, as desired.

While LAN bandwidth is essentially unlimited in the data center (DC) and on the campus, the wide area network (WAN) connections to cloud data centers are most definitely not. Some of the biggest enterprises and research institutions have paid big bucks to lay fiber optic lines from their on-premise data centers to their cloud provider’s data center, but the cost of such adventures is beyond what most organizations can bear.

“Unless the enterprise has a private high-speed connection to the nearest Amazon DC, you’re absolutely correct–it will be a huge performance hit,” Phelan says. “I don’t think we’re technologically or infrastructure-wise ready to take the leap yet. Some of our customers already have extremely high performance network connections….but the rank and file, it’s going to be some time before the infrastructure is available to them.”

The big exception will be for data that’s born in the cloud. Organizations will naturally turn to the cloud to store data generated from sensors, such as connected cars and other devices that make up the Internet of Things (IoT). Sensitive data in regulated industries will also stay grounded for the foreseeable future.

In-Memory Computing

Benjamin Joyen-Conseil and Olivier Baillot from OCTO Talks! wrote this excellent piece in le blog  on the evolution of bottlenecks in big data technology. The pair argue that, as big data developers have worked to eliminate the network bottleneck in distributed systems, that the bottleneck has been pushed further and further into the hardware.

MapReduce was the original distributed framework developed by Google as an alternative to the MPI technique favored in the HPC world. When Doug Cutting paired MapReduce with a new file system, the Hadoop stack was born. As organizations began using MapReduce and related technologies to perform machine learning tasks, the disk I/O emerged as the new bottleneck.Spark_black

Soon, newer in-memory big data frameworks, like Apache Spark, emerged on the scene that didn’t need to store data on disks, which all but eliminated disk I/O as the culprit. However, as Java heap sizes increased with these in-memory frameworks, organizations began running into challenges associated with Java Virtual Machine (JVM) garbage collection routines.

This is where we stand today, according to Joyen-Conseil and Baillot, who state that the JVM CG is currently the biggest bottleneck in the stack. But it likely won’t last for long, as the folks behind Spark have taken steps to reduce the impact of JVM CG with Project Tungsten. They also point out how other projects, like Apache Flink, have created intelligent ways to minimize the size of the heaps and to get around the JVM CG problem.

But like air and water, bottlenecks never go away; they just move from place to place. Where will the next bottleneck be? Joyen-Conseil and Baillot have some thoughts:

“We can already bet a coin on the fact that the shared cache of the processor will be the next bottleneck of big data systems,” they conclude.

Count Phelan among those who see processor caches getting taxed. “Now everyone is experimenting with mechanisms for separating compute and storage,  whether it’s caching or something else,” says the former senior engineer at VMware. “I think we’ll probably overshoot it, which means software will try to make too much demand on currently available hardware. But I believe shortly thereafter the hardware will catch up.  I believe we’ll constantly go back and forth in technology, where the software will always push the limits of what the hardware is capable of, and then we’ll make hardware improvements to meet the needs of software.”

Experiential Bottlenecks

Scality’s Lecat has an interesting take on another type of bottleneck.shutterstock_funnel_In-Finity

“Honestly I think right now the biggest bottleneck is not technology. It’s knowledge and experience,” he tells Datanami. “The technology is there. The proof is that Google and Facebook employ these technologies very efficiently….Every time we meet a technology problem, it’s possible to deliver a solution around it.”

Big data technology has evolved a tremendous amount over the past 15 years. It’s amazing that Hadoop, at 10 years of age, is considered a legacy technology by some. No matter where big data tech takes us, there will always be a bottleneck. The key is to assess the state of the art of the technology, and to find a way around it. With so much amazing technology at our disposal, it really is just a matter of time—and experience.

Related Items:

Machine Learning: No Longer the ‘Fine China’ of Analytics, HPE Says

Why Hadoop Must Evolve Toward Greater Simplicity

Hadoop 3 Poised to Boost Storage Capacity, Resilience with Erasure Coding

 

 

 

 

 

 

 

Share This