Testing Cisco’s Unified Fabric
Networking big data can be a hassle, especially when that data is spread out over several data centers separated by some distance.
Yong-Hee Jeon of the Catholic University of Daegu in South Korea examined the efficiency of the Cisco Unified Fabric and its attempt to ease networking in managing big data applications. Specifically, he ran business intelligence and ETL tasks using Hadoop MapReduce over the Cisco Fabric to determine its runtime performance.
To assess the versatility of the Cisco network, it was important to test both BI and ETL functions. Both would take in about a Terabyte of data, but the BI would be expected to analyze it and output only a Megabyte while ETL is expected to convert the one TB into workable formats such that it can be used later by BI applications.
According to the Internet Research Group, network considerations need to be taken into account before Hadoop clusters are installed. Several companies are finding themselves buying those clusters without much of a thought as to how set up the network that will govern it, leading to undesirable results. “The scalability and usability of a Hadoop cluster may be damaged without understanding the role of WAN in the application of enterprise Hadoop,” Jeon said.
Cisco introduced their fabric as a means to enhance companies’ ability to manage their big data. The idea is to minimize I/O and computing bottlenecks by moving the computing itself to the data, a principle that has taken hold in the industry in the last couple of years. In this way, large files can be split up and spread across the fabric. A diagram of the fabric and how it relates to I/O is shown below.
“To efficiently process massive amounts of data, it was noted that it is important to move computing to where the data is using a distributed file system, rather than a central system for data,” Jeon said about Cisco’s purpose for the fabric. “Cisco proposes that a single large file is split into blocks, and the blocks are distributed among the nodes of the Hadoop cluster.”
When Jeon ran his BI and ETL tests on the fabric, he noted that the traffic—the aforementioned bottlenecks that slow networking down and delay runtime—was minimal over BI operations where a lot of data had to be analyzed and output. With ETL functions, spikes occurred before the Hadoop Reducers took effect, lending credence to a slight I/O bottleneck.
The main issue, however, took place during an ETL-related “reduce-shuffle phase.” As Jeon puts it, “it is shown that there is a significant amount of traffic because the entire data set needs to be shuffled across the network. The spikes are made up of many short-lived flows from all the nodes in the job and can potentially create temporary burst trigger short-lived buffer and I/O congestion.”
The good news is that, according to Jeon, these bursts do not last long. However, Jeon left out discrete numbers in his report, meaning the true effectiveness of the fabric is perhaps difficult to quantify.