Facebook Sees Hadoop Through Prism
A prism is used to separate a ray of light into its contingent frequencies, spreading them out into a rainbow. Maybe Facebook’s Prism will not make rainbows, but it will have the ability to split Hadoop clusters across the world.
This development, along with another Facebook project Corona which solves the single point of failure in MapReduce, was recently outlined with an emphasis on the role of the project for the larger Hadoop ecosystem.
Splitting a Hadoop cluster is difficult, explains Facebook’s VP of Engineering Jay Parikh. “The system is very tightly coupled,” said Facebook’s VP of Engineering Jay Parikh as he was explaining why this splitting of Hadoop clusters was not possible, “and if you introduce tens of milliseconds of delay between these servers, the whole thing comes crashing to a halt.”
So if a Hadoop cluster works fine when located in one data center and it is so difficult to change that, why go to the trouble? For a simple reason, according to Parikh. “It lets us move data around, wherever we want. Prineville, Oregon. Forest City, North Carolina. Sweden.”
This mobility mirrors that of Google Spanner, which moved data around its different centers to account for power usage and data center failure. Perhaps, with Prism, Facebook is preparing for data shortages/failures on what they call the world’s largest Hadoop cluster. That would be smart. Their 900 million users have, to this point, generated 100 petabytes of data that Facebook has to keep track of. According to Metz, they analyze about 105 terabytes every thirty minutes. That data is not shrinking, either.
Indeed, Parikh has an eye toward power usage and data failures. “It allows us to physically separate this massive warehouse of data but still maintain a single logical view of all of it. We can move the warehouses around, depending on cost or performance or technology…. We’re not bound by the maximum amount of power we wire up to a single data center.” Essentially, like Google Spanner, Prism will automatically copy and move data to other data centers when problems are detected or if needed elsewhere.
Facebook also improved Hadoop’s MapReduce system by introducing a workable solution that utilizes multiple job trackers. Facebook had already recently engineered a solution that takes care of the Hadoop File System which resulted in Hadoop’s HA NameNode, and Corona supplements that with a MapReduce fix. “In the past,” Parikh notes, “if we had a problem with the job tracker, everything kinda died, and you had to restart everything. The entire business was affected if this one thing fell over. Now, there’s lots of mini-job-trackers out there, and they’re all responsible for their own tasks.”
While both Corona and Prism seem like terrific additions to Hadoop, neither of these products have yet been released and Parikh gives no timetable for when that will happen. Further, according to Metz, MapR claims it has a solution similar to that of Facebook’s but nothing has hit the open source Hadoop.
Either way, there exist few companies who have as much at stake as Facebook when it comes to big data analytics. Improvements to Hadoop mean improvements to how Facebook handles their ridiculous amount of data. Hopefully, through Prism and Corona, they have found ways to split clusters among multiple data centers along with introducing backup trackers.