MIT Programmers Attack Big Data Memory Gap
Among the computing challenges presented by big data is the scattering of unstructured items across huge datasets. Pulling together that data from arbitrary locations in main memory is therefore emerging as a major performance bottleneck in CPUs.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory have proposed a solution to the memory “locality” problem with a new programming language called “Milk.” The approach is designed to allow application developers to more efficiently manage memory when crunching scattered data in ever-larger datasets.
The MIT researchers reported at a computing conference this week that common algorithms written in the new programming language ran up to four times faster as those written in existing languages. They predict larger performance gains as the new programming language is refined to orchestrate data locations and determine the relevance of data stored at particular locations.
Along with the scattering of big data in memory, the MIT researchers are also tackling a problem they refer to as “sparse” data. In other words, the scale of big data solutions do not always scale in proportion to big data problem to be solved.
The MIT programming language was predicated on the fact that today’s CPUs are not optimized for “sparse” data. Fetching data sequentially from main memory, a CPU core is designed to grab blocks of data based on its location. The university researchers concluded that accessing main memory for a single data point is woefully inadequate in the age of big data.
“It’s as if, every time you want a spoonful of cereal, you open the fridge, open the milk carton, pour a spoonful of milk, close the carton and put it back in the fridge,” explained Vladimir Kiriansky, an MIT doctoral student in electrical engineering and computer science and lead researcher.
(The analogy also explains the name of the new programming language.)
Milk adds several commands to OpenMP, or Open Multi-Processing, a compiler extension to the C and other programming languages geared to multicore processors. Milk allows programmers to add a few lines of code to instructions that repeat through large datasets looking for “sparse” data. The compiler then manages memory accordingly, the researchers said.
By compiling a list of data addresses and grouping addresses near each other in memory, each core requests only the data it needs and can be retrieved efficiently, thereby boosting overall performance.
The next step in boosting performance will be tailoring the Milk compiler to keep track of the list of memory addresses but also data stored at those addresses. The approach would decide which addresses to retain for future reference and which to discard.
“Many important applications today are data-intensive, but unfortunately, the growing gap in performance between memory and CPU means they do not fully utilize current hardware,” noted Matei Zaharia, an assistant professor of computer science at Stanford University. “Milk helps to address this gap by optimizing memory access in common programming constructs.”