DDN Helps Biotech Researchers Deal with Big Data Pain
Scientific equipment such as gene sequencers and electron microscopes are essential to the work of biotechnology researchers at the University of Florida. However, the latest editions of these pieces of high-tech gear can generate terabytes of data per day. That level of data growth was becoming a burden to the scientists at the university, and led its IT team to DataDirect Networks (DDN) to rethink its storage and computing infrastructure.
|University of Florida’s Cancer Genetics research building.|
University of Florida’s Interdisciplinary Center for Biotechnology Research (ICBR) is one of the country’s top cross-disciplinary life sciences academic research facilities, and provides university faculty, staff, graduate students, and other partners with more than 400 types of scientific services.
The ICBR’s IT infrastructure was starting to strain under the weight of petabytes worth of bioinformatics, cellomics, genomics and proteomics data. New pieces of equipment, such as cryo-electron microscopy (cryo-EM) instruments, are great at generating high-fidelity data that’s beneficial to scientific research. However, providing a suitable landing pad for all that data was becoming a problem, says Aaron Gardner, Cyberinfrastructure Section Director for the ICBR.
“A lot of the daily effort goes to juggling data itself,” Gardner tells Datanami. “The informaticsists and researchers are spending a lot of time moving data around between name spaces or between silos and trying to get the most efficient type of storage to do this particular kind of task. So the more of that that we could offload, the more than they’ll be enabled to move the research lifecycle forward and make it faster.”
Over the 25 years ICBR has been operating, its IT and storage infrastructure has grown in an ad-hoc manner, with a collection of NAS and direct-attached storage device. As a public university competing for grant money, the ICBR’s mantra has always been to get the most bang for your buck. That means using open source software whenever possible, and rolling your own white-box infrastructure.
Too often, Gardner and his six-person IT team scrambled to cope with the data generated in the past. He recognized there was a dire need to get in front of the problem before it started having a bigger impact on the science. “We’ve always dealt with a lot of data,” Gardner said. “Then next-gen sequencing came along and disrupted everything, and now we have a huge amount of data to deal with. We had a big data problem before it was being called that.”
The primary goal of Gardner and his team was to reduce the complexity of the storage infrastructure and implement a single storage namespace. Reducing power and space were also goals. Gardner’s search for a solution led him to DataDirect Network (DDN), a well-known provider of high-end storage gear to high performance computing (HPC) and research sites across the world.
|ICBR helps researchers get data out of research devices, such as the iTRAQ mass-spectrometry device..|
Gardner knew that DDN was doing some cutting edge work in the area of combining storage and processing with parallel file systems. It offers storage appliances that combine high-end disk storage along with the GPFS and Lustre file systems. There’s also the hScaler, DDN’s pre-built appliance for Hadoop.
“That’s one of the key concepts and why we’re attracted to the SFA [Storage Fusion Architecture] embedded appliance model,” Gardner says. “Hadoop is great and the whole MapReduce model of moving the computation to the data–I do very strongly feel is the way to approach the majority of big data challenges, and the same holds true to us in life sciences.”
Hadoop itself wouldn’t make a good solution for the ICBR, however, for a variety of reasons. One of the core pieces of software used at the ICBR, the open source iRODS (integrated Rule-Oriented Data System) application, wouldn’t be a good fit for Hadoop. But Gardner was very intrigued with the idea of moving the iRODS application as close to the storage layer as possible, to minimize the movement of data.
“We said, ‘What if we did the same thing and embedded into a virtual machine first GPFS, to define a single parallel storage namespace, and all the benefits it has, and then put iRODS on top of that, to do the metadata management, and embedded all that on the appliance?'” Gardner said. “The idea is we’re all trying to minimize latency so the closer we can get these services to the actual storage hardware, the better.”
DDN didn’t offer such an appliance. If it had existed, Gardner would have bought it. But Gardner was game for using the ICBR as a test bed, and to create a proof of concept (POC) to see if embedding iRODS into the FSA appliance was feasible. DDN was willing to work on the problem, too.
|DDN’s line of SFA appliances can scale to more than 10 PB.|
Thus was born a fruitful collaboration. The POC was a success, and DDN and ICBR worked together to get iRODS running in a Linux virtual machine on the SFA controllers. The POC involved a single SFA12k device running about 180 TB of raw storage. Based on performance data taken from the POC, the next generation of DDN devices, called the SFA12KX (due to ship in early 2014) will make a very good starting point for an actual implementation of the embedded storage appliance at ICBR, and the first step in creating a next-generation, multi-petabyte data warehouse for the organization and the researchers it supports.
Along the way, DDN became the first private company to be welcomed into the iRODS Consortium. In the future, if DDN offers a version of the SFA appliance embedded with the iRODS application, the work with Gardner and ICBR will be considered one of the first deployments. “This stuff is very complicated,” Gardner said. “We had great people working with us at DDN. But if really want to optimize and get all these things I’m talking about – not just running but have performance and scalability on the hardware — it’s not trivial. To me, this is part of being an academic institution.”
The SFA appliance from DDN isn’t a panacea for ICBR’s big data woes, but it does provide a good starting point for the organization to begin to deal with the data volumes it has now, and to better cope with them going forward. Gardner is intrigued with the idea of perhaps embedding more types of workloads into the SFA appliances and deploying them in a infrastructure as a service (IAAS) model. This will drive more complexity out of the infrastructure, and thereby simplifying work for researchers.
Gardner is convinced that the containerized appliance-based approach that marries complicated software stacks to hardware that’s appropriate to the task is the wave of the future. “Otherwise the spiraling complexity of large-scale big data type IT is going to make losers of most endeavoring to tame it right now,” he says. “There has to be something happening to help contain that level of complexity.”