How a Facebook-Like Graph Powers Drug Discovery
Researchers have long sought to identify the key proteins involved in the development of diseases like cancer. However, the time and effort required to check each combination of proteins can be daunting. But thanks to the advent of graph analytics, researchers can now build models of protein networks, thereby enabling mass parallelization of the protein problem and powering a more efficient drug discovery process.
One of the companies employing advanced graph analytics in drug discovery is e-Therapeutics, a British biotech based near Oxford. The company recently started running its proprietary algorithms on a graph database that’s housed in an in-memory cluster.
“Effectively we’ve built something like the Facebook network, except between proteins,” says Jonny Wray, head of discovery informatics at e-Therapeutics. “All normal biology is the result of a network of proteins talking to each other… In a disease state, there’s a different set of connectivity and patterns of the connections between proteins.”
Before adopting the in-memory graph database, the company would run its algorithms on a single workstation. Depending on the disease, the graph could have anywhere from 500 nodes to 3,000 nodes, which corresponded with 500 to 3,000 proteins. The company’s algorithms would look to see which combination of proteins in a network are the most critical to a particular disease state, and therefore the promising candidates to target with a drug.
“Say we have a graph of 2,000 proteins,” Wray tells Datanami. “And we know if we remove a small subset of the proteins, it will have a big effect on the network, on the graph structure. Say we want to remove 20. If you randomly remove 20, you don’t have any effect on the structure of the network. Much like the Internet, if you remove 20 random routers, you’ll have no effect. But if you remove 20 key ones, you will have a big effect on the function of the system.”
The hard part is finding which of those 20 proteins are the key ones. You could tackle the problem head-on, and check each combination. “In theory you could do that exhaustively. But the number of combinations there is astronomical. You wouldn’t be able to accomplish it in a lifetime,” Wray says. Instead, the company uses the genetic algorithm approximation approach, which is a standard method used in the industry.
On a single workstation, the algorithms would sometimes run for days before determining which proteins were the most critical for cancer, depression, or other diseases that have a protein-network component (not all diseases are applicable to this approach). But that put a limit on the size of the graph the company could use, and the manner in which its scientists tackle the problems. What it needed was a way to parallelize the execution of the algorithms against the graph across a large number of machines, which would take the shackles off its scientists and allow them to explore more freely.
About a year ago, the company started exploring an in-memory data grid from GridGain Systems. Running on a cluster of modest hardware, the GridGain In-Memory Data Fabric effectively handles much of the underlying plumbing involved in parallelizing e-Therapeutics code, which was mostly written in Java and follow the basic map-reduce style of programming, according to Wray.
“We can develop our own algorithms,” Wray says. “But that baseline infrastructure around distributed computing–we didn’t want to develop it ourselves. We wanted to get something off the shelf, and that’s what GridGain gives us.”
The GridGain cluster has had its expected impact. Compared to the single-CPU workstation it was using before, the 20-node cluster has delivered a 17x to 20x speedup in the execution of the algorithms. Some of the jobs that would have taken a few days can now be run in hours, he says. The company also runs bigger jobs that run for a few days. Those jobs would have taken three to four months to run on the old single-CPU workstation. “That would never have gotten run before,” Wray says. “It allows us to do certain analytics that we just couldn’t do before.”
“It’s pretty critical,” Wray continues “Our algorithm for analysis for every project runs on top of this now. In fact it’s actually become a victim of its own success in that it works pretty well and people are sending more and more analytics to it, so it slows things down again. It’s a good problem to have.”
At the end of the day, the GridGain system and the graph database have given the e-Therapeutics scientists a powerful new tool for exploring potential proteins to exploit in the battle against cancer and neurodegenerative diseases, which are its main targets at the moment. “We wanted to move away from that batch environment and more toward allowing the scientists to experiment more,” Wray says. “It changes the way you work quite dramatically.”