Rice Genome Data Mined to Feed a Hungry Planet
The global population is forecast to top 7.7 billion human beings by 2020. As weather patterns change and, with them, global agricultural production, a new dataset containing the genome sequences of more than 3,000 rice varieties are being made available to researchers working to figure out how to feed the world.
The International Rice Research Institute (IRRI) along with the Chinese Academy of Agricultural Sciences and BGI Shenzhen compiled the 120-terabyte dataset, which is available now on the Amazon Web Services (AWS) cloud platform. The project sequenced the genomes of 3,024 rice varieties from 89 countries.
AWS said this week the consortium partnered with DNAnexus Inc., which operates a cloud platform for sharing genomic data and tools, to process source genomic data across 37,000 AWS compute cores. The process took two days.
Rice is a staple food for roughly half the world’s population, accounting for an estimated 20 percent of all calories on a per-capita basis. Current rice yields based on traditional breeding techniques are unlikely to meet growing demand. Researchers say they must increase yields by 25 percent over the next 15 years.
Hence the growing emphasis on applying big data analytics to breeding techniques to take into account underlying genetic information. The rice genome dataset includes more than 30 million genetic variations spanning all known and predicted rice genes, organizers said.
“Through analysis of this data, researchers can potentially identify genes associated with important agronomic traits such as crop yield, climate stress tolerance and disease resistance,” the partners noted. “Together, they represent an unprecedented resource for advancing rice science and breeding technology.”
Big data analysis of the genome dataset also could yield new inferences about how to achieve higher rice yields through tolerance to pests, crop diseases and expected weather extremes brought about by climate change.
AWS said the genomic dataset is hosted on its Simple Storage Service and is publicly available over common HTTP protocols. The 3000 Rice Genome Dataset can be accessed here.
“The dataset provides access to millions of genetic markers that can be used to design sustainable crops for the future, that is, ones that are high-yielding and more nutritious while at the same time requiring less water, fertilizer and pesticides,” Rod Wing, director of the Arizona Genomics Institute at the University of Arizona, noted in a statement.
IRRI added that the new dataset contains “millions of genomic sequences from a diverse set of rice varieties that, when combined with phenotyping observations, gene expression and other information, provides an important step in establishing gene-trait associations, building predictive models and applying these models to breeding.”
Organizers said the next challenge for researchers would be systematically mining the rice genome dataset to link genotypic to functional variations in order to create “new and sustainable rice varieties.” The results of those analytics efforts will then be combined with environmental studies based on satellite imagery like Landsat 8 data also accessible on the AWS cloud.