What Can Enterprises Learn From Genome Sequencing?
The data handing requirements for genetic discovery are increasing as the tools become more sophisticated, causing data sets to become increasingly large. However, dealing with these large data sets is nothing new for practitioners in traditional science, and enterprises can learn from the strategies and processes that these disciplines have put into play.
Mario Caccamo, head of Bioinformatics at The Genome Analysis Center (TGAC) based in the UK, recently outlined the work and challenges that researchers face, explaining the explosive nature of the data and the processes that that genome scientists use to wrestle with it.
“We have super exponential growth in data generation,” explained Caccamo. “To give you a concrete example, with two sequencing instruments in 2010, we generated 1.2 terabases of data. We can now generate half of that (600 gigabases) in two weeks in only one sequencing round with our current instruments.”
This, of course, is a familiar refrain that enterprise executives are hearing as they consider their own data plans and consider what their own data requirements will be down the line. But having an infrastructure to handle and store this data isn’t the same thing as having a strategy to turn it into predictive intelligence.
“This is very much hitting the point where you need a strategy to cope with more data today than you have ever generated before,” says Caccamo, explaining the growing challenges TGAC faces as their data volumes increase in size and complexity. The organization looks at these challenges holistically, explained Caccamo in describing how they have built a cultural infrastructure to support their data-driven science goals.
Caccamo explains that a technology focus drives their cultural underpinnings. “In one end, developing the state of the art intensive platform, both sequencing and computational – this is what I call the ‘hard infrastructure.’ The other focus of our activities is on what we can develop between intensive algorithms and databases – this is what we call the ‘soft infrastructure,’”
Caccamo explains that TGAC’s challenge is on developing these components together through what he calls a systems approach in order to produce “predictive assumptions” that help them towards discovery and ultimately understanding biology.
Included in their base equation is the development of skills, says Caccamo, being very sensitive to the talent acquisition challenges that organizations face in the exploding data science fields. “We really take training into this,” explained Caccamo. “Technologies and training toward developing new strategies and tools is a very important part of what we do.”
Putting this organizational foundation into place, explained Caccamo, they are then able to give focus to what their true aim is: data driven science. Their process is something that other organizations can learn from, suggests Caccamo, explaining TGAC’s view of the pipeline that starts with data on the input end, and biology science as the output.
The process starts with huge volumes of sequencing data and using efficient algorithms to tackle the data, endeavor to distill it into what can then be classified as “information.” In the case of genomic research, the information are the ~150 letter base strings that the researchers use to assemble genome maps with. The research head says that once they have achieved the information step in the process, the focus, they shift priorities from efficiency to quality. At this step, says Caccamo, they’re focused on turning “information” into “knowledge” which can then be transformed into biology.
In the case of genome sequencing, the research head used the discovery process of the wheat genome as an example. The first step is in extracting and sequencing the DNA into strands of information called “bases,” which are ready for assembling into something more actionable.
These bases are then passed through sophisticated hardware (TGAC uses an array of SGI UV supercomputers for much of their assembly work) and assembled into enormous graphs. When mapped out in a graph, wheat has 10 billion nodes of sequencing information, says Caccamo. These graphs are now considered knowledge that can be used by biologists use for the purposes of obtaining better crops.
Caccamo explains that having the right strategy in place is essential for the long term success of TGAC because their data growth is exponential. He notes that as the data becomes cheaper and cheaper to process, the scope of the research tends to expand. In the example of wheat genetics, he explains that the research has expanded to examine environmental factors.
“A recent addition of the toolkit of the bioinformatician is that we can look now into what is present as well in the soil,” says Caccamo. “In this case, it’s not going to be about one species, but instead about a community of species – what we call a microbial community. That’s what we call metagenomics.”
While the concept of metagenomics is contained within the esoteric domain of the genomic community, the concept of the run-away project is not. Virtually every enterprise has experienced the resource drain that happens when a project expands beyond its original scope. In the case of big data, these run-away project can be very costly if there aren’t strategies in place to govern the direction they take.
As enterprises ramp up their big data initiatives, TGAC’s example suggests that organizations would be wise to consider the culture and processes already in place where traditional sciences have already blazed a trail in managing and processing enormous amounts of data.
August 12, 2022
- KDD 2022 Reveals Schedule of Data Mining and Knowledge Discovery Papers
- Dropbase Announces $1.75M to Manage Messy Files in Analytical Databases
August 11, 2022
- ObservIQ Announces Availability of BindPlane Observability Pipeline
- Granulate to Launch Free Solution for Autonomous Kubernetes Cost Optimization
August 10, 2022
- New Study From Elastic Provides Insights for Online Retailers
- Esri Renews Partnership with NTIS
- Mindtech Shares Details on Platform’s Intelligent Data Engineering Capability
- Stardog Achieves SOC 2 Type I Compliance
- Signal AI Acquires KELP to Help Organizations Strengthen Their Reputation
- Mighty Canary Introduces Real-Time “Trust Mark” for Data
August 9, 2022
- Micron Announces $40B Investment in US Memory Manufacturing
- Verta Adds New Capabilities to Its MLOps Platform
- Kyligence Completes SOC 2 Type II Certification
- Anaconda Announces Strategic Cloud Partnership with Oracle
- BigID Announces Automated Remediation for Sensitive File Access in the Cloud
- Monte Carlo Survey: Data Engineers Spend Two Days Per Week on Bad Data
- NSF and Amazon Continue Collaboration Supporting Fairness in AI & ML
- Appen Releases Annual State of AI and Machine Learning Report
August 8, 2022
August 5, 2022
Most Read Features
- A Dozen Questions for Databricks CTO Matei Zaharia
- The Race to Ensure Post Quantum Data Security
- Altair Gives Legacy SAS Code a New Place to Run
- Esri Puts Federal Climate Change Action on the Map
- Data Mesh: What’s In It For The Business?
- Esri Melds GIS with AI, Graph, and Analytics
- Data Mesh Vs. Data Fabric: Understanding the Differences
- Big Data File Formats Demystified
- How Intuit Is Building AI, Analytics, and Streaming on One Lakehouse
- Second Half 2022 Tech Predictions for Data and AI
- More Features…
Most Read News In Brief
- Mathematica Helps Crack Zodiac Killer’s Code
- These 15 Data Fabrics Made the Cut in Forrester’s Wave
- FeatureByte Raises $5.7M to Fix the Weakest Link in AI
- IBM Research Open-Sources Deep Search Tools
- PowerSchool Launches K-12 Education Cloud Platform with Snowflake
- Nvidia Could Become Google of the Metaverse with AI Search Engine
- Google Debuts LaMDA 2 Conversational AI System and AI Test Kitchen
- TIBCO’s ModelOps Takes AI Models Out of the Lab and Onto on the Road
- Meta Releases AI Model That Translates Over 200 Languages
- UK Information Commissioner Releases Data Privacy Plan
- More News In Brief…
Most Read This Just In
- Fiddler Announces Updates to AI Model Management Platform
- Intel Releases Open Source AI Reference Kits
- SingleStore and Intel Collaborate to Deliver Real-Time Data Technology
- Samsung Launches 24Gbps GDDR6 DRAM
- dbt Labs Announces Formal Launch of its Technology Partner Program
- Accenture Acquires Latin American Cloud Data Firm Tenbu
- Western Digital Extends HDD Technology Across Smart Video, NAS and Datacenter Channel Segments
- Dataiku Joins Deloitte US Data and AI Alliance Ecosystem
- Elastic Announces Enhancements to Cross-Cluster Search and Replication
- UiPath Acquires Natural Language Processing Company Re:infer
- More This Just In…
Sponsored Partner Content
September 13 @ 1:00 pm - September 14 @ 5:00 pmWashington DC United States
September 19 @ 8:00 am - September 20 @ 5:00 pm
September 26 - September 27Denver CO United States
October 5 - October 6Santa Clara CA United States
October 10 - October 12Boston MA United States
October 17 - October 20Toronto ON Canada