What Can Enterprises Learn From Genome Sequencing?
The data handing requirements for genetic discovery are increasing as the tools become more sophisticated, causing data sets to become increasingly large. However, dealing with these large data sets is nothing new for practitioners in traditional science, and enterprises can learn from the strategies and processes that these disciplines have put into play.
Mario Caccamo, head of Bioinformatics at The Genome Analysis Center (TGAC) based in the UK, recently outlined the work and challenges that researchers face, explaining the explosive nature of the data and the processes that that genome scientists use to wrestle with it.
“We have super exponential growth in data generation,” explained Caccamo. “To give you a concrete example, with two sequencing instruments in 2010, we generated 1.2 terabases of data. We can now generate half of that (600 gigabases) in two weeks in only one sequencing round with our current instruments.”
This, of course, is a familiar refrain that enterprise executives are hearing as they consider their own data plans and consider what their own data requirements will be down the line. But having an infrastructure to handle and store this data isn’t the same thing as having a strategy to turn it into predictive intelligence.
“This is very much hitting the point where you need a strategy to cope with more data today than you have ever generated before,” says Caccamo, explaining the growing challenges TGAC faces as their data volumes increase in size and complexity. The organization looks at these challenges holistically, explained Caccamo in describing how they have built a cultural infrastructure to support their data-driven science goals.
Caccamo explains that a technology focus drives their cultural underpinnings. “In one end, developing the state of the art intensive platform, both sequencing and computational – this is what I call the ‘hard infrastructure.’ The other focus of our activities is on what we can develop between intensive algorithms and databases – this is what we call the ‘soft infrastructure,’”
Caccamo explains that TGAC’s challenge is on developing these components together through what he calls a systems approach in order to produce “predictive assumptions” that help them towards discovery and ultimately understanding biology.
Included in their base equation is the development of skills, says Caccamo, being very sensitive to the talent acquisition challenges that organizations face in the exploding data science fields. “We really take training into this,” explained Caccamo. “Technologies and training toward developing new strategies and tools is a very important part of what we do.”
Putting this organizational foundation into place, explained Caccamo, they are then able to give focus to what their true aim is: data driven science. Their process is something that other organizations can learn from, suggests Caccamo, explaining TGAC’s view of the pipeline that starts with data on the input end, and biology science as the output.
The process starts with huge volumes of sequencing data and using efficient algorithms to tackle the data, endeavor to distill it into what can then be classified as “information.” In the case of genomic research, the information are the ~150 letter base strings that the researchers use to assemble genome maps with. The research head says that once they have achieved the information step in the process, the focus, they shift priorities from efficiency to quality. At this step, says Caccamo, they’re focused on turning “information” into “knowledge” which can then be transformed into biology.
In the case of genome sequencing, the research head used the discovery process of the wheat genome as an example. The first step is in extracting and sequencing the DNA into strands of information called “bases,” which are ready for assembling into something more actionable.
These bases are then passed through sophisticated hardware (TGAC uses an array of SGI UV supercomputers for much of their assembly work) and assembled into enormous graphs. When mapped out in a graph, wheat has 10 billion nodes of sequencing information, says Caccamo. These graphs are now considered knowledge that can be used by biologists use for the purposes of obtaining better crops.
Caccamo explains that having the right strategy in place is essential for the long term success of TGAC because their data growth is exponential. He notes that as the data becomes cheaper and cheaper to process, the scope of the research tends to expand. In the example of wheat genetics, he explains that the research has expanded to examine environmental factors.
“A recent addition of the toolkit of the bioinformatician is that we can look now into what is present as well in the soil,” says Caccamo. “In this case, it’s not going to be about one species, but instead about a community of species – what we call a microbial community. That’s what we call metagenomics.”
While the concept of metagenomics is contained within the esoteric domain of the genomic community, the concept of the run-away project is not. Virtually every enterprise has experienced the resource drain that happens when a project expands beyond its original scope. In the case of big data, these run-away project can be very costly if there aren’t strategies in place to govern the direction they take.
As enterprises ramp up their big data initiatives, TGAC’s example suggests that organizations would be wise to consider the culture and processes already in place where traditional sciences have already blazed a trail in managing and processing enormous amounts of data.
June 21, 2021
- Ceph Foundation Announces the Formation of the Ceph Market Development Group
- Domino 4.4 Boosts Data Scientists’ Ability to Work the Way they Want, Maximize Productivity
- HPE Fuels GreenLake Expansion with Cloud Services Support for Azure Stack HCI, SQL Server
- Dataiku Launches in AWS Marketplace
- HPE Acquires Determined AI to Accelerate Machine Learning Training
June 18, 2021
- Alva Named Winner in AI and Machine Learning Awards 2021
- Collibra Announces 24 Gold and Silver Partners for 2021
June 17, 2021
- Esri’s ArcGIS Platform Chosen for Red Bull X-Alps Competition Live Tracking App
- Collibra Announces 2021 Excellence Awards
- Latest Release of InterSystems IRIS Data Platform Provides Next Step in Data Fabric Adoption
- Zaloni Automates Data Governance, Fast Tracks Data Access with 6.4 Platform Release
- Qumulo, HPE GreenLake Cloud Services to Provide Pay-As-You-Go File Platform for Unstructured Data
- Lucidworks Joins Google Cloud Partner Advantage Program, Launches AI-Powered Search Platform
- TigerGraph Announces Center of Innovation in San Diego, R&D and Recruitment Efforts
- Monte Carlo, PagerDuty Integration Bring DevOps to Data Pipelines with End-to-End Observability
- HPE Passes Rigorous Splunk Engineering Tests for Kubernetes Operator with HPE Ezmeral
- Partners Together Now: Snowflake Announces FY21 Partner of the Year Award Winners
June 16, 2021
- Vertica Announces Early Access of Vertica Eon Accelerator
- Alation Named Top Vendor in End-User Study of Data Catalog Market for Fifth Consecutive Year
- Fetch.ai, Poznan Supercomputing and Networking Center to Develop AI Tools For Cancer Cell Detection
Most Read Features
- Newly ‘Headquarterless’ Snowflake Makes a Flurry of Announcements
- Big Data File Formats Demystified
- Do Customers Want Open Data Platforms?
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- Understanding Your Options for Stream Processing Frameworks
- Why Data Science Is Still a Top Job
- Three Reasons Python Is The AI Lingua Franca
- Databricks Unveils Data Sharing, ETL, and Governance Solutions
- Cloudera To Go Private in $5.3 Billion Buyout by Wall Street Firms
- What’s Driving Data Science Hiring in 2019
- More Features…
Most Read News In Brief
- Confluent S-1 Reveals ‘Reimagining of Business’ Theme
- Confluent Files to Go Public. Who Could Be Next?
- Lakehouses Prevent Data Swamps, Bill Inmon Says
- Google Cloud Tackles Data Unification with New Offerings
- Google’s ‘Breakthrough’ LaMDA Promises to Elevate the Common Chatbot
- Databricks Unveil New Machine Learning Solution
- Dremio Charts Open Course with Dart
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- Qualcomm Unveils 5G Modem for IoT
- MIT Researchers Leverage Machine Learning for Better Lidar
- More News In Brief…
Most Read This Just In
- SAS Named a Leader in Streaming Analytics Per Independent Research Firm
- Sumo Logic Signs Definitive Agreement to Acquire Sensu to Extend Open Source Strategy
- Relativity Acquires Text IQ to Drive Leadership in AI for e-Discovery, Compliance and Privacy
- University of Texas at San Antonio Researchers Collaborate to Improve Computer Vision for AI
- US Air Force RSO Expands Engagement with C3 AI as Strategic AI Platform
- Latest Release of SnapLogic Fast Data Loader Provides Fast, Free Cloud Data Warehouse Loading
- Esri’s ArcGIS Platform Chosen for Red Bull X-Alps Competition Live Tracking App
- Dgraph Rises to the Top Graph Database on GitHub with 11 G2 Badges, 11M Downloads
- Incorta Announces Tableau Connector to Extend Faster Data Analytics to All Customers
- Google Cloud Launches Datashare for Financial Services
- More This Just In…