How to Build a Better Machine Learning Pipeline
Machine learning (ML) pipelines consist of several steps to train a model, but the term ‘pipeline’ is misleading as it implies a one-way flow of data. Instead, machine learning pipelines are cyclical and iterative as every step is repeated to continuously improve the accuracy of the model and achieve a successful algorithm. To build better machine learning models, and get the most value from them, accessible, scalable and durable storage solutions are imperative, paving the way for on-premises object storage.
Machine Learning Is Burgeoning
Welcome to the era of digital transformation, where data has become a modern-day currency. Tremendous value and intelligence is being extracted from large, captured datasets (big data) that has led to actionable insights through today’s analytics. Data analytics is uncovering trends, patterns and associations, new connections and precise predictions that are helping businesses achieve better outcomes. It’s not just about storing data any longer, but capturing, preserving, accessing and transforming it to take advantage of its possibilities and the value it can deliver. The goal for ML is simple: make faster and more predictive decisions.
Many of today’s ML models are ‘trained’ neural networks capable of executing a specific task or providing insights derived from ‘what happened’ to ‘what will likely happen’ (predictive analysis). These models are complex and are never completed, but rather, through the repetition of mathematical or computational procedures, are applied to the previous result and improved upon each time to get closer approximations to ‘solving the problem.’ Data scientists want more captured data to provide the fuel to train the ML models.
Machine learning use globally is burgeoning and its respective market is expected to grow in revenue to $8.81 billion by 2022, at a 44.1 percent CAGR. Businesses are rethinking their data strategies to include machine learning capabilities, not only to increase competitiveness, but also to create infrastructures that help enable data to live forever.
Getting Familiar with ML Pipelines
A machine learning pipeline is used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative.
There are generally two types of machine learning approaches (Figure 1). The first is supervised learning, where a model is built and datasets are provided to solve a particular problem using classification algorithms, and is the most common use of machine learning. The second approach is unsupervised learning, where a model is built to discover structures within given datasets. The initial data captured is not necessarily labeled so clustering algorithms are used to group the unlabeled data together.
Challenges Associated with ML Pipelines
In creating machine learning pipelines, there are challenges that data scientists face, but the most prevalent ones fall into three categories: Data Quality, Data Reliability and Data Accessibility.
If the quality of the data is not accurate, complete, reliable or robust, there is no need to run machine learning models because the outcomes will be wrong. This places a very high priority on data reliability because data scientists want as much quality data as possible to build and train their ML models. The more high-quality data they get, the more accurate and better their outcomes.
Data that will be used to run machine learning pipelines will be generated from a variety of sources. In order to determine the reliability of the data, collaboration amongst those who have data outcomes is required so that the data itself, its source of generation, and those who assessed the analysis are trusted and viable. As such, implementing a repository for the data outcomes that serves as a single source of truth is required. This enables the source data to reside in a single repository that data scientists and analysts can access quickly and use as reference whenever they need to present results.
The single source repository also enables machine learning to be run from various locations within a data center versus administrators having to physically carry or port the ML model to whatever location the analysis is being conducted. This avoids duplicate and varying versions of data, and makes sure that the analytical teams, from multiple organizations, are always working with the most recent and reliable data.
Before any machine learning model is run, the data itself must be accessible, requiring consolidation, cleansing and curation (where more qualitative data is added such as data sources, authorized users, project name, and time-stamp references). As a result of data curation, metadata is updated with the new tags.
Since data can be captured from years or even decades past, it can reside on many forms of storage media ranging from hard drives to memory sticks to hard copies in shoe boxes. In many cases, it resides on tape that deteriorates over time, can be difficult to find and may require obsolete readers to extract the data. To analyze big data in the modern world requires that it be captured and stored on reliable media, not only for immediate access, but to validate that it is of the highest integrity and accuracy possible. As such, enterprise SSDs and HDDs are used extensively to consolidate and store data for machine learning applications.
Cleansing is equally important as it removes irrelevant and redundant data during the pre-analysis stage. Doing this will not only save compute power, and associated time and costs, but will significantly increase the accuracy and comprehensibility of the ML model itself. Feature selection is a process used to cleanse unnecessary data by selecting attributes (or features) that are the most relevant in creating a predictive model. Feature extraction (Figure 2) is an alternate process that extracts existing features (and their associated data transformations) into new formats that not only describe variances within the data, but reduce the amount of information that is required to represent the ML model.
Once the data is cleansed, it can be aggregated with other cleansed data. From a data scientist’s perspective, this is heaven since massive quantities of stored data are needed to successfully run and train analytical models. Storing data in today’s data-centric world is no longer about just recovering datasets, but rather preserving them and being able to access them easily using search and index techniques. As such, data curating is part of the cleansing process but worth a separate callout as it requires reference marks as to where the data originated, as well as other forms of identification that differentiate it from other data, so that the information is reliable and trusted.
The Value Is In the Metadata
For data scientists and analysts who strive to obtain good outcomes from big data and improve their results over time is really about the metadata. Metadata extraction and the discovered correlations between metadata insights are the foundation of ML models. Once a model is sufficiently trained, it can be put into production to deliver faster determinations. In a traditional file-based network-attached storage (NAS) architecture, directories are used to tag data and must be traversed each time that it needs to be accessed. So many directories to traverse through in a hierarchical scheme makes it difficult to find files and access them quickly. But more importantly, the file-based approach has little to no information about the data stored that can help in analysis, or simplify management, or even support the ever-increasing amounts of data at scale.
When a business or operation is at scale is the time that the IT department needs to look at new storage solutions that are affordable, can help keep data forever (for analysis and ML training) and most importantly, easily scalable. Object storage has made tremendous inroads and is an architecture that manages data as objects (versus traditional block- or file-based approaches), and an exceptional option for storing unstructured data at petabyte scale. Unlike file-based storage that manages data in a folder hierarchy, or block-based storage that manages disk sectors collectively as blocks, object storage manages data as objects.
In an object storage platform, the totality of the data, be it a document, audio or video file, image or photo, or other unstructured data, is stored as a single object. Metadata resides with the captured data and provides descriptive information about the object and the data itself. This eliminates the need for a hierarchical structure and simplifies access by placing everything in a flat address space (or single namespace). The unique identifier assigned to each object makes it easier to index and retrieve data, or find a specific object.
Since metadata resides with captured data, users can tag as many data points as they want, and tag and find groups of objects much faster than file- or block-based storage options. Object storage also enables versioning — a very important feature of ML pipelines because of the repetitiveness in refining algorithms. Leveraging this unique feature for object storage, data scientists can version their data such that they or their collaborators can reproduce the results later. The versioning feature helps to shorten research time, obtain desired results faster, enable reproducible machine learning pipelines and validate data reliability. And since many users pay for storage per petabyte, one person can manage more petabytes being grouped as objects, resulting in lower total cost of ownership (TCO), especially relating to manpower and power consumption.
Object Storage for ML Pipelines
Machine learning gets better over time as more data points are collected and the true value occurs when different data assets from a variety of sources are correlated together. The act of correlating these new data formats streaming into the data center is quite a challenge as it’s not just about the sheer capacity of data, but more about the disparate data formats and the set of applications that need to access them. Businesses are now focusing on consolidating their assets into a single petabyte scale-out storage architecture. On-premises object storage or cloud storage systems serve a great purpose for these environments as they are designed to scale and support custom data formats.
With data scientists and analysts playing more prominent roles in mapping the statistical significance of key problems, and translate it quickly for business implementation, they also strive to improve their results. They want to store everything locally because their research is local and not in a public cloud as the time it takes to download an abundance of ML content can be extraordinary. And they want immediate access to improve their algorithm and re-run the analysis – repeating as necessary so that better comparisons can be made to the original results.
With GPUs residing next to the data on the compute side, results can be produced faster and the technology won’t be blocked from analytical processing, but rather, enabled! Every step in the ML process is cyclical and iterative as algorithms are being updated, analysis is being reprocessed, more data is being accumulated, and the end result is either improved or worsened. Once the computer learns, further tests can be taken to see if the results are accurate and whether the analysis needs to be re-run.
The amount of data businesses capture and store today is overwhelming. However, it’s not the volume of data being gathered that’s most important – but what businesses are doing with the data that really matters. Today’s businesses are starting to realize that big data is powerful, and significantly more valuable when paired with intelligent automation. Supported by massive computational power, machine learning is helping businesses manage, analyze and use their data far more effectively than ever before.
About the author: Linda Zhou is the Director of Research and Life Sciences Solutions for the Data Center Systems (DCS) business unit within Western Digital. She has in-depth knowledge of life sciences, machine learning, big data analytics, IT service management (ITSM) and compliance archiving. Prior to joining Western Digital, Ms. Zhou held business and technical positions at Silicon Graphics, Inc., EMC, Hewlett Packard and BMC Software, and ran a development services company in the data management space. She earned a Master’s degree in Business Administration from Carnegie Mellon University and a Bachelor’s degree in Computer Science and Engineering from Jinan University.
December 4, 2020
- Manetu Selects YugabyteDB to Power its Data Privacy Management Platform
- OctoML Announces Early Access for its ML Platform for Automated Model Optimization and Deployment
December 3, 2020
- Snowflake Reports Financial Results for Q3 of Fiscal 2021
- MLCommons Launches and Unites 50+ Tech and Academic Leaders in AI, ML
- BuntPlanet’s AI Software Helps Reduce Water Losses in Latin America
- Securonix Named a Leader in Security Analytics by Independent Research Firm
- Tellimer Brings Structure to Big Data With AI Extraction Tool, Parsel
- Privitar Introduces New Right to be Forgotten Privacy Functionality for Analytics, ML
- Cohesity Announces New SaaS Offerings for Backup and Disaster Recovery
- Pyramid Analytics Now Available on AWS Marketplace
- Google Enters Agreement to Acquire Actifio
December 2, 2020
- AWS Announces Amazon DevOps Guru
- SingleStore Managed Service Now Available in AWS Marketplace
- PagerDuty’s Real-Time AIOps-Powered DOP Integrates with Amazon DevOps Guru
- Visualizing Multidimensional Radiation Data Using Video Game Software
- Confluent Launches Fully Managed Connectors for Confluent Cloud
- Monte Carlo Releases Data Observability Platform
- Alation Collaborates with AWS on Cloud Data Search, Governance and Migration
- Domino Data Lab Joins Accenture’s INTIENT Network to Help Drive Innovation in Clinical Research
December 1, 2020
Most Read Features
- Snowflake Extends Its Data Warehouse with Pipelines, Services
- Data Lakes Are Legacy Tech, Fivetran CEO Says
- Big Data File Formats Demystified
- AI Model Detects Asymptomatic COVID-19 from a Cough 100% of the Time
- Snowflake: Not What You May Think It Is
- The Maturation of Data Science
- Why Data Science Is Still a Top Job
- How to Build a Better Machine Learning Pipeline
- Is Kubernetes Really Necessary for Data Science?
- Data Lake or Warehouse? Databricks Offers a Third Way
- More Features…
Most Read News In Brief
- C3.ai Files to Go Public
- The Shifting Landscape of Database Systems
- Algorithmia, Datadog Team on MLOps
- Qubole is Latest Acquisition Target
- Big Blue Taps Into Streaming Data with Confluent Connection
- Data Exchange Maker Harbr Closes Series A
- DataRobot Eyes IPO After Another VC Haul
- Stanford COVID-19 Model Identifies Superspreader Sites, Socioeconomic Disparities
- Testing Data Literacy on Main Street
- Databricks Plotting IPO in 2021, Bloomberg Reports
- More News In Brief…
Most Read This Just In
- Business Leaders Turn to Analytics to Reimagine a Post-COVID (and Post-Election) World
- LogicMonitor Makes Log Analytics Smarter with New Offering
- Accenture to Acquire End-to-End Analytics
- Dynatrace Named a Leader in AIOps Report by Independent Research Firm
- GoodData Open-sources Next Gen Analytics Framework
- Teradata Reports Third Quarter 2020 Financial Results
- DataRobot Announces $270M in Funding Led by Altimeter Capital
- C3.ai Announces Launch of Initial Public Offering
- Financial Network, Inc. Moves to the Cloud with MariaDB SkySQL, Leaves Oracle Behind
- Informatica Announces New Governed Data Lake Management for AWS Customers
- More This Just In…