How to Build a Better Machine Learning Pipeline
Machine learning (ML) pipelines consist of several steps to train a model, but the term ‘pipeline’ is misleading as it implies a one-way flow of data. Instead, machine learning pipelines are cyclical and iterative as every step is repeated to continuously improve the accuracy of the model and achieve a successful algorithm. To build better machine learning models, and get the most value from them, accessible, scalable and durable storage solutions are imperative, paving the way for on-premises object storage.
Machine Learning Is Burgeoning
Welcome to the era of digital transformation, where data has become a modern-day currency. Tremendous value and intelligence is being extracted from large, captured datasets (big data) that has led to actionable insights through today’s analytics. Data analytics is uncovering trends, patterns and associations, new connections and precise predictions that are helping businesses achieve better outcomes. It’s not just about storing data any longer, but capturing, preserving, accessing and transforming it to take advantage of its possibilities and the value it can deliver. The goal for ML is simple: make faster and more predictive decisions.
Many of today’s ML models are ‘trained’ neural networks capable of executing a specific task or providing insights derived from ‘what happened’ to ‘what will likely happen’ (predictive analysis). These models are complex and are never completed, but rather, through the repetition of mathematical or computational procedures, are applied to the previous result and improved upon each time to get closer approximations to ‘solving the problem.’ Data scientists want more captured data to provide the fuel to train the ML models.
Machine learning use globally is burgeoning and its respective market is expected to grow in revenue to $8.81 billion by 2022, at a 44.1 percent CAGR. Businesses are rethinking their data strategies to include machine learning capabilities, not only to increase competitiveness, but also to create infrastructures that help enable data to live forever.
Getting Familiar with ML Pipelines
A machine learning pipeline is used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative.
There are generally two types of machine learning approaches (Figure 1). The first is supervised learning, where a model is built and datasets are provided to solve a particular problem using classification algorithms, and is the most common use of machine learning. The second approach is unsupervised learning, where a model is built to discover structures within given datasets. The initial data captured is not necessarily labeled so clustering algorithms are used to group the unlabeled data together.
Challenges Associated with ML Pipelines
In creating machine learning pipelines, there are challenges that data scientists face, but the most prevalent ones fall into three categories: Data Quality, Data Reliability and Data Accessibility.
If the quality of the data is not accurate, complete, reliable or robust, there is no need to run machine learning models because the outcomes will be wrong. This places a very high priority on data reliability because data scientists want as much quality data as possible to build and train their ML models. The more high-quality data they get, the more accurate and better their outcomes.
Data that will be used to run machine learning pipelines will be generated from a variety of sources. In order to determine the reliability of the data, collaboration amongst those who have data outcomes is required so that the data itself, its source of generation, and those who assessed the analysis are trusted and viable. As such, implementing a repository for the data outcomes that serves as a single source of truth is required. This enables the source data to reside in a single repository that data scientists and analysts can access quickly and use as reference whenever they need to present results.
The single source repository also enables machine learning to be run from various locations within a data center versus administrators having to physically carry or port the ML model to whatever location the analysis is being conducted. This avoids duplicate and varying versions of data, and makes sure that the analytical teams, from multiple organizations, are always working with the most recent and reliable data.
Before any machine learning model is run, the data itself must be accessible, requiring consolidation, cleansing and curation (where more qualitative data is added such as data sources, authorized users, project name, and time-stamp references). As a result of data curation, metadata is updated with the new tags.
Since data can be captured from years or even decades past, it can reside on many forms of storage media ranging from hard drives to memory sticks to hard copies in shoe boxes. In many cases, it resides on tape that deteriorates over time, can be difficult to find and may require obsolete readers to extract the data. To analyze big data in the modern world requires that it be captured and stored on reliable media, not only for immediate access, but to validate that it is of the highest integrity and accuracy possible. As such, enterprise SSDs and HDDs are used extensively to consolidate and store data for machine learning applications.
Cleansing is equally important as it removes irrelevant and redundant data during the pre-analysis stage. Doing this will not only save compute power, and associated time and costs, but will significantly increase the accuracy and comprehensibility of the ML model itself. Feature selection is a process used to cleanse unnecessary data by selecting attributes (or features) that are the most relevant in creating a predictive model. Feature extraction (Figure 2) is an alternate process that extracts existing features (and their associated data transformations) into new formats that not only describe variances within the data, but reduce the amount of information that is required to represent the ML model.
Once the data is cleansed, it can be aggregated with other cleansed data. From a data scientist’s perspective, this is heaven since massive quantities of stored data are needed to successfully run and train analytical models. Storing data in today’s data-centric world is no longer about just recovering datasets, but rather preserving them and being able to access them easily using search and index techniques. As such, data curating is part of the cleansing process but worth a separate callout as it requires reference marks as to where the data originated, as well as other forms of identification that differentiate it from other data, so that the information is reliable and trusted.
The Value Is In the Metadata
For data scientists and analysts who strive to obtain good outcomes from big data and improve their results over time is really about the metadata. Metadata extraction and the discovered correlations between metadata insights are the foundation of ML models. Once a model is sufficiently trained, it can be put into production to deliver faster determinations. In a traditional file-based network-attached storage (NAS) architecture, directories are used to tag data and must be traversed each time that it needs to be accessed. So many directories to traverse through in a hierarchical scheme makes it difficult to find files and access them quickly. But more importantly, the file-based approach has little to no information about the data stored that can help in analysis, or simplify management, or even support the ever-increasing amounts of data at scale.
When a business or operation is at scale is the time that the IT department needs to look at new storage solutions that are affordable, can help keep data forever (for analysis and ML training) and most importantly, easily scalable. Object storage has made tremendous inroads and is an architecture that manages data as objects (versus traditional block- or file-based approaches), and an exceptional option for storing unstructured data at petabyte scale. Unlike file-based storage that manages data in a folder hierarchy, or block-based storage that manages disk sectors collectively as blocks, object storage manages data as objects.
In an object storage platform, the totality of the data, be it a document, audio or video file, image or photo, or other unstructured data, is stored as a single object. Metadata resides with the captured data and provides descriptive information about the object and the data itself. This eliminates the need for a hierarchical structure and simplifies access by placing everything in a flat address space (or single namespace). The unique identifier assigned to each object makes it easier to index and retrieve data, or find a specific object.
Since metadata resides with captured data, users can tag as many data points as they want, and tag and find groups of objects much faster than file- or block-based storage options. Object storage also enables versioning — a very important feature of ML pipelines because of the repetitiveness in refining algorithms. Leveraging this unique feature for object storage, data scientists can version their data such that they or their collaborators can reproduce the results later. The versioning feature helps to shorten research time, obtain desired results faster, enable reproducible machine learning pipelines and validate data reliability. And since many users pay for storage per petabyte, one person can manage more petabytes being grouped as objects, resulting in lower total cost of ownership (TCO), especially relating to manpower and power consumption.
Object Storage for ML Pipelines
Machine learning gets better over time as more data points are collected and the true value occurs when different data assets from a variety of sources are correlated together. The act of correlating these new data formats streaming into the data center is quite a challenge as it’s not just about the sheer capacity of data, but more about the disparate data formats and the set of applications that need to access them. Businesses are now focusing on consolidating their assets into a single petabyte scale-out storage architecture. On-premises object storage or cloud storage systems serve a great purpose for these environments as they are designed to scale and support custom data formats.
With data scientists and analysts playing more prominent roles in mapping the statistical significance of key problems, and translate it quickly for business implementation, they also strive to improve their results. They want to store everything locally because their research is local and not in a public cloud as the time it takes to download an abundance of ML content can be extraordinary. And they want immediate access to improve their algorithm and re-run the analysis – repeating as necessary so that better comparisons can be made to the original results.
With GPUs residing next to the data on the compute side, results can be produced faster and the technology won’t be blocked from analytical processing, but rather, enabled! Every step in the ML process is cyclical and iterative as algorithms are being updated, analysis is being reprocessed, more data is being accumulated, and the end result is either improved or worsened. Once the computer learns, further tests can be taken to see if the results are accurate and whether the analysis needs to be re-run.
The amount of data businesses capture and store today is overwhelming. However, it’s not the volume of data being gathered that’s most important – but what businesses are doing with the data that really matters. Today’s businesses are starting to realize that big data is powerful, and significantly more valuable when paired with intelligent automation. Supported by massive computational power, machine learning is helping businesses manage, analyze and use their data far more effectively than ever before.
About the author: Linda Zhou is the Director of Research and Life Sciences Solutions for the Data Center Systems (DCS) business unit within Western Digital. She has in-depth knowledge of life sciences, machine learning, big data analytics, IT service management (ITSM) and compliance archiving. Prior to joining Western Digital, Ms. Zhou held business and technical positions at Silicon Graphics, Inc., EMC, Hewlett Packard and BMC Software, and ran a development services company in the data management space. She earned a Master’s degree in Business Administration from Carnegie Mellon University and a Bachelor’s degree in Computer Science and Engineering from Jinan University.
September 28, 2020
- Cohesity Announces Automated Disaster Recovery that Minimizes Application Downtime and Data Loss
- DataStax Co-Founder and CTO Jonathan Ellis to Keynote at ApacheCon 2020 on Open Source in the Cloud Era with DataStax Astra and Apache Cassandra
September 25, 2020
- PostgreSQL 13 Released: Performance Gains, Space Savings, Enhanced Security, Developer Experience
- WANdisco Announces Global Agreement with Infosys to De-Risk and Accelerate Data Lake Migration to the Cloud
- Matillion Partner Ecosystem Identifies Trends Driving Data Transformation Market
- TIBCO Simplifies Data Unification With TIBCO Any Data Hub
- Trifacta Named Leader in G2’s Fall Grid Report for Data Preparation
- Seagate’s New Solutions Equip Enterprises for the New Data Economy
September 24, 2020
- Spectra Logic Announces Industry’s First Tape Library to Store One Exabyte of Uncompressed Data Leveraging LTO-9 Technology
- QDA Miner 6 Powers Businesses with New Qualitative Analysis Capabilities
- Cambridge Semantics Appoints Brian D. Owen as Chief Executive Officer
- Exasol Dominates Its Peer Groups in BARC Data Management Survey 2020
- The Apache Software Foundation Announces Apache IoTDB as a Top-Level Project
- Sneak Peek of Breakout Sessions Announced for the In-Memory Computing Summit 2020 Virtual Worldwide Conference
September 23, 2020
- Elastic Announces ElasticON Global, Free Virtual User Conference to Take Place From October 13-15
- KIOXIA Bolsters NVMe-oF Ecosystem with Ethernet SSD Storage; Collaborates with Marvell, Foxconn-Ingrasys and Accton
- TIBCO Hyperconverged Analytics Dramatically Simplifies Analytics Experience
- NASA, ICIJ, ATPCO, Lyft and More Choose Neo4j for their Knowledge Graphs
September 22, 2020
- Qlik Expands Strategic Partnership With Google Cloud With Integrated Solution for SAP Data Analytics
- U.S. Food and Drug Administration Selects Cambridge Semantics for Data and Analytics Platform
Most Read Features
- How Facebook Accelerates SQL at Extreme Scale
- 10 Big Data Statistics That Will Blow Your Mind
- Big Data File Formats Demystified
- Microsoft Now Developing Its Own Hadoop
- VC Ben Horowitz Dishes on Hadoop, AI, and Data Culture
- How to Build a Better Machine Learning Pipeline
- The CDO’s Role in Leading Data-Driven Transformation
- How the Coronavirus Response Is Aided by Analytics
- The Future of Labor in an AI World
- Is Python Strangling R to Death?
- More Features…
Most Read News In Brief
- Snowflake to Make it SNOW on NYSE
- Aerospike Gives Legacy Infrastructure a Real-Time Boost
- A ‘Breakout Year’ for ModelOps, Forrester Says
- Google Joins the MLOps Crusade
- Snowflake Pops in ‘Largest Ever’ Software IPO
- New AI Tool Maps the Families of the Bible, A Song of Ice and Fire
- Microsoft Launches Spatial Analytics, Other AI Services at Ignite
- Air Force Expands Predictive Maintenance
- Cassandra Gets an Indexing Upgrade
- Fivetran Launches Pay-As-You-Go Option for ETL
- More News In Brief…
Most Read This Just In
- Monte Carlo Raises $16M to Build the World’s First Data Reliability Platform
- Talend Introduces Industry-First Measure of Data Health to Bring Clarity and Confidence to Every Business Decision
- Tabor Communications, Inc. Announces Expansion of the Editorial Team
- Scality RING8 on All-Flash Delivers File and Object Storage Performance 10x Faster Than Competitive Solutions
- ScyllaDB Unveils One-Step Migration from Amazon DynamoDB to Scylla NoSQL Database
- IBM Cognos Analytics-Based Business Transformation Going Strong
- Tamr Data Mastering Platform Now Available on Microsoft Azure
- Kinetica Releases New Version of The Kinetica Streaming Data Warehouse Platform
- VMware and DataStax Partner to Bring Cloud-Native, Scale-Out, Hybrid Database-as-a-Service to Enterprises
- AWS and the National Football League Announce New Next Gen Stats Powered by AWS for the 2020 Season
- More This Just In…