How to Build a Better Machine Learning Pipeline
Machine learning (ML) pipelines consist of several steps to train a model, but the term ‘pipeline’ is misleading as it implies a one-way flow of data. Instead, machine learning pipelines are cyclical and iterative as every step is repeated to continuously improve the accuracy of the model and achieve a successful algorithm. To build better machine learning models, and get the most value from them, accessible, scalable and durable storage solutions are imperative, paving the way for on-premises object storage.
Machine Learning Is Burgeoning
Welcome to the era of digital transformation, where data has become a modern-day currency. Tremendous value and intelligence is being extracted from large, captured datasets (big data) that has led to actionable insights through today’s analytics. Data analytics is uncovering trends, patterns and associations, new connections and precise predictions that are helping businesses achieve better outcomes. It’s not just about storing data any longer, but capturing, preserving, accessing and transforming it to take advantage of its possibilities and the value it can deliver. The goal for ML is simple: make faster and more predictive decisions.
Many of today’s ML models are ‘trained’ neural networks capable of executing a specific task or providing insights derived from ‘what happened’ to ‘what will likely happen’ (predictive analysis). These models are complex and are never completed, but rather, through the repetition of mathematical or computational procedures, are applied to the previous result and improved upon each time to get closer approximations to ‘solving the problem.’ Data scientists want more captured data to provide the fuel to train the ML models.
Machine learning use globally is burgeoning and its respective market is expected to grow in revenue to $8.81 billion by 2022, at a 44.1 percent CAGR. Businesses are rethinking their data strategies to include machine learning capabilities, not only to increase competitiveness, but also to create infrastructures that help enable data to live forever.
Getting Familiar with ML Pipelines
A machine learning pipeline is used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative.
There are generally two types of machine learning approaches (Figure 1). The first is supervised learning, where a model is built and datasets are provided to solve a particular problem using classification algorithms, and is the most common use of machine learning. The second approach is unsupervised learning, where a model is built to discover structures within given datasets. The initial data captured is not necessarily labeled so clustering algorithms are used to group the unlabeled data together.
Challenges Associated with ML Pipelines
In creating machine learning pipelines, there are challenges that data scientists face, but the most prevalent ones fall into three categories: Data Quality, Data Reliability and Data Accessibility.
If the quality of the data is not accurate, complete, reliable or robust, there is no need to run machine learning models because the outcomes will be wrong. This places a very high priority on data reliability because data scientists want as much quality data as possible to build and train their ML models. The more high-quality data they get, the more accurate and better their outcomes.
Data that will be used to run machine learning pipelines will be generated from a variety of sources. In order to determine the reliability of the data, collaboration amongst those who have data outcomes is required so that the data itself, its source of generation, and those who assessed the analysis are trusted and viable. As such, implementing a repository for the data outcomes that serves as a single source of truth is required. This enables the source data to reside in a single repository that data scientists and analysts can access quickly and use as reference whenever they need to present results.
The single source repository also enables machine learning to be run from various locations within a data center versus administrators having to physically carry or port the ML model to whatever location the analysis is being conducted. This avoids duplicate and varying versions of data, and makes sure that the analytical teams, from multiple organizations, are always working with the most recent and reliable data.
Before any machine learning model is run, the data itself must be accessible, requiring consolidation, cleansing and curation (where more qualitative data is added such as data sources, authorized users, project name, and time-stamp references). As a result of data curation, metadata is updated with the new tags.
Since data can be captured from years or even decades past, it can reside on many forms of storage media ranging from hard drives to memory sticks to hard copies in shoe boxes. In many cases, it resides on tape that deteriorates over time, can be difficult to find and may require obsolete readers to extract the data. To analyze big data in the modern world requires that it be captured and stored on reliable media, not only for immediate access, but to validate that it is of the highest integrity and accuracy possible. As such, enterprise SSDs and HDDs are used extensively to consolidate and store data for machine learning applications.
Cleansing is equally important as it removes irrelevant and redundant data during the pre-analysis stage. Doing this will not only save compute power, and associated time and costs, but will significantly increase the accuracy and comprehensibility of the ML model itself. Feature selection is a process used to cleanse unnecessary data by selecting attributes (or features) that are the most relevant in creating a predictive model. Feature extraction (Figure 2) is an alternate process that extracts existing features (and their associated data transformations) into new formats that not only describe variances within the data, but reduce the amount of information that is required to represent the ML model.
Once the data is cleansed, it can be aggregated with other cleansed data. From a data scientist’s perspective, this is heaven since massive quantities of stored data are needed to successfully run and train analytical models. Storing data in today’s data-centric world is no longer about just recovering datasets, but rather preserving them and being able to access them easily using search and index techniques. As such, data curating is part of the cleansing process but worth a separate callout as it requires reference marks as to where the data originated, as well as other forms of identification that differentiate it from other data, so that the information is reliable and trusted.
The Value Is In the Metadata
For data scientists and analysts who strive to obtain good outcomes from big data and improve their results over time is really about the metadata. Metadata extraction and the discovered correlations between metadata insights are the foundation of ML models. Once a model is sufficiently trained, it can be put into production to deliver faster determinations. In a traditional file-based network-attached storage (NAS) architecture, directories are used to tag data and must be traversed each time that it needs to be accessed. So many directories to traverse through in a hierarchical scheme makes it difficult to find files and access them quickly. But more importantly, the file-based approach has little to no information about the data stored that can help in analysis, or simplify management, or even support the ever-increasing amounts of data at scale.
When a business or operation is at scale is the time that the IT department needs to look at new storage solutions that are affordable, can help keep data forever (for analysis and ML training) and most importantly, easily scalable. Object storage has made tremendous inroads and is an architecture that manages data as objects (versus traditional block- or file-based approaches), and an exceptional option for storing unstructured data at petabyte scale. Unlike file-based storage that manages data in a folder hierarchy, or block-based storage that manages disk sectors collectively as blocks, object storage manages data as objects.
In an object storage platform, the totality of the data, be it a document, audio or video file, image or photo, or other unstructured data, is stored as a single object. Metadata resides with the captured data and provides descriptive information about the object and the data itself. This eliminates the need for a hierarchical structure and simplifies access by placing everything in a flat address space (or single namespace). The unique identifier assigned to each object makes it easier to index and retrieve data, or find a specific object.
Since metadata resides with captured data, users can tag as many data points as they want, and tag and find groups of objects much faster than file- or block-based storage options. Object storage also enables versioning — a very important feature of ML pipelines because of the repetitiveness in refining algorithms. Leveraging this unique feature for object storage, data scientists can version their data such that they or their collaborators can reproduce the results later. The versioning feature helps to shorten research time, obtain desired results faster, enable reproducible machine learning pipelines and validate data reliability. And since many users pay for storage per petabyte, one person can manage more petabytes being grouped as objects, resulting in lower total cost of ownership (TCO), especially relating to manpower and power consumption.
Object Storage for ML Pipelines
Machine learning gets better over time as more data points are collected and the true value occurs when different data assets from a variety of sources are correlated together. The act of correlating these new data formats streaming into the data center is quite a challenge as it’s not just about the sheer capacity of data, but more about the disparate data formats and the set of applications that need to access them. Businesses are now focusing on consolidating their assets into a single petabyte scale-out storage architecture. On-premises object storage or cloud storage systems serve a great purpose for these environments as they are designed to scale and support custom data formats.
With data scientists and analysts playing more prominent roles in mapping the statistical significance of key problems, and translate it quickly for business implementation, they also strive to improve their results. They want to store everything locally because their research is local and not in a public cloud as the time it takes to download an abundance of ML content can be extraordinary. And they want immediate access to improve their algorithm and re-run the analysis – repeating as necessary so that better comparisons can be made to the original results.
With GPUs residing next to the data on the compute side, results can be produced faster and the technology won’t be blocked from analytical processing, but rather, enabled! Every step in the ML process is cyclical and iterative as algorithms are being updated, analysis is being reprocessed, more data is being accumulated, and the end result is either improved or worsened. Once the computer learns, further tests can be taken to see if the results are accurate and whether the analysis needs to be re-run.
The amount of data businesses capture and store today is overwhelming. However, it’s not the volume of data being gathered that’s most important – but what businesses are doing with the data that really matters. Today’s businesses are starting to realize that big data is powerful, and significantly more valuable when paired with intelligent automation. Supported by massive computational power, machine learning is helping businesses manage, analyze and use their data far more effectively than ever before.
About the author: Linda Zhou is the Director of Research and Life Sciences Solutions for the Data Center Systems (DCS) business unit within Western Digital. She has in-depth knowledge of life sciences, machine learning, big data analytics, IT service management (ITSM) and compliance archiving. Prior to joining Western Digital, Ms. Zhou held business and technical positions at Silicon Graphics, Inc., EMC, Hewlett Packard and BMC Software, and ran a development services company in the data management space. She earned a Master’s degree in Business Administration from Carnegie Mellon University and a Bachelor’s degree in Computer Science and Engineering from Jinan University.
July 2, 2020
- Anaconda Releases 2020 State of Data Science Survey Results
- Big Data Analytics Among Top Three Deployment Priorities for Enterprises, Says Frost & Sullivan
- Informatica Acquires Compact Solutions
- Confluent Announces Infinite Retention for Apache Kafka in Confluent Cloud
- BP Invests $5M in Geospatial Analytics Software Company Satelytics
- Data Visualization Gets Artificial Intelligence Boost with $5M NSF Grant
- LSU CS Professor Studies COVID-19 Disparities on Social Media
July 1, 2020
- OmniSci Powers New Website Enabling Public to View House-by-House Information On Flint Water Crisis
- Aerospike Adds New Partners to Meet Growing Demand in APAC Region
- Informatica, The ADAPT Research Centre Collaborate to Accelerate AI Research, Development
- Huawei’s Data Virtualization Engine openLooKeng Goes Open Source
- Zoic Labs Creates Interactive Data Visualization Tool, Connecting Scientists with COVID-19 Research Data
- Noodle Partners and Stevens Institute of Technology Address Shifting Demand with Online Programs
- Decisions, NLP Logix Partner to Deliver Machine Learning Capabilities to Business Process Management
- UMass Amherst Awarded Federal Grants to Support Research to Improve Pandemic Forecasting
- Yellowbrick Makes Cloud Disaster Recovery Service, New Features Generally Available
June 30, 2020
- Hitachi Vantara Names Gajen Kandiah as New CEO
- Ahana Announces Linux Foundation’s PrestoDB Now Available on AWS Marketplace and DockerHub
- Fivetran Raises $100M in Series C Financing Round
- American Family Insurance Data Science Institute Awards ‘Mini Grants’ to Advance Data Science
Most Read Features
- Big Data File Formats Demystified
- How to Build a Better Machine Learning Pipeline
- Nvidia Destroys TPCx-BB Benchmark with GPUs
- BI Tools — Are They Enough to Build a Data-Driven Culture?
- How COVID-19 Is Impacting the Market for Data Jobs
- Databricks Brings Data Science, Engineering Together with New Workspace
- What Is a Data Cloud? And 11 Other Snowflake Enhancements
- Understanding Your Options for Stream Processing Frameworks
- MongoDB Steps Up Game with MongoDB Cloud
- SAS Provides Big Data Solutions for… Bees?
- More Features…
Most Read News In Brief
- New Report Ranks Countries by COVID-19 Safety
- Spark 3.0 Brings Big SQL Speed-Up, Better Python Hooks
- New Map Shows Hundreds of Counties in the COVID-19 Endgame — and Thousands on the Uptick
- IBM Brings Back a Netezza, Attacks Yellowbrick
- Blurred Lines: SAS and Microsoft To Go Deep in Analytics Partnership
- U.S. Special Ops Launches $600M Analytics Effort
- NIH Launches Massive Initiative for COVID-19 Patient Data Analytics
- War Unfolding for Control of Elasticsearch
- AWS Upgrades SageMaker Labeling Tool
- PrestoDB Hits Fork in the Road as Startup Gains Venture Funding
- More News In Brief…
Most Read This Just In
- HSBC Joins Data Privacy Firm Privitar’s Series C Financing Round with $7M Investment
- D2iQ Unveils KUDO for Kubeflow to Accelerate Enterprise-Grade Machine Learning on Kubernetes
- SAS Debuts Tools to Gauge Risks and Impacts of Reopening
- The Linux Foundation Cloud Engineer Bootcamp Announced
- Databricks Introduces Delta Engine, Acquires Redash
- Technology Aims to Provide Cloud Efficiency for Databases During Data-Intensive COVID-19 Pandemic
- Cloudera Debuts its Cloudera Data Platform Private Cloud
- Alation Launches Data Governance Initiatives
- New Actian Vector for Hadoop Enables Real-time and Operational Analytics
- MariaDB Announces the General Availability of MariaDB Community Server 10.5
- More This Just In…