How to Build a Better Machine Learning Pipeline
Machine learning (ML) pipelines consist of several steps to train a model, but the term ‘pipeline’ is misleading as it implies a one-way flow of data. Instead, machine learning pipelines are cyclical and iterative as every step is repeated to continuously improve the accuracy of the model and achieve a successful algorithm. To build better machine learning models, and get the most value from them, accessible, scalable and durable storage solutions are imperative, paving the way for on-premises object storage.
Machine Learning Is Burgeoning
Welcome to the era of digital transformation, where data has become a modern-day currency. Tremendous value and intelligence is being extracted from large, captured datasets (big data) that has led to actionable insights through today’s analytics. Data analytics is uncovering trends, patterns and associations, new connections and precise predictions that are helping businesses achieve better outcomes. It’s not just about storing data any longer, but capturing, preserving, accessing and transforming it to take advantage of its possibilities and the value it can deliver. The goal for ML is simple: make faster and more predictive decisions.
Many of today’s ML models are ‘trained’ neural networks capable of executing a specific task or providing insights derived from ‘what happened’ to ‘what will likely happen’ (predictive analysis). These models are complex and are never completed, but rather, through the repetition of mathematical or computational procedures, are applied to the previous result and improved upon each time to get closer approximations to ‘solving the problem.’ Data scientists want more captured data to provide the fuel to train the ML models.
Machine learning use globally is burgeoning and its respective market is expected to grow in revenue to $8.81 billion by 2022, at a 44.1 percent CAGR. Businesses are rethinking their data strategies to include machine learning capabilities, not only to increase competitiveness, but also to create infrastructures that help enable data to live forever.
Getting Familiar with ML Pipelines
A machine learning pipeline is used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative.
There are generally two types of machine learning approaches (Figure 1). The first is supervised learning, where a model is built and datasets are provided to solve a particular problem using classification algorithms, and is the most common use of machine learning. The second approach is unsupervised learning, where a model is built to discover structures within given datasets. The initial data captured is not necessarily labeled so clustering algorithms are used to group the unlabeled data together.
Challenges Associated with ML Pipelines
In creating machine learning pipelines, there are challenges that data scientists face, but the most prevalent ones fall into three categories: Data Quality, Data Reliability and Data Accessibility.
If the quality of the data is not accurate, complete, reliable or robust, there is no need to run machine learning models because the outcomes will be wrong. This places a very high priority on data reliability because data scientists want as much quality data as possible to build and train their ML models. The more high-quality data they get, the more accurate and better their outcomes.
Data that will be used to run machine learning pipelines will be generated from a variety of sources. In order to determine the reliability of the data, collaboration amongst those who have data outcomes is required so that the data itself, its source of generation, and those who assessed the analysis are trusted and viable. As such, implementing a repository for the data outcomes that serves as a single source of truth is required. This enables the source data to reside in a single repository that data scientists and analysts can access quickly and use as reference whenever they need to present results.
The single source repository also enables machine learning to be run from various locations within a data center versus administrators having to physically carry or port the ML model to whatever location the analysis is being conducted. This avoids duplicate and varying versions of data, and makes sure that the analytical teams, from multiple organizations, are always working with the most recent and reliable data.
Before any machine learning model is run, the data itself must be accessible, requiring consolidation, cleansing and curation (where more qualitative data is added such as data sources, authorized users, project name, and time-stamp references). As a result of data curation, metadata is updated with the new tags.
Since data can be captured from years or even decades past, it can reside on many forms of storage media ranging from hard drives to memory sticks to hard copies in shoe boxes. In many cases, it resides on tape that deteriorates over time, can be difficult to find and may require obsolete readers to extract the data. To analyze big data in the modern world requires that it be captured and stored on reliable media, not only for immediate access, but to validate that it is of the highest integrity and accuracy possible. As such, enterprise SSDs and HDDs are used extensively to consolidate and store data for machine learning applications.
Cleansing is equally important as it removes irrelevant and redundant data during the pre-analysis stage. Doing this will not only save compute power, and associated time and costs, but will significantly increase the accuracy and comprehensibility of the ML model itself. Feature selection is a process used to cleanse unnecessary data by selecting attributes (or features) that are the most relevant in creating a predictive model. Feature extraction (Figure 2) is an alternate process that extracts existing features (and their associated data transformations) into new formats that not only describe variances within the data, but reduce the amount of information that is required to represent the ML model.
Once the data is cleansed, it can be aggregated with other cleansed data. From a data scientist’s perspective, this is heaven since massive quantities of stored data are needed to successfully run and train analytical models. Storing data in today’s data-centric world is no longer about just recovering datasets, but rather preserving them and being able to access them easily using search and index techniques. As such, data curating is part of the cleansing process but worth a separate callout as it requires reference marks as to where the data originated, as well as other forms of identification that differentiate it from other data, so that the information is reliable and trusted.
The Value Is In the Metadata
For data scientists and analysts who strive to obtain good outcomes from big data and improve their results over time is really about the metadata. Metadata extraction and the discovered correlations between metadata insights are the foundation of ML models. Once a model is sufficiently trained, it can be put into production to deliver faster determinations. In a traditional file-based network-attached storage (NAS) architecture, directories are used to tag data and must be traversed each time that it needs to be accessed. So many directories to traverse through in a hierarchical scheme makes it difficult to find files and access them quickly. But more importantly, the file-based approach has little to no information about the data stored that can help in analysis, or simplify management, or even support the ever-increasing amounts of data at scale.
When a business or operation is at scale is the time that the IT department needs to look at new storage solutions that are affordable, can help keep data forever (for analysis and ML training) and most importantly, easily scalable. Object storage has made tremendous inroads and is an architecture that manages data as objects (versus traditional block- or file-based approaches), and an exceptional option for storing unstructured data at petabyte scale. Unlike file-based storage that manages data in a folder hierarchy, or block-based storage that manages disk sectors collectively as blocks, object storage manages data as objects.
In an object storage platform, the totality of the data, be it a document, audio or video file, image or photo, or other unstructured data, is stored as a single object. Metadata resides with the captured data and provides descriptive information about the object and the data itself. This eliminates the need for a hierarchical structure and simplifies access by placing everything in a flat address space (or single namespace). The unique identifier assigned to each object makes it easier to index and retrieve data, or find a specific object.
Since metadata resides with captured data, users can tag as many data points as they want, and tag and find groups of objects much faster than file- or block-based storage options. Object storage also enables versioning — a very important feature of ML pipelines because of the repetitiveness in refining algorithms. Leveraging this unique feature for object storage, data scientists can version their data such that they or their collaborators can reproduce the results later. The versioning feature helps to shorten research time, obtain desired results faster, enable reproducible machine learning pipelines and validate data reliability. And since many users pay for storage per petabyte, one person can manage more petabytes being grouped as objects, resulting in lower total cost of ownership (TCO), especially relating to manpower and power consumption.
Object Storage for ML Pipelines
Machine learning gets better over time as more data points are collected and the true value occurs when different data assets from a variety of sources are correlated together. The act of correlating these new data formats streaming into the data center is quite a challenge as it’s not just about the sheer capacity of data, but more about the disparate data formats and the set of applications that need to access them. Businesses are now focusing on consolidating their assets into a single petabyte scale-out storage architecture. On-premises object storage or cloud storage systems serve a great purpose for these environments as they are designed to scale and support custom data formats.
With data scientists and analysts playing more prominent roles in mapping the statistical significance of key problems, and translate it quickly for business implementation, they also strive to improve their results. They want to store everything locally because their research is local and not in a public cloud as the time it takes to download an abundance of ML content can be extraordinary. And they want immediate access to improve their algorithm and re-run the analysis – repeating as necessary so that better comparisons can be made to the original results.
With GPUs residing next to the data on the compute side, results can be produced faster and the technology won’t be blocked from analytical processing, but rather, enabled! Every step in the ML process is cyclical and iterative as algorithms are being updated, analysis is being reprocessed, more data is being accumulated, and the end result is either improved or worsened. Once the computer learns, further tests can be taken to see if the results are accurate and whether the analysis needs to be re-run.
The amount of data businesses capture and store today is overwhelming. However, it’s not the volume of data being gathered that’s most important – but what businesses are doing with the data that really matters. Today’s businesses are starting to realize that big data is powerful, and significantly more valuable when paired with intelligent automation. Supported by massive computational power, machine learning is helping businesses manage, analyze and use their data far more effectively than ever before.
About the author: Linda Zhou is the Director of Research and Life Sciences Solutions for the Data Center Systems (DCS) business unit within Western Digital. She has in-depth knowledge of life sciences, machine learning, big data analytics, IT service management (ITSM) and compliance archiving. Prior to joining Western Digital, Ms. Zhou held business and technical positions at Silicon Graphics, Inc., EMC, Hewlett Packard and BMC Software, and ran a development services company in the data management space. She earned a Master’s degree in Business Administration from Carnegie Mellon University and a Bachelor’s degree in Computer Science and Engineering from Jinan University.
July 19, 2019
- Insight Adds Rubrik Cloud Data Management Solutions to OneCall Support and Managed Services Portfolio
- Infor Partners with GTY Technology to Fuel Digital Transformation in Public Sector
- Machine-Learning Competition Boosts Earthquake Prediction Capabilities
July 18, 2019
- Quantum to Speak on High-Capacity Archive Storage at SVG Sports Content Management Forum
- Valen Unveils New ‘Unavailable Loss History’ Model for Workers’ Compensation
- Toshiba Memory to Rebrand as ‘Kioxia’ in October
- SUSE Joins the iRODS Consortium
- DataRobot Launches AI for Good: Powered by DataRobot
- WANdisco, Neudesic Partner to Migrate Hadoop Analytical Workloads to Databricks in the Azure Cloud
July 17, 2019
- Qubole Named a Leader in the G2 Crowd Big Data Processing and Distribution Software Report
- Ascend Launches with $19M in Funding to Create Automated and Intelligent Dataflows
- Fusionex Hackathon Grooms Students in Industry Application of Data Analytics
- LexisNexis Launches Context for Courts, Delivering Venue-Specific Insights for Data-Driven Litigators
- Iron Mountain and 451 Research Offer Insight into Unlocking and Monetizing Your Unstructured Data
- Syncsort Survey Reveals Disconnect Between Data Trust and Data Quality
July 16, 2019
- Cloudera and ISID Partner to Build Integrated Platform for Mizuho Americas
- Nano Puzzle for More Stable Data Storage
- iBASIS Turns to Infinidat to Upgrade Overall Storage Performance
- AdhereHealth Selects Paxata to Accelerate Medication Adherence Solution
- SnapLogic Launches AWS Quick Start Solution to Accelerate Big Data Initiatives
Most Read Features
- Hitting the Reset Button on Hadoop
- Big Data File Formats Demystified
- Is Hadoop Officially Dead?
- Hadoop Struggles and BI Deals: What’s Going On?
- The 4 Paradigms of Data Prep for Analytics and Machine Learning
- 10 Big Data Trends to Watch in 2019
- Big Data Is Still Hard. Here’s Why
- How to Build a Better Machine Learning Pipeline
- Cloudera Commits to 100% Open Source
- ‘Data Scientist’ Title Evolving Into New Thing
- More Features…
Most Read News In Brief
- MapR Says It’s Close to Deal to Sell Company
- Cloud Now Default Platform for Databases, Gartner Says
- Argonne Team Makes Record Globus File Transfer
- After Funding Falls Through, MapR Seeks a Buyer to Avoid Shut Down
- Cloudera Unveils CDP, Talks Up ‘Enterprise Data Cloud’
- California’s New Data Privacy Law Takes Effect in 2020
- Inside Fortnite’s Massive Data Analytics Pipeline
- Tibco Eyes ‘Data Science for Ops’ with Spotfire Upgrades
- War Unfolding for Control of Elasticsearch
- Global DataSphere to Hit 175 Zettabytes by 2025, IDC Says
- More News In Brief…
Most Read This Just In
- IBM Unveils New Data Prep Tool Designed to Help Speed DataOps
- Sinequa Raises $23 million to Accelerate the Transition Beyond Data-Driven to an Information-Driven Economy
- MicroStrategy 2019 Update Brings HyperIntelligence to Mobile Users, Injects Analytics into Business Applications
- Most Enterprises Don’t Trust Their Data, According to Talend Survey
- What’s My Line? GPUs Help Researcher Decipher Ancient Sanskrit
- Syncsort Delivers Mainframe Data to Microsoft Azure to Unlock New Business Insights
- Microsoft, Providence St. Joseph Health Announce Strategic Alliance to Accelerate the Future of Care Delivery
- Attunity Wins Microsoft 2019 MSUS Partner Award for Intelligent Cloud-Data Estate Modernization
- Talend Delivers Pay-as-You-Go On-Ramp to Accelerate Integration Projects
- EnterpriseDB Acquired by Great Hill Partners
- More This Just In…