How to Build a Better Machine Learning Pipeline
Machine learning (ML) pipelines consist of several steps to train a model, but the term ‘pipeline’ is misleading as it implies a one-way flow of data. Instead, machine learning pipelines are cyclical and iterative as every step is repeated to continuously improve the accuracy of the model and achieve a successful algorithm. To build better machine learning models, and get the most value from them, accessible, scalable and durable storage solutions are imperative, paving the way for on-premises object storage.
Machine Learning Is Burgeoning
Welcome to the era of digital transformation, where data has become a modern-day currency. Tremendous value and intelligence is being extracted from large, captured datasets (big data) that has led to actionable insights through today’s analytics. Data analytics is uncovering trends, patterns and associations, new connections and precise predictions that are helping businesses achieve better outcomes. It’s not just about storing data any longer, but capturing, preserving, accessing and transforming it to take advantage of its possibilities and the value it can deliver. The goal for ML is simple: make faster and more predictive decisions.
Many of today’s ML models are ‘trained’ neural networks capable of executing a specific task or providing insights derived from ‘what happened’ to ‘what will likely happen’ (predictive analysis). These models are complex and are never completed, but rather, through the repetition of mathematical or computational procedures, are applied to the previous result and improved upon each time to get closer approximations to ‘solving the problem.’ Data scientists want more captured data to provide the fuel to train the ML models.
Machine learning use globally is burgeoning and its respective market is expected to grow in revenue to $8.81 billion by 2022, at a 44.1 percent CAGR. Businesses are rethinking their data strategies to include machine learning capabilities, not only to increase competitiveness, but also to create infrastructures that help enable data to live forever.
Getting Familiar with ML Pipelines
A machine learning pipeline is used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative.
There are generally two types of machine learning approaches (Figure 1). The first is supervised learning, where a model is built and datasets are provided to solve a particular problem using classification algorithms, and is the most common use of machine learning. The second approach is unsupervised learning, where a model is built to discover structures within given datasets. The initial data captured is not necessarily labeled so clustering algorithms are used to group the unlabeled data together.
Challenges Associated with ML Pipelines
In creating machine learning pipelines, there are challenges that data scientists face, but the most prevalent ones fall into three categories: Data Quality, Data Reliability and Data Accessibility.
If the quality of the data is not accurate, complete, reliable or robust, there is no need to run machine learning models because the outcomes will be wrong. This places a very high priority on data reliability because data scientists want as much quality data as possible to build and train their ML models. The more high-quality data they get, the more accurate and better their outcomes.
Data that will be used to run machine learning pipelines will be generated from a variety of sources. In order to determine the reliability of the data, collaboration amongst those who have data outcomes is required so that the data itself, its source of generation, and those who assessed the analysis are trusted and viable. As such, implementing a repository for the data outcomes that serves as a single source of truth is required. This enables the source data to reside in a single repository that data scientists and analysts can access quickly and use as reference whenever they need to present results.
The single source repository also enables machine learning to be run from various locations within a data center versus administrators having to physically carry or port the ML model to whatever location the analysis is being conducted. This avoids duplicate and varying versions of data, and makes sure that the analytical teams, from multiple organizations, are always working with the most recent and reliable data.
Before any machine learning model is run, the data itself must be accessible, requiring consolidation, cleansing and curation (where more qualitative data is added such as data sources, authorized users, project name, and time-stamp references). As a result of data curation, metadata is updated with the new tags.
Since data can be captured from years or even decades past, it can reside on many forms of storage media ranging from hard drives to memory sticks to hard copies in shoe boxes. In many cases, it resides on tape that deteriorates over time, can be difficult to find and may require obsolete readers to extract the data. To analyze big data in the modern world requires that it be captured and stored on reliable media, not only for immediate access, but to validate that it is of the highest integrity and accuracy possible. As such, enterprise SSDs and HDDs are used extensively to consolidate and store data for machine learning applications.
Cleansing is equally important as it removes irrelevant and redundant data during the pre-analysis stage. Doing this will not only save compute power, and associated time and costs, but will significantly increase the accuracy and comprehensibility of the ML model itself. Feature selection is a process used to cleanse unnecessary data by selecting attributes (or features) that are the most relevant in creating a predictive model. Feature extraction (Figure 2) is an alternate process that extracts existing features (and their associated data transformations) into new formats that not only describe variances within the data, but reduce the amount of information that is required to represent the ML model.
Once the data is cleansed, it can be aggregated with other cleansed data. From a data scientist’s perspective, this is heaven since massive quantities of stored data are needed to successfully run and train analytical models. Storing data in today’s data-centric world is no longer about just recovering datasets, but rather preserving them and being able to access them easily using search and index techniques. As such, data curating is part of the cleansing process but worth a separate callout as it requires reference marks as to where the data originated, as well as other forms of identification that differentiate it from other data, so that the information is reliable and trusted.
The Value Is In the Metadata
For data scientists and analysts who strive to obtain good outcomes from big data and improve their results over time is really about the metadata. Metadata extraction and the discovered correlations between metadata insights are the foundation of ML models. Once a model is sufficiently trained, it can be put into production to deliver faster determinations. In a traditional file-based network-attached storage (NAS) architecture, directories are used to tag data and must be traversed each time that it needs to be accessed. So many directories to traverse through in a hierarchical scheme makes it difficult to find files and access them quickly. But more importantly, the file-based approach has little to no information about the data stored that can help in analysis, or simplify management, or even support the ever-increasing amounts of data at scale.
When a business or operation is at scale is the time that the IT department needs to look at new storage solutions that are affordable, can help keep data forever (for analysis and ML training) and most importantly, easily scalable. Object storage has made tremendous inroads and is an architecture that manages data as objects (versus traditional block- or file-based approaches), and an exceptional option for storing unstructured data at petabyte scale. Unlike file-based storage that manages data in a folder hierarchy, or block-based storage that manages disk sectors collectively as blocks, object storage manages data as objects.
In an object storage platform, the totality of the data, be it a document, audio or video file, image or photo, or other unstructured data, is stored as a single object. Metadata resides with the captured data and provides descriptive information about the object and the data itself. This eliminates the need for a hierarchical structure and simplifies access by placing everything in a flat address space (or single namespace). The unique identifier assigned to each object makes it easier to index and retrieve data, or find a specific object.
Since metadata resides with captured data, users can tag as many data points as they want, and tag and find groups of objects much faster than file- or block-based storage options. Object storage also enables versioning — a very important feature of ML pipelines because of the repetitiveness in refining algorithms. Leveraging this unique feature for object storage, data scientists can version their data such that they or their collaborators can reproduce the results later. The versioning feature helps to shorten research time, obtain desired results faster, enable reproducible machine learning pipelines and validate data reliability. And since many users pay for storage per petabyte, one person can manage more petabytes being grouped as objects, resulting in lower total cost of ownership (TCO), especially relating to manpower and power consumption.
Object Storage for ML Pipelines
Machine learning gets better over time as more data points are collected and the true value occurs when different data assets from a variety of sources are correlated together. The act of correlating these new data formats streaming into the data center is quite a challenge as it’s not just about the sheer capacity of data, but more about the disparate data formats and the set of applications that need to access them. Businesses are now focusing on consolidating their assets into a single petabyte scale-out storage architecture. On-premises object storage or cloud storage systems serve a great purpose for these environments as they are designed to scale and support custom data formats.
With data scientists and analysts playing more prominent roles in mapping the statistical significance of key problems, and translate it quickly for business implementation, they also strive to improve their results. They want to store everything locally because their research is local and not in a public cloud as the time it takes to download an abundance of ML content can be extraordinary. And they want immediate access to improve their algorithm and re-run the analysis – repeating as necessary so that better comparisons can be made to the original results.
With GPUs residing next to the data on the compute side, results can be produced faster and the technology won’t be blocked from analytical processing, but rather, enabled! Every step in the ML process is cyclical and iterative as algorithms are being updated, analysis is being reprocessed, more data is being accumulated, and the end result is either improved or worsened. Once the computer learns, further tests can be taken to see if the results are accurate and whether the analysis needs to be re-run.
The amount of data businesses capture and store today is overwhelming. However, it’s not the volume of data being gathered that’s most important – but what businesses are doing with the data that really matters. Today’s businesses are starting to realize that big data is powerful, and significantly more valuable when paired with intelligent automation. Supported by massive computational power, machine learning is helping businesses manage, analyze and use their data far more effectively than ever before.
About the author: Linda Zhou is the Director of Research and Life Sciences Solutions for the Data Center Systems (DCS) business unit within Western Digital. She has in-depth knowledge of life sciences, machine learning, big data analytics, IT service management (ITSM) and compliance archiving. Prior to joining Western Digital, Ms. Zhou held business and technical positions at Silicon Graphics, Inc., EMC, Hewlett Packard and BMC Software, and ran a development services company in the data management space. She earned a Master’s degree in Business Administration from Carnegie Mellon University and a Bachelor’s degree in Computer Science and Engineering from Jinan University.
November 19, 2018
- Immuta and Collibra Partner to Enable Data Science Programs to Enforce Global Policies and Apply Regulatory Controls
- Veritas Predictive Insights Uses AI and ML to Predict and Prevent Unplanned Service
November 16, 2018
- Talend and Databricks Deliver Scalable Data Engineering Solution
- Redis Labs Introduces RedisGraph and Streams to Support a Zero Latency Future
November 15, 2018
- Kyvos Insights Announces the Availability of Kyvos Version 5
- Elastic Releases Version 6.5 of the Elastic Stack
- AppNexus Mobilizes Anodot’s Autonomous Analytics To Improve Customer Service
- Cloudian Doubles Revenue, Grows Customer Base 50% in First Three Quarters of Fiscal Year
- Zoho Deepens Analytics and AI in New Customer Experience Platform
- MapR Data Platform v6.0 and v6.1 Now Certified on Oracle Cloud Infrastructure
November 14, 2018
- ThreatConnect Expands TIP Capabilities with New Automation
- Mindtree Partners with the Indian Institute of Science Bangalore to Advance Research in Artificial Intelligence
- Trifacta Extends Data Preparation to DataOps with New Functionality for Data Engineers
November 13, 2018
- ThoughtSpot Announces Partnership with Google Cloud Platform to Deliver Multi-Cloud Analytics for the Enterprise
- Snowflake Announces Automatic Clustering and Materialized Views
- BlueData and H2O.ai Partner to Accelerate AI Deployments
- Cognigo Secures $8.5 Million Series A Round to Transform Data Protection & Privacy via AI
- Sigma Launches Next-Generation Analytics for Cloud Data Warehouses
- TIBCO Unveils Advanced Analytics with Spotfire X and the A(X) Experience
- VoltDB and MapR Technologies Join Forces to Support Machine Learning for Real-Time Decision Making
Most Read Features
- Is Hadoop Officially Dead?
- New Cloudera Plots a Course Toward a Unified Future
- Why Knowledge Graphs Are Foundational to Artificial Intelligence
- Will GraphQL Become a Standard for the New Data Economy?
- Which Programming Language Is Best for Big Data?
- What Does IBM’s Acquisition of Red Hat Mean for Open Source?
- Movie Recommendations with Spark Collaborative Filtering
- Inside Teradata’s Audacious Plan to Consolidate Analytics
- Big Data File Formats Demystified
- 9 Must-Have Skills to Land Top Big Data Jobs in 2015
- More Features…
Most Read News In Brief
- The Scent of an AI
- Rockset, SQL Cloud Service, Emerges from Stealth
- UC-Berkeley Expands Data Science 101
- Talend Buys Stitch for $60M
- Gartner Sees AI Democratized in Latest ‘Hype Cycle’
- Dremio Fleshes Out Data Platform
- California’s New Data Privacy Law Takes Effect in 2020
- Hot DataRobot Raises a Bundle
- MapR Targets Cloudera-Hortonworks Customers with ‘Clarity’ Release
- DataTorrent, Stream Processing Startup, Folds
- More News In Brief…
Most Read This Just In
- H2O.ai’s Full Suite of AI Platforms Now Available in the Microsoft Azure Marketplace
- Berkeley Inaugurates Division of Data Science and Information
- Reltio Introduces Industry’s First Data Quality Confidence Indicator for Business Users
- Sigma Launches Next-Generation Analytics for Cloud Data Warehouses
- Conversica Launches Industry’s First AI-Powered Admissions Assistant for Higher Education
- DataRobot Raises $100 Million Series D Led by Meritech and Sapphire Ventures
- DataRobot and Snowflake Unveil Partnership to Accelerate Adoption of AI in the Enterprise
- Neo4j Closes $80 Million in Series E Funding
- DataStax Announces DataStax Enterprise Production Support on VMware vSAN
- Accenture Insights Platform Now Offers Splice Machine
- More This Just In…
November 28 - November 29Santa Clara CA United States