Follow Datanami:
August 14, 2017

Training Your AI With As Little Manually Labeled Data As Possible

James Kobielus

(Snopek Nadi/Shutterstock)

Let’s imagine a future in which developers are creating and deploying artificial intelligence (AI) apps faster than people can consume them.

Please ignore the minor issue of whether the developers in question can make a decent living building stuff that almost nobody is using. For the purpose of this column, let’s consider another tricky issue that will arise again and again in the future’s AI-choked application environment: where will you find the people you’ll need to label the data you’ll require for supervised training of your apps’ deep learning, machine learning, and other statistical algorithms?

More often than not, you won’t be able to find the labelers you need when you need them, at any price, even if you had the budget. And it will only get worse as you develop more AI-fueled apps, hence require more training data and more frequent relabeling of the data you have. That’s why the working data scientists of the world know they will need to resort to drastic measures to sustain their development pipelines while ensuring that their AI apps don’t fail to achieve their core objectives at the moment they’re deployed (as I said, let’s ignore the tangentially related issue of whether those apps will succeed in finding a monetizable use case).

When it comes to acquiring high-quality labeled training data, what drastic measures am I referring to? These go beyond the approaches that I sketched out in this recent post. Everything I discussed there–repurposing labeled data, harvesting your own free sources, exploring pre-labeled public datasets, acquired pretrained models, using crowdsourced labeling services, etc.— involves getting your hands on training assets that, at some point, were curated and annotated by living, breathing humans.

What I’m referring to is the most radical tactic of all: using AI to fabricate machine-labeled AI training data without the need for even a speck of human curation. In other words, I’m referring to the bold step of surrendering to our machine overlords the “last mile” in supervised learning: exercising the judgment needed to identify the real-world entity to which some data object—such as a video clip, audio segment, or text file—corresponds.

Like that nice, clean data set  you’ve been working with? It was likely created by a person  (ESB Professional/Shutterstock)

This may sound like a hall-of-mirrors trick among computer scientists with too much time on their hands. However, it’s a real legitimate research focus at the Stanford DAWN program, which is focused on empowering everybody to build high-quality AI apps and which includes among its principals the creator of Apache Spark. And it’s not inconceivable that AI might automate more of the algorithmic-training process, considering the recent successes that many researchers have achieved in using AI to automate development of AI models as well as the declarative programming code needed to flesh out the AI-infused functionality that’s finding its way into practically every real world app.

Under the DAWN initiative, there are two projects that focus on acceleration of data-driven training of AI algorithms. One of them, MacroBase, is enabling AI-driven prioritization of human attention in the analysis, curation, and labeling of training data sourced from large-scale data sets and real-time streams. Leveraging these tools, human data labelers will be able specify domain-specific rules to automate filtering of training data on the fly.

Another project, Snorkel, takes the radical step of using labeler-defined heuristics to automate the extraction and creation of high-quality labels directly from the data, in a process referred to as “weak supervision.” This approach, which its creators call “data programming,” involves developers writing scripts that consist of declarative rules for programmatically labeling data. These scripts, known as “labeling functions,” extract labels directly from the data itself. The resultant labels are “noisy,” in that sense that they are generally less accurate than what human labelers, reviewing the same data and using their own judgment, might have produced.

As discussed in this research paper, these programmatically generated labels are processed in a specialized AI environment, known as a “generative adversarial network (GAN),” that assesses how they deviate statistically from some reference set of manually generated labels associated with the same or equivalent solution domain. The GAN iteratively refines the rule-specified training functions so that, upon repeated runs, they begin to generate auto-programmed training data that is at least as accurate as what manual labeling might have produced.

Essentially, this process prepares auto-programmed training data to be used for the “weakly supervised” learning of AI models, such as those built in TensorFlow or another library. Another way of looking at it is that it uses GANs to simulate high-quality training data to such verisimilitude that it can be used to train models without ever—apart from the initial specification of labeling functions—requiring manual human judgment.

Monte-Carlo simulations let data scientists conjure up data to work with (Wetzkaz Graphics/Shutterstock)

Simulated training data? If you’re under the illusion that data scientists always work from actual source data, prepare to be thunderstruck. In the traditional world of statistical analysis, a data-fabrication technique called “Monte Carlo simulation” is well-established. What this involves is predicting statistical outcomes not from the actual values of input data (when those aren’t known) but from the likely or “simulated” values of that data, based on their probability distributions. This involves statisticians doing repeated random sampling from simulated data in order to estimate approximate target outcomes.

This makes great sense. If we have the probability distributions for the labels associated with a particular training-data domain, why not simulate the data, as opposed to going to the trouble and expense of gathering the real thing? One of the poorly kept secrets in the data science arena is how often simulation programs are used to generate training data when the proverbial “ground truth” reference data is lacking. This is especially true in robotics, autonomous vehicles, and other projects where the device that this data will help train have yet to be fabricated. In such circumstances, the training data will need to come from valid simulations of the scenarios within which the hardware prototypes are being engineered to operate.

In many ways, end-to-end automation of the data science pipeline may come down to configuration of ever more complex rule-driven patterns. That’s the power of abstraction layers in software development, which are the foundation of automated development and optimization of machine learning (ML) models and also of ML-driven automated code generation. What it all involves is leveraging rule-driven heuristics—as specified in scripts, domain-specific languagesprogramming templates, metamodels, flowchart models, and the like—to automate the pipeline for development, training, and deployment of data, models, code, and other essential AI assets. Already, abstraction layers have come to commercial and open-source AI toolkits to support visual, modular model development.

This trend suggests that rule-driven approaches for building AI applications may return with a vengeance. Rule-driven expert systems—exemplified by such languages as Lisp and Smalltalk—were the heart and soul of mainstream AI until machine learning, neural networks, and other statistical approaches gained the upper hand earlier in this decade. As the trend toward rule-driven AI auto-programming intensifies, data scientists’ productivity will skyrocket while the need for these developers to tinker with the underlying algorithmic models, training data, and declarative code will diminish.

But that doesn’t mean we’ll ever be able to entirely eliminate manual curation of training data, or manual inspection of auto-programmed models and code. The GANs that automate this new-era development pipeline will grind to a halt if they lack human-built reference examples associated with each of these AI artifacts.

About the author: James Kobielus is SiliconANGLE Wikibon‘s lead analyst for Data Science, Deep Learning, and Application Development.

Related Items:

Automating Development and Optimization of Machine Learning Models

How Spark Illuminates Deep Learning

Scrutinizing the Inscrutability of Deep Learning

Datanami