May 14, 2021

LLNL Researcher’s Conference Papers Highlight Importance of Data Security to Machine Learning

May 14, 2021 — The 2021 Conference on Computer Vision and Pattern Recognition, a premier conference of its kind, will feature two papers co-authored by a Lawrence Livermore National Laboratory (LLNL) researcher targeted at improving the understanding of robust machine learning models.

Both papers include contributions from LLNL computer scientist Bhavya Kailkhura and examine the importance of data in building models, part of a Lab effort to develop foolproof artificial intelligence and machine learning systems. The first paper focuses on “poisoning” attacks to data that malicious hackers or adversaries might use to trick artificial intelligence into making mistakes, such as manipulating facial recognition systems to commit fraud or causing autonomous drones to crash.

The team, which includes co-authors from Tulane University and IBM Research, highlighted a previously unknown vulnerability of robust machine learning models to imperceptible data poisoning that severely weakens their certified robustness guarantees. Researchers concluded the study suggests that proper data curation is a crucial factor in creating highly robust models.

“Data should be considered a first-class citizen in a machine learning workflow,” Kailkhura said. “In the past, the sole focus of the robustness community has been on coming up with better training algorithms and models. Here we are suggesting that we have missed the most important piece in the robust machine learning puzzle, and that is the training data quality. In this paper, for the first time we are showing that an adversary can actually devise an extremely hard-to-detect attack on your training data that can fool even state-of-the-art robust models. If an adversary can do this, any model-related advancements we have made in the last several years that guarantee better robustness would be useless.”

Unlike other types of poisoning attacks that reduce the accuracy of the models on a small set of target points, the team was able to make undetectable but devastating distortions in training data and reduce the average certified accuracy (ACA) of an entire dataset target class to zero in several cases. They showed the approach is effective even when the victim trains the models using state-of-the-art methods shown to improve robustness, such as randomized smoothing.

In a paper published by the upcoming 2021 Conference on Computer Vision and Pattern Recognition, Lawrence Livermore National Laboratory computer scientist Bhavya Kailkhura and co-authors present a previously unknown vulnerability of robust machine learning models to imperceptible “data poisoning.” The research is part of an LLNL effort to develop foolproof artificial intelligence and machine learning systems.

Funded under Kailkhura’s Laboratory Directed Research and Development (LDRD) project to design foolproof machine learning systems, the work impacts national security and mission-critical applications where researchers need to be certain that models will make correct predictions, Kailkhura said. By pointing out a previously unrecognized threat, Kailkhura said he hopes the paper exposes flaws in supposed robust models and will leverage the power of the machine learning community to build models that can detect and withstand such test-time attacks.

“It is going to be extremely difficult to solve the problem by a single team and come up with a model that is robust to these attacks we’ve identified,” Kailkhura said. “This is where the power of collaboration and open science comes into the picture. We are asking for help from the community by publishing these results, and we are seeking ideas and thinking about models that cannot be fooled. What we want to come up with is not yet another robustness heuristic but a foolproof system, so even if an adversary knows about the system, there is nothing they can do.”

Kailkhura said because the changes were so minute that a human user cannot notice, it may take an algorithm to tell the difference between clean and poisoned data. That kind of algorithm is one long-term objective of the research described in the second paper, in which Kailkhura and co-authors from the University of Virginia, the University of Illinois Urbana-Champaign, ETH Zurich and other institutions examined approaches to evaluating data importance.

“Understanding the value of a subset of training examples is a fundamental problem in machine learning that could have profound impact on a range of applications including data valuation, interpretability, data acquisition etc.” Kailkhura said. “For example, data labeling is very expensive, so the question is can we make the process better by only showing a scientist samples what the machine learning model thinks are important? If you want to apply this machine learning model on experiments, how do you choose the samples you believe are most representative of the true phenomena captured by these experiments? If one can do data importance evaluation correctly, all of these problems can be solved.”

Since training massive amounts of models isn’t feasible, computer scientists have turned to mathematical techniques that can tell researchers which data points are most important to collect or analyze, Kailkhura explained. A principled approach is based on “Shapley values” — a concept from game theory that calculates the average marginal contribution from a feature to the model’s prediction, considering all possible combinations.

The downside of the Shapley value approach is that it is extremely computationally expensive and cannot be applied to large models or datasets, such as those used at LLNL, Kailkhura said. The solution is approximation — instead of quantifying the value of data in the original model, researchers use a simplified model so the Shapley value can be calculated more efficiently.

In the paper, the team compared common approximation methods (or heuristics), finding that the approximated Shapley value approach could, in fact, provide correct data quality evaluation at scale. They applied the method to five machine learning tasks common in science: noisy label detection, watermark removal, data summarization, active data acquisition and domain adaptation.

The researchers found the approximation approach could effectively rank data points according to importance and accurately identify which samples were noisy, choose which data samples were most important to acquire or quantify and determine which samples could train models that will do well on the same domain.

The team concluded that Shapley-based methods could outperform “leave-one-out” cross validation methods in both run time and experimental performance. Most notably, they determined a Shapley approach incorporating a surrogate K-nearest neighbor classifier was the most efficient solution and best overall performer. Kailkhura’s LDRD project funded the work.

Co-authors on the data poisoning paper included Akshay Mehra and Jihun Hamm of Tulane University and Pin-Yu Chen of IBM Research. Co-authors on the data evaluation paper included Ruoxi Jia of Virginia Tech University, Fan Wu and Bo Li of University of Illinois Urbana-Champaign, Xuehui Sun of Shanghai Jiao Tong University, Jiacen Xu of the University of California, Irvine, David Dao and Ce Zhang of ETH Zurich and Dawn Song of the University of California, Berkeley.

Click here for the full announcement.

Source: LLNL

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

LLNL Researcher’s Conference Papers Highlight Importance of Data Security to Machine Learning

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

LLNL Researcher’s Conference Papers Highlight Importance of Data Security to Machine Learning

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 26, 2024

April 25, 2024

April 24, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link