April 30, 2021

Appen Combats Biased AI with Diverse Training Data Sets for NLP Initiatives

SAN FRANCISCO, April 30, 2021 — Appen Limited (ASX:APX), a leading provider of high-quality training data for organizations that build effective AI systems at scale, is enabling organizations to launch, update and operate unbiased AI models through a range of projects and partnerships. With support from the company’s global crowd of data annotation specialists that’s more than a million strong, Appen has developed diverse training data sets for AI models, particularly natural language processing (NLP) initiatives to ensure end users receive the same experience, no matter their language variety, dialect, ethnolect, accent, race or gender.

AI projects based on biased or incomplete data don’t work for everyone. According to a report published by PNAS in March 2020 (Proceedings of the National Academy of Sciences), popular automated speech recognition (ASR) systems that are used for virtual assistants, closed captioning, hands-free computing and much more, exhibit significant racial disparities in performance. The report concludes that more diverse training datasets are needed to reduce these performance differences and ensure speech recognition technology is inclusive. Language interpretation and natural language processing (NLP) systems suffer from the same challenge and require the same solution.

“The quality and diversity of training data directly impacts the performance and bias present in AI models,” said Appen CEO Mark Brayan. “As a data partner, we can supply complete training data for many use cases to ensure AI models work for everyone. It’s critical that we engage a diverse group of individuals to produce, label, and validate the data to ensure the model being trained is not only equitable, but also built responsibly.”

Range of Appen Language Projects

Appen demonstrates its commitment to creating AI for everyone through a variety of projects and partnerships focused on the diversity of languages and dialects.

Translators without Borders (TWB) partnership – Appen, in partnership with TWB, Amazon, Carnegie Mellon University, Facebook, Google, John Hopkins University, Microsoft, and Translated joined the Translation Initiative for COVID-19 (TICO-19), which supported the development of language technology to make COVID-19 information available in as many languages as possible, including languages in developing countries like Congolese Swahili, Tigrinya, and Nigerian Fulfulde.
The Inuktitut translation project – In collaboration with the Government of Nunavut, Microsoft added Inuktitut, an Indigenous language in North America spoken in the Canadian Arctic, to Microsoft Translator, using Appen services
The Canadian French translation project – Appen coordinated with native language consultants to help Microsoft add “Canadian French” as a language option in Microsoft Translator.
African American Vernacular English (AAVE) off-the-shelf datasets – Most existing training datasets used in ASR, search engines, voice assistants and sentiment analysis are not representative of AAVE. To make high-quality AAVE data available, Appen is working with AAVE speakers among its crowd of annotators to collect data for an OTS dataset based on conversations about a broad range of topics.

“Biased AI data leads to projects that can fail to deliver the expected business results and harm individuals they are supposed to benefit,” said Dr. Judith Bishop, Senior Director of AI Specialists at Appen. “The scale and complexity of AI projects makes it impossible for most companies to acquire sufficient unbiased high-quality data without partnering with an AI data expert. Appen’s commitment to developing the most diverse and expert crowd of data annotators provides the industry with a clearly differentiated resource for building fair and ethical AI projects.”

Appen’s Leading Approach to Diversity

Appen relies on training data annotators from over 170 countries. Language representation includes 235 unique languages and 395 dialects. Over the years, the Appen crowd of annotators has included over 30,000 fluent trilingual speakers – a true testament to diversity and expertise.

Appen also offers off-the-shelf (OTS) datasets designed to make it easier and faster for businesses to acquire the high-quality training data they need to accelerate their AI and machine learning projects. OTS datasets are available for 80 languages and multiple dialects, including hard-to-acquire languages such as multiple varieties of the Arabic language, Croatian, Greek, Hungarian, Thai and more.

According to the United Nations Department of Economic and Social Affairs, “about 97 percent of the world’s population speaks just 4 percent of its [7000] languages”. That 4 percent is only 280 languages – yet the number of languages well-served by AI core technologies, is a fraction of that number. Appen aims to help increase that number through these and future projects.

About Appen Limited

Appen collects and labels images, text, speech, audio, and video used to build and continuously improve the world’s most innovative artificial intelligence systems. With expertise in more than 235 languages, a global crowd of over 1 million skilled contractors, and one of the industry’s most advanced AI-assisted data annotation platforms, Appen solutions provide the quality, security, and speed required by leaders in technology, automotive, financial services, retail, manufacturing, and governments worldwide. Founded in 1996, Appen has customers and offices around the world.

Source: Appen Limited

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Appen Combats Biased AI with Diverse Training Data Sets for NLP Initiatives

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Appen Combats Biased AI with Diverse Training Data Sets for NLP Initiatives

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link