Appen Combats Biased AI with Diverse Training Data Sets for NLP Initiatives
SAN FRANCISCO, April 30, 2021 — Appen Limited (ASX:APX), a leading provider of high-quality training data for organizations that build effective AI systems at scale, is enabling organizations to launch, update and operate unbiased AI models through a range of projects and partnerships. With support from the company’s global crowd of data annotation specialists that’s more than a million strong, Appen has developed diverse training data sets for AI models, particularly natural language processing (NLP) initiatives to ensure end users receive the same experience, no matter their language variety, dialect, ethnolect, accent, race or gender.
AI projects based on biased or incomplete data don’t work for everyone. According to a report published by PNAS in March 2020 (Proceedings of the National Academy of Sciences), popular automated speech recognition (ASR) systems that are used for virtual assistants, closed captioning, hands-free computing and much more, exhibit significant racial disparities in performance. The report concludes that more diverse training datasets are needed to reduce these performance differences and ensure speech recognition technology is inclusive. Language interpretation and natural language processing (NLP) systems suffer from the same challenge and require the same solution.
“The quality and diversity of training data directly impacts the performance and bias present in AI models,” said Appen CEO Mark Brayan. “As a data partner, we can supply complete training data for many use cases to ensure AI models work for everyone. It’s critical that we engage a diverse group of individuals to produce, label, and validate the data to ensure the model being trained is not only equitable, but also built responsibly.”
Range of Appen Language Projects
Appen demonstrates its commitment to creating AI for everyone through a variety of projects and partnerships focused on the diversity of languages and dialects.
- Translators without Borders (TWB) partnership – Appen, in partnership with TWB, Amazon, Carnegie Mellon University, Facebook, Google, John Hopkins University, Microsoft, and Translated joined the Translation Initiative for COVID-19 (TICO-19), which supported the development of language technology to make COVID-19 information available in as many languages as possible, including languages in developing countries like Congolese Swahili, Tigrinya, and Nigerian Fulfulde.
- The Inuktitut translation project – In collaboration with the Government of Nunavut, Microsoft added Inuktitut, an Indigenous language in North America spoken in the Canadian Arctic, to Microsoft Translator, using Appen services
- The Canadian French translation project – Appen coordinated with native language consultants to help Microsoft add “Canadian French” as a language option in Microsoft Translator.
- African American Vernacular English (AAVE) off-the-shelf datasets – Most existing training datasets used in ASR, search engines, voice assistants and sentiment analysis are not representative of AAVE. To make high-quality AAVE data available, Appen is working with AAVE speakers among its crowd of annotators to collect data for an OTS dataset based on conversations about a broad range of topics.
“Biased AI data leads to projects that can fail to deliver the expected business results and harm individuals they are supposed to benefit,” said Dr. Judith Bishop, Senior Director of AI Specialists at Appen. “The scale and complexity of AI projects makes it impossible for most companies to acquire sufficient unbiased high-quality data without partnering with an AI data expert. Appen’s commitment to developing the most diverse and expert crowd of data annotators provides the industry with a clearly differentiated resource for building fair and ethical AI projects.”
Appen’s Leading Approach to Diversity
Appen relies on training data annotators from over 170 countries. Language representation includes 235 unique languages and 395 dialects. Over the years, the Appen crowd of annotators has included over 30,000 fluent trilingual speakers – a true testament to diversity and expertise.
Appen also offers off-the-shelf (OTS) datasets designed to make it easier and faster for businesses to acquire the high-quality training data they need to accelerate their AI and machine learning projects. OTS datasets are available for 80 languages and multiple dialects, including hard-to-acquire languages such as multiple varieties of the Arabic language, Croatian, Greek, Hungarian, Thai and more.
According to the United Nations Department of Economic and Social Affairs, “about 97 percent of the world’s population speaks just 4 percent of its  languages”. That 4 percent is only 280 languages – yet the number of languages well-served by AI core technologies, is a fraction of that number. Appen aims to help increase that number through these and future projects.
About Appen Limited
Appen collects and labels images, text, speech, audio, and video used to build and continuously improve the world’s most innovative artificial intelligence systems. With expertise in more than 235 languages, a global crowd of over 1 million skilled contractors, and one of the industry’s most advanced AI-assisted data annotation platforms, Appen solutions provide the quality, security, and speed required by leaders in technology, automotive, financial services, retail, manufacturing, and governments worldwide. Founded in 1996, Appen has customers and offices around the world.
Source: Appen Limited