Meta Releases AI Model That Translates Over 200 Languages
If you’ve ever witnessed the bizarre word stew that Facebook sometimes concocts when translating content between languages, you have seen how translation technology doesn’t always hit its mark. That could be changing soon, especially for less common languages.
Meta has released an open source AI model capable of translating 202 different languages. The model is called NLLB-200 and is named after the company’s No Language Left Behind initiative. Meta says it will improve the quality of translations across its technologies by an average of 44% with that number jumping to 70% for some African and Indian languages, as shown by its BLEU benchmark scores.
The No Language Left Behind effort stems from the lack of high-quality translation tools for what natural language researchers call low resource languages, or those with little to no data available to train language models. Without proper means for translation, speakers of these languages, often found in Africa and Asia, may be unable to fully engage with online communication or content in their preferred or native languages. Meta’s initiative seeks to change that.
“Language is the key to inclusion. If you don’t understand what people are saying or writing, you can be left behind,” said Jean Maillard, research engineer at Meta AI in a video.
The model supports 55 African languages with high quality results, according to Meta, while other popular translation tools can only facilitate less than 25. In order to improve the NLLB-200 model and confirm that translations are high quality, Meta built an evaluative dataset called FLORES-200 that allows assessment of the model’s performance in 40,000 different language directions.
The company is now sharing NLLB-200 and FLORES-200 along with the model training code and the code for reproducing the training dataset. Meta is also offering grants up to $200,000 to nonprofit organizations and researchers for what it calls impactful uses of NLLB-200, or projects related to sustainability, food security, gender-based violence, or education. The company is specifically encouraging nonprofits focused on translating two or more African languages to apply for the grants, as well as researchers in linguistics, machine translation and language technology.
Meta has lofty goals for its own use of the language model. NLLB-200 will support over 25 billion translations used daily on Facebook, Instagram, and other platforms maintained by the company. The company asserts that higher accuracy in translations available for more languages may aid in finding harmful content or misinformation, protecting election integrity, and stopping online sexual exploitation and human trafficking.
Additionally, Meta has begun a partnership with the Wikimedia Foundation to improve translations on Wikipedia by using NLLB-200 as its back end content translation tool. For languages spoken mainly outside of Europe and North America, there are far fewer articles available than the over 6 million English entries or the 2.5 million available in Swedish. To illustrate, for the 45 million speakers of Lingala, a language spoken in several African countries including the Democratic Republic of the Congo, there are only 3,260 Wikipedia articles in their native language.
“This is going to change the way that people live their lives … the way they do business, the way that they are educated. No language left behind really keeps that mission at the heart of what we do, as people,” said Al Youngblood, user researcher at Meta AI in a video.
Like most AI projects, NLLB-200 has come with challenges. AI models are trained with large amounts of data, and “for text translation systems, this typically consists of millions of sentences carefully matched between languages. But there simply aren’t large volumes of parallel sentences across, say, English and Fula,” the company noted.
Researchers could not go the usual route of overcoming this through mining data from the web, as the required data may not even exist in some cases and could lead to inaccuracy. Instead, Meta upgraded an existing NLP toolkit, LASER, into a new version. The LASER3 multilingual embedding method “uses a Transformer model that is trained in a self-supervised manner with a masked language modeling objective. We further boosted performance by using a teacher-student training procedure and creating language-group specific encoders, which enabled us to scale LASER3’s language coverage and produce massive quantities of sentence pairs, even for low-resource languages.” LASER3 and its billions of parallel sentences in different language pairs are also now being offered as open source tools.
Meta says that optimizing a single model to work effectively and accurately across hundreds of languages was also a significant challenge requiring ingenuity. Translation models can generate hard-to-trace errors such as misstatements, unsafe content, and “hallucinations,” or glitches that can change the meaning of training data completely.
“We completely overhauled our data cleaning pipeline to scale to 200 languages, adding major filtering steps that included first using our LID-200 models to filter data and remove noise from internet-scale corpora with high confidence. We developed toxicity lists for the full set of 200 languages, and then used those lists to assess and filter potential hallucinated toxicity,” the company said. “These steps ensured that we have cleaner and less toxic datasets with correctly identified languages. This is important for improving translation quality and reducing the risk of what is known as hallucinated toxicity, where the system mistakenly introduces toxic content during the translation process.”
For comprehensive technical specifications, read the Meta researcher’s full scientific paper at this link. To see NLLB-200’s translation capabilities in action through stories translated with the technology, visit the Meta AI Demo Lab.
August 8, 2022
August 5, 2022
- Datajoin Raises $3.5M Seed to Fix Marketers’ Broken Tech Stack
- Talend Announces Support for Amazon Redshift Serverless
August 4, 2022
- Ahana Awarded Industry Recognition for Big Data Analytics and Presto Innovations
- NeuroBlade Wins Innovation Award at Flash Memory Summit 2022
- Forrester Recognizes Zenoss in New AIOps Report
- Next Pathway Partners with Microsoft to Migrate Customers to Azure
- Census Achieves Premier Partner Status with Snowflake
- SingleStore Selects AppDirect to Power Real-Time Marketplace
- Zilliz Announces Key Contributions to Milvus 2.1
August 3, 2022
- VisualCortex and i-PRO Collaborate on Enterprise-Wide Computer Vision Tech Deployments
- ManageEngine Releases SaaS Version of Analytics Plus
- Pliops Collaborates with Partners to Break Through Data Scalability Barriers at FMS
- mParticle Acquires AI Startup Vidora
- Tricentis Survey Reveals Majority of Organizations Recognize Value in AI-Augmented DevOps
- Gigamon: Deep Observability Forecasted to Grow From $278M to $2B by 2026
- Pavilion Announces Flash Array 7X Database Performance Gains
- Micro Focus’ CyberRes Partners with Google Cloud
- SentinelOne and Cribl Partner to Deliver Data Flexibility Across Cybersecurity and Observability
- Seagate Addresses Hyperscale Workloads with New Enterprise-Class Nytro SSDs
Most Read Features
- How Intuit Is Building AI, Analytics, and Streaming on One Lakehouse
- A Dozen Questions for Databricks CTO Matei Zaharia
- The Race to Ensure Post Quantum Data Security
- Altair Gives Legacy SAS Code a New Place to Run
- Databricks Claims 30x Advantage in the Lakehouse, But Does It Hold Water?
- Esri Puts Federal Climate Change Action on the Map
- Esri Melds GIS with AI, Graph, and Analytics
- Big Data File Formats Demystified
- Data Mesh Vs. Data Fabric: Understanding the Differences
- StarRocks Brings Speedy OLAP Database to the Cloud
- More Features…
Most Read News In Brief
- Meta Releases AI Model That Translates Over 200 Languages
- Mathematica Helps Crack Zodiac Killer’s Code
- IBM Research Open-Sources Deep Search Tools
- FeatureByte Raises $5.7M to Fix the Weakest Link in AI
- SingleStore is the Newest Data Unicorn with $116M Funding Round
- Google Debuts LaMDA 2 Conversational AI System and AI Test Kitchen
- TIBCO’s ModelOps Takes AI Models Out of the Lab and Onto on the Road
- PowerSchool Launches K-12 Education Cloud Platform with Snowflake
- Altoros Report Compares Two NoSQL Databases
- EMR Serverless Now Available from AWS
- More News In Brief…
Most Read This Just In
- Samsung Launches 24Gbps GDDR6 DRAM
- Fiddler Announces Updates to AI Model Management Platform
- Intel Releases Open Source AI Reference Kits
- SingleStore and Intel Collaborate to Deliver Real-Time Data Technology
- Grafana Labs Introduces K8s Monitoring in Grafana Cloud
- Palantir Named a Leader in AI/ML Platforms by Independent Research Firm
- Dataiku Joins Deloitte US Data and AI Alliance Ecosystem
- Accenture Acquires Latin American Cloud Data Firm Tenbu
- dbt Labs Announces Formal Launch of its Technology Partner Program
- Western Digital Extends HDD Technology Across Smart Video, NAS and Datacenter Channel Segments
- More This Just In…
Sponsored Partner Content
September 13 @ 1:00 pm - September 14 @ 5:00 pmWashington DC United States
September 19 @ 8:00 am - September 20 @ 5:00 pm
October 5 - October 6Santa Clara CA United States
October 10 - October 12Boston MA United States
October 17 - October 20Toronto ON Canada