July 3, 2019

What’s My Line? GPUs Help Researcher Decipher Ancient Sanskrit

July 3, 2019 — With 10 verb tenses, eight noun cases, three grammatical genders and a strong predilection for compound words, Sanskrit is not an easy language to teach a human — let alone an AI model.

But Indologist Oliver Hellwig is undertaking the challenge, training deep learning models that can analyze Sanskrit texts up to 4,000 years old. A digital repository of Sanskrit works parsed word by word would enable researchers to more easily search for information and better identify passages with parallel context.

AI is being used to interpret historical texts in German and Italian, as well as classical Japanese literature. But most existing NLP models are geared towards Western languages that follow similar rules of grammar, punctuation and formatting.

That presents a challenge for researchers developing software to transcribe and analyze scripts that are read right to left, are pictographical instead of phonetic, or — like Sanskrit — often don’t use character breaks between words.

Unlike English, Sanskrit is a highly inflected language, which means words change their form depending on their function in a sentence. Some Sanskrit verbs have more than 200 forms depending on the context. The language also has an extensive vocabulary, with more than 50 words for terms like “sun” or “moon” — making it essential that an AI model be trained on a large, diverse dataset of text.

Hellwig, a postdoctoral researcher at the University of Zurich, Switzerland, knew 15 years ago that computational tools could enable new possibilities for his linguistics research — but found that just a fraction of Sanskrit manuscripts have been digitized into machine-readable text.

For a half hour almost every day since, he’s been changing that bit by bit, painstakingly parsing Sanskrit works and adding them to a database that now consists of 4.5 million manually labeled words.

Hellwig began building Sanskrit-parsing tools from scratch — starting with statistical models before advancing to more complex optical character recognition and NLP models. Using an NVIDIA Quadro GPU, he’s now training deep learning models that can identify characters and find word endings in Sanskrit texts.

AI tools that transcribe Sanskrit could help digitize a vast corpus of historical manuscripts, spanning epic poetry, religious texts and Ayurvedic medicine.

Segmenting Sanskrit

When training an AI model for texts based on the Latin alphabet, researchers can teach the neural network to detect white spaces to determine where one word ends and another begins.

That’s not the case for Sanskrit manuscripts, where one line of text can be made up of multiple words merged together into just one or two compound strings. The word sandhi, meaning “connection,” is used to describe the phonetic process of joining these words together.

An effective NLP model for Sanskrit texts must be able to split a sandhied line into individual words, posing a significant challenge for researchers.

“Any algorithm has to a certain degree understand the semantics of a line of text to generate a valid split form of it,” said Hellwig. “What’s quite trivial for English is actually the most problematic step in Sanskrit.”

The deep learning tool Hellwig developed to split lines of Sanskrit into individual words is 10 to 15 percent more accurate than previous methods.

“I was surprised that it worked so well,” he said, “because it’s a complicated task, even for human readers using the original forms of these texts.”

Using an NVIDIA GPU helped Hellwig speed up training his AI models by 10x. This speed allows him to evaluate errors faster, and efficiently develop more accurate models. His sandhi-splitting tool is now being used on a large Sanskrit corpus dubbed GRETIL.

Many historians debate the age of key Sanskrit texts — particularly religious works like the Bhagavad Gita. To contribute to this academic conversation, Hellwig wants to use neural networks and NVIDIA GPUs to analyze the grammatical structure and language patterns in ancient Sanskrit texts.

By connecting this linguistic evidence with a model of how Sanskrit changed over time, he hopes to help determine when some of these major texts were composed.

Source: Isha Salia, NVIDIA

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

What’s My Line? GPUs Help Researcher Decipher Ancient Sanskrit

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 15, 2024

May 14, 2024

May 13, 2024

May 10, 2024

May 9, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

What’s My Line? GPUs Help Researcher Decipher Ancient Sanskrit

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 15, 2024

May 14, 2024

May 13, 2024

May 10, 2024

May 9, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link