Updated September 23rd, 2020 at 12:56 IST

IIT-Madras, AI4Bharat develop AI models to process texts in 11 Indian languages

IIT-Madras has developed artificial intelligence (AI) models and datasets in association with AI4Bharat that can process texts in 11 major Indian languages.

Reported by: Vishal Tiwari
| Image:self
Advertisement

The Indian Institute of Technology Madras (IIT-M) has developed artificial intelligence (AI) models and datasets in association with AI4Bharat that can process texts in 11 major Indian languages from Indo-Aryan and Dravidian branch. The languages that the AI models and datasets can process are Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu, and also Indian English, which makes it compatible with 12 languages. 

Read: IIT Kanpur Team Develops Super Activated N-95 Mask Based On Odourless Technology

The tools will help computers process texts in Indian languages which will help learners, industry, start-ups to work, and innovate more efficiently, said one of the professors involved in the project. There are already such tools available for the English language but the tools lack for Indian languages and this project will help fill the gap. IIT-M and AI4Bharat released IndicNLPSuite to help solve the problem. It is a collection of various resources and models for Indian languages such as IndicCorp, IndicFT, IndicBERT, and IndicGLUE. 

Read: IIT Guwahati Sets Up Self-operated Kiosks To Trace Viruses And Bacteria; Read More

How does it work?

The monolingual corpora contain a total of 8.9 billion tokens across all 11 languages and Indian English, primarily sourced from news crawls. The word embeddings are based on FastText, hence suitable for handling the morphological complexity of Indian languages. The pre-trained language models are based on the compact ALBERT model. ALBERT model was chosen because it is very compact and hence easier to use in downstream tasks.

Read: IIT Delhi 2020 Starts School Dedicated To Artificial Intelligence; Admissions Open In 2021

"Lastly, the IndicGLUE benchmark for Indian language NLU contains datasets for the following tasks: Article Genre Classification, Headline Prediction, Named Entity Recognition, Cross-lingual Sentence Retrieval, Wikipedia Section-Title Prediction, and Clozestyle Multiple choice QA," said researchers in their study published on AI4Bharat website.

Read: IIT Delhi's Job Posting For Dog Handler Creates Buzz; Minimum Qualification Baffles People

(Image Credit: AI4Bharat/Website)

Advertisement

Published September 23rd, 2020 at 12:57 IST