Tech Mahindra Uses AI to Close the Language Gap in India
Project Indus LLM
Tech Mahindra has worked to create an Open Source Large Language AI model to meet the demands of 25% of the world’s population, with the goal of empowering all Indic languages that descended from the magnificent Indus Civilisation!
Through cooperation with the three pillars of society-government, business, and academia the Indus Project aims to create an Indian-based foundational model in Indic languages. Rural finance will be the main focus of the model. Small, medium, or large LLM will be used. First, Hindi would be used to train the model, followed by other Indian languages. To train the model, a large amount of Hindi-language data is required. It encourage you to support this special national endeavour by adding linguistic tidbits in your own distinctive Hindi.
Project Indus LLM Description
Beginning with Hindi and its dialects, Project Indus LLM seeks to develop a strong language model for Indian languages. Researchers and developers interested in the linguistic diversity of India can easily integrate and expand this open-source foundational model, which is housed on Hugging Face.
The Indus LLM model is instruct-tuned and pretrained in Hindi and its dialects.
- Created by: Vinay Sharma, Satish Mishra, Nilesh Brahme, and Nikhil Malhotra (Makers Lab, TechMahindra)
- Model type: Foundational Language model
- NLP Language(s): hin, bho, mai, doi;
- Licence: OtherParent model: This model is entirely based on the GPT-2 architecture, from the tokeniser to the decoder.
Applications
Questions and answers, as well as dialogue in Hindi and dialects, are among the uses. The reward-tuned methodology would apply to a variety of sectors, like Healthcare, Automotive, Telecom, and Call Centre
Straightforward Use
Without further training, Project Indus LLM can be used directly to generate text, simulate conversations, and perform other text generation tasks.
Use Outside of Scope
At the time, Project Indus LLM is not suitable for fill-in-the-blank exercises, repeated Q&A, or other similar applications, nor is it intended for high-stakes decision-making activities like medical diagnosis or legal advice.
Limitations, Risks, and Bias
Bias and fairness concerns with language models have been extensively studied. The model may produce predictions that include identification traits, sensitive, social, and occupational groupings, as well as unsettling and damaging stereotypes across protected classes. It has attempted to eliminate a variety of biases from the training data. But because the model is generative, it would likely result in hallucinations. The model’s creation of any unsettling or damaging sterotype is entirely accidental and inadvertent.
Suggestions
Avoiding biases and negative connotations in the model is advised, and any emergent bias or misuse scenarios must be addressed through frequent updates and community input.
Training Specifics
A carefully selected dataset containing Hindi text from a variety of sources, such as books, news stories, and websites, was used to train the model.
Infrastructure
- Training Infrastructure: Made use of CDAC’s high-performance computer resources, which included NVIDIA A100 GPUs.
- Operating Infrastructure: Tested in CPU (Intel Xeon Platinum 8580) and GPU (NVIDIA GeForce RTX 3070 or higher) settings.
Data for Training
A large and varied dataset of Hindi text and its dialects from multiple sources was used to train the Project Indus LLM. With a special emphasis on Hindi and its 37 dialects, the data collecting and curation procedure was painstakingly created to accommodate the linguistic diversity and complexity of Indian languages.
Sources and Collection of Data
Three primary buckets were used to collect the data:
Open-Source Hindi Data: This comprised online resources that were accessible to the general public in a variety of areas, including news and non-news. Text from online pages was scraped and extracted using automated tools. Some of the sources are as follows:
- News: News portal articles.
- Non-News: A variety of sites, such as commoncrawl.org, Wikipedia, and other culturally relevant material like AIR’s “Man ki Baat.”
Translated Data: Three distinct translation models were employed to convert a section of the Pile dataset a sizable English dataset used to train AI models into Hindi. Based on its accuracy and efficiency, IndicTrans2 (AI4Bharat) was chosen as the best model for this use.
Dialects: The scarcity of information on the internet made gathering data on dialects particularly difficult. Data was gathered from a variety of sources, including fieldwork where representatives gathered antique books and other texts, which were subsequently digitised and transformed into text data, including important dialects such as Maithili, Bhojpuri, Magahi, and Braj Bhasha.
Method of Training
Supervised learning was conducted on a high-performance computing system after the text had undergone considerable preprocessing to sanitise and standardise it.
- Pre-training: Using sophisticated tokenisation techniques, a dataset of 22 billion tokens was used.
- Fine-Tuning: Using datasets especially designed for cultural, political, and social situations, supervised fine-tuning was carried out with an emphasis on Indian languages.
The datasets used for pre-training and fine-tuning the model are summarised in the table below:
Phase | Data Source | Tokens | Notes |
---|---|---|---|
Pre-training | Cleaned dataset of Hindi and dialects | 22 billion | Utilized advanced tokenization |
Fine-tuning | Custom datasets tailored for Indian languages | Varied | Focus on cultural, political, and social contexts |
Preparation
To guarantee excellent quality and training utility, the gathered data went through multiple cleaning and preparation stages:
Cleaning: Unwanted language, characters, and private data, such as mobile numbers, were removed from the data. Unwanted tags from scraped web pages were eliminated, and transliteration was done when needed.
Bias Removal: To identify and eliminate biassed language from the training data, a Bias Removal Toolkit was created. This toolbox assisted in making sure the model was trained on morally sound, accurate, and socially conscious text.
Tokenisation: A unique tokeniser created especially for Hindi and its dialects was used to tokenise the data. Byte Pair Encoding (BPE) served as the foundation for this tokeniser, which added features such byte fallback to effectively handle the quirks of Hindi script.
In brief
The final training dataset included the following:
Raw Data Size: More than 500 GB of unprocessed data were gathered.
Data that has been cleaned and curated: Roughly 200 GB of clean Hindi and dialect text data.
Tokenisation: For pre-training, 22 billion tokens made from the cleansed data were used.
Project Indus LLM was able to build strong comprehension and generation skills for Hindi text with its broad and varied training data base, which makes it an effective solution for applications that need Indian language processing.