By Kesava Reddy, Chief Revenue Officer, E2E Networks
With over 1.4 billion diverse citizens conversing in 22 official languages and hundreds more regional dialects, India has an unprecedented linguistic landscape. This is coupled with a rapidly growing digital footprint – internet users have already reached 700 million, while 450 million Indians now access the online world daily through their smartphones. As connectivity becomes even more affordable and reliable across the country, these adoption rates are poised for further growth in the coming years.
While this is the case, the majority of online content and services are in English. This creates a barrier for many Indians to access and benefit from the digital world. Large Language Models or LLMs can change this narrative. Large Language Models (LLMs) are the new frontier of artificial intelligence that can process and generate natural language at an unprecedented scale and quality. They have shown remarkable capabilities in various domains, such as conversational AI, text summarisation, natural language understanding, and more.
Without requiring human translators or intermediaries, the ability to communicate and interact with online platforms and applications in our native tongues would be possible if LLMs that support Indian languages could be developed. It will significantly expand the opportunities available to individuals who do not speak English. In addition to improving the accessibility of services in sectors including finance, governance, healthcare, and agriculture, it will facilitate revenue generation.
Open Source LLMs of 2023
One of the most remarkable developments in the field LLMs in 2023 was the emergence of open-source models. These models, released by various research groups and organisations, have equaled or surpassed the capabilities of proprietary LLMs in many cases while offering innovators the added advantage of fine-tuning or adapting them to their needs.
Due to their adaptability, these LLMs offer a great starting point for building India-specific LLMs. Many of these LLMs have already been trained on Indian languages. For instance, BLOOM can generate natural language text in 12 languages, including Hindi, Bengali, Tamil, Telugu, and Urdu. Falcon 180B, by TII, is another LLM that can generate high-quality text in 180 languages, including several Indian languages. Similarly, Mistral and Llama2 LLMs also have capabilities in handling multiple languages, including Indian languages such as Hindi, Bengali, Tamil, and Telugu.
However, the challenge with these LLMs is that they perform far better in English than in Indian languages due to the way they have been built and trained. Unlike English, Indian languages are layered with complex sentence structures and contextual subtleties, requiring LLM architectures that are not only technically robust but also culturally aware. To truly leverage the capabilities that LLMs can offer us, we need to build LLMs that are specifically designed for Indian languages.
Building Large Language Models for India
Several efforts are currently underway to develop Large Language Models (LLMs) tailored to Indian languages. Building these models requires addressing some key challenges, such as creating Indian language datasets and training models in a way that they work with nuances of Indian languages. Additionally, the availability of advanced cloud GPUs is an essential factor. To solve the challenge of limited datasets, Bhashini by MeitY has created Bhasha Daan, a crowdsourcing platform that aims to build an open repository of data to digitally enrich Indian languages.
This platform supports multiple crowdsourcing initiatives, such as Bolo India, Suno India, Likho India, and Dekho India, which encourage users to contribute sentences in different languages, validate text or audio transcribed by others, and enrich their language by typing the audio they hear. Similarly, along with efforts to solve the lack of datasets, startups are working on adapting the architecture and training process of AI models so that they can handle the intricacies of Indian languages.
OpenHathi model, released by the startup Sarvam AI, for instance, is a Hindi language model built on Meta AI’s Llama2-7B architecture. The model was trained on Hindi, English, and Hinglish, and the dataset used was a subsample of 100K documents from the Sangraha corpus. Another such initiative is Krutrim LLM developed by Krutrim Si Designs, a model designed to understand and respond in 20 Indian languages.
Apart from dataset availability and innovation in AI architecture or training process, a key factor is access to advanced GPUs. GPUs are essential for training LLMs as they can accelerate the training process by several orders of magnitude. In mid-2023, there was a major supply shortage of advanced GPUs like H100 and A100. With persistent efforts from cloud service providers, even cutting-edge GPU clusters like HGX H100 are now available on instant access.
With these efforts by startups and innovators towards building India-specific LLMs, we can expect to see more models emerge in 2024, addressing India-centric problems. These LLMs will enable better access to information and services for speakers of Indian languages, reduce the digital divide, enhance inclusivity, and benefit regional economies. They can assist startups in building applications with deep vernacular language support in the domains of education, health, finance, and governance. Eventually, our country, with its linguistic and cultural diversity, has a lot to gain and contribute concerning LLMs. As we embrace and leverage LLMs for our languages and use cases, we can potentially achieve our vision of a digital and inclusive society, powered by AI.