Express Computer
Home  »  News  »  Engineering voice-first commerce for Bharat: Inside Meesho’s Vaani AI architecture

Engineering voice-first commerce for Bharat: Inside Meesho’s Vaani AI architecture

0 1

As generative AI reshapes digital experiences globally, Meesho is betting that the future of Indian commerce will not be search-first or even app-first — it will be conversation-first. With the launch of Vaani, Meesho is building what it describes as a multi-agent AI commerce system designed specifically for Bharat’s next wave of internet users: consumers who are mobile-first, vernacular-first, and increasingly voice-first.

In an exclusive interaction with Express Computer, Ravindra Yadav, Head of Data Science at Meesho, details the architectural thinking behind Vaani, the engineering trade-offs involved in building conversational commerce at population scale, and why Meesho believes the future of AI commerce lies in making interfaces effectively disappear.

Beyond voice assistants: Why Meesho built a multi-agent AI commerce system

According to Yadav, one of the foundational decisions behind Vaani was recognising that commerce is fundamentally different from traditional assistant-driven interactions.

“Voice assistants are good at answering questions or executing commands,” he explains. “But shopping journeys are inherently conversational, iterative, and non-linear.”

A consumer may begin with a vague need, compare multiple products, ask follow-up questions, check reviews for reassurance, evaluate affordability, and only then move towards purchase. A single AI model, Yadav argues, cannot reliably manage that entire journey with sufficient accuracy and contextual understanding.

Instead, Vaani operates as a coordinated multi-agent architecture where multiple specialised AI systems collaborate within a single interaction.

One layer focuses on speech understanding and language interpretation. Another handles intent understanding — determining whether the user is exploring, comparing, seeking recommendations, or ready to transact. Dedicated commerce agents then retrieve product information across millions of listings, surface reviews, evaluate relevance, and guide users through checkout and order confirmation.

Sitting above these systems is an orchestration layer that continuously determines which agent should lead the interaction at a given moment while passing contextual memory across the stack.

The result is a conversational system designed not merely to answer queries but to dynamically reason across evolving user intent.

“Shopping conversations are inherently non-linear,” says Yadav. “Users change preferences, switch topics, ask follow-up questions, or refine requirements mid-conversation. The system therefore needs to continuously reason over evolving intent rather than simply respond to individual queries.”

Vaani also uses a hybrid multimodal architecture that combines speech and vision capabilities. Lightweight speech processing workloads execute at the device edge to reduce latency and maintain responsiveness under weak network conditions, while more computationally intensive reasoning and marketplace intelligence operate in the cloud.

The ambition, Yadav points out, is not to create another voice interface for e-commerce, but to digitally recreate the experience of interacting with a trusted neighbourhood shopkeeper — someone who understands preferences, asks the right questions, and guides customers towards confident purchasing decisions.

Building AI for low-end devices and unstable networks

Designing conversational AI for India introduces challenges fundamentally different from those faced in developed digital markets.

“A foundational insight behind Vaani was recognising that the next several hundred million internet users in India will not experience technology the same way as digitally native urban consumers,” Yadav notes.

Many users are more comfortable sending voice notes or making calls rather than typing. They often operate on entry-level smartphones, inconsistent mobile networks, shared devices, and fluid multilingual environments. Building for this demographic required Meesho to rethink standard assumptions around AI deployment.

One of the biggest engineering trade-offs involved balancing intelligence with accessibility.

While larger AI models often produce richer reasoning capabilities, they also increase latency, infrastructure costs, and computational demands. In Bharat’s operating environment, Yadav says, that approach becomes impractical.

“The experience has to remain fast, responsive, and reliable even in challenging conditions,” he adds.

This led Meesho to carefully partition workloads between on-device processing and cloud-based inference. Lightweight speech understanding tasks are processed locally to minimise network dependency and preserve responsiveness, especially under low-bandwidth conditions. More intensive workloads — such as multi-step reasoning, ranking, recommendation generation, and real-time marketplace intelligence — are handled in the cloud.

Another major challenge was maintaining conversational flexibility without compromising transactional precision.

Commerce systems cannot afford ambiguity around inventory, pricing, payment confirmation, delivery timelines, or seller reliability. Vaani therefore combines natural language interaction with real-time marketplace intelligence so that recommendations remain grounded in live commerce conditions.

The linguistic complexity of India presented an equally significant challenge.

India is not merely multilingual; it is conversationally fluid. Users frequently switch between languages, dialects, colloquial expressions, and regional terminology within the same interaction. Rather than forcing users to conform to structured language patterns, Meesho designed Vaani to adapt to natural speech behaviour.

“This required continuously fine-tuning models for regional nuances, speech patterns, accents, and contextual intent across diverse linguistic environments,” Yadav explains.

Edge AI, cloud AI, and the economics of scale

For Meesho, the balance between edge computing and cloud infrastructure is ultimately governed by three variables: latency, accuracy, and economics.

Voice interactions are highly sensitive to delays. Even small pauses can break conversational flow and make AI interactions feel unnatural. To address this, Meesho pushes lightweight speech-processing workloads closer to the user’s device.

“Running parts of speech understanding at the edge helps reduce network dependency and improves responsiveness,” Yadav says.

Cloud systems, meanwhile, handle the heavier computational tasks that require broader marketplace context. These include deep language understanding, recommendation generation, ranking decisions, and real-time access to dynamic marketplace signals such as inventory, pricing, seller quality, demand patterns, and fulfilment performance.

At Meesho’s scale, infrastructure efficiency becomes inseparable from product experience.

“Every millisecond saved and every inference optimised matters when operating at our scale,” Yadav asserts. 

Today, Meesho’s AI ecosystem processes trillions of inferences and learns from billions of interactions across hundreds of millions of users. Supporting that level of operational scale economically required major investments in infrastructure optimisation through BharatMLStack — Meesho’s in-house machine learning stack built on open-source foundations.

According to Yadav, BharatMLStack enables high-throughput, low-latency model serving at significantly lower inference costs than traditional architectures, making sophisticated generative AI workloads economically viable for Indian commerce.

The company introduced several structural changes across its infrastructure stack to support conversational AI workloads efficiently.

On the data layer, Meesho adopted a workload-specific approach that combines licensed platforms and open-source compute systems depending on efficiency requirements. It also built an internal caching layer capable of serving the majority of analytical queries without repeatedly scanning raw datasets. Event analytics pipelines were brought fully in-house, while high-volume jobs were shifted onto next-generation single-node processing engines that outperform traditional distributed systems for Meesho’s workload patterns.

On the training side, Meesho diversified GPU sourcing beyond a single hyperscaler, improving cost efficiency while maintaining performance parity. Automated configuration recommendations were also introduced to ensure compute resources are right-sized dynamically without manual intervention.

Inference optimisation received particular focus because of the real-time demands of conversational AI. Meesho eliminated network hops in the serving path, rewrote model-serving proxies in more performant languages, introduced custom tiered caching systems, multiplexed low-throughput models across shared GPUs, and selectively pushed certain inference workloads directly onto user devices.

Episodic memory: Moving from transactions to persistent commerce relationships

One of the more interesting capabilities within Vaani is episodic memory — the system’s ability to retain contextual understanding across shopping sessions.

Yadav argues that commerce interactions are rarely isolated events. Consumers often return after several days to continue exploration, refine preferences, or revisit previous comparisons.

“A trusted shopkeeper remembers what you bought, understands your preferences, knows your budget, and uses that context to make future interactions more relevant,” he says. “We believe AI agents should eventually be capable of the same.”

Rather than storing raw conversations, Vaani structures interactions into meaningful “episodes” containing commerce signals such as affordability thresholds, delivery requirements, category interests, comparison behaviour, and evolving preferences.

Over time, these episodes are linked together to create a richer contextual understanding of each user’s shopping behaviour.

The challenge, however, lies in balancing continuity with relevance.

Memory that is too shallow results in repetitive experiences. Memory that is too broad risks becoming intrusive or noisy. Meesho’s objective, according to Yadav, is to retain only the context that meaningfully improves future shopping interactions and reduces user friction.

Privacy therefore becomes central to how episodic memory evolves.

“Memory must be purposeful, transparent, and aligned with delivering a better shopping experience,” he emphasises.

Yadav believes this shift represents a larger evolution in commerce itself — from isolated transactions toward persistent, context-aware consumer relationships.

Reimagining discovery for conversational commerce

Voice-first commerce fundamentally alters how consumer intent is expressed.

Traditional e-commerce systems assume users know what they want and can describe it using explicit keywords. But conversational shopping rarely works that way, particularly in India’s emerging internet markets.

“Users browse, explore, compare, seek validation, and often discover what they want during the journey itself,” Yadav observes.

Instead of relying purely on search keywords, Vaani focuses on understanding contextual and behavioural intent — including affordability preferences, use cases, urgency, lifestyle cues, hesitation patterns, and conversational follow-up signals.

Importantly, Meesho did not build a separate recommendation system exclusively for Vaani. The platform leverages PRISM (Personalised Ranking and Intent Signal Module), the same intelligence ecosystem already powering product discovery across the marketplace.

Today, PRISM powers more than 75% of orders on Meesho. The system orchestrates billions of data points across more than 100 ranking models, some containing up to 300 million parameters. These models are trained on over 400 trillion input signals while executing more than 6 trillion inferences daily.

What changes in conversational commerce is the richness of the incoming signal set.

Rather than relying primarily on clicks and search history, Vaani integrates conversational indicators such as hesitation, occasion, regional preferences, follow-up questioning behaviour, and contextual shopping cues.

The system also incorporates marketplace intelligence such as regional demand trends, seller quality, fulfilment reliability, inventory availability, and pricing dynamics.

Capabilities like TrendPulse within PRISM help identify emerging cultural and regional demand patterns, allowing Vaani not only to respond to consumer intent but increasingly anticipate it.

The outcome, Yadav believes, is a discovery experience that feels less like navigating a digital catalogue and more like interacting with a knowledgeable shopping assistant.

Solving India’s language complexity problem

India’s multilingual and code-mixed conversational patterns represent one of the world’s most difficult AI language challenges.

A single shopping interaction may include Hindi, English, regional dialects, local product names, colloquial expressions, and culturally specific references — often within the same sentence.

“Building for India requires much more than multilingual translation,” Yadav says.

The primary challenge is semantic understanding rather than pure transcription accuracy. In commerce, the same product may be described differently across geographies and communities, often using local terminology instead of catalogue-standard names.

Vaani therefore prioritises understanding commercial intent rather than merely recognising speech accurately.

Code-mixing presents another layer of complexity. Users frequently switch languages dynamically based on context — discussing products in Hindi, budgets in English, and category references in regional dialects.

To address this, Meesho combines speech understanding systems, language models, and commerce-specific intent models trained on real-world shopping interactions. The system continuously applies contextual reasoning to infer meaning from surrounding conversational context, behavioural patterns, category signals, and interaction history.

The importance of localisation becomes even more pronounced considering that 88% of Meesho’s annual transacting users now come from outside India’s top eight cities.

“Our philosophy has always been that technology should adapt to how people naturally communicate, not the other way around,” Yadav says.

The future: When interfaces disappear

Looking ahead, Yadav believes the next major transformation in commerce will not come from a single interface innovation, but from the gradual disappearance of interfaces altogether.

“For years, digital commerce has required consumers to adapt to technology,” he says. “The next wave of AI will reverse that equation.”

Instead of forcing users to search, navigate categories, apply filters, or manually compare products, future AI-native commerce systems will increasingly understand intent across voice, visual inputs, behavioural patterns, affordability preferences, historical interactions, and contextual signals simultaneously.

The boundaries between voice, text, and visual commerce, he argues, will gradually dissolve into unified AI systems capable of adapting dynamically to individual user preferences. “Commerce will become less about navigating a marketplace and more about expressing a need,” he adds.

This shift carries particular significance for India because the next several hundred million internet users are unlikely to be search-first consumers.

“They will be voice-first, vernacular-first, and mobile-first,” he notes.

For Meesho, the larger opportunity is not merely improving e-commerce for digitally sophisticated users, but making digital commerce accessible to entirely new consumer cohorts previously excluded by language barriers, literacy constraints, or interface complexity.

“In that sense,” Yadav concludes, “the future of AI in commerce is not automation for its own sake. It is accessibility at population scale.”

Leave A Reply

Your email address will not be published.