Express Computer
Home  »  Exclusives  »  Inside Meesho’s Kafka rethink: How an internal ‘mQ’ layer is powering AI-scale commerce

Inside Meesho’s Kafka rethink: How an internal ‘mQ’ layer is powering AI-scale commerce

0 5

What does it take to build data infrastructure for Bharat—not just at scale, but under constraint?

At a recent Confluent event in Mumbai, Meesho’s Senior Architect Shubham Sharma pulled back the curtain on a reality rarely discussed in glossy architecture diagrams: scale is only half the problem. The other half is unpredictability—of networks, devices, user behavior, and economics.

For Meesho, where a majority of users come from tier 2 and tier 3 India, engineering is shaped as much by constraints as by ambition. Low-end devices, patchy connectivity, and price-sensitive users mean that performance isn’t a luxury metric—it’s a business imperative.

“Even an extra byte or a few milliseconds can impact conversion,” Shubham said.

Building for Bharat, not just for scale

Meesho’s platform today operates at a staggering magnitude: processing petabytes of data daily and trillions of inferences, while handling nearly 100 million Kafka messages per second. Behind every search, recommendation, and transaction lies a constantly moving stream of data.

But the platform’s uniqueness lies not just in its scale—it lies in who it serves.

Voice-based searches account for a significant portion of discovery. Image-based browsing is common. The app is optimised to run on devices with limited storage, and even its size is aggressively minimised. Content is tuned to load faster over unstable networks, and the experience is available in multiple regional languages.

This is infrastructure designed not for ideal conditions, but for real-world India.

In such an environment, streaming data systems are not just backend plumbing—they are the nervous system of the business.

The Kafka trilemma—and the cost of fragmentation

At the heart of Meesho’s architecture sits Apache Kafka. But as Shubham explained, the real challenge wasn’t running Kafka clusters—it was managing how hundreds of services used them.

Every team had different priorities. Some optimised for throughput, pushing larger batches of data. Others needed ultra-low latency, sacrificing efficiency for speed. Still others demanded high durability, increasing replication and system load.

These choices weren’t independent—they were deeply interconnected. Improving one often came at the cost of another.

Compression decisions added another layer of complexity. Algorithms like gzip, snappy, lz4, and zstd each came with trade-offs across CPU usage, latency, and cost. What looked like a technical configuration decision quickly became a financial one.

“This is FinOps for Kafka,” Shubham noted.

Over time, this led to fragmentation. Teams tuned systems independently, often reactively. Optimisation happened after failures, not before them. Over-provisioning became the default safety net.

The breaking point came during a flash-sale event. As traffic surged, failures cascaded across systems—feeds, product pages, carts—each struggling under its own configuration constraints.

Different teams were solving the same problem, in isolation.

From complexity to abstraction: The rise of mQ

The solution was not more tuning—it was less.

Meesho built an internal abstraction layer called mQ (Meesho Queue), fundamentally rethinking how developers interact with Kafka. Instead of exposing engineers to the complexity of topics, partitions, clusters, and configurations, mQ reduces everything to a simple interface.

Developers interact with an “mQ ID.” The system takes care of the rest.

Behind the scenes, a centralised control plane manages infrastructure decisions dynamically—choosing the right configurations, optimising performance, and adapting to workload changes in real time.

This shift does two critical things. First, it removes the cognitive burden from developers, allowing them to focus on product innovation rather than infrastructure tuning. Second, it standardises optimisation, eliminating the inconsistencies that arise from fragmented decision-making.

In effect, Meesho moved from a model of manual tuning to one of continuous, automated optimisation.

Engineering meets economics

What sets mQ apart is its tight integration with cost awareness.

Infrastructure decisions are no longer made in isolation from business outcomes. Instead, configuration choices—batch sizes, compression, replication—are directly linked to cloud costs.

The system continuously evaluates performance-per-dollar, ensuring that efficiency is not just technical, but economic.

This is where mQ evolves from an engineering abstraction into a FinOps engine.

It also enables capabilities that are difficult to achieve in traditional setups: zero-code migrations, seamless failovers, and automated benchmarking. Systems can adapt without requiring manual intervention, reducing both downtime and operational overhead.

Towards autonomous data infrastructure

Beyond immediate gains in efficiency and reliability, mQ signals a deeper shift in how infrastructure is conceived.

As AI accelerates developer productivity—potentially by orders of magnitude—the traditional model of manual configuration becomes unsustainable. Systems need to become proactive, self-optimising, and intelligent.

At Meesho, that future is already taking shape.

The goal is a unified platform where best practices are embedded by default, optimisation is continuous, and systems evolve automatically with changing workloads.

“The next frontier is autonomous systems,” Shubham said.

A blueprint for AI-scale commerce

What makes Meesho’s journey compelling is not just the technology, but the context.

This is infrastructure built for a market where users search in multiple languages, rely on voice and images, operate on low-end devices, and transact over unreliable networks. And yet, it performs at global scale.

In solving for these constraints, Meesho has done more than optimise Kafka—it has reimagined how data platforms should function in high-growth, AI-driven environments.

Because at this scale, infrastructure isn’t just about keeping systems running.

It’s about making them think.

Leave A Reply

Your email address will not be published.