As companies speed up their use of AI and move toward more complex digital environments, reliable systems have become essential for business. In this interview, Rob Newell, Senior Vice President and General Manager for Asia Pacific and Japan at New Relic, explains why outages are happening more often despite better technology. He discusses how incidents like the Cloudflare outage show the widespread impact of downtime and why observability supported by AI is shifting from reactive monitoring to autonomous action. He also highlights how Indian organisations can gear up for a time when visibility, speed, and trust are key competitive advantages.
Why are outages increasing despite stronger tech stacks?
Businesses today are quickly moving into a new phase of AI adoption by deploying Agentic AI systems to boost operational efficiency. Unlike traditional LLM-based systems, Agentic AI works across highly interconnected workflows, making them harder to observe and fix. This increased interconnectedness often turns system behavior into a black box, making it very hard to find the source of an issue. The problem grows when monitoring is inconsistent.
More than 30 percent of Indian businesses still depend on several disconnected monitoring tools, which slows down how quickly issues can be found and fixed. Instead of improving visibility, this disconnection causes blind spots that let minor problems become big outages. Indian businesses are starting to see these issues in their monitoring systems. About 73 percent have already implemented AI-driven monitoring to gain clearer insights into complex, agentic environments and speed up issue resolution.
AI-supported observability tools help teams not only understand system performance but also uncover the reasons behind issues. By linking signals across interconnected parts, these tools provide actionable insights and usually resolve problems automatically, reducing Mean Time to Resolution (MTTR) and cutting the risk of outages.
How does the Cloudflare incident show the cascading impact of downtime?
When a provider like Cloudflare experiences an outage, the effects are immediate and widespread. Millions of customers, their systems, and end users feel the disruption at the same time. In this case, a routine configuration change in Cloudflare’s database caused a crucial internal file to exceed expected limits. This issue damaged parts of its bot management system and affected core traffic routing, resulting in a global outage. Cloudflare took nearly six hours to resolve the problem and return to normal operations.
During that time, global businesses that rely on its infrastructure faced service disruptions, which impacted their end users directly. The incident highlighted how failures in shared infrastructure can ripple through ecosystems, even when companies downstream do not have any failures of their own. This is where intelligent, AI-supported observability becomes vital. Traditional monitoring typically focuses on internal systems and sends alerts only after failures happen.
In contrast, AI-driven observability can trace service dependencies from start to finish, connect signals across third-party platforms, and spot early signs of unusual behavior. By examining traffic patterns, error rates, and configuration changes in real-time, observability helps teams identify emerging issues sooner, understand the potential impact quickly, and respond before full disruptions occur. While observability cannot prevent every third-party outage, it can greatly reduce uncertainty and response time, allowing solutions to be introduced sooner and helping rebuild customer trust.
What are the rising financial and reputational risks of downtime?
As more Indian companies adopt AI-supported observability, they are starting to achieve system reliability that traditional monitoring setups can’t match. This is creating a growing gap between organisations that can spot and resolve issues early and those that respond reactively. Customers still expect consistent reliability from every service provider, no matter what technology is behind it. In this setting, reliability becomes a key differentiator.
Organisations that do not keep up risk losing customer trust and falling behind their peers. As expectations rise, the costs of downtime accumulate over time through decreased loyalty and missed opportunities. Today, disruptions can cost organisations between $1 and $3 million per hour, making downtime an urgent business risk. At the same time, businesses investing in AI-supported observability are already seeing tangible returns, with many reporting 2x to 5x ROI. Organisations that focus on system reliability and visibility are better positioned to earn long-term customer trust, while those that do not risk lagging in both operations and competitiveness.
How is observability shifting from monitoring to autonomous action?
Originally, observability aimed to help teams find the root causes of issues and resolve them quickly. Now, observability not only directs IT teams to the source of a problem but can also take autonomous action to address it, even when issues arise in complex Agentic AI workflows.
AI-supported observability platforms use features like distributed tracing to follow interactions between AI agents and the tools they use. This makes it easier to identify communication gaps, unexpected behavior, or performance issues across interconnected components. With AI-driven analysis and predefined guidelines, observability systems can initiate self-guided actions to manage or fix problems, while keeping human oversight when necessary.
This change reduces both Mean Time to Detection (MTTD) and MTTR, limits disruptions, and lessens the amount of routine firefighting in daily operations, allowing developers to concentrate on innovation and growth.
Where do you see the role of Agentic AI in incident detection and resolution?
Agentic AI eliminates the need for teams to manually sift through countless logs to find faulty prompts or responses. AI monitoring can now provide deep, actionable visibility across complex systems, going beyond simple automation. In environments that depend on multi-agent collaboration, Agentic AI helps teams detect and resolve issues faster by offering clear insights into every agent and tool involved in an incident.
Teams can see which tools agents call upon, how agents and tools work together, the sequence of interactions, and how each component performs. Distributed tracing combines agent-to-tool communication into a single view, letting teams track issues across workflows without switching between different tools. This simplifies incident investigations and speeds up resolutions, helping prevent minor issues from turning into larger disruptions.
How AI agents will reduce firefighting and free engineers for strategic work?
When AI-driven applications fail, teams often lack clear visibility into what went wrong, putting significant AI investments at risk. Slow or incorrect responses turn troubleshooting into guesswork, as teams struggle to understand agent interactions, find delays, or identify the responsible agent or tool. This lack of clarity slows down root-cause analysis, extends downtime, diverts engineering efforts from innovation, and can ultimately lead to lost revenue and customer trust.
Observability addresses this challenge by providing complete visibility into AI application behavior. It links AI performance with the larger technology stack and automatically maps every agent and tool interaction. Engineers can pinpoint bottlenecks, failed calls, and performance issues across workflows without manually piecing together data. Layered views let teams track each step in an AI workflow, including agent calls, latency, and errors.
Waterfall-style visualisations show patterns of communication between agents, while a central inventory provides a clear view of all AI agents and services in use. This simplifies governance and helps engineers concentrate only on the systems they manage.
What organisations must do to prepare for autonomous observability?
Fragmented visibility is still one of the biggest obstacles to system stability for Indian businesses. Most use around four different monitoring tools, leading to isolated insights that are hard to interpret. Companies need integrated, AI-supported observability platforms that provide complete visibility, fit into existing workflows, and present insights directly where work happens.
The complexity of technology stacks adds to the challenge. About 44 percent of organisations list system complexity as their top concern, with applications operating across many layers and dependencies. In such environments, one failure can quickly impact an entire agentic workflow. The financial impact emphasises the need for change.
Around 45 percent of organizations report losses of $1 to $3 million per hour during outages. As the costs of outages rise, companies need tools that offer both clarity and ROI. In today’s fast-paced technology landscape, intelligent, AI-supported observability that keeps up with new systems is becoming the best way to manage complexity, reduce disruptions, and protect business value.