Resilience is a leadership decision, not a cloud feature

By Vinay Chhabra, Co-Founder and MD, Acecloud

India’s digital infrastructure now runs almost entirely on cloud. E-commerce platforms process millions of transactions, startups train AI models on hyperscale compute, and enterprises depend on external identity and control layers. India’s digital growth is now deeply tied to cloud environments.

This model has delivered speed and scale. It has also created structural interdependence. Reliability cannot be outsourced to scale. When critical workloads sit on shared network and control layers, exposure extends beyond a single enterprise.

Systems may appear independent, but many rely on the same backbone. When that backbone fails, independence proves illusory. Decades of infrastructure evolution have shown that concentration amplifies consequences. We have normalised architectural concentration in the name of efficiency. That trade-off now needs to be re-examined.

A June 2025 Cisco ThousandEyes report recorded 1,843 network outage incidents in one month, reflecting instability in the routing and connectivity layers that underpin the internet. These are not isolated IT disruptions. They are indicators of systemic exposure and therefore a board-level concern.

In hyperscaler environments, the dominant failure mode is internal complexity: flawed updates, dependency conflicts, and control plane disruptions. When consumer platforms, enterprise systems, and public services slow together, the issue is no longer uptime. It is architectural concentration risk. Without isolating failure domains, scale amplifies disruption rather than containing it.

The rising tide of outages
Cloud outages are no longer operational inconveniences. They carry direct financial consequences. Industry research shows that incidents exceeding one million dollars account for a growing share of major disruptions, even before regulatory penalties, customer churn, or reputational damage are considered. What was once downtime is now a balance sheet event.

Regulatory and governance exposure further compounds the impact. Outages do not create new vulnerabilities. They expose existing ones. Research published in 2025 indicates that 54 per cent of organisations have secrets embedded directly in workloads, and 9 per cent of cloud storage contains sensitive data due to misconfiguration. During disruptions, restoration takes precedence over discipline. Controls are bypassed. Credentials are reused. Oversight weakens. Availability incidents increasingly become entry points to data exposure and compliance failures.

The root causes are architectural, not accidental. Identity, networking, storage, orchestration, analytics, and AI services are tightly coupled through shared control layers. When a foundational service degrades, the blast radius expands quickly. Automated deployments, essential at scale, can propagate configuration errors across regions before detection. In several documented incidents, recovery was delayed because rollback mechanisms depended on the same degraded control systems.

Centralisation creates predictable failure patterns. Experience shows that resilience is never accidental. It must be engineered deliberately.

Diversification as a leadership decision
Diversification is a governance choice, not a technical preference. As critical workloads consolidate on fewer platforms, resilience becomes a board-level decision about concentration risk. Leaders must weigh cost against control, simplicity against survivability, and speed against substitutability.

Consolidation delivers efficiency and streamlines governance. Diversification introduces complexity and incremental cost. The question is not which model is superior in theory. The question is which risk profile aligns with the organisation’s tolerance, regulatory obligations, and long-term strategy.

In architectural terms, diversification means reducing single points of failure across independent domains. It does not automatically require multiple clouds. It requires containment. Can a failure in one region, control plane, or identity layer cascade across the enterprise? Can workloads fail over without relying on the same degraded systems?

Resilient design demands deliberate choices: distributing workloads across availability zones, validating region-level recovery, maintaining independent recovery environments, and enabling degraded modes or alternate platforms where downtime is unacceptable.

For some organisations, strengthening resilience within a primary cloud through rigorous recovery testing and dependency mapping may be more effective than adding providers. For others, substitutability is essential. The leadership mandate is to decide where efficiency is acceptable and where independence is non-negotiable.

Diversification reinforces operational discipline. It reduces lock-in, strengthens negotiating leverage, and aligns infrastructure strategy with regulatory and data sovereignty expectations. It forces organisations to map dependencies, maintain verified runbooks, test failovers, and confirm that backups can be restored. Diversification is not about provider count. It is about ensuring that one failure does not become every failure.

The real barrier is mindset
The greatest obstacle to resilience is not technology. It is comfort with concentration. Years of optimisation within a single ecosystem make dependence feel efficient until it fails. Organisations begin to assume there are no viable alternatives. That assumption discourages experimentation and delays corrective action.

The challenge is also structural. Incentives reward cost compression and rapid deployment, not survivability under stress. Teams are measured on optimisation metrics, not on how systems behave during disruption. Changing this requires leadership that looks beyond quarterly optics, treats resilience as strategic capital, and accepts measured complexity in exchange for long-term control.

Designing for failure is not pessimism. It is disciplined governance in a digital economy where disruption is inevitable.

Public cloud will remain central to India’s digital growth. Scale is essential. But scale alone does not guarantee continuity. As systems become more interconnected, outages will occur. The question is no longer whether failure will happen. It is whether leadership has already defined what may fail, what must never fail, and how containment will be enforced.

Resilience is not a cloud feature. It is a leadership mandate.

Comments (0)
Add Comment