The rise of the intelligent agent: Why human-in-the-loop is the future of AIOps

Autonomous IT operations has become a guard-railed process that combines human oversight with the speed and intelligence of Agentic AI, rather than functioning as an uncontrolled AI system.

In this interview, Srinivasa Raghavan S, Director of Product Management at ManageEngine, explores the transformative power of agentic AI and self-driving observability. He details how a ‘human-in-the-loop’ architecture, supported by orchestration, auditability, and strict policy guardrails, allows organisations to automate repetitive tasks without sacrificing human judgment or control.

How Agentic AI works alongside IT teams, not independently?
Agentic AI, in our context, is not about replacing the IT team’s judgment, it is about amplifying it. The goal is to reduce the repetitive operational work that IT teams deal with every day by introducing an intelligent agent into the workflow. The agent continuously monitors telemetry across the full stack, correlates anomalies, maps service dependencies, and surfaces a clear, context-rich diagnosis the moment an incident begins. This helps IT operations teams cut through alert noise and focus on insights that are objective and actionable.

The agent can also recommend remediation actions, but always within guardrails defined by the IT operations teams. Every proposed action is visible, every step is logged, and engineers retain the authority to approve, modify, or halt an automated response at any point. This is what we call a human-in-the-loop architecture.

As automation becomes more capable, enterprises are also becoming more deliberate about how these actions are governed. Autonomous IT operations therefore become a guard-railed process that combines human oversight with the speed and intelligence of Agentic AI, rather than functioning as an uncontrolled AI system.

What is the importance of orchestration, auditability, and policy guardrails?
Orchestration, auditability, and policy guardrails form the backbone of responsible automation. Enterprises that overlook any one of these pillars often end up in a worse position than before adopting AI. As agentic systems begin taking operational actions, the challenge quickly shifts from capability to control, especially in high-security environments such as BFSI and government.

Orchestration ensures that multiple agents operating in the same environment do not contradict each other or create new issues while trying to solve existing ones. It ensures automation is not a one-off script acting in isolation, but part of a coordinated sequence of actions that understands dependencies. For instance, when our enterprise observability platform Site24x7 detects a fault, it does not simply trigger a command. It coordinates a downstream workflow through Qntrl, our enterprise orchestration platform. This ensures the right teams are notified and the appropriate runbooks are invoked in the correct order, with rollback mechanisms pre-wired in case a remediation step creates an unexpected side effect.

Auditability addresses a question that inevitably arises after any incident – “why did the system take a particular action?” If that cannot be answered clearly, trust in the system erodes regardless of how well it performed previously.
Policy guardrails give leadership the confidence to adopt automation in the first place. Teams can define approved remediation paths, set boundaries around what actions an agent can attempt, and ensure that sensitive production environments are never modified without explicit human approval. Without guardrails, even well-intentioned automation can introduce risk. With them, enterprises gain speed without sacrificing control.

What “self-driving observability” means in practice for CIOs and IT leaders?
For CIOs and IT leaders, self-driving observability is not about having more telemetry or dashboards. It is about the platform intelligently optimising itself across the entire operational lifecycle from deployment to remediation.
The real challenge in modern IT environments is not a lack of data, but the overwhelming volume of operational noise. In traditional setups, incident response often begins with an alert, followed by a bridge call where engineers spend significant time simply determining what has actually gone wrong. This triage phase can be extremely expensive in terms of both human time and business impact.

Self-driving observability compresses that process significantly, often reducing mean time to detect (MTTD) and resolve incidents by 60 to 70%. As the system learns normal behavior across the stack, it becomes far more effective at separating real signals from routine fluctuations. For example, it may detect that a payment gateway is beginning to slow down well before checkout abandonment rates start rising.

Indian enterprises have traditionally relied on reactive monitoring tools. How is Intelligent IT Operations fundamentally shifting this mindset toward predictive and autonomous operations?
Indian enterprises have historically relied on reactive monitoring models, where alerts are triggered only after a failure has already occurred. Operations teams would then sift through logs and dashboards to identify the root cause. That approach worked when environments were relatively stable, but it becomes far less effective in modern distributed architectures where early signals of failure appear long before a user-visible outage.

Across Indian enterprises, the conversation has now shifted from monitoring to operations intelligence. CIOs are no longer asking whether they have visibility, they are asking whether that visibility leads to faster decisions and fewer human escalations. Intelligent IT Operations platforms such as Site24x7 address this by embedding predictive anomaly detection into observability itself. The system learns what normal behavior looks like within an environment and flags deviations early, long before they cascade into outages.

The impact can be significant. Synechron, an early adopter, reduced alert noise by 90% using ManageEngine AIOps capabilities. This is not just incremental improvement, it represents a transformation in how IT teams operate. When noise is reduced at that scale, engineers can focus on meaningful work instead of triaging alerts. It also helps combat alert fatigue, one of the silent challenges affecting IT team effectiveness. Over time, teams begin to trust the platform’s insights, define automation runbooks for common scenarios, and gradually move from a firefighting approach to one of continuous, intelligent assurance.

What role are AI, machine learning, and automation playing in helping enterprises detect anomalies, reduce noise, and accelerate root cause analysis in complex IT environments?
AI, machine learning, and automation each play a distinct yet interconnected role in what we call the AIOps intelligence stack. Machine learning forms the foundation by establishing dynamic, context-aware baselines. Instead of relying on static thresholds that generate false positives during routine traffic spikes, these adaptive models understand patterns such as seasonality, deployment cycles, and service-specific behavior. This allows platforms like Site24x7 to surface genuine anomalies while suppressing operational noise. In practice, organisations such as Synechron have reduced alert noise by as much as 90% using these capabilities.

AI, particularly through our causal intelligence layer, takes those anomaly signals and correlates them across distributed systems in real time. A single incident in a complex environment might trigger hundreds of alerts across different layers. A spike in database latency, elevated application errors, and API gateway timeouts may all stem from the same underlying issue, such as a misconfigured network policy following a deployment. Domain-aware causal AI understands the service dependency graph of the environment and automatically connects these signals, presenting teams with the most probable root cause instead of a fragmented list of symptoms.

Automation then converts this intelligence into action. Once the root cause is identified, often with more than 80% accuracy, the platform can initiate a pre-approved remediation workflow such as restarting a service, scaling infrastructure resources, or triggering a rollback through our orchestration layer with Qntrl. In many cases, the entire chain from detection to resolution can occur within minutes, sometimes even before an incident ticket is created.

How does Intelligent IT Operations help IT teams align more closely with business outcomes such as uptime SLAs, digital customer experience, and cost optimisation?
Traditionally, IT operations teams and business leaders viewed performance through very different lenses. Operations focused on technical metrics such as alerts, mean time to repair, and uptime, while business leaders cared about customer experience, revenue impact, and service commitments. Intelligent IT Operations helps bridge this gap by linking technical telemetry directly with business outcomes.

In practice, incidents are no longer treated as isolated infrastructure issues. By correlating signals across applications, infrastructure, and user-experience layers, teams can quickly understand not just what failed, but how that failure affects customers and service delivery. For example, instead of investigating a generic database alert, teams can immediately see that a slowdown is affecting payment transactions or checkout flows for users in a specific region.

Predictive anomaly detection and dependency-aware diagnostics also strengthen SLA management. Teams can identify degradation patterns early and resolve them before they result in SLA breaches. For sectors such as BFSI, telecom, and digital commerce, this ability to prevent incidents rather than simply respond to them significantly improves service reliability.

Automation further enhances operational efficiency by handling repetitive Tier-1 and Tier-2 workflows. This allows engineering teams to focus on improving system resilience and performance. Over time, IT operations evolve from being viewed purely as a support function to becoming a key contributor to customer experience and operational efficiency.

With Indian enterprises rapidly adopting hybrid and multi-cloud environments, what are the biggest operational challenges they face, and how can intelligent operations platforms simplify visibility and control?
Hybrid and multi-cloud adoption has brought significant agility to Indian enterprises, but it has also introduced operational challenges that traditional monitoring tools were not designed to address.

The most immediate challenge is fragmented visibility. When workloads span on-premise infrastructure, multiple cloud providers, and container platforms, telemetry becomes scattered across different monitoring tools, each producing its own metrics, alerts, and dashboards. During incidents, engineers often spend valuable time simply piecing together what is happening across the stack.

Another challenge is the sheer scale of telemetry generated by modern digital services. Massive volumes of metrics, logs, traces, and user experience data can quickly overwhelm operations teams if they are not intelligently filtered and correlated.

Cloud-native environments also introduce constant change. Microservices scale dynamically, containers are ephemeral, and service dependencies evolve with every deployment, making static topology views quickly outdated.
Platforms like Site24x7 address these challenges by unifying observability across infrastructure, applications, and user experience into a single operational layer. With unified telemetry ingestion, automated service discovery, and continuously updated dependency mapping, teams gain a real-time view of service relationships across environments.

Our causal intelligence engine then correlates signals across these layers. For example, it can link a Kubernetes pod failure in the cloud with a database latency spike on-premise and associate the resulting impact on user transactions.

Do you see Indian enterprises ready to embrace self-healing systems and autonomous remediation? What cultural or technological shifts are required to enable that transition?
Enterprises are increasingly open to adopting self-healing systems, although readiness varies across industries. In our experience working with large organisations, hesitation is rarely about the technology itself. It is more about trust, governance, and operational risk. When automation begins making decisions in production environments, organisations naturally question how much control they are willing to delegate to the system.

The first shift required is cultural. Many IT teams have traditionally operated in reactive modes, where engineers manually validate every action before execution. Moving toward autonomous remediation requires confidence that the system understands context, dependencies, and potential impact. This confidence typically grows as teams observe consistent outcomes from AI-assisted recommendations and automated workflows.

The second shift involves operational governance. Enterprises need clear guardrails around what automation is allowed to do and where human oversight remains essential. In practice, automation often begins with lower-risk tasks such as restarting services, scaling infrastructure, or resolving resource bottlenecks.

The third shift is architectural maturity. Self-healing systems depend on accurate telemetry, dependency awareness, and reliable automation frameworks. Without a strong observability foundation, autonomous remediation becomes difficult because the system cannot confidently determine cause and effect.

Across many Indian enterprises, we are seeing a phased adoption approach. Teams typically start with advisory systems that recommend actions, then move to assisted automation where engineers approve those actions, and eventually adopt guarded autonomous remediation for well-understood scenarios. As systems demonstrate reliability and operational benefits, particularly in reducing downtime and operational workload, organisations become more comfortable expanding the scope of automation.

Agentic AIAIOPsManageEngineSelf-driving ObservabilitySrinivasa Raghavan S
Comments (0)
Add Comment