Mr. Arun Pandiyan Perumal

AIOps for SRE: How AI is Redefining Operations and Reliability Practices

Abstarct:

Artificial Intelligence for IT Operations (AIOps) is transforming Site Reliability Engineering (SRE) by enhancing proactive issue detection, automated incident management, and system resilience. This presentation explores how AI-driven technologies are redefining traditional SRE practices by enabling proactive incident detection, intelligent root cause analysis, and automated remediation. Attendees will gain insights into the core principles of AIOps, practical use cases for anomaly detection, predictive analytics, and self-healing systems, as well as strategies for integrating AI into existing reliability frameworks. The session will also cover the impact of AIOps on operational metrics, mean time to recovery (MTTR), and service level objectives (SLOs). It explores the evolving intersection of AI and SRE, empowering teams to drive operational excellence and maintain high availability in complex, distributed systems.