Rahul Jain
Driving Faster Incident Resolution: Data-Driven Anomaly Detection & Root Cause Analysis in Real-Time Streaming Systems.
Abstract:
Real-time streaming systems generate massive volumes of time-sensitive data, yet many operational incidents remain undetected until customer-impacting degradation occurs. This talk presents a data-driven framework for high-precision anomaly detection and automated root cause analysis, specifically designed for large-scale media and content delivery pipelines that process millions of events per minute.
We begin by demonstrating how segmentation techniques, including sliding-window partitioning, behavior-based clustering, and statistical distribution cuts, can significantly increase anomaly detection accuracy compared to traditional global-threshold models. By segmenting
traffic by device type, region, and codec parameters, the system isolates micro-patterns that correlate strongly with failure signatures, substantially reducing false positives in high-variance traffic scenarios.
Next, we highlight the application of decision trees and random forest classifiers to perform rapid, explainable root cause analysis of media quality incidents. Using a robust dataset of labeled failure cases, the random forest approach achieves high classification accuracy, identifying key contributors such as encoder drift, CDN saturation, and edge node latency spikes. Decision path visualization further reduces the mean time to diagnosis for on-call engineering teams, accelerating response and resolution.
Attendees will walk away with a blueprint for deploying scalable, machine-learning-based anomaly detection pipelines, complete with segmentation strategies, feature-engineering examples, and real-world performance insights. This session is ideal for engineering leaders, data practitioners, and platform architects aiming to elevate service reliability and streamline operational workflows.
Profile:
Rahul Jain is a seasoned Principal Data & AI Engineering Leader with more than 22 years of experience architecting large-scale data platforms and advanced AI-driven analytics systems. His expertise spans OLTP, OLAP, time-series, document, and key-value databases, as well as real-time and batch processing across petabyte-scale environments.
At Cisco Systems, Rahul leads architecture for the Webex Analytics Platform, designing high-throughput data pipelines that process billions of events daily using Kafka, Spark, Flink, and Pinot. He built RADAR, a real-time anomaly detection and root-cause analysis framework, cutting detection time by 50%, and developed CHAI (Controlhub AI Assistance), a natural-language-to-SQL interface that democratizes enterprise data access. He also championed metadata governance through DataHub and modernized ingestion and processing layers to improve reliability and reduce operational costs.
Previously at iPass, Rahul designed the Firefly big data platform and deployed real-time reporting, geo-analytics, and network recommendation systems that significantly improved network intelligence and performance.
Rahul is skilled across cloud platforms (AWS, GCP, Azure), distributed systems, AI/ML, and modern data engineering ecosystems. He holds a Bachelor of Engineering in Computer Science and is fluent in English.