saravanakumar baskaran
( Microsoft Corporation )
Professional Summary:
With 17 years of software engineering expertise, I specialize in architecting, designing, and delivering enterprise-scale applications using Microsoft and Open-Source technologies. My technical skills span services, web, Windows, data, and big data applications, with a strong focus on software design, automation, and cross-functional collaboration. I have a proven track record in strategy development, distributed systems architecture, and leading teams toward impactful results.
Core Competencies:
• Technical Leadership: I have successfully led technical teams to deliver high-quality projects. My expertise includes developing roadmaps, implementing operational strategies, and fostering a collaborative environment with clear guidance and mentorship.
Cross-Functional Collaboration:
• Cross-functional collaboration involves working closely with teams from various departments, such as product management, design, and QA, to ensure seamless development and delivery of software solutions. By actively engaging with stakeholders, I bridge the gap between technical and non-technical teams, fostering effective communication and alignment on project goals.
Significant Performance Demonstrated in Microsoft from 2017 - 2024
Site Reliability Engineering:
• As part of Azure SRE team, Implemented advanced Site Reliability Engineering (SRE) practices, combining automation, error budgets, and observability tools to enhance cloud service availability and reliability.
1. Cloud Observability and Monitoring:
Designed and implemented AI-driven predictive maintenance and monitoring strategies for cloud services, utilizing tools like Prometheus, Grafana, and ELK Stack alongside machine learning algorithms to enhance infrastructure reliability and performance.
Spearheaded the use of anomaly detection and predictive models to proactively address issues, reducing downtime, operational costs, and improving resource efficiency. Integrated these systems into Incident Automation workflows.
2. Chaos Engineering
• Utilized Chaos Engineering and AI-driven predictive maintenance to proactively identify and address potential failures, minimizing downtime and ensuring fault tolerance.
• Integrated real-time monitoring, predictive analytics, and continuous improvement workflows to optimize system resilience, scalability, and performance.
3. Azure Services Incident Automation Platform
• Designed and Implemented an Incident Automation Platform to enable Azure Product Teams to streamline and secure their live site incident responses effectively.
• Developed a Secure Connection Model utilizing Managed Identity, enhancing system security and reducing vulnerabilities.
• Leveraged Azure Logic App Designer to enable users to create and deploy automation workflows with simplicity and efficiency.
• Achieved Broad Adoption with all Azure Quality critical services (QCS) utilizing the platform for live site incident handling after five years of operation.
• Impact Highlights:
o Over 3 million incidents / month are handled by the incident automation platform.
o Enriched diagnostic data for 80% of Azure incidents
o Over 2 million production touches per day are carried out by the Automation platform.
o 750K hours of DRI / Oncall are saved each month.
o Facilitated 85% of incident engagements with the appropriate teams through automated workflows.
o SLI of the components of QCS improved by 5%