Amol Lele
Real-Time ML Training Profiling : Cut Costs Through Smarter Optimization
Abstract:
Machine learning training workloads are among the most resource intensive processes in modern computing, yet a significant portion of GPU capacity remains underutilized in production environments. Training systems often experience inefficiencies such as idle GPUs waiting for data, delays in synchronization, and uneven workload distribution across workers. These issues lead to wasted compute, increased energy consumption, and higher operational costs. Traditional profiling methods rely on post run analysis, which means bottlenecks are identified only after substantial time and resources have already been consumed.
This session presents a real time, multi layer profiling approach that captures signals across application, system, and infrastructure layers and aligns them into a unified timeline. By correlating events with fine grained temporal resolution, teams can identify inefficiencies such as data pipeline delays, distributed communication overhead, and straggler workers as they occur. This approach provides continuous visibility into training performance and enables faster root cause identification compared to isolated metrics from a single layer.
The discussion also covers adaptive profiling strategies that balance observability with overhead through selective sampling and dynamic adjustment of monitoring intensity. In addition, automated recommendation systems analyze performance data to suggest configuration changes and architectural improvements, reducing the need for deep performance engineering expertise. A case study demonstrates how identifying delays in data preprocessing and workload imbalance led to improved efficiency, reduced training time, and more consistent resource utilization.
Attendees will gain a practical framework for moving from reactive debugging to proactive optimization in machine learning training workflows. The session highlights how real time visibility can reduce compute waste, improve efficiency, and support faster experimentation and innovation in large scale AI systems.
Profile:
Amol Ashok Lele is a Senior Technical Architect based in San Jose, California, with over 27 years of experience designing and delivering cloud-native, distributed systems, and enterprise-grade solutions. He holds a Bachelor of Science in Electronics from the University of Pune, India, where he graduated with a Gold Medal for academic excellence. He has since supplemented his technical foundation with certifications, including AWS Certified Solutions Architect, AWS Certified Cloud Practitioner, a Deep Learning Specialization from Coursera, and an Accelerated Product Management certificate from Stanford University.
Amol currently serves as a Senior Master Engineer at Hewlett Packard Enterprise, where he leads high-impact initiatives including the development of the HPE Alletra MP Storage Plugin for Morpheus, the architecture of Service Blueprinting for HPE's Converged Private Cloud, and the delivery of a Generative AI-based Slack bot that reduced internal support response times by 60%. He has also driven a CI/CD transformation, standardizing AWS-based pipelines and improving deployment reliability by 30%.
Before HPE, Amol spent over four years as a Software Development Engineer at Amazon Web Services, where he designed and built AWS SageMaker Debugger and Profiler, cutting ML model debugging time by 50%. His work earned him a granted patent in ML training, debugging and profiling. He also contributed to the Apache MXNet open-source framework and mentored junior engineers across the organization.
Earlier in his career, Amol served as Principal Engineer at Nimble Storage, where he led the architecture of the SMI-S standard implementation and directed the development of the Performance Monitoring product for InfoSight. He also held software engineering roles at NetApp and VMware, where he redesigned a core vSphere module that delivered a 150% performance improvement.
Throughout his career, Amol has been recognized for his ability to bridge deep technical expertise with strategic leadership, mentoring teams, driving cross-functional collaboration, and aligning complex engineering efforts with broader business objectives.

.png)