Koushal Anitha Raja

Making Generative AI Reliable: Evaluation, Hallucination Control, and Production-Scale Guardrails

Abstract:

Generative AI is powerful, but reliability is still the biggest barrier to real-world adoption, especially in high-stakes use cases where hallucinations, brittle reasoning, and inconsistent outputs can cause real harm. In this keynote, I’ll present a practical, engineering-focused framework for making LLM systems measurably reliable from prototype to production.
I’ll start by explaining why “accuracy” isn’t enough for GenAI and how reliability needs to be measured across multiple dimensions: correctness, faithfulness/grounding, robustness to prompt and data variations, safety, and consistency over time. I’ll then walk through modern evaluation approaches used in industry, including rubric-based evaluations, model-as-judge patterns, multi-model arbitration, automated regression testing, and feedback loops that prevent quality drift.
Next, I’ll dive into hallucination control in real deployments, how hallucinations happen, how to detect them early, and how to reduce them using retrieval-augmented generation (RAG), citation grounding, verification layers, tool use, uncertainty calibration, and human-in-the-loop gating. I’ll also cover guardrail design patterns that balance helpfulness with safety, including fallback strategies, refusal policies, and monitoring signals that catch failures before they reach users.
Finally, I’ll share a practical “production checklist” for deploying GenAI systems: what to measure, how to set acceptance thresholds, how to monitor quality, and how to respond when the model behavior changes.

Profile:

Koushal Anitha Raja is a Software Development Engineer II at Amazon, where he designs, develops, and deploys AI and machine learning models into large-scale production systems. Based in Seattle, WA, Koushal brings extensive experience in LLM optimisation, scalable backend development, multimodal dataset creation, and full-stack engineering.
Before joining Amazon, Koushal made significant contributions to cutting-edge AI projects at major tech companies. His work with Google's Gemini team involved creating and reviewing over 50,000 multimodal chart-caption-QA examples, contributing to chart-understanding capabilities that were highlighted in a Google Cloud Next keynote. Through his collaboration with OpenAI, he worked on SWE-Bench-style evaluations, RLHF/RLEF training pipelines, and agentic systems, creating preference-pair data and reference solutions for algorithmic coding and debugging tasks.
As Lead AI Software Engineer at Highbrow Technology/Turing, Koushal spearheaded multimodal and coding-reasoning LLM evaluation projects, building datasets that powered major AI capabilities and creating evaluation frameworks for large-scale language model projects.
Koushal holds a Master of Science in Computer Science from Stevens Institute of Technology (GPA: 3.7/4.0) and a Bachelor of Technology in Information Technology from RMD Engineering College (GPA: 8.6/10.0). His technical expertise spans Python, JavaScript/TypeScript, cloud technologies (AWS, Google Cloud), AI/ML frameworks (LangChain, Hugging Face), and modern development tools, including Docker, Kubernetes, and various databases.
With a track record of operationalising ML workflows, optimising system performance, and contributing to industry-leading AI capabilities, Koushal continues to drive innovation at the intersection of software engineering and artificial intelligence.