Enterprise Knowledge Assistant
A production-grade retrieval-augmented generation system for a 10,000-document internal knowledge base. Reduced average query resolution time by 60% and improved answer accuracy through hybrid dense/sparse retrieval.
I'm Sadie Kim — an AI engineer working at the intersection of language models, intelligent systems, and human-centered product design.
My work spans the full stack of modern AI engineering — from fine-tuning and evaluating large language models to designing the pipelines and interfaces that bring them into production. I care deeply about reliability, interpretability, and building systems that are genuinely useful.
Previously I've worked on retrieval-augmented generation (RAG) systems, AI agents, and applied NLP. I'm particularly interested in the challenge of making AI systems that behave predictably and align with user intent.
A selection of AI engineering projects spanning research, applied systems, and tooling.
A production-grade retrieval-augmented generation system for a 10,000-document internal knowledge base. Reduced average query resolution time by 60% and improved answer accuracy through hybrid dense/sparse retrieval.
Fine-tuned a 7B parameter model on a curated legal corpus using LoRA adapters. Developed a custom evaluation harness measuring clause extraction accuracy, citation recall, and hallucination rate.
Built a multi-step AI agent capable of decomposing research questions, querying external APIs, synthesizing findings, and producing structured reports with citations. Deployed as an internal tool.
Designed and built a lightweight platform for tracking LLM output quality over time. Includes automated regression testing, latency dashboards, and human feedback collection loops.
Notes on AI engineering, system design, and the craft of building with language models.
Dense embeddings are powerful, but combining them with BM25 sparse retrieval consistently outperforms either approach alone. Here's what I've learned from six months of production RAG systems.
Agents that work 80% of the time are not production-ready. A look at failure modes, guardrails, and the architectural patterns that actually make multi-step agents dependable.
Most LLM evaluation benchmarks measure the wrong things. This is my framework for building evaluations that actually correlate with real-world product quality.
I'm open to full-time AI engineering roles, research collaborations, and consulting engagements. If you're working on something interesting in the AI space, I'd love to hear about it.
hello@sadie.kim