Thoughts on AI & Research
Practical insights on LLMs, NLP, agentic systems, and the journey from research to production engineering.
Why Low-Resource Languages Break LLMs
Most LLM benchmarks only tell half the story. After evaluating 41 models across 19 Persian datasets, here's what we found — and why it matters for billions of people.
Benchmarking LLMs in Low-Resource Languages — Lessons from Persian
After evaluating 41 models across 19 datasets in MELAC, the results were clear — and uncomfortable. Fine-tuned Persian models lost to generalist frontier models. Here's why.
Building Agentic AI Systems: From Theory to Production
After shipping several agentic systems at Rudys.AI, I've distilled the patterns that actually work in production — and the failure modes nobody writes about.
RAG in Production: What the Tutorials Skip
Retrieval-Augmented Generation looks simple in demos. In production, chunking strategy, reranking, and context window management decide whether it works or fails silently.
Fine-Tuning vs. Prompting: A Decision Framework
Both approaches work. The right choice depends on data availability, latency budget, and how stable your task definition is. Here's the framework I use to decide.
Behind the Benchmark: Persian in a Court (ACL 2025)
How we built a vision-language benchmark grounded in Persian legal documents, and why domain grounding exposes LLM weaknesses that general benchmarks miss.
Async Python Patterns for High-Throughput LLM Apps
When you're making hundreds of concurrent LLM calls, synchronous code becomes your bottleneck. AsyncIO and structured concurrency patterns that I rely on daily.