Benchmarking LLMs in Low-Resource Languages — Lessons from Persian

A model that aces MMLU is impressive. But “aces MMLU” means aces it in English. Ask that same model to explain a Persian idiom, reason about Iranian civil law, or analyze a passage from Hafez — and the numbers collapse fast. That gap is the reason I built MELAC.

MELAC (Massive Evaluation of LLMs in Persian) started from a simple question: how well do modern LLMs actually perform in Persian, across the full range of tasks a Persian speaker would care about? Not translated English benchmarks. Not grammar quizzes. Real tasks — comprehension, legal reasoning, cultural knowledge, idiomatic understanding. After 41 models, 19 datasets, and six evaluation categories, the answer was more humbling than I expected. Here’s what I learned.

What “Low-Resource” Actually Means

The term “low-resource” is often misread as “not enough training data.” That’s part of it — but for Persian, a language with over 100 million speakers and a literary tradition stretching back over a thousand years, the resource gap is more nuanced than raw data volume. Persian is well-represented online. The problem runs deeper.

The real challenge is threefold. First, training data bias: most LLMs are overwhelmingly trained on English content, with other languages as an afterthought. Persian content in pretraining corpora is typically a small fraction — and often of lower quality, sourced from scraped web text with encoding inconsistencies, mixed-script artifacts, and transliteration noise that would never appear in professionally edited text.

Second, tokenizer inequity: Persian script is right-to-left with a character set that most tokenizers handle inefficiently. Persian words are routinely fragmented into far more subword tokens than their English equivalents. In practice, this means a Persian prompt consumes 2–4x the tokens of an equivalent English prompt — shrinking effective context windows, increasing inference costs, and degrading performance on tasks that depend on attending across long spans of text.

Third, and most importantly for this work: evaluation blindspots. The benchmarks we use to measure model quality were mostly designed for English. Translated versions introduce systematic artifacts — grammar that sounds unnatural to native speakers, cultural references that don’t map cleanly, concepts that have no clean equivalent. A model that scores 75% on a translated Persian MMLU might still fail completely on tasks any native speaker would consider basic. The benchmark is measuring the wrong thing.

The Core Problem

Persian sits in an uncomfortable middle ground: not as underserved as Swahili or Tigrinya, but far less supported than the European languages that receive most multilingual research attention. It’s a useful test case precisely because it exposes gaps that simple “high-resource vs. low-resource” framings miss.

How We Built MELAC

The core design decision for MELAC was to build native datasets, not translate existing ones. Every benchmark was created for Persian speakers, grounded in Iranian cultural and linguistic reality. This matters more than it sounds — it means the evaluation is actually measuring Persian competence, not translation fidelity.

We organized 19 datasets across six evaluation categories:

Category	What We Tested
Persian Linguistic	Grammar rules, idiomatic expressions, verb morphology, literary analysis
Persian Legals	Iranian civil law, religious jurisprudence (fiqh), procedural law
Reading Comprehension QA	Understanding and reasoning over Persian-language texts
General Knowledge	Trivia grounded in Iranian culture, history, and geography
Domain Specific Knowledge	Expert-level questions in medicine, engineering, and sciences
Common Sense Reasoning	Logical inference and everyday reasoning in Persian contexts

The linguistic and legal categories were the hardest to build — and, as it turned out, the most revealing. Persian idioms simply don’t translate. Iranian law is a mixture of civil code and Islamic jurisprudence that has no direct equivalent in Western legal systems; a model trained primarily on English legal texts has no scaffold to hang it on.

Forty-one models were evaluated in total: flagship closed-source models (GPT-4o, Gemini 2.0 Flash, Claude families), a range of open-source models of varying sizes, and — critically — several models that had been fine-tuned specifically on Persian data. That last group turned out to be the most interesting part of the study.

What the Numbers Revealed

The headline finding: on culturally-specific datasets, only one of the 41 models we evaluated exceeded 50% accuracy. That model was GPT-4o.

Model	Type	Avg. Accuracy
GPT-4o	Closed-source	72.61%
Gemini 2.0 Flash	Closed-source	70.68%
Other frontier models	Closed-source	55–68%
PersianMind	Persian fine-tune	35.08%
Maral-7B	Persian fine-tune	34.71%

The Persian-specific fine-tuned models sitting at the bottom is not a formatting artifact or a quirk of dataset selection. PersianMind and Maral-7B were built specifically to perform in Persian — trained on Persian data, optimized for Persian tasks. And they were beaten by every major generalist model, often by 30 or more percentage points.

One more counterintuitive finding: there was no strong correlation between a model’s English MMLU performance and its Persian MELAC performance. Models that were close competitors in English showed wildly different trajectories in Persian. The quality and coverage of multilingual pretraining — not raw model capability — is what determines performance in low-resource settings.

What Surprised Me Most

I went into MELAC with several hypotheses. Most were confirmed. Two things genuinely surprised me.

The fine-tuned Persian models surprised me most. I expected them to underperform on general knowledge tasks — that’s the known trade-off when you fine-tune a smaller model on narrow domain data. What I didn’t expect was for them to lose on linguistic tasks too: grammar, morphology, idiomatic expressions — the exact categories where a model trained on Persian text should have a structural advantage. This points to a more fundamental problem than data volume. The Persian-specific training data was either too limited in linguistic diversity, too narrow in register and style, or both — and the smaller base model couldn’t compensate.

The cultural knowledge ceiling was steeper than I expected. For questions requiring knowledge of Iranian culture, history, or social context — not language, just knowledge — even the best models dropped significantly below their general performance. GPT-4o performs impressively overall, but Iranian cultural knowledge is genuinely underrepresented in its training data. What’s revealing is how it fails: the model isn’t confused about what’s being asked. It’s confident. It just has the wrong answer — which is the signature of missing pretraining knowledge, not a prompting problem.

A Practical Implication

You cannot prompt-engineer your way to strong Persian cultural performance. No amount of system prompt tuning compensates for missing facts about Iranian history, law, or idiom. The knowledge has to be in the model from pretraining — retrieval-augmented generation is the only practical workaround for production deployments today.

Why This Matters Beyond Persian

Persian is one data point in a pattern that repeats across dozens of languages. The structural problems we encountered — tokenizer inequity, translated benchmark contamination, cultural knowledge gaps, fine-tuning not compensating for pretraining gaps — are not unique to Persian. They’re the default condition for most of the world’s languages.

There’s a deeper issue the field needs to confront: the evaluation problem precedes the training problem. Before we can claim that a model “works” in a language, we need evaluation frameworks that test the right things — natively, not via translation. MELAC-style efforts need to exist for every language that LLMs are being deployed in. Right now, for the vast majority of those languages, they don’t.

The current default in multilingual NLP is to translate English benchmarks and call it adequate evaluation. This is insufficient. A model can score 80% on a translated Persian MMLU and still be completely unable to reason about a common Persian idiom or interpret an Iranian legal document. We don’t actually know how well our best models perform in most of the world’s languages, because we haven’t built the evaluation infrastructure to measure it properly.

The evaluation gap is upstream of the training gap. You can’t close the training gap if you can’t measure it accurately. That’s the argument for MELAC, and for similar work in other languages — not just as academic benchmarks, but as prerequisites for responsible deployment.

The full MELAC paper, including methodology, dataset details, and complete results across all 41 models, is available on arXiv. If you work on multilingual NLP, LLM evaluation, or deploying models for non-English audiences, I’d genuinely like to hear your perspective. Find me on LinkedIn or GitHub.

What makes Persian a “low-resource” language if it has over 100 million speakers?

Speaker count and digital resource availability don’t always correlate. Persian has a large speaker base, but its representation in LLM pretraining corpora is a small fraction of English. Compounding this, most evaluation benchmarks were built for English and later translated — introducing artifacts that make Persian-language tests less reliable as measures of actual language competence. “Low-resource” in NLP refers to the ecosystem of data, tools, and evaluation infrastructure, not just raw text volume.

Can the MELAC datasets be used to fine-tune or evaluate other models?

That’s one of the primary goals. The datasets were designed to be reusable by the research community — both for evaluating new models as they’re released and as a basis for fine-tuning experiments. Details on access and licensing are included in the paper. If you’re working on Persian NLP and want to discuss usage, reach out directly.

How does MELAC differ from other Persian NLP benchmarks like ParsBench?

The key distinction is scope and cultural grounding. Existing benchmarks tend to cover narrow task ranges and frequently include translated English content. MELAC covers six categories — including Persian legal reasoning and cultural knowledge — with datasets that were created natively rather than translated. The 41-model comparison also makes it one of the largest evaluations of LLMs specifically on Persian to date.

Will this work extend to vision-language models in Persian?

Yes — and that work is already underway. My ACL 2025 paper, “Persian in a Court: Benchmarking Large Vision-Language Models,” extends this evaluation paradigm to the multimodal setting. The findings mirror MELAC in important ways: cultural visual knowledge is an even steeper gap than textual cultural knowledge, and fine-tuned smaller models again fail to match the performance of large generalist frontier models.