When GPT-4 scored above 86% on MMLU, the AI community celebrated. When we ran the same class of models on Persian — a language spoken by over 110 million people — the picture looked very different. Models that seemed capable in English revealed surprising gaps: broken reasoning chains, degraded safety alignment, and in some cases, outputs that made no sense at all.
This post is about why that happens. Not as a complaint, but as a technical explanation — because understanding the failure modes of LLMs on low-resource languages is essential for building systems that are actually fair and useful at a global scale. I’ll draw on findings from our MELAC evaluation (41 LLMs, 19 Persian datasets) and our ACL 2025 work on vision-language models for Persian text.
1. What Makes a Language “Low-Resource”?
The term is slightly misleading. Persian is not a small language. It’s the official language of Iran, Afghanistan (as Dari), and Tajikistan, with over a century of modern literary output and a rich digital presence. Yet by NLP standards, it is firmly low-resource — and this distinction matters.
In NLP, “low-resource” is not about speaker count. It’s about the availability of digitized, structured, machine-readable data. English dominates the Common Crawl, Wikipedia, GitHub, and virtually every other large-scale corpus used to train modern LLMs. Estimates put English at roughly 40–50% of web-scraped training data for major models, while Persian typically accounts for less than 1%. That gap in training signal is the root cause of most downstream failures.
There’s also a second axis: the availability of labeled evaluation data. For English, we have MMLU, HellaSwag, ARC, TruthfulQA, and dozens of others — carefully curated, human-validated, covering a wide range of reasoning types. For Persian, comparable resources were nearly nonexistent before projects like ours. This makes it hard to even measure the problem, let alone fix it.
2. The Core Problem: Training Data Imbalance
Modern LLMs are trained on massive multilingual corpora, but “multilingual” can be a misleading label. Having a language represented in training data is not the same as having it well-represented. The distribution matters enormously — and for low-resource languages, it creates a compounding set of problems.
The first problem is sheer data volume. A model that sees 500 billion English tokens and 2 billion Persian tokens will develop deeply asymmetric language competence. The Persian representations in the embedding space are sparser, less refined, and far more likely to be confused with neighboring scripts or transliterations.
The second problem is what researchers call the “curse of multilinguality.” When a fixed-capacity model is trained on many languages simultaneously, adding more languages tends to hurt performance on each individual language — especially low-resource ones. The model’s parameters are a shared resource, and high-resource languages dominate the gradients during training.
The third problem is instruction fine-tuning data. Even models that are reasonably pre-trained on Persian suffer when the instruction-tuning phase (RLHF, SFT) is conducted primarily in English. The model learns how to be helpful in English, and this helpfulness behavior doesn’t transfer cleanly to Persian. The result is a model that may understand a Persian question but structure its answer in ways that feel unnatural, incomplete, or subtly wrong to native speakers.
3. Where Models Actually Break
When we ran MELAC — evaluating 41 LLMs across 19 Persian datasets covering reading comprehension, math reasoning, commonsense QA, summarization, and safety — the failure patterns were consistent enough to categorize.
Tokenization failures
Persian uses a right-to-left Arabic script with complex letter-joining rules, zero-width non-joiners, and a rich morphological structure. A single Persian word can carry information that requires several English words to express. Most LLM tokenizers were built primarily for Latin-script languages, and they handle Persian by splitting it into far more tokens than semantically appropriate.
In our evaluations, we observed cases where the model would produce a grammatically broken Persian word because the tokenizer had split it across subword units that don’t correspond to any morphological boundary. The model was essentially reasoning about word fragments, not words.
Reasoning degradation under language shift
One of the more striking findings was how much reasoning quality degrades when you move from English to Persian — even for models with explicit multilingual support. A model that correctly solves a multi-step math problem in English will often fail the same problem posed in Persian. This isn’t because the math changes — it’s because the model’s chain-of-thought reasoning was learned primarily in English. The scaffolding for systematic thinking is weaker in the second language.
Safety alignment gaps
Perhaps the most concerning finding: safety alignment is significantly weaker in low-resource languages. Models that reliably refuse harmful requests in English will often comply with the same requests when they’re rephrased in Persian. The red-teaming and preference data used in RLHF pipelines is overwhelmingly English, so the model’s “refusal instinct” is poorly calibrated for other languages.
This is not a theoretical risk. It means that the same model can be both safer and more capable in English than in Persian — creating an unequal protection gap that affects real users.
4. The Benchmark Illusion
When a model reports 75% accuracy on a “multilingual benchmark,” it’s easy to assume that means the model performs at 75% across all included languages. It rarely does. Multilingual benchmarks are typically weighted or averaged in ways that allow high-resource language performance to mask catastrophic failures in low-resource ones.
There’s also the problem of translated benchmarks. A common shortcut is to take an English benchmark like MMLU and translate it into the target language. This approach introduces several artifacts: the translation may not preserve the difficulty of the original question, cultural references may become meaningless, and correct answers can change subtly across languages. Models evaluated on translated benchmarks are often being tested on whether they can navigate translation artifacts, not whether they understand the source language.
This is precisely why we built MELAC differently. Rather than translating existing English tasks, we sourced and annotated Persian-native tasks — questions written and validated by native Persian speakers, grounded in Persian-language knowledge. The performance gap between models on translated vs. native Persian benchmarks was substantial.
In MELAC, models showed an average accuracy drop of 15–30 percentage points compared to their English equivalents on native Persian tasks — not because Persian is harder, but because the evaluation is more honest about what the model actually knows.
5. What the Field Is Doing About It
The research community hasn’t been idle, and there are several promising directions — though none of them fully close the gap yet.
Continued pre-training on native corpora is the most direct intervention. Taking a well-trained base model and continuing pre-training on a large, high-quality Persian corpus allows the model to internalize the language’s patterns more deeply without losing the general capabilities it already has.
Synthetic data generation has become increasingly viable. Using a strong English model to generate Persian instruction-following data — carefully validated by native speakers — can provide the kind of aligned training signal that was previously unavailable.
Community-driven dataset creation is slow but produces the highest-quality data. Initiatives that involve native speakers in annotation, validation, and red-teaming produce benchmarks that reflect how the language actually works. Our own work on MELAC and the Persian vision-language benchmark for ACL 2025 falls into this category.
Tokenizer reform is underserved. Most research focuses on model weights, but a tokenizer designed with morphologically rich languages in mind could significantly reduce the per-token representation burden for Persian, Arabic, Turkish, and many others.
| Intervention | Impact | Cost |
|---|---|---|
| Continued pre-training | High — directly improves language modeling | High — requires compute + clean corpus |
| Native instruction fine-tuning | High — improves alignment and helpfulness | Medium — human annotators needed |
| Synthetic data augmentation | Medium — depends on translation quality | Low — automatable with validation |
| Tokenizer reform | Medium — reduces token overhead | Very high — requires retraining from scratch |
| Native benchmarks | Indirect — enables honest measurement | Medium — requires expert annotators |
The low-resource language problem is not a niche concern. The languages that suffer most from these gaps are spoken by hundreds of millions of people across the Middle East, Africa, Southeast Asia, and South Asia. As LLMs become infrastructure — embedded in healthcare tools, educational software, legal services — the quality of these systems in non-English languages becomes a matter of equity, not just academic interest.
Better benchmarks are a start. They make the problem visible and force honest comparison. But the deeper work is building the data infrastructure — native corpora, aligned instruction sets, culturally grounded evaluation tasks — that will let the next generation of models do better from the start.
If you’re working on low-resource NLP or want to collaborate on Persian evaluation infrastructure, I’d love to connect. Find me on LinkedIn or GitHub.
What exactly is a “low-resource language” in NLP?
In NLP, a low-resource language is one that lacks sufficient digitized, machine-readable training data — not necessarily a language with few speakers. Persian, for example, has over 110 million speakers but accounts for less than 1% of most LLM pre-training corpora. The shortage of both raw text data and labeled evaluation datasets defines a language as low-resource in this context.
Why does Persian underperform despite having millions of speakers?
Several factors compound: Persian is underrepresented in web-crawled training corpora; its Arabic script and complex morphology are poorly handled by tokenizers designed for Latin-script languages; and instruction fine-tuning has been conducted almost entirely in English. Our MELAC evaluation found 15–30 percentage point accuracy drops compared to English equivalents on native Persian tasks.
Can fine-tuning fix low-resource language gaps?
Partially. Continued pre-training on large, high-quality native corpora can meaningfully improve language modeling, and native instruction fine-tuning can improve alignment. However, fine-tuning doesn’t fix tokenizer-level inefficiencies, and it can’t compensate for the lack of culturally grounded commonsense knowledge if that was absent from pre-training. It’s a useful intervention, not a complete solution.
Which LLMs perform best on low-resource languages like Persian?
Based on our MELAC evaluation across 41 models and 19 Persian datasets, frontier models (GPT-4-class and the strongest open-weight alternatives) showed the smallest absolute gaps — but still had significant performance drops compared to English. Models with larger parameter counts and more recent training cutoffs tended to perform better. Purpose-built Persian models outperform general models on language-specific tasks but lag behind on general reasoning.
What is MELAC?
MELAC (Massive Evaluation of LLMs in Persian) is a benchmark suite we developed at Shahid Beheshti University covering 19 Persian-native datasets across reading comprehension, mathematical reasoning, commonsense QA, summarization, and safety evaluation. We used it to evaluate 41 LLMs and published findings on arXiv in 2025.