April 30, 2026

Building an Arabic Reverse Dictionary: From TF-IDF to LLMs

A step-by-step NLP journey through classical methods, static embeddings, Transformer fine-tuning, and Retrieval-Augmented Generation — all in Arabic.

Introduction & Motivation
The Dataset
Experiment 1 — TF-IDF
Experiment 2 — FastText + FAISS
Experiment 3 — Transformers (Zero-Shot)
Experiment 4 — Contrastive Fine-Tuning
Experiment 5 — LLMs (Zero-Shot & RAG)
Lessons Learned & What’s Next

1. Introduction & Motivation

A reverse dictionary answers a simple question: given a description or meaning, what is the word? It is the tool you reach for when a word is on the tip of your tongue. You know what it means, you just cannot recall the word itself.

Building one is a surprisingly rich NLP problem. It requires a model that understands the meaning of a definition well enough to map it back to a single lexical item from a vocabulary of tens of thousands of candidates.

Arabic was chosen deliberately for two reasons. First, Arabic morphology is exceptionally complex — a single root can spawn hundreds of derived word forms through patterns of prefixes, suffixes, and internal vowel changes. This makes tokenization, normalization, and semantic matching genuinely hard. Second, Arabic remains comparatively underserved in NLP research. Most benchmarks and pre-trained models are English-first, which means the design decisions here carry real practical weight.

Note: This project was built as a deliberate learning curriculum — moving from classical methods to modern LLMs — rather than as a production system. The goal was to understand what each approach gains and loses on a genuinely hard Arabic NLP task. The full code is available on GitHub.

2. The Dataset

Sources

Data was aggregated from two primary sources:

KSAA-CAD (Contemporary Arabic Reverse Dictionary):^[1] Derived from three major Arabic dictionaries — the Contemporary Arabic Language Dictionary, Mu’jam Arriyadh, and Al Wassit LMF. Each entry contains a word lemma, a gloss (definition), and a part-of-speech tag. The dataset is linguistically grounded in lemmas rather than roots, and the original source also ships pre-computed contextual embeddings from AraELECTRA, AraBERTv2, and CamelBERT-MSA. A sample entry looks like this:

{
  "id": "ar.45",
  "word": "عين",
  "gloss": "عضو الإبصار في ...",
  "pos": "n"
}

Split sizes: Train 31,372 / Validation 3,921 / Test 3,922.

riotu-lab/arabic_reverse_dictionary (Hugging Face):^[2] A secondary dataset in CSV format with two columns word and definition. Train size: 58,607.

Preprocessing Pipeline

All preprocessing was implemented using CamelTools^[3], Arabics leading NLP toolkit:

Normalization: Metadata columns from KSAA (id, pos, embeddings) were removed to match the structure of the second source.
De-diacritization: All tashkeel (short vowel markings) were stripped.
Tatweel removal: The elongation character “ـ” was removed.
Orthographic normalization: Alef variants (أ، إ، آ → ا), Alef Maksura (ى → ي), and Teh Marbuta (ة → ه) were unified. This is critical in Arabic as the same word can appear with different spellings depending on the typist or source, and without normalization, identical words would be treated as different items.
Deduplication and merging: Duplicates within each source were removed before merging into a final dataset of 97,822 entries, then shuffled and split 80/10/10.

Data Integrity & Leakage Analysis

Before running a single experiment, the data splits were audited carefully.

Metric	Train	Val	Test
Total Samples	76,265	9,533	9,534
Unique Words	35,310	7,201	7,205
Unique Pairs (Word + Gloss)	76,265	9,533	9,534

On word overlap: 4,060 words are shared between Train and Validation, and 4,072 between Train and Test. The task requires identifying the same word from different glosses, so overlap at the word level is expected and desirable.

On pair overlap (the critical number): Both Train ∩ Val and Train ∩ Test return 0. No identical word-gloss pair exists across splits. Critical leakage is 0.00%.

Multiple glosses per word: The dataset contains many words mapped to multiple definitions, which is linguistically realistic. Arabic words are often polysemous. This helps models learn diverse semantic contexts for the same target.

Split	Multi-Gloss Cases
Train	40,955
Validation	2,332
Test	2,329

Target word length: The vast majority of targets are single tokens, confirming this is primarily a single-word retrieval task.

Word Length (tokens)	Frequency in Train
1	56,142
2	15,476
3	3,638
4	807
5+	202

3. Experiment 1 — TF-IDF

Why Start Here?

TF-IDF is the obvious baseline: fast, interpretable, requires no GPU, and surprisingly strong for keyword-heavy tasks. For a reverse dictionary with short, precise glosses, it is a reasonable first bet.

Methodology

A TF-IDF matrix is built over the training glosses. At inference time, the test gloss is transformed using the same vocabulary and compared against all training glosses via Cosine Similarity. The training gloss with the highest similarity maps back to its word, and that word becomes the prediction.

The two core formula components:

Term Frequency (TF): How often a term appears in a specific gloss, normalized by gloss length.

$TF(t,d) = \frac{\text{Number of times term } t \text{ appears in doc } d}{\text{Total number of terms in doc } d}$

Inverse Document Frequency (IDF): How rare a term is across all glosses. Rare terms get higher weight.

$IDF(t,D) = \log\left(\frac{\text{Total number of docs } N}{\text{Number of docs containing } t}\right)$

Vectorizer Configuration

The scikit-learn TfidfVectorizer was configured as follows, with rationale for each choice:

Parameter	Value	Why
`analyzer`	`"word"`	Full-word operations, not character n-grams
`ngram_range`	`(1, 1)`	Unigrams only
`min_df`	`2`	Drops rare typos and noise
`max_df`	`0.95`	Filters near-universal stopwords
`max_features`	`30,000`	Caps vocabulary to prevent memory explosion
`sublinear_tf`	`True`	Replaces raw TF with $1 + \log(TF)$ to compress high-frequency term dominance
`dtype`	`np.float32`	32-bit floats halve memory vs. 64-bit

Results

Metric	Result
Top-1 Accuracy	18.18%
Top-5 Accuracy	28.40%
MRR	0.2205

18% Top-1 on a 35,000+ word vocabulary is a meaningful baseline. The result also sets an important benchmark: any approach that claims to add semantic understanding should beat this number. As we will see, that turns out to be harder than expected.

4. Experiment 2 — FastText + FAISS

Moving Beyond Keywords

The limitation of TF-IDF is that it matches on exact token overlap. If the test gloss uses “مبنى” (building) and the training gloss used “هيكل” (structure), TF-IDF sees two completely different signals even though they are semantically close.

Static embeddings convert words into dense vectors trained on co-occurrence patterns across large corpora, capturing this kind of semantic proximity. The hope was that a “semantic search” over embedded definitions would outperform keyword matching.

Model Selection: FastText

FastText (cc.ar.300.bin)^[4], pre-trained by Facebook on the Arabic Common Crawl corpus (2 million words/n-grams), was selected for one key reason: it handles Arabic morphology through subword n-grams. Instead of treating “مكتبة” as an atomic unit, FastText breaks it into character n-grams and can construct meaningful vectors for unseen word forms based on shared substrings. For a morphologically rich language where a seen root might appear in an unseen form at test time, this matters.

Methodology

Each gloss was converted to a 300-dimensional vector using mean pooling — averaging the FastText vectors of all words in the definition. The resulting sentence vectors were indexed using FAISS^[5] (IndexFlatIP) for efficient nearest-neighbor search. Vectors were L2-normalized before indexing so that inner product search is mathematically equivalent to cosine similarity.

Results

Metric	Result
Top-1 Accuracy	15.04%
Top-5 Accuracy	23.12%
MRR	0.1817

The semantic approach underperformed TF-IDF by about 3 percentage points across every metric.

Why Did Semantics Hurt?

This result is counterintuitive but makes complete sense once you look at what mean pooling does to short Arabic glosses.

TF-IDF automatically down-weights high-frequency terms via the IDF score. Words like “هو”, “الذي”, “في” (he, who, in) are common across all glosses and get near-zero weight. Rare, distinctive terms get high weight. The result is that the TF-IDF vector is dominated by the words that actually distinguish this definition from all others.

Mean pooling does the opposite. Every word, including common function words, contributes equally to the sentence vector. High-frequency words pull the vector toward a generic, “average Arabic sentence” representation, eroding the distinctive signal that identifies the target word.

For short, precise dictionary glosses — often just 5–15 words — this dilution effect is severe. The conclusion: for this kind of task, keyword importance is more valuable than broad semantic meaning, at least with simple averaging.

5. Experiment 3 — Transformers (Zero-Shot)

Why Transformers?

Both TF-IDF and FastText treat words in isolation. TF-IDF counts tokens; FastText looks up pre-trained word vectors. Neither method understands context. The word “بيت” (house/verse) means something completely different in “بيت الشعر” (a verse of poetry) than in “بيت السكن” (a dwelling). Only a model that reads the entire sentence simultaneously and weights tokens relative to each other can capture this.

Transformers do exactly this via self-attention: every token attends to every other token, producing representations that are context-dependent by design.

Model Selection

Six Arabic-specific transformer models were selected to compare out-of-the-box performance:

Model	Source	HuggingFace ID
Arabic-BERT^[6]	Ali Safaya	`asafaya/bert-base-arabic`
AraElectra^[7]	AubMindLab	`aubmindlab/araelectra-base-discriminator`
AraBERT v2^[8]	AubMindLab	`aubmindlab/bert-base-arabertv2`
CamelBERT^[11]	CAMeL-Lab	`CAMeL-Lab/bert-base-arabic-camelbert-msa`
MARBERT^[10]	UBC-NLP	`UBC-NLP/MARBERT`
MARBERTv2^[9]	UBC-NLP	`UBC-NLP/MARBERTv2`

The Pipeline

Tokenization: Arabic strings are broken into subword units. The word “وبالوالدين” becomes ['و', 'ب', 'ال', 'والدين']. Max length was set to 128 tokens — long enough to capture the 95th percentile of gloss lengths, short enough to avoid diluting the representation with excessive padding.

Encoding: Only the Encoder block is used (not generation). The self-attention mechanism computes token-to-token interaction scores across all 12+ layers, producing a contextual vector for every token.

Pooling: Three pooling strategies were tested to collapse token vectors into a single sentence vector:

Mean Pooling: Average all token vectors (padding excluded).
CLS Token: Use the dedicated [CLS] token vector, designed to represent the full sequence.
Max Pooling: Take the maximum value across tokens per dimension, highlighting the most salient features.

Retrieval: The test gloss is encoded and compared to all 76k training gloss vectors via cosine similarity. The closest match maps to its word.

In-Vocab vs. Overall: Because the training set forms a closed vocabulary, both metrics are reported. In-Vocab measures accuracy when the correct word exists in the training set. Overall measures accuracy across all test samples, including words never seen during training.

Zero-Shot Results

Top-1 Accuracy

Model	In-Vocab	Overall
CamelBERT	22.32%	14.84%
MARBERTv2	16.53%	10.99%
MARBERT	16.17%	10.75%
AraBERT	16.06%	10.68%
Arabic-BERT	16.03%	10.66%
AraElectra	10.09%	6.71%

Top-5 Accuracy

Model	In-Vocab	Overall
CamelBERT	37.56%	24.97%
MARBERTv2	27.52%	18.30%
MARBERT	26.18%	17.41%
AraBERT	25.95%	17.25%
Arabic-BERT	25.93%	17.24%
AraElectra	15.95%	10.60%

MRR

Model	In-Vocab	Overall
CamelBERT	0.30	0.20
MARBERTv2	0.22	0.15
MARBERT	0.21	0.14
AraBERT	0.21	0.14
Arabic-BERT	0.21	0.14
AraElectra	0.13	0.09

Key Observations

CamelBERT’s lead: CamelBERT outperforms its closest competitor (MARBERTv2) by nearly 6 percentage points in Top-1 accuracy. CamelBERT was pre-trained specifically on Modern Standard Arabic (MSA) text, which is the exact register used in formal dictionaries. Its pre-training distribution matches the task distribution better than models trained on more colloquial or mixed-dialect data.

AraElectra’s poor performance: AraElectra is an ELECTRA-style model trained as a discriminator — it learns to distinguish real tokens from replaced tokens, not to produce semantically rich sentence representations. This architecture is powerful for classification but is not naturally suited for embedding-based retrieval without fine-tuning. Its poor zero-shot score is expected.

Overall vs. TF-IDF: In zero-shot mode, transformers do not beat TF-IDF on overall accuracy. Zero-shot contextual embeddings are not yet calibrated for dictionary retrieval. The next experiment addresses this directly.

6. Experiment 4 — Contrastive Fine-Tuning

The Core Problem

Pre-trained transformer embeddings encode general Arabic semantics. But the task here is highly specific: move a gloss vector close to its target word vector and far away from all other word vectors. Without task-specific training, the embedding space has no reason to organize itself around this structure.

Contrastive learning directly optimizes for this. Rather than treating word prediction as a 35,000-class classification problem, we frame it as a dense retrieval alignment task using the NT-Xent loss.

Architecture

Each model is extended with a two-stage architecture:

Pre-trained Transformer Encoder — provides contextualized token representations.
Task-Specific Projection Head — a single linear layer mapping the model’s hidden dimension to a compact 256-dimensional space. All embeddings are L2-normalized after projection, placing them on a unit hypersphere where cosine similarity equals dot product. This stabilizes gradients and ensures metric compatibility.

The NT-Xent Loss (InfoNCE)

For each training sample, a gloss g is pulled toward its true target word w⁺ and pushed away from N=5 randomly sampled distractor words w₁⁻, ..., w₅⁻.

Step 1 — Compute similarities:

$s_{\text{pos}} = \cos(z_g, z_{w^+})$

$s_{\text{neg}} = \left[\cos(z_g, z_{w_1^-}), \ldots, \cos(z_g, z_{w_N^-})\right]$

Step 2 — Scale by temperature τ = 0.07:

$\text{logits} = \left[\frac{s_{\text{pos}}}{\tau}, \frac{s_{\text{neg},1}}{\tau}, \ldots, \frac{s_{\text{neg},N}}{\tau}\right]$

Step 3 — Cross-entropy with target index 0:

$\mathcal{L} = -\frac{1}{B} \sum_{i=1}^{B} \log \left( \frac{\exp(s_{\text{pos}}^{(i)} / \tau)}{\exp(s_{\text{pos}}^{(i)} / \tau) + \sum_{j=1}^{N} \exp(s_{\text{neg},j}^{(i)} / \tau)} \right)$

The intuition is direct: if the model correctly identifies the positive pair as most similar, the numerator dominates the denominator and the loss approaches zero. If a distractor outscores the true target, the loss grows large. The temperature τ = 0.07 sharpens the softmax distribution, forcing the model to be confident rather than hedging across candidates.

Negative Sampling

Five negatives per gloss were used, sampled randomly from the training vocabulary excluding the true target. All B × N negatives are flattened into a single batch, tokenized, and passed through the transformer in one forward pass before being reshaped to (B, N, 256) for loss computation. This is significantly more efficient than encoding negatives sequentially.

Five was chosen empirically: fewer negatives produce weak gradients (the model is not challenged enough), while more negatives increase compute without proportional gains in retrieval accuracy.

Training Configuration

Parameter	Value	Rationale
`batch_size`	128	Stable gradient estimates; maximizes GPU utilization
`learning_rate`	2e-5	Standard fine-tuning rate; prevents catastrophic forgetting
`num_epochs`	3	Sufficient for convergence without overfitting
`warmup_steps`	500	Stabilizes early updates when embeddings are noisy
`max_length`	128	Covers ~95th percentile of Arabic gloss lengths
`temperature`	0.07	Empirically optimal for sentence-level contrastive learning
`negative_sample_size`	5	Balances discriminative pressure and memory

Additional stabilization: AdamW with weight decay applied only to the base transformer; gradient clipping at max_norm=1.0; mixed precision (FP16) via torch.autocast + GradScaler for ~2× training speedup.

Training Loss Per Epoch

Model	Epoch 1	Epoch 2	Epoch 3	Avg. Epoch Duration
CamelBERT	0.3533	0.1509	0.1039	20:31
MARBERTv2	1.5419	0.1987	0.1335	19:11
MARBERT	0.5900	0.1966	0.1146	19:11
AraBERT	0.7003	0.3246	0.2607	21:36
Arabic-BERT	0.4569	0.1965	0.1187	20:59
AraElectra	0.7004	0.2425	0.1658	21:17

Note: MARBERTv2’s high Epoch 1 loss (1.54 vs. ~0.5 for others) reflects its larger initial misalignment with the task — but it converges aggressively by Epoch 2.

Fine-Tuning Results

Top-1 Accuracy

Model	In-Vocab	Overall
CamelBERT	41.48%	27.59%
MARBERTv2	40.60%	27.00%
MARBERT	38.85%	25.83%
Arabic-BERT	39.54%	26.30%
AraBERT	38.20%	25.40%
AraElectra	29.83%	19.83%

Top-5 Accuracy

Model	In-Vocab	Overall
CamelBERT	59.78%	39.75%
MARBERTv2	58.41%	38.84%
MARBERT	56.34%	37.47%
Arabic-BERT	55.66%	37.01%
AraBERT	53.85%	35.81%
AraElectra	46.23%	30.74%

MRR

Model	In-Vocab	Overall
MARBERTv2	0.49	0.32
MARBERT	0.47	0.31
Arabic-BERT	0.47	0.31
AraBERT	0.46	0.30
AraElectra	0.38	0.25

Key Observations

The equalization effect: The ~6-point gap between CamelBERT and the other models that existed in zero-shot collapses to ~1–2 points after fine-tuning. Contrastive alignment can overcome initial architectural and pre-training differences. Given enough task-specific signal, most models converge to similar performance levels.

AraElectra’s recovery: AraElectra’s Top-1 accuracy nearly tripled (6.71% → 19.83%). This confirms the earlier hypothesis: discriminator models need task-specific training to produce useful semantic embeddings. After fine-tuning, they are competitive.

Semantic neighborhoods: The high Top-5 accuracy (~39–40% overall) suggests the models are successfully clustering synonyms and semantically related words near each other. Even when the top prediction is wrong, the correct answer is usually nearby in embedding space.

Persistent OOV ceiling: Around 33% of test words do not appear in the training vocabulary at all. For retrieval-based methods, these cases are structurally unsolvable, the correct answer cannot be returned if it was never indexed. This hard ceiling motivates the shift to generative models.

7. Experiment 5 — LLMs (Zero-Shot & RAG)

The Fundamental Shift

All previous experiments are retrieval tasks operating over a closed vocabulary. They find the best match in an index. This means they literally cannot return a word that was not in their training set.

LLMs are generative. They can produce any word in the Arabic lexicon, including words never seen during training. This is not just an incremental improvement, it is a different problem formulation.

Hardware & Model Selection

All experiments were run locally on a 16GB Apple Silicon machine. Models were selected for Arabic proficiency and parameter efficiency:

Model	Source	Format	Runtime
Gemma-4-E4B^[12]	Google	GGUF (Q4_K_M)	LM Studio API
Qwen3.5-4B^[13]	Alibaba	MLX (8-bit)	MLX-LM

Prompt Design

A single unified prompt was designed to constrain output to a parseable structured format:

أنت قاموس عكسي للغة العربية. بناءً على التعريف المعطى، اذكر أفضل 5 كلمات عربية تناسب هذا التعريف.
 
القواعد:
- أعد فقط قائمة مرقمة من 1 إلى 5
- كل إجابة يجب أن تكون كلمة أو عبارة عربية واحدة فقط
- رتب من الأكثر احتمالاً إلى الأقل
- لا تكتب أي شرح أو نص إضافي
- لا تكرر نفس الكلمة
 
التعريف: {definition}

Evaluation: Raw vs. Morphological Matching

A critical insight emerged during LLM evaluation: exact string matching severely underestimates LLM performance for Arabic.

A model might return “المكتبةُ” while the ground truth is “مكتبه”. Both refer to the same word. Punishing this as incorrect is misleading.

Morphological matching was implemented using CAMeL Tools, applying: orthographic normalization (unifying Alef, Teh Marbuta, Alef Maksura), diacritic stripping, tatweel removal, definite article (ال) stripping, lemmatization (plural/dual → singular), and root extraction.

Additional quality metrics were tracked beyond standard retrieval scores:

Coverage Rate: How often the model produced any output at all.
Prediction Count Distribution: Whether the model consistently returned exactly 5 predictions.
Repetition Rate: How often the same word appeared multiple times in one response.
Average Prediction Length: Measures word vs. phrase vs. sentence outputs — shorter is better for a reverse dictionary.
Language Consistency: What percentage of predictions were actually Arabic.

RAG Implementation

RAG was implemented using ChromaDB^[14] as the vector database. intfloat/multilingual-e5-base^[15] was used for embedding generation — it supports 100+ languages including Arabic and is trained on large multilingual corpora. All 76,265 training entries (definitions + words) were embedded with normalize_embeddings=True and persisted to disk.

At inference time, the test definition is embedded and the top 3 most similar training entries are retrieved. These are injected into the prompt as in-context examples following a consistent pattern:

التعريف: حلو المنطق، مليح اللفظ.
الإجابة: معسول الكلام

This allows the model to observe the expected input-output format from real examples before generating its prediction, reducing hallucination and grounding responses in retrieved context.

Results: Zero-Shot

Raw Matching:

Model	Top-1	Top-5	MRR
Qwen3.5	10.50%	16.60%	0.1310

Morphological Matching:

Model	Top-1	Top-5	MRR
Qwen3.5	26.25%	50.40%	0.4163

Quality Metrics:

Model	Coverage	Repetition Rate	Avg. Word Length	Language Consistency
Qwen3.5	100%	26.30%	1.36	99.62%

Results: RAG

Raw Matching:

Model	Top-1	Top-5	MRR
Qwen3.5	25.90%	33.10%	0.2890

Morphological Matching:

Model	Top-1	Top-5	MRR
Qwen3.5	39.82%	62.60%	0.5462

Quality Metrics:

Model	Coverage	Repetition Rate	Avg. Word Length	Language Consistency
Qwen3.5	100%	25.50%	1.35	99.94%

Note: Due to hardware constraints, LLM results are based on a 1,000-sample subset of the test set.

Key Observations

The 2.5× morphological gap: Raw Top-1 for Qwen zero-shot is 10.50%. After morphological normalization it jumps to 26.25% — a 2.5× increase. This is not noise; it reflects the genuine richness of Arabic morphology. The model is producing correct answers that exact string matching refuses to credit. Evaluation methodology matters as much as model selection.

RAG’s impact: RAG improved morphological Top-1 from 26.25% to 39.82% and Top-5 from 50.40% to 62.60%. Providing three contextually relevant examples from the training dataset helped the model understand the expected output format and reduced hallucination.

OOV as a structural advantage: Unlike all retrieval-based methods, LLMs are not constrained to a closed vocabulary. Any word in Arabic can be a valid prediction. This fundamentally changes the nature of the task and explains a large part of the performance gain.

The reasoning loop problem: Qwen3.5 has a tendency to generate internal chain-of-thought reasoning even when instructed to produce only a list. This consumed tokens and slowed inference significantly across the 1,000-sample test. Models designed primarily for instruction-following rather than reasoning-first generation would be more efficient here.

Gemma’s verbosity: Gemma 4 consistently generated verbose preambles before the numbered list, exhausting the token budget before producing parseable output. Multiple prompt engineering strategies were attempted: restructuring the prompt, English-only instructions, separating system/user/assistant turns, disabling reasoning modes, increasing max tokens, and adjusting temperature. None produced consistently parseable output. The issue likely reflects a mismatch between Gemma’s generation priorities (verbose, explanatory) and this task’s requirements (compact, structured). Gemma is not excluded because it is a poor model, it may simply be poorly suited to this specific output format constraint.

Zero-shot by design: All LLM results here use zero-shot inference with no fine-tuning. The performance gains over fine-tuned transformers are achieved without any task-specific optimization. Fine-tuning these models would likely push results significantly higher.

Dataset ambiguity: Some glosses in the dataset are genuinely ambiguous, multiple words legitimately satisfy the same definition. A model producing a semantically valid but non-ground-truth word appears incorrect in evaluation even when it is right. This is an inherent limitation of single-label evaluation on a task where multiple correct answers exist.

8. Lessons Learned & What’s Next

What is learned from the Experiments (Personal Opinion)

Keywords beat semantics on short glosses. TF-IDF outperformed FastText because IDF automatically identifies the distinguishing terms in a definition. Mean pooling erases this signal. If you must use static embeddings, a weighted pooling strategy (TF-IDF weights applied to word vectors) would likely outperform simple averaging.

Contrastive learning is a genuine multiplier. Fine-tuning roughly doubled zero-shot transformer performance across every model and every metric. The NT-Xent loss forces the embedding space to organize itself around the specific retrieval task, which zero-shot pre-training cannot achieve.

Retrieval has a hard OOV ceiling. Approximately one third of test words do not appear in the training vocabulary. No retrieval-based method — regardless of how sophisticated the embeddings — can solve these cases. Generative models break this ceiling entirely.

Evaluation methodology is not neutral. The 2.5× gap between raw and morphological matching scores for Arabic LLMs shows that exact-string evaluation significantly underestimates model performance on morphologically rich languages. This is not an Arabic-specific problem, it applies to any highly inflected language.

Model architecture matters most at zero-shot; matters less after fine-tuning. CamelBERT’s 6-point lead at zero-shot shrinks to 1–2 points after fine-tuning. Task-specific training is a strong equalizer.

What Could Come Next

Weighted pooling for static embeddings: Apply TF-IDF weights to FastText word vectors before averaging. This should recover much of the IDF signal that mean pooling loses.
Larger LLMs: The experiments here used 4B parameter models on constrained hardware. Even a 7B or 13B Arabic-specialized model (e.g., Jais, ALLaM) would likely show significant gains.
LLM fine-tuning: All LLM results are zero-shot. Supervised fine-tuning or instruction tuning on the training data would likely push morphological Top-1 substantially higher.
Harder negative mining: Random negatives in contrastive training are easy to distinguish. Using semantically similar but incorrect words as negatives (hard negatives) would force the model to learn finer-grained distinctions.
Multi-label evaluation: Rather than scoring against a single ground truth, evaluating against all semantically valid answers for a given definition would produce more honest metrics and reward models that produce legitimate synonyms.
Gemma revisited: Gemma’s verbosity may be solvable with grammar-constrained decoding or structured output forcing. This is worth investigating.

References

1. KSAA-CAD - Contemporary Arabic Reverse Dictionary Dataset

2. riotu-lab/arabic_reverse_dictionary — Hugging Face

3. CAMeL Tools — Arabic NLP Toolkit

4. FastText Arabic Vectors — cc.ar.300.bin

5. FAISS — Facebook AI Similarity Search

6. Arabic Bert - Hugging Face

7. AraElectra - Hugging Face

8. AraBERT v2 — AubMindLab

9. MARBERTv2 — UBC NLP

10. MARBERT — UBC NLP

11. CamelBERT — CAMeL Lab

12. Gemma 4 E4B — Google

13. Qwen3.5 4B — Alibaba

14. ChromaDB — Vector Database

15. intfloat/multilingual-e5-base — Multilingual Embeddings

16. Codebase — Full Codebase on Github

Contents

1. Introduction & Motivation

2. The Dataset

Sources

Preprocessing Pipeline

Data Integrity & Leakage Analysis

3. Experiment 1 — TF-IDF

Why Start Here?

Methodology

Vectorizer Configuration

Results

4. Experiment 2 — FastText + FAISS

Moving Beyond Keywords

Model Selection: FastText

Methodology

Results

Why Did Semantics Hurt?

5. Experiment 3 — Transformers (Zero-Shot)

Why Transformers?

Model Selection

The Pipeline

Zero-Shot Results

Key Observations

6. Experiment 4 — Contrastive Fine-Tuning

The Core Problem

Architecture

The NT-Xent Loss (InfoNCE)

Negative Sampling

Training Configuration

Training Loss Per Epoch

Fine-Tuning Results

Key Observations

7. Experiment 5 — LLMs (Zero-Shot & RAG)

The Fundamental Shift

Hardware & Model Selection

Prompt Design

Evaluation: Raw vs. Morphological Matching

RAG Implementation

Results: Zero-Shot

Results: RAG

Key Observations

8. Lessons Learned & What’s Next

What is learned from the Experiments (Personal Opinion)

What Could Come Next

References