Corpus statistics
A self-audit of coverage, quality, and provenance. All numbers baked in from build-time extracts of the local SQLite corpus; update by re-running scripts/export_root_families.py.
Top-level
- Total lemmas
- 111,471
- Rooted lemmas
- 110,02599%
- Gold (editor-tagged)
- 23,84922%
- Semitic varieties
- 17
- Curated root families
- 60
- Fuzzy junction rows
- 398,690
- Editor cognate claims
- 174
- LLM backfill spend
- $0.18
Per-language coverage
| Language | Total | Rooted | % | Gold |
|---|---|---|---|---|
| Arabicar | 75,429 | 75,219 | 99.7% | 12,495 |
| Hebrewhe | 16,908 | 16,605 | 98.2% | 6,472 |
| Assyrian Neo-Aramaicaii | 6,886 | 6,874 | 99.8% | 3,528 |
| Classical Syriacsyc | 3,886 | 3,836 | 98.7% | 1,323 |
| Imperial Aramaicarc | 2,170 | 2,129 | 98.1% | 2 |
| Amharicam | 1,785 | 1,638 | 91.8% | 8 |
| Akkadianakk | 1,321 | 754 | 57.1% | 1 |
| Tigrinyati | 903 | 867 | 96.0% | 1 |
| Ugariticug | 883 | 842 | 95.4% | 0 |
| Ge'ezgez | 522 | 484 | 92.7% | 19 |
| Turoyotru | 207 | 207 | 100.0% | 0 |
| Classical Mandaicmid | 160 | 160 | 100.0% | 0 |
| Phoenicianphn | 145 | 145 | 100.0% | 0 |
| Punicpun | 103 | 102 | 99.0% | 0 |
| Old South Arabianosa | 99 | 99 | 100.0% | 0 |
| Western Neo-Aramaicamw | 32 | 32 | 100.0% | 0 |
| Sabaeansab | 32 | 32 | 100.0% | 0 |
Evaluation baseline
From scripts/eval_regression.py — regression suite against the 168 usable Semitic-to-Semitic editor claims.
- Strict recall
- 71/16842%
- Fuzzy recall
- 108/16864%
- Precision probes passing
- 13/13100%
- Reconstruction cases
- 4/4100%
Scripture attestations
Cross-referenced lemmas where we have an earliest textual citation.
- 📜 Tanakh
- 6,914Hebrew lemmas
- ☪︎ Qur'an
- 45,222Arabic lemmas
Root-class distribution (top 60 families)
sound ·44hamzated ·8hollow ·3geminate ·3biliteral ·1initial-weak ·1