Corpus statistics

A self-audit of coverage, quality, and provenance. All numbers baked in from build-time extracts of the local SQLite corpus; update by re-running scripts/export_root_families.py.

Top-level

Total lemmas: 111,471
Rooted lemmas: 110,02599%
Gold (editor-tagged): 23,84922%
Semitic varieties: 17
Curated root families: 60
Fuzzy junction rows: 398,690
Editor cognate claims: 174
LLM backfill spend: $0.18

Per-language coverage

Language	Total	Rooted	%	Gold
Arabicar	75,429	75,219	99.7%	12,495
Hebrewhe	16,908	16,605	98.2%	6,472
Assyrian Neo-Aramaicaii	6,886	6,874	99.8%	3,528
Classical Syriacsyc	3,886	3,836	98.7%	1,323
Imperial Aramaicarc	2,170	2,129	98.1%	2
Amharicam	1,785	1,638	91.8%	8
Akkadianakk	1,321	754	57.1%	1
Tigrinyati	903	867	96.0%	1
Ugariticug	883	842	95.4%	0
Ge'ezgez	522	484	92.7%	19
Turoyotru	207	207	100.0%	0
Classical Mandaicmid	160	160	100.0%	0
Phoenicianphn	145	145	100.0%	0
Punicpun	103	102	99.0%	0
Old South Arabianosa	99	99	100.0%	0
Western Neo-Aramaicamw	32	32	100.0%	0
Sabaeansab	32	32	100.0%	0

Evaluation baseline

From scripts/eval_regression.py — regression suite against the 168 usable Semitic-to-Semitic editor claims.

Strict recall: 71/16842%
Fuzzy recall: 108/16864%
Precision probes passing: 13/13100%
Reconstruction cases: 4/4100%

Scripture attestations

Cross-referenced lemmas where we have an earliest textual citation.

📜 Tanakh: 6,914Hebrew lemmas
☪︎ Qur'an: 45,222Arabic lemmas

Root-class distribution (top 60 families)

sound ·44hamzated ·8hollow ·3geminate ·3biliteral ·1initial-weak ·1