← Semitic Search

Corpus statistics

A self-audit of coverage, quality, and provenance. All numbers baked in from build-time extracts of the local SQLite corpus; update by re-running scripts/export_root_families.py.

Top-level

Total lemmas
111,471
Rooted lemmas
110,02599%
Gold (editor-tagged)
23,84922%
Semitic varieties
17
Curated root families
60
Fuzzy junction rows
398,690
Editor cognate claims
174
LLM backfill spend
$0.18

Per-language coverage

LanguageTotalRooted%Gold
Arabicar75,42975,21999.7%12,495
Hebrewhe16,90816,60598.2%6,472
Assyrian Neo-Aramaicaii6,8866,87499.8%3,528
Classical Syriacsyc3,8863,83698.7%1,323
Imperial Aramaicarc2,1702,12998.1%2
Amharicam1,7851,63891.8%8
Akkadianakk1,32175457.1%1
Tigrinyati90386796.0%1
Ugariticug88384295.4%0
Ge'ezgez52248492.7%19
Turoyotru207207100.0%0
Classical Mandaicmid160160100.0%0
Phoenicianphn145145100.0%0
Punicpun10310299.0%0
Old South Arabianosa9999100.0%0
Western Neo-Aramaicamw3232100.0%0
Sabaeansab3232100.0%0

Evaluation baseline

From scripts/eval_regression.py — regression suite against the 168 usable Semitic-to-Semitic editor claims.

Strict recall
71/16842%
Fuzzy recall
108/16864%
Precision probes passing
13/13100%
Reconstruction cases
4/4100%

Scripture attestations

Cross-referenced lemmas where we have an earliest textual citation.

📜 Tanakh
6,914Hebrew lemmas
☪︎ Qur'an
45,222Arabic lemmas

Root-class distribution (top 60 families)

sound ·44hamzated ·8hollow ·3geminate ·3biliteral ·1initial-weak ·1