How it works
A single tool for exploring Semitic roots across 17 languages and two-and-a-half millennia of attestations. This page walks through the data, the algorithms, and where to look when something feels off.
1. Data sources
All primary data is open-license. We deliberately keep each source isolated so provenance stays clean.
- Wiktionary (via Kaikki.org) — CC-BY-SA. Lemmas, glosses, etymology text, derived/related terms, and editor-tagged roots for 10 Kaikki-available Semitic varieties (~111k lemmas).
- Scraped Wiktionary — for Sabaean, Phoenician, Punic, Old South Arabian, Mandaic, Turoyo, Western Neo-Aramaic: direct HTML scrapes when Kaikki coverage was sparse.
- Open Scriptures Hebrew Bible (OSHB)— CC-BY. Morphologically-tagged Tanakh (306k words, 39 books) with Strong's lemma IDs.
- Quranic Arabic Corpus(Dukes 2011) — GPL. 128k Qur'anic segments with ROOT, POS, and morphology features.
- Sefaria— public API (CC-BY). Mishnah (63 tractates), Targum Onkelos (Torah), Targum Jonathan (Prophets), Targum Neofiti & Targum Jerusalem (Torah). Together they span ~500k words of rabbinic Hebrew and Jewish Aramaic.
2. Ingestion pipeline
Each source lands in data/raw/, gets parsed by a source-specific Python script, and is written to a local SQLite DB (data/processed/semitic.sqlite3). From there a resilient sync streams to Turso over Hrana with smart-resume on stream drops.
ingest.py— Kaikki JSONL →entriestablebackfill_camel_arabic.py— CAMeL Tools MSA analyzer (51k Arabic roots)resolve_hollow_roots.py— corrects CAMeL's#placeholder in hollow verbs using the surface word (18k rows)ethiopic_root.py— rule-based Ge'ez syllabary decomposition (79% Ge'ez / 63% Amharic / 100% Tigrinya gold accuracy)nwsemitic_root.py— Hebrew / Syriac / Aramaic abjad extractorllm_backfill_*.py— Gemini Flash Lite on hard cases, tightly scoped (~$0.18 total spend across the whole corpus)ingest_oshb.py / ingest_quran.py / ingest_sefaria.py— strip niqqud/te'amim, match consonantal forms against entries, writeattestationstableingest_derivations.py— re-parses Kaikki forderived/relatedarrays (12k edges)
End state: 111,471 lemmas, 110,025 rooted (98.7%), 59,765 textual attestations, and 12,125 derivation edges, covering 17 Semitic varieties from Akkadian (c. 2400 BCE) through modern Neo-Aramaic.
3. Canonical root index
The core move: map every script-specific root — Arabic ك ت ب, Hebrew כ ת ב, Syriac ܟ ܬ ܒ, Ge'ez ከ ተ በ, Akkadian katab- — to a shared space-separated Semitistic transliteration likek t b. One indexed column on entries.root_canonical turns cross-script cognate lookup into a single-digit-millisecond query.
Mapping tables live in src/semitic_search/canonical_root.py. Where scripts distinguish phonemes that others merge (Hebrew ש covers PS *š/*ś/*ṯ but Arabic ش/س/ث distinguish them), we pick the most common proto-segment and let collisions happen — the index is a hash function for finding cognate families, not a reconstruction.
4. Proto-Semitic reflex-aware fuzzy matching
Strict canonical identity finds 42% of editor-curated cognate claims. The remaining 58% involve Proto-Semitic sound correspondences where daughter languages chose different reflexes of the same ancestor — the classic examples being Ar ض ↔ Heb צ ↔ Syc ܥ (all from *ḍ), and Ar ث ↔ Heb ש ↔ Syc ܬ (all from *ṯ).
The fuzzy matcher (src/semitic_search/fuzzy_canonical.py) maps each surface phoneme to the SET of PS sources it could reflect, then declares two roots potentially-cognate iff every aligned position has at least one shared PS source. Expanding the cartesian product gives ~3.6 surface variants per root; 398,684 junction rows cover all 109,952 rooted entries. The reflex table is conservative — it only collapses phonemes whose shared ancestry is attested in scholarly references (Lipiński 2001, Huehnergard 2000, Moscati 1964).
Effect: strict 42% → fuzzy 64% recall on editor claims; precision stays at 100% on hand-curated negative probes (see /linguistics for the empirical reflex weights).
5. Proto-Semitic reconstruction
Given a cognate set, the reconstruction engine (src/semitic_search/reconstruct.py) walks each slot:
- Map each surface phoneme to its set of possible PS sources.
- For each candidate PS label, count supporters (languages whose surface phoneme could reflect it).
- Best label = highest supporter count. Ties broken by specificity: a PS source with fewer sibling surface reflexes is more diagnostic.
- Per-slot confidence = supporters / total cognates. Overall confidence = geometric mean across slots.
Spot-check (all verified): gold {ar ذ-h-b, he z-h-b, syc d-h-b} → *ḏ-h-b (100%) · three {ar ṯ-l-ṯ, he š-l-š, syc t-l-t, akk š-l-š} → *ṯ-l-ṯ (83%) · earth {ar ʾ-r-ḍ, he ʾ-r-ṣ, syc ʾ-r-ʿ} → *ʾ-r-ḍ (100%).
6. Attestation matching
For each textual source, we strip niqqud / te'amim / cantillation and match consonantal-form tokens againstentries.word and entries.vocalized_form (same stripping on both sides). Each hit writes a row toattestations(entry_id, source, citation, book_order), with UNIQUE on (entry_id, source, citation) so re-ingests are idempotent.
"Earliest attestation" is computed over all matches with a source priority Tanakh → Targumim → Mishnah → Qur'an, then by book_order within a source. Displayed as badges (📜 Tanakh, ✡︎ Mishnah, 𐡀 Targum, ☪︎ Qur'an) across the UI.
7. Regression harness
scripts/eval_regression.py runs three test suites and exits non-zero if any fall below its floor:
- Recall on 168 editor claims (strict ≥ 40%, fuzzy ≥ 60%)
- Precision on 13 hand-curated negative pairs (≥ 95%)
- Reconstruction spot tests (100% — every case must match)
Current readings: strict 42%, fuzzy 64%, precision 100%, reconstruction 4/4. Run before any data change to catch regressions.
8. Known limitations
- The reflex table is conservative — it misses some attested cross-family cognates (e.g. certain Syriac-Arabic pairs involving non-standard sound changes).
- Polysemy and semantic drift aren't tracked. The root k-t-b covers "write" in Hebrew/Aramaic/Arabic but "troops/squadron" in some Arabic senses — we surface both without distinguishing.
- Loanwords aren't flagged. An Arabic word borrowed from Greek via Syriac may show up under its Semitic-looking surface consonants.
- Neofiti is patchy — historical manuscript gaps leave ~15 chapters unavailable. Fragment Targum is genuinely partial.
- The canonical hash collides emphatic/non-emphatic pairs under the fuzzy index; use the strict badge to tell which matches are identity vs reflex.
9. Licenses & attribution
Wiktionary data under CC-BY-SA. OSHB under CC-BY. Quranic Arabic Corpus under GPL. Sefaria content under CC-BY. Ktav Yad CLM font under GPL (Culmus Project, Maxim Iorsh). All Noto and Amiri fonts under SIL OFL. Reflex table drawn from Lipiński (2001), Huehnergard (2000), Moscati (1964).
When re-publishing data, please cite the upstream source — this project is aggregation + indexing, the linguistic primary sources deserve attribution.
10. Everything in one place
Want the raw data or the API?
- /docs — endpoint reference with curl examples
- /stats — live coverage and quality metrics
- /linguistics — reflex matrix and empirical weights
- /isogloss — interactive map of phoneme preservation
/data/root_families.json— top-60 polyglot families as JSON/data/reflex_weights.json— empirical correspondence weights