← Semitic Search

How it works

A single tool for exploring Semitic roots across 17 languages and two-and-a-half millennia of attestations. This page walks through the data, the algorithms, and where to look when something feels off.

1. Data sources

All primary data is open-license. We deliberately keep each source isolated so provenance stays clean.

  • Wiktionary (via Kaikki.org) — CC-BY-SA. Lemmas, glosses, etymology text, derived/related terms, and editor-tagged roots for 10 Kaikki-available Semitic varieties (~111k lemmas).
  • Scraped Wiktionary — for Sabaean, Phoenician, Punic, Old South Arabian, Mandaic, Turoyo, Western Neo-Aramaic: direct HTML scrapes when Kaikki coverage was sparse.
  • Open Scriptures Hebrew Bible (OSHB)— CC-BY. Morphologically-tagged Tanakh (306k words, 39 books) with Strong's lemma IDs.
  • Quranic Arabic Corpus(Dukes 2011) — GPL. 128k Qur'anic segments with ROOT, POS, and morphology features.
  • Sefaria— public API (CC-BY). Mishnah (63 tractates), Targum Onkelos (Torah), Targum Jonathan (Prophets), Targum Neofiti & Targum Jerusalem (Torah). Together they span ~500k words of rabbinic Hebrew and Jewish Aramaic.

2. Ingestion pipeline

Each source lands in data/raw/, gets parsed by a source-specific Python script, and is written to a local SQLite DB (data/processed/semitic.sqlite3). From there a resilient sync streams to Turso over Hrana with smart-resume on stream drops.

  • ingest.py — Kaikki JSONL → entries table
  • backfill_camel_arabic.py — CAMeL Tools MSA analyzer (51k Arabic roots)
  • resolve_hollow_roots.py — corrects CAMeL's # placeholder in hollow verbs using the surface word (18k rows)
  • ethiopic_root.py — rule-based Ge'ez syllabary decomposition (79% Ge'ez / 63% Amharic / 100% Tigrinya gold accuracy)
  • nwsemitic_root.py — Hebrew / Syriac / Aramaic abjad extractor
  • llm_backfill_*.py — Gemini Flash Lite on hard cases, tightly scoped (~$0.18 total spend across the whole corpus)
  • ingest_oshb.py / ingest_quran.py / ingest_sefaria.py — strip niqqud/te'amim, match consonantal forms against entries, write attestations table
  • ingest_derivations.py — re-parses Kaikki for derived/related arrays (12k edges)

End state: 111,471 lemmas, 110,025 rooted (98.7%), 59,765 textual attestations, and 12,125 derivation edges, covering 17 Semitic varieties from Akkadian (c. 2400 BCE) through modern Neo-Aramaic.

3. Canonical root index

The core move: map every script-specific root — Arabic ك ت ب, Hebrew כ ת ב, Syriac ܟ ܬ ܒ, Ge'ez ከ ተ በ, Akkadian katab- — to a shared space-separated Semitistic transliteration likek t b. One indexed column on entries.root_canonical turns cross-script cognate lookup into a single-digit-millisecond query.

Mapping tables live in src/semitic_search/canonical_root.py. Where scripts distinguish phonemes that others merge (Hebrew ש covers PS *š/*ś/*ṯ but Arabic ش/س/ث distinguish them), we pick the most common proto-segment and let collisions happen — the index is a hash function for finding cognate families, not a reconstruction.

4. Proto-Semitic reflex-aware fuzzy matching

Strict canonical identity finds 42% of editor-curated cognate claims. The remaining 58% involve Proto-Semitic sound correspondences where daughter languages chose different reflexes of the same ancestor — the classic examples being Ar ض ↔ Heb צ ↔ Syc ܥ (all from *ḍ), and Ar ث ↔ Heb ש ↔ Syc ܬ (all from *ṯ).

The fuzzy matcher (src/semitic_search/fuzzy_canonical.py) maps each surface phoneme to the SET of PS sources it could reflect, then declares two roots potentially-cognate iff every aligned position has at least one shared PS source. Expanding the cartesian product gives ~3.6 surface variants per root; 398,684 junction rows cover all 109,952 rooted entries. The reflex table is conservative — it only collapses phonemes whose shared ancestry is attested in scholarly references (Lipiński 2001, Huehnergard 2000, Moscati 1964).

Effect: strict 42% → fuzzy 64% recall on editor claims; precision stays at 100% on hand-curated negative probes (see /linguistics for the empirical reflex weights).

5. Proto-Semitic reconstruction

Given a cognate set, the reconstruction engine (src/semitic_search/reconstruct.py) walks each slot:

  1. Map each surface phoneme to its set of possible PS sources.
  2. For each candidate PS label, count supporters (languages whose surface phoneme could reflect it).
  3. Best label = highest supporter count. Ties broken by specificity: a PS source with fewer sibling surface reflexes is more diagnostic.
  4. Per-slot confidence = supporters / total cognates. Overall confidence = geometric mean across slots.

Spot-check (all verified): gold {ar ذ-h-b, he z-h-b, syc d-h-b} → *ḏ-h-b (100%) · three {ar ṯ-l-ṯ, he š-l-š, syc t-l-t, akk š-l-š} → *ṯ-l-ṯ (83%) · earth {ar ʾ-r-ḍ, he ʾ-r-ṣ, syc ʾ-r-ʿ} → *ʾ-r-ḍ (100%).

6. Attestation matching

For each textual source, we strip niqqud / te'amim / cantillation and match consonantal-form tokens againstentries.word and entries.vocalized_form (same stripping on both sides). Each hit writes a row toattestations(entry_id, source, citation, book_order), with UNIQUE on (entry_id, source, citation) so re-ingests are idempotent.

"Earliest attestation" is computed over all matches with a source priority Tanakh → Targumim → Mishnah → Qur'an, then by book_order within a source. Displayed as badges (📜 Tanakh, ✡︎ Mishnah, 𐡀 Targum, ☪︎ Qur'an) across the UI.

7. Regression harness

scripts/eval_regression.py runs three test suites and exits non-zero if any fall below its floor:

  • Recall on 168 editor claims (strict ≥ 40%, fuzzy ≥ 60%)
  • Precision on 13 hand-curated negative pairs (≥ 95%)
  • Reconstruction spot tests (100% — every case must match)

Current readings: strict 42%, fuzzy 64%, precision 100%, reconstruction 4/4. Run before any data change to catch regressions.

8. Known limitations

  • The reflex table is conservative — it misses some attested cross-family cognates (e.g. certain Syriac-Arabic pairs involving non-standard sound changes).
  • Polysemy and semantic drift aren't tracked. The root k-t-b covers "write" in Hebrew/Aramaic/Arabic but "troops/squadron" in some Arabic senses — we surface both without distinguishing.
  • Loanwords aren't flagged. An Arabic word borrowed from Greek via Syriac may show up under its Semitic-looking surface consonants.
  • Neofiti is patchy — historical manuscript gaps leave ~15 chapters unavailable. Fragment Targum is genuinely partial.
  • The canonical hash collides emphatic/non-emphatic pairs under the fuzzy index; use the strict badge to tell which matches are identity vs reflex.

9. Licenses & attribution

Wiktionary data under CC-BY-SA. OSHB under CC-BY. Quranic Arabic Corpus under GPL. Sefaria content under CC-BY. Ktav Yad CLM font under GPL (Culmus Project, Maxim Iorsh). All Noto and Amiri fonts under SIL OFL. Reflex table drawn from Lipiński (2001), Huehnergard (2000), Moscati (1964).

When re-publishing data, please cite the upstream source — this project is aggregation + indexing, the linguistic primary sources deserve attribution.

10. Everything in one place

Want the raw data or the API?

  • /docs — endpoint reference with curl examples
  • /stats — live coverage and quality metrics
  • /linguistics — reflex matrix and empirical weights
  • /isogloss — interactive map of phoneme preservation
  • /data/root_families.json — top-60 polyglot families as JSON
  • /data/reflex_weights.json — empirical correspondence weights