designing one sign language for the world, by measurement rather than taste. Work in progress; everything below is the live working state of the project, regenerated on every publish.
The approach in one sentence: measure what the world's sign languages already agree on (they agree far more than spoken languages do), keep the convergent parts, design the rest against explicit constraints, and test everything with deaf reviewers before it counts.
# STATUS: read this first One page for anyone (human or agent) stepping into the project cold. ## What this is pansign: designing a universal auxiliary sign language by measuring convergence across existing sign languages instead of inventing from taste. PLAN.md is the full design document and the authority on why. Dated snapshots of it live in backups/. ## Where we are **Phase 1-2 of PLAN.md section 12: the annotations-only v0 convergence study.** Goal: first real convergence and head-start numbers computed from already-annotated lexical databases, zero video processing. Video and pose estimation come after this proves the method. Progress and reasoning are in LOG.md (chronological, append-only). Data source inventory and access status: data/SOURCES.md. ## How to continue 1. Read LOG.md from the last entry backward until oriented. 2. Check DECISIONS.md for anything marked `open` that blocks your task; proceed on the recorded default unless it is marked `blocking`. 3. Do the next thing listed under "Next" in the latest LOG.md entry. 4. Append your own LOG.md entry: what you did, why, what surprised you. 5. `just save "<plain words>"` after every green step. ## File map - PLAN.md: the design document. Read sections 5, 9, 11 before touching data. - LOG.md: work log, newest entry last. The "why" lives here. - DECISIONS.md: open/settled decision register. Arbitrary calls and convention choices are parked here so the operator or deaf reviewers can revisit them without rework. - QUESTIONS.md: emerging concerns that are nobody's decision yet (legal, methodological, resource). - EDGECASES.md: corner cases the language and tooling must eventually handle. - data/: datasets and their inventory (SOURCES.md). Raw downloads stay out of git if large (.gitignore'd); derived tables get committed. - backups/: dated PLAN.md snapshots. ## Who does what - Wayne (agent): everything computable. Data survey, pipeline, analysis, drafts, this repo. - Operator: spending, accounts, legal calls (scraping terms, licenses), outreach, anything public. Registers pansign.org. Reviews DECISIONS.md. - Deaf community: reviews and vetoes, from phase 3 on. Nothing publishes without deaf review; see PLAN.md sections 7 and 12. ## Current human (operator) items Not blocking yet; the concept list and pose-pipeline proof proceed without them. In priority order: 1. **Global Signbank account** (unblocks the first cross-language comparison): register at https://signbank.cls.ru.nl (top-right signup), request access to the NGT dataset (it is CC BY 4.0; academic-adjacent justification "cross-linguistic lexical comparison research" is accurate), then Datasets → NGT → export gloss CSV, and place the file in data/raw/ named ngt-signbank-export.csv. Wayne handles the rest. 2. **Register pansign.org** (already agreed, any time). 3. Later, at phase 2 breadth: the Spread the Sign bulk-use decision (DECISIONS D004) and a real legal read on QUESTIONS Q1.
# Work log Append-only, chronological. Each entry: what, why, surprises, next. Detail level: enough that a stranger could pick up mid-stride. ## 2026-07-01: project start Repo created from the legendary-taste kit. PLAN.md written: the full design argument (auxiliary not replacement, convergence-measured lexicon, deaf-first, script-free). Private backup at wayniacal/pansign. ## 2026-07-05: naming, scorecard, media pipeline, infant track Operator settled the name: pansign (pansign.org to be registered by operator; .com squatted). "koine" reserved for the learning app. Added to PLAN.md: media pipeline (section 9, pose estimation in, encoding-first out, per-stage checks), scorecard deliverable (section 11, head-start and neutrality numbers), infant guide as a parallel track with baby-sign market and trademark notes. ## 2026-07-05: v0 study kickoff What: dated PLAN.md snapshot to backups/, scaffolding docs (STATUS, LOG, DECISIONS, QUESTIONS, EDGECASES), starting the annotations-only v0 convergence study per PLAN.md section 12 phase 2. Why annotations first: several lexical databases already carry human phonological coding. Any convergence result computable from those is cheaper and more trustworthy than CV output, and calibrates the CV pipeline later. Machine constraints noted: 2 cores, 8 GB RAM, ~21 GB free disk. Fine for tables and keypoints; never store bulk video here (stream, extract, drop). Next: data source survey with live fetch attempts, logged to data/SOURCES.md. Then ASL-LEX ingest and first sanity analyses. ## 2026-07-05: ASL-LEX ingested, first real numbers; source survey done What: downloaded ASL-LEX 2.0 (2,723 signs, full phonological coding) via the OSF API; wrote analysis/asl_lex_summary.py (stdlib only, per D003); probed Global Signbank, Dicta-Sign, BSL/Auslan signbanks, STS, Spread the Sign. Inventory with access details: data/SOURCES.md. Findings that matter (full tables in data/derived/asl-lex-summary.md): - Handshapes: ASL uses 58, but the top 20 cover 79% of the lexicon. A ~20-handshape designed inventory forfeits little. Fed into D006. - One-handed degradation is cheaper than feared: 39% of ASL signs are already one-handed and another 35% are symmetrical (one hand can carry them); only the asymmetric quarter needs designed variants. Natural ASL also carries ~3% dominance/symmetry-violating signs; a designed language carries zero. - Location: neutral space + head = 63%; below-waist locations essentially absent in nature, so the webcam-framing requirement costs nothing. - Movement inventory is small: straight/curved/none = 84% of signs. - Iconicity: mean 3.1 of 7; ~28-30% of signs rate >= 4. Deaf and hearing raters correlate at r = 0.817, so iconicity judgments are stable across hearing status (within US culture; Q2 caveat unchanged). The iconicity discount is real but covers maybe a third of a lexicon, matching PLAN.md section 3's "smaller than optimists assume". - Fingerspelled loan signs are only 1.4% of ASL-LEX: dropping fingerspelling from the core costs almost no lexicon. Its real function (names, out-of-vocabulary) still needs the sign-name convention (EDGECASES). Source survey headlines: Global Signbank's NGT dataset is CC BY 4.0 with ~7,469 glosses but bulk export needs a free researcher account (operator step written in data/SOURCES.md). Dicta-Sign is the sleeper: 2,092 concepts with parallel videos in BSL/DGS/LSF/GSL at predictable URLs, concept-aligned already, ideal first corpus for the pose pipeline. No machine-readable coding found yet for any non-Western sign language; the neutrality gap (Q3) is real and needs active hunting. Surprise: ASL-LEX CSV is latin-1 encoded, not utf-8. Next, in order: 1. Draft the 500-concept list v0 (D001 hybrid recipe): structure the file with sense IDs, tags (swadesh/contact/frequency source), and bias annotations per Q3. 2. Operator unblocks Global Signbank export (see SOURCES.md); then the same summary analysis runs on NGT for the first cross-language handshape/location comparison. 3. Pose pipeline proof on a handful of Dicta-Sign videos (phase 1, PLAN.md section 9): mediapipe needs a pinned python lane first (D003 revisit: this is the point where stdlib stops sufficing). 4. Hunt for CSL/JSL/IPSL machine-readable sources. Addendum, same day: drafted the survival-core-50 candidate list (data/concepts-survival50-draft.tsv, register entry D008) with sense IDs and domain/audience tags, so guessability testing and the infant guide have a concrete object to argue about. The full 500-concept list stays next; its recipe is D001. ## 2026-07-09: pose pipeline proven; python lane wired; methods draft What: wired the pinned python lane (mise python 3.12 + uv, mediapipe 0.10.35 in uv.lock) per D003's "when stdlib stops sufficing". Wrote pipeline/extract_pose.py (MediaPipe HolisticLandmarker, Tasks API; the legacy mp.solutions API is gone in 0.10.35). Smoke-tested on four parallel Dicta-Sign videos of concept 100 (BSL/DGS/LSF/GSL). Results: pose detected 99-100% of frames, hands 45-100% (the 45% is a one-handed sign; correct rejection). Near real-time on this 2-core VPS: 401 frames in 21s. Full Dicta-Sign corpus is a ~15h background job (Q5 updated). Keypoints land in data/derived/poses/*.json.gz. Also: started notes/methods.md, the accreting citable-prose companion to this log, after the operator asked whether documentation would support a conference presentation. Rule added there: every completed work item gets its methods paragraph in the same or next save. Q8 added to QUESTIONS.md recording the honest expected-value and legitimacy assessment. Surprise: mediapipe 0.10.35 removed mp.solutions entirely; HolisticLandmarker Tasks API + explicit model download replaced it. Model bundle (13 MB) is gitignored at data/models/, re-fetch URL in pipeline/extract_pose.py. Next: 1. Full 500-concept list v0 (D001). 2. Normalization + DTW form distance over keypoints; validate on Dicta-Sign same-concept pairs before any cross-language claims. 3. Background job: extract poses for all 2,092 Dicta-Sign concepts x 4 languages (fetch, extract, delete video). 4. NGT export lands from operator -> replicate asl_lex_summary on NGT. ## 2026-07-09: Signbank ECVs ingested; NGT phonology scrape running Operator downloaded all 31 Signbank exports; they turned out to be ECV files (ELAN controlled vocabularies): gloss id + gloss text + translations + stable per-gloss URL, no phonology. The Request Access flow for CSV exports 500s server-side (reported to their admin; fallback is emailing the Radboud CLS group if it stays broken). What we got anyway: - analysis/parse_ecv.py -> data/derived/signbank-glosses.tsv: 44,377 glosses across 26 datasets (test sets skipped), the concept-alignment backbone. Inventory: data/derived/signbank-inventory.md. Includes the full IS series (1959-2021), CSL Shanghai (2,241), ISL, JSL/KSL/TSL, Kata Kolok, VGT (16,928). - Public NGT gloss pages DO render phonology (handedness, strong/weak handshape, location, contact, orientation, movement when set). NGT is CC BY 4.0, so polite scraping of public pages is license-clean. pipeline/scrape_ngt.py: 1 req/s, identifying UA, resumable JSONL out. Smoke test clean; full run (~7.5k pages, ~2h) launched in background -> data/derived/ngt-phonology.jsonl. Why NGT phonology matters: it is the second fully-coded lexicon after ASL-LEX, from a different coding tradition, and unlocks the first cross-language handshape/location comparison plus the D007 mapping table. Next (after the running items above): parse ngt-phonology.jsonl into the NGT equivalent of the ASL-LEX summary; gloss-match the IS_WFD1975 list against NGT/ASL for a first Gestuno-vs-nature look. ## 2026-07-09: pansign.org live path prepared Operator registered pansign.org, A record to this server. Built the deliberately-cheap progress page (site/build.py): one static HTML, the repo documents verbatim in collapsible monospace blocks plus a live numbers strip computed from data/derived/. Ship = regenerate + cp to ~/sites/pansign + curl sentinel (justfile wired; run = local preview on :8099). Page regenerates from the docs, so publishing stays current with zero extra writing. Files staged to ~/sites/pansign; Caddy vhost needs operator sudo (two commands handed over). Site is expendable by design; the eventual dictionary site replaces it wholesale. NGT scrape note: their server is slow (~9s/gloss), so the full pass takes about a day, not 2h. Resumable, running. ## 2026-07-09 (night): dicta extraction launched; alignment floors; concepts v0 Operator confirmed pansign.org live (Caddy vhost added), sent the Radboud access email, went to sleep. Standing instruction: keep working, ship the site when progress lands. Three things done: 1. **Dicta-Sign full extraction launched** (pipeline/dicta_run.py, background): all 2,092 concepts x 4 languages, fetch-extract-delete, resumable, misses recorded. Output data/derived/dicta-poses/. 2. **Gloss-level concept alignment** (analysis/gloss_match.py -> data/derived/gloss-match.md). Floors, not truths (exact normalized match). Usable: Gestuno-1975 shares 649 glosses with ASL-LEX, 500 with NTS, 465 with ISL, 348 with CSL Shanghai; the 1959->1971->1975->1979 IS series has strong internal continuity (255/247/125 with 1975). Flaw found: VGT and LSFB gloss in Dutch/French, so English exact-match reads ~zero; v2 must match on the translation fields split on commas, not just the primary gloss. Survival-50 coverage confirms the list is ordinary vocabulary: 46/50 in ASL-LEX. 3. **Concept list v0** (analysis/build_concepts.py -> data/concepts-v0.tsv): 520 concepts = survival-50 + Swadesh-100 + ASL-LEX top deaf-signer-frequency lemmas (fingerspelled loans excluded), provenance-tagged per entry. Bias recorded (Q3): the only frequency signal is ASL. Curation pass (senses, domains for the frequency tranche, cultural blind spots) is open for operator and later deaf review; mechanical regeneration is safe, the file is derived. Next: NGT scrape completes -> NGT phonology summary + D007 mapping table; dicta poses accumulate -> normalization + DTW metric development can start on partial data (a few hundred concepts suffice for method development). ## 2026-07-12 (night 2): baby-sign research distilled; PISL sourcing Operator asks honored this session: - **notes/baby-sign.md**: the requested distillation. Practitioner consensus (7-point method), honest evidence state (signs precede speech: solid; developmental claims: do not survive systematic review; real benefit is the 8-18mo communication gap), product landscape with prices (Baby Signs franchise, Signing Time $12.99/mo, Tiny Signs course, free-chart SEO plays), testimonial themes (fewer tantrums dominates; the PAIN-sign medical story recurs; week-6 quit cliff), and the Deaf-community appropriation critique with its sharp implication: our infant core is only defensible as the first 50 words of a real co-owned language, never as another fragment product. Feeds the infant guide and D008 (infant-first subset, PAIN gets design priority). - **PISL**: Tomkins 1926 and Mallery 1881 identified on archive.org (identifiers in data/SOURCES.md); fetches throttled from this server, retry later. Extraction of Tomkins' ~800 sign descriptions is a good future task: out-of-family historical lexicon, and it carries the operator's framing (a sovereign signed auxiliary that actually worked across nations). Background jobs mid-flight: NGT phonology 519/7,464 (their server is the bottleneck); dicta-poses 207 files so far, no misses logged. Docs review pass: PLAN/DECISIONS/QUESTIONS consistent; one addition worth making later, a PLAN sentence connecting PISL's sovereignty precedent to the section 1 argument (deferred; PLAN edits are operator- visible and it is 1am). Site shipped with updated numbers.
# Decision register Every arbitrary or convention-level choice gets an entry the moment it is made, so the operator or deaf reviewers can revisit any of them later without archaeology. Statuses: - `open`: proceeding on the recorded default; cheap to change. - `blocking`: work stops here until a human decides. Rework cost of guessing wrong is too high. - `settled`: decided by operator or deaf review; date and rationale noted. - `deferred`: not yet live; parked until its phase. Rule of thumb: computational choices default to `open` (results can be recomputed), language-design choices that thousands of people would have to relearn lean `blocking` once we reach them. --- ## D001 `open` Concept list composition for the 500-word core Options: a. Classic Swadesh-style universal concepts. Comparable to prior literature, but built for historical linguistics, not daily use. b. Corpus frequency from the sign corpora (BSL/Auslan/NGT). Reflects real signing, but only three Western languages have public frequency data. c. Contact-situation vocabulary (travel, health, food, numbers, family) from use cases 3-5 in PLAN.md section 2. Default: hybrid. Frequency-ranked union of (b) capped to concepts expressible cross-culturally, plus (c) coverage checklist, with (a) as a comparability subset tagged inside the list. Rework cost: low; the list is an input file, analyses rerun. ## D002 `open` What counts as "the same sign" across languages (cognate threshold) The head-start and convergence numbers depend on this line. Options: a. Strict: all four manual parameters match (handshape, location, movement, orientation) up to inventory binning. b. Loose: handshape + location match, movement similar. c. Graded: report similarity as a score, draw the "known/guessable/new" lines late, publish the thresholds with the scorecard. Default: (c). Never bake a binary into the pipeline; thresholds are a reporting decision. Rework cost: none by design. ## D003 `open` Analysis toolchain Options: plain python3 stdlib (zero deps, this VPS has 3.11) vs pinned uv + pandas/scipy lane. Default: stdlib csv/json until something actually hurts, then wire the python lane properly in .mise.toml (pinned, per kit rules). Rework: low. ## D004 `blocking-later` Bulk-fetching Spread the Sign video Their videos are copyrighted; there is no public API. Analysis-only scraping is a terms-of-service and ethics call the operator must make (QUESTIONS Q1). Not needed for the v0 study, so not blocking today. Becomes blocking at phase 2 full breadth. ## D005 `deferred` Notation system: extend HamNoSys / extend SignWriting / new Live at phase 3 (design spec). Research task first: what breaks in each when encoding must be unique, machine-parseable, and human-writable. ## D006 `open` Handshape inventory size Battison's unmarked set is ~7; usable cores in natural SLs run 20-40. First numbers (ASL-LEX 2.0, 2026-07-05, data/derived/asl-lex-summary.md): ASL uses 58 distinct handshapes; the top 10 cover 56% of the lexicon, top 20 cover 79%, top 25 cover 87%. ASL-LEX's own marked-handshape flag splits the lexicon almost exactly in half (1,373 marked / 1,350 unmarked), so a strictly unmarked core is a real departure from natural practice, paid for in shorter sign space (fewer cheap contrasts). Working default: target inventory around 20 handshapes, final call after the same curve is computed for NGT and at least one non-Western language. Rework: medium. ## D008 `open` Survival-core-50 composition Draft candidate list: data/concepts-survival50-draft.tsv (2026-07-05), tagged by domain and audience (infant/contact/swadesh). Chosen for: infant firsts (baby-sign literature's most-used set), contact situations (travel, health, commerce), and cross-cultural expressibility. Deliberately absent: anything script-, religion-, or cuisine-specific. Known tension: "please" and politeness marking are not universal across cultures; may merge into one politeness sign or drop. Operator and later deaf reviewers should scan the list for cultural blind spots. Rework cost: low now, high after guessability testing starts. ## D007 `open` Handshape equivalence across databases ASL-LEX, Signbank datasets, and HamNoSys each use different handshape labels/granularity. Cross-language comparison needs one mapping table. Options: map everything onto HamNoSys's fine-grained set and coarsen, or define our own coarse bins from the start. Default: map to a coarse bin set we define (documented in data/handshape-bins.md when built), keeping source labels alongside so the mapping is revisable. Rework: medium (mapping table edits force recompute, but recompute is cheap).
# Open questions and concerns Not decisions (those go in DECISIONS.md); worries, unknowns, and things that need a lawyer, a linguist, or a deaf reviewer eventually. ## Q1 Legal: are extracted keypoints derivative works? The input pipeline turns copyrighted dictionary video into keypoint series and parameter codes. We treat those as internal analysis data and never republish media, but whether pose data legally counts as a derivative work is untested ground. Needs a real opinion before anything built on scraped video publishes. (Related: D004.) ## Q2 ASL-LEX iconicity ratings are culture-bound ASL-LEX's iconicity scores were collected from hearing American raters. Using them as "universal guessability" would smuggle in exactly the iconicity chauvinism PLAN.md section 7 warns about. Use them as one signal, never ground truth; real guessability testing (check 5) stays mandatory. ## Q3 Frequency data exists for very few sign languages BSL, Auslan, NGT have corpus frequency lists; most languages have none. The 500-concept list will be frequency-weighted toward Western languages no matter what we do at v0. Mitigation: tag the bias in the list itself, revisit when more corpora surface. ## Q4 Spread the Sign coverage is uneven Some languages have full 15k+ vocabularies on STS, others a few hundred entries of varying production quality. Head-start scores for thin languages will have wide error bars; publish per-language sample sizes with the scorecard. ## Q5 VPS resources 2 cores / 8 GB / ~21 GB free. Annotation tables and keypoints fit easily. Measured 2026-07-05: HolisticLandmarker runs near real-time on this CPU (401 frames in 21s including model startup). The full Dicta-Sign corpus (2,092 concepts x 4 languages, ~12h of video) is a ~15h background job, completely feasible here. GPU rental only becomes relevant at Spread the Sign breadth (hundreds of hours of video). Never store bulk video on this disk: fetch, extract, delete. ## Q6 Deafblind users Protactile and tactile signing exist as their own modality. A language optimized for visual discriminability is not automatically tactile-friendly. At minimum: do not design core signs that are tactile-hostile when a tactile-neutral variant exists. Needs expert input at phase 3. (Also listed in EDGECASES.) ## Q8 Expected value and legitimacy (the "is this even a good idea" question) Asked by the operator 2026-07-09, answered honestly: the language itself catching on is a low-probability outcome (order 5-15% even for the IS niche); the research byproducts (convergence study, similarity metric, open corpus tooling, notation) are near-certainly useful to sign linguistics and accessibility tech regardless. The project is structured so its expected value does not depend on adoption. Legitimacy is the harder constraint: a hearing-side project has roughly zero standing to ship a language; phase 3 is a genuine go/no-go gate where the honest outcomes include "hand the whole thing to a deaf-led organization" and "stop, publish the data, done." Written down so nobody later pretends the plan promised more. ## Q7 Left-handed signers Natural SLs let signers mirror; dominance is signer-relative, not absolute. Our encoding must express handedness relatively (dominant / non-dominant, never left/right) or half the videos will "fail" the recording lint. Cheap to get right now, expensive later.
# Edge and corner cases Things the language and its tooling must eventually handle. Each gets a short entry when discovered; promote to PLAN.md or DECISIONS.md when it starts driving design. Sources: known sign-linguistics issues plus whatever the data turns up. ## Signers - Left-handed / mirrored signing: handedness is signer-relative. Encoding uses dominant/non-dominant, never left/right. (QUESTIONS Q7) - One-handed contexts and one-handed people: every core sign carries a defined one-handed variant (PLAN.md requirement). The variant is part of the sign's entry, not folklore. - Deafblind / tactile signing: visual-optimal is not tactile-optimal. Flag tactile-hostile candidates during design. (QUESTIONS Q6) - Children's motor limits: infant guide (50-sign core) must avoid handshapes infants cannot form; baby-sign literature documents typical first-year approximations. The core should survive being mangled by eight-month-old hands and still be readable. - Seated signers, wheelchair users: location parameters that assume standing-height sight lines or torso-length movement paths are bugs. - Limited facial mobility (stroke, Moebius syndrome, botox, veils): grammar that lives exclusively on the face needs manual fallbacks. Also relevant: face-covering norms in parts of the world. ## Environments - Video calls: framing cuts the signing space at the sternum and the camera mirrors. Core signs should read inside a webcam crop. One more reason location contrasts below the waist are banned. - Distance and low light: legibility at 20m outdoors was a design requirement; test it, don't assume it. - Encumbered signing: holding a phone, a rail, a baby. Overlaps with the one-handed variant requirement. ## Content - Cultural gesture collisions: any designed sign must be screened against offensive emblem gestures worldwide (thumbs-up, OK-ring, horns, palm-out moutza, left-hand taboos, pointing taboos). Build a screening checklist from the gesture-studies literature before phase 3 sign design; this is a check, like typos but for obscenity. - Sign names for people and places: no fingerspelling in the core, so the convention for naming needs design (descriptive sign names are the natural-SL norm; collision handling is the open part). - Numbers, dates, time: numeral systems vary wildly across SLs and are a known IS pain point. Treat as its own mini-study within the 500. - Taboo and medical vocabulary: contact register needs body/health terms that are neither euphemistic to uselessness nor crude. - Loanwords from national SLs: when a local sign is ubiquitous (e.g. a city's name sign), pansign should borrow, not compete. Borrowing rules are a phase 3 design item. ## Data and tooling - Same gloss, different senses: dictionary glosses are spoken-language words; "right" (direction) and "right" (correct) must not merge. Concept list is defined by sense IDs, not English strings. - One concept, several regional variants inside one language (ASL has multiple signs for BIRTHDAY): a language contributes its most frequent variant to convergence scoring, others recorded as alternates. - Compound signs and multi-sign phrases in dictionaries: normalize before comparison or convergence scores inflate. - Non-manual-only signs and mouthing-dependent signs: our core bans meaning that lives only in mouthing (it imports spoken language), but the data contains such signs; tag them on ingest.
# pansign: designing one sign language for the world
**Naming, settled 2026-07-05.** pansign is the project's name; pansign.org
to be registered (pansign.com is squatted). The language's true name will
be a sign its users choose once there are users, the way Nicaraguan Sign
Language was named after it existed; the written word is transliteration.
**koine** is reserved as a product name, most likely the learning app: the
Greek common dialect, which is the concept exactly. Infant-facing product
names must clear a crowded trademark field first (section 10).
The goal: a signed language that any two people on earth could
share. Deaf-first, but learnable as a first non-spoken language by hearing
people everywhere. This document is the 1000-mile view: what the language is
for, what to copy, what to invent, how to bootstrap it, and where projects
like this have died before.
## 1. Why this is more plausible than Esperanto
Esperanto failed its ambition because it competed with English for a slot
everyone already fills. A universal sign language competes with nothing for
most of humanity: hearing people have no signed language at all, so it would
be their first, not their third. And the deaf world already behaves as if
this language wants to exist:
- **Sign languages are far more mutually intelligible than spoken ones.**
Deaf strangers with no shared language converge on working communication
in hours, not years. The cross-signing studies out of MPI Nijmegen
document this happening in real time.
- **International Sign (IS) already exists** as a contact register used at
WFD congresses and the Deaflympics. It is a pidgin, not a full language,
but it proves the demand and supplies a partial base.
- **Plains Indian Sign Language** served for centuries as a lingua franca
across dozens of unrelated spoken languages. A signed interlanguage
across language families is not hypothetical; it has already happened.
- **Iconicity is a real discount.** A large fraction of sign vocabulary is
motivated (the sign for "drink" looks like drinking). Iconic signs are
learned faster by both deaf and hearing learners. Spoken conlangs get no
such discount; every word is arbitrary.
The honest framing: this is less "invent a language from nothing" and more
"standardize and complete a convergence that keeps happening spontaneously."
## 2. What a sign language is actually used for
Design for the real use cases, ranked:
1. **Daily life of deaf people**: full expressive range. Argument, humor,
poetry, technical talk, child-rearing. This is the bar for a real
language; everything below is a subset.
2. **Deaf-to-deaf across borders**: today served badly by IS. The first
market, because the users are skilled signers who only need vocabulary
and conventions, not the modality.
3. **Deaf-hearing contact**: shops, hospitals, airports, family members who
never get past 200 signs. A small core must carry this alone.
4. **Hearing-to-hearing where speech fails**: noise, distance, underwater,
quiet required, across a window, on a video call with broken audio.
Divers, crane operators, and traders all invented ad hoc sign systems;
the demand exists.
5. **Parent-infant**: baby sign is already a mass hearing market. Babies
sign before they can speak. This is the strangest and maybe strongest
adoption vector: the first words of millions of hearing children could
be in this language.
6. **Human-machine**: gesture interfaces and sign-recognition ML want a
phonology that cameras can discriminate. No natural sign language was
designed with this constraint; a new one can be.
Requirement that falls out of the ranking: the language must degrade
gracefully. A 50-sign user, a 500-sign user, and a fluent signer must all be
speaking the same language, not three systems.
## 3. Assumptions
Stated so they can be attacked:
- ~70M deaf people, ~150-300 documented sign languages, most tiny and many
endangered. The large ones cluster into families (LSF lineage including
ASL and Libras, BANZSL, German, Japanese, Chinese, Indo-Pakistani).
- Sign language grammar is convergent to a degree spoken grammar is not:
spatial verb agreement, classifier constructions, topic-comment order,
and grammatical facial expressions recur across unrelated families. Not
universal, but common enough to be the default choices.
- Iconicity is partly cultural. "Eat" and "sleep" travel; "marriage,"
"time," and anything metaphorical do not. The shared core is real but
smaller than optimists assume; the 500-word study (section 5) measures it
instead of guessing.
- A committee cannot finish a language. It can ship a seed; only a
community of users, especially children, turns a seed into a language.
Nicaraguan Sign Language emerged in one generation of schoolchildren from
fragmentary input. Children will regularize whatever we ship, which means
the seed needs good bones more than complete coverage.
- Chinese is not a special obstacle. Chinese Sign Language is unrelated to
the spoken language's difficulty; deaf signers in China face the same
modality and the same iconic affordances as everyone else. The actual
China problem is fingerspelling (CSL uses pinyin-based handshapes), which
argues for keeping fingerspelling marginal (section 6).
## 4. Requirements
- **Deaf-first.** Fluent deaf signers must find it expressive and fast, or
it is a gadget. Hearing learnability is a constraint, never the driver.
- **Auxiliary, not replacement.** It sits beside national sign languages
the way English sits beside Dutch. The "replace any given sign language"
goal from the brief is dropped deliberately; see pitfalls.
- **Motorically cheap.** Restrict the core lexicon to unmarked handshapes
(the fist, flat hand, index point, spread hand, C and O shapes that
appear in every studied sign language and that children acquire first).
- **Visually discriminable.** No minimal pairs that differ only in features
a webcam at 15fps or a viewer at 20 meters loses. This single constraint
serves distance signing, video calls, and machine recognition at once.
- **One-handed degradation.** Every core sign needs a defined one-handed
variant: people hold phones, coffee, babies, and steering wheels, and
some people have one hand.
- **Script-free.** No fingerspelled alphabet in the core. Fingerspelling
imports a writing system (Latin for ASL, pinyin for CSL, kana tracing for
JSL) and instantly de-internationalizes the language. Names get sign
names; borrowings get signs.
- **Notation from day one.** A written/encodable form (successor to
SignWriting or HamNoSys, machine-parseable) so the dictionary, corpus,
and tooling exist before the community does.
- **Open.** Public domain lexicon, open corpus, open governance. A language
with an owner is dead on arrival.
## 5. Method: the 500-word convergence study
The empirical heart, and cheap to run now:
1. Build a signed Swadesh-style list, ~500 concepts, weighted for actual
conversational frequency (sign corpora exist for BSL, Auslan, NGT) plus
the contact-situation vocabulary of use cases 3-5.
2. Pull the same 500 concepts from **Spread the Sign** (video dictionary,
40+ sign languages) and the Global Signbank datasets. This corpus
already exists; nobody has mined it for exactly this.
3. Code each entry by the standard phonological parameters: handshape,
location, movement, orientation, non-manual. Cluster. For each concept,
measure: is there a cross-family majority form? A shared iconic motif
with surface variation? Or true divergence?
4. Triage the lexicon accordingly:
- **Convergent** (expect maybe 20-30%): adopt the majority form,
regularized to the constrained phonology. Free vocabulary.
- **Shared motif**: design one sign that keeps the common iconic base.
Cheap vocabulary; most learners get a mnemonic for free.
- **Divergent**: greenfield. Design for iconicity where a culture-neutral
image exists, otherwise optimize for articulation and discriminability.
5. Validate every design decision with deaf consultants from at least the
five major families before it enters the dictionary, and test candidate
signs in staged cross-signing sessions: put strangers in a room with the
seed lexicon and record what survives contact. What people actually
reproduce is the spec; the dictionary follows usage, not vice versa.
Same procedure later for grammar, run on typological literature instead of
video dictionaries: where sign languages agree (spatial agreement,
classifiers, brow-marked questions, topic-comment), take the common
solution. Where they diverge (negation strategies, word order details),
pick the simplest system consistent with the modality and let usage sand it
down.
## 6. Copy versus greenfield
Copying is the shortcut; greenfield is the once-only chance to apply eighty
years of sign linguistics on purpose. The split:
**Copy (convention is the value):**
- Spatial grammar: setting up referents in signing space and directing
verbs between them. Every family does this; it is the modality's killer
feature and costs hearing learners the most, so it must match what deaf
signers already know.
- Classifier constructions: near-universal, productive, and the reason
signers can improvise vocabulary that strangers understand.
- Non-manual grammar: brow raise for yes/no questions, furrow for
wh-questions, headshake negation. Common enough to be de facto standard.
- The convergent lexicon from the 500-word study, and IS conventions with
real currency (numbers, pointing pronouns, time-line metaphors).
**Greenfield (nature never optimized this):**
- Phoneme inventory: chosen for motor ease, visual distance between
phonemes, camera framing (signing space biased high and central for
selfie cameras), and symmetric one-handed fallback. Natural sign
languages carry marked handshapes and sub-visible contrasts as
historical baggage; drop it.
- Morphological regularity: fully regular aspect, plural, and agreement
paradigms. Irregularity is history's scar tissue; a seed language starts
without scars (children would strip them anyway; see Nicaragua).
- The notation system and a canonical machine encoding, co-designed with
the phonology so every legal sign has exactly one encoding. This makes
the dictionary diffable, the corpus searchable, and sign-recognition
training data self-labeling.
- Register plumbing for the degradation requirement: the 50-sign survival
core, the 500-sign contact register, and full grammar designed as
concentric circles, each a strict subset of the next.
## 7. Pitfalls
- **Gestuno (1975).** The WFD's committee published ~1500 signs as a book:
vocabulary with no grammar, no community, no media, chosen by committee
aesthetics rather than convergence data. Interpreters were handed it at
congresses and audiences understood nothing. It quietly became today's
IS only after deaf users discarded most of it and rebuilt by contact.
Lesson: ship a usable register into a live community fast; a dictionary
alone is a tombstone.
- **The replacement framing is poison.** Sign languages are the core of
Deaf identity, and the community carries the memory of Milan 1880, when
hearing educators banned sign from deaf schools for a century. A hearing-
led project to "replace national sign languages" would be received,
correctly, as that again. Auxiliary framing, deaf leadership, and WFD
partnership are not diplomacy; they are the difference between a
language and an insult.
- **Iconicity chauvinism.** Designers reach for their own culture's image
of a concept and call it universal. Every "obvious" sign needs testing
against consultants from unrelated cultures.
- **The prestige trap.** If the auxiliary language gets institutional money
and national sign languages stay poor, it becomes a threat to already
endangered languages, and it should be opposed. Budget and advocacy must
visibly flow to national sign languages alongside it.
- **Nativization drift.** If babies do acquire it natively in scattered
communities, they will change it, and it may dialectize like everything
else. Plan for a living standard with versioned reference corpora, not a
frozen academy; the goal is mutual intelligibility, not purity.
- **Committee capture.** Every conlang community fractures over reform
proposals (Esperanto vs Ido). Governance design (who can change the core,
how slowly) matters as much as phonology.
## 8. Tradeoffs
- **Iconicity vs compactness.** Iconic signs are learnable but tend long
and pantomimic; fluent languages compress. Resolution: iconic citation
forms with documented reduced forms, mirroring what natural sign
languages do diachronically anyway.
- **Deaf optimality vs hearing learnability.** Full spatial grammar is hard
for hearing adults. Resolution: the concentric registers. The contact
register is honest linear signing that fluent grammar contains as a
degenerate case; hearing learners are speaking correctly, just plainly.
- **Familiarity vs neutrality.** Leaning on convergence data means leaning
on the big documented families, which are mostly European-descended, and
ASL-heavy IS already gets resented for this. Resolution: weight the study
corpus toward under-documented languages deliberately, and count the
cost of neutrality as real (a maximally neutral sign is new to everyone).
- **Ship early vs ship right.** Gestuno shipped wrong; academic projects
ship never. Resolution: the 500-sign contact register ships early and
loose (it will be reshaped by use); the full grammar ships conservative
and slow.
## 9. Media pipeline: video in, visuals out
The study consumes thousands of dictionary videos and the project must emit
reference visuals of its own. Neither direction needs generative models.
**Input: video to data.**
- Pose estimation is the workhorse. MediaPipe Holistic or MMPose turns each
video into a keypoint time series (21 landmarks per hand, plus body and
face). Mature, free, runs locally on this class of hardware.
- Auto-coding on top of keypoints: classify handshape against the
inventory, bucket location into zones relative to body landmarks,
classify the movement trajectory, detect handedness and symmetry. Output
is the parameter vector per sign that section 5's clustering needs. The
machine does the bulk coding; humans verify clusters, which is two
orders of magnitude less labor than humans coding raw video.
- The data is less "entirely visual" than it looks. Decades of annotation
already exist in machine-readable form and gets ingested before any CV
runs: ASL-LEX codes ~2700 ASL signs for phonological features, frequency,
and crucially iconicity ratings; the Signbank family (Global/NGT, ASL,
BSL, Auslan) runs the same software with per-entry handshape and location
fields, so cross-language joins are cheap; the national corpora carry
ELAN gloss annotations. CV fills the gaps and the overlap with human
coding calibrates the auto-coder for free.
- Comparison runs at two levels. Form distance: normalize keypoint
trajectories for signer body size and framing, then dynamic time warping
gives sign-to-sign visual similarity directly from video. Phonological
distance: compare the coded parameter vectors, which is robuster to
signer idiosyncrasy and is what the triage clusters on. Disagreement
between the two levels is itself informative (same idea, different
surface, or vice versa).
- The similarity metric has a built-in oracle: it must rediscover the known
sign language families blind. ASL close to LSF and Libras, far from BSL;
BSL clustering with Auslan and NZSL. Woodward's hand-done
lexicostatistics from the 1970s-90s is the answer key. A metric that
cannot reproduce known genealogy has no business triaging the lexicon.
- Licensing is an input check, not an afterthought. Spread the Sign videos
are copyrighted: extracted keypoints and derived codes are internal
analysis data; no third-party media is ever republished. Everything the
project records itself ships under an open license with signed releases.
**Output: encoding to visuals, a ladder from cheap to gold.**
- The canonical form of every sign is its machine encoding, never a video.
All visuals are renderings of the encoding and can be regenerated when
the encoding changes.
- Avatar renders for drafts: compile the encoding to keypoint choreography
and drive a rigged 3D model (prior art: JASigning animates HamNoSys via
SiGML; same idea, saner encoding). Deaf audiences dislike robotic
avatars, rightly, so avatars are internal proofs and previews only.
- Still illustrations: pull key frames from the reference video or the
avatar and reduce to the classic dictionary style, line drawing plus
movement arrows. Edge-detection gets most of the way mechanically; one
paid illustrator defines the house style over the 500 core.
- Human video is the published standard: paid deaf signers on camera. Deaf
faces on the reference material is simultaneously quality control and
legitimacy; hearing-produced reference media gets rejected, and should.
- Crowdsourcing sits on top of a lint. Anyone may submit a recording for
any sign; the input pipeline extracts its keypoints and scores the match
against the canonical encoding; only passing takes reach the paid deaf
review queue; accepted takes become credited alternate reference videos.
Two loops make the pipeline self-checking:
- **Recording lint**: the same pose pipeline that codes existing
dictionaries validates every video we produce. One oracle, both
directions.
- **Round trip**: encode, render on the avatar, and have signers read the
sign back cold. A sign that fails readback has an ambiguous encoding or
an unlearnable form, and it fails before money is spent filming it.
**Checks per stage** (each stage has an oracle before the next starts):
1. Ingest: coverage matrix, concepts by languages, missing cells counted.
2. Pose extraction: landmark confidence thresholds; manual audit of a
random sample of videos.
3. Auto-coding: agreement with a 200-sign hand-coded gold sample (deaf
annotator). Under ~90% on handshape or location, fix the coder before
trusting any cluster downstream.
4. Clustering: stability under resampling; consultant review of cluster
assignments.
5. Seed design: guessability testing, the standard iconicity methodology.
Show the sign cold, ask for the meaning, open response then forced
choice. Divergent-tier signs need a floor score to enter the dictionary.
6. Recordings: lint score, then deaf reviewer approval.
7. Published dictionary: per-sign feedback widget and variant submission
on the website; submission volume per sign is itself a signal of which
forms are contested.
8. Field: cross-signing trial performance against the IS baseline, and
week-later recall from learners (spaced-repetition telemetry doubles as
the learnability metric).
## 10. Bootstrap: who learns it first and why
No language spreads on merit. Each wave needs a selfish reason:
1. **International deaf events.** IS users are the beachhead: already
multilingual signers, already need this, already meet annually. Get the
seed adopted as the working register of one recurring event and iterate
there. This is where Gestuno half-worked despite itself.
2. **Interpreters and deaf professionals.** IS interpretation is a paid
accreditation today (WASLI/WFD). A better-documented standard with a
real curriculum makes their work easier; they become the teachers.
3. **Hearing baby-sign families.** Replace the ad hoc ASL-fragment baby
sign market with the real contact register: same effort, and the child's
50 signs are a live language shared with deaf people worldwide. This
quietly seeds millions of hearing homes with the core lexicon.
The market is real and crowded: Baby Signs (registered trademark,
Acredolo/Goodwyn, franchised in 40+ countries), Signing Time, and a long
tail of courses and flashcards, nearly all ASL fragments sold to hearing
parents. Our differentiators are that the signs are a language rather
than a party trick, and that the materials are deaf-vetted. Two hard
rules for the infant guide (a parallel deliverable, see phases): honest
science only, since the evidence says babies sign months before they can
speak but not that signing accelerates development, and this market's
signature vice is overpromising exactly that; and no trademark-adjacent
naming. "Babytalk, the first language of new human beings" works as a
pitch sentence, not as a product name (Babytalk was a parenting
magazine for decades, and Baby Signs will litigate its corner).
4. **Special-use niches.** Divers, film sets, factories, air-traffic ground
crews, noisy kitchens. Publish the survival core as the off-the-shelf
answer to "we need hand signals"; every niche adoption is free marketing
and free stress-testing.
5. **Deaf schools, last and only by invitation.** The babies-as-natives
endgame from the brief is real (Nicaragua proves a child community can
nativize a seed in a generation) but it can only happen inside deaf
communities that choose it, most plausibly as a second language beside
the national one. Pushing here first would trigger the Milan reflex and
deserve to.
The realistic sequence is decades long, and that is fine. English needed
three hundred years; a signed auxiliary that owns niches 1-4 within twenty
would already be the most successful constructed language in history after
Esperanto, with a clearer path past it.
## 11. The scorecard: numbers the language ships with
Every claim in the pitch gets a measured number, published with its method.
The scorecard is a first-class deliverable of the corpus study and trials,
not an afterthought, because "you already know a third of it" is the entire
sales pitch to an existing signer.
- **Head-start score, per language.** For each source language: the share
of the pansign core an existing signer already knows (identical or
cognate form) plus the share they can guess (shared iconic motif).
Headline format: "An ASL signer starts at 34% known, 27% guessable."
Falls straight out of the convergence study's distance data.
- **Neutrality score.** The spread of head-start scores across languages.
If ASL signers start at 45% and CSL signers at 12%, the prestige-skew
pitfall (section 7) has a number, and design iterations must narrow it.
Published per release so drift is visible.
- **Guessability.** Share of core signs whose meaning naive viewers guess
cold (open response, then forced choice), measured separately for deaf
and hearing viewers across at least three unrelated cultures. The
iconicity literature supplies natural-language baselines to beat.
- **Learnability.** Median hours to 90% recall of the 50-sign survival
core; signs retained per study-hour after one week (spaced-repetition
telemetry from the website's learning ladder).
- **Contact performance.** Task success rate and completion time for
stranger pairs using pansign versus the IS baseline (Whynot 2016 gives
published IS comprehension figures to compare against).
- **Robustness.** Share of the core that survives one-handed degradation;
human legibility at distance; machine recognition accuracy from a 480p
webcam. That last number exists for no natural sign language by design.
## 12. Phases for this project
Each phase ends at its oracle (section 9 checks); nothing advances on
vibes.
1. **Pipeline calibration** (this repo, next): ingest signbank annotations
and a video sample, stand up pose extraction and the auto-coder, and
hit the gold-sample agreement bar. Exit: checks 1-3 green.
2. **Corpus study**: the 500-word convergence analysis. A v0 can run on
annotations alone (ASL-LEX plus the Signbank exports cover several
languages with zero video processing); video-derived coding extends it
to the full Spread the Sign breadth. Output: data, clustering, the
convergent/motif/divergent triage, and the first head-start and
neutrality numbers for the scorecard (section 11). Publishable on its
own even if everything else stalls. Exit: check 4.
3. **Design spec v0**: phoneme inventory, notation/encoding, grammar
sketch, survival core (50) and contact register (500) as encodings
with avatar drafts. Exit: round-trip readback plus guessability floors
(check 5).
4. **Consultation and filming**: deaf consultants across families revise
the spec; paid deaf signers record the reference videos through the
lint. Nothing ships that deaf reviewers call hearing-brained. Exit:
check 6 on the full core.
5. **The website** (the eventual product of this repo): dictionary with
video, illustrations, and notation; the learning ladder (50 → 500 →
grammar); corpus browser; crowdsource submission queue; all
open-licensed. Exit: check 7 running live.
6. **Field use**: one event, one niche, one baby-sign curriculum. Measure
against the IS baseline; fold what survives contact back into v1.
Exit: check 8, repeated.
Parallel track, can start once the survival core exists in draft: the
**infant guide**. A short, honest manual for teaching the 50-sign core to
hearing babies: when to start (signs emerge around 8-10 months, months
ahead of speech), how to sign consistently during routines, what to expect,
what not to expect (no cognitive miracles; the payoff is earlier
communication and fewer frustration tantrums). Written to be translated,
deaf-reviewed before publication, and eventually the seed content for the
koine learning app.
Still open, deliberately: the governance charter (who can change the core
and how slowly), the funding model for paid deaf review and filming,
consent and compensation norms written down before the first recording
session, and where the video corpus physically lives once it outgrows a
git repo.
## 13. Sources
**Plains Indian Sign Language:**
- The 1930 Indian Sign Language Grand Council film: shot September 4-6,
1930, Browning, Montana; organized by Gen. Hugh L. Scott, who died before
finishing the planned 1300-sign film dictionary. Held by the US National
Archives (Office of Indian Affairs); viewable online at
https://vimeo.com/681506002 ("Office of Indian Affairs: The Indian Sign
Language, 1930"). The largest intertribal council ever filmed: eighteen
participants from a dozen nations.
- Jeffrey Davis, *Hand Talk: Sign Language among American Indian Nations*,
Cambridge University Press, 2010. The modern treatment.
- Davis's NSF/NEH digital archive and corpus: http://pislresearch.com
(University of Tennessee). Includes the digitized 1930 films and fieldwork
with 25+ living signers (Northern Cheyenne, Crow, Blackfeet, Assiniboine),
with Melanie McKay-Cody.
- Garrick Mallery, *Sign Language Among North American Indians*, First
Annual Report of the Bureau of Ethnology, 1881. Public domain; on
archive.org and Project Gutenberg. Hundreds of pages of described signs.
- William Tomkins, *Universal Indian Sign Language*, 1926. Popular primer,
Dover reprint still in print.
**International Sign:**
- World Federation of the Deaf, *Gestuno: International Sign Language of the
Deaf*, British Deaf Association, 1975. The cautionary tale, worth owning.
- Rachel Rosenstock and Jemina Napier (eds.), *International Sign:
Linguistic, Usage, and Status Issues*, Gallaudet University Press, 2016.
- Lori Whynot, *Understanding International Sign: A Sociolinguistic Study*,
Gallaudet University Press, 2016. The comprehension and density data.
- WFD-WASLI International Sign interpreter accreditation (wfdeaf.org).
**Cross-signing and language emergence:**
- Kang-Suk Byun, Connie de Vos, Anastasia Bradford, Ulrike Zeshan, and
Stephen Levinson, "First Encounters: Repair Sequences in Cross-Signing,"
*Topics in Cognitive Science*, 2018. Plus the wider MULTISIGN project at
iSLanDS (UCLan) and MPI Nijmegen.
- Ann Senghas and Marie Coppola, "Children Creating Language: How Nicaraguan
Sign Language Acquired a Spatial Grammar," *Psychological Science*, 2001.
**Data for the convergence study:**
- Spread the Sign: https://spreadthesign.com (European Sign Language
Centre, Örebro). Video dictionary, 40+ sign languages.
- Global Signbank (Radboud University) and the national corpora: BSL Corpus
(UCL), Auslan Corpus, Corpus NGT. Frequency lists come from these.
- ASL-LEX: https://asl-lex.org. ~2700 ASL signs with phonological coding,
frequency, and iconicity ratings.
- The sign-language-processing open-source ecosystem (pose-format and
friends): https://sign-language-processing.github.io. Pose estimation
and tooling built specifically for sign video.
- James Woodward's lexicostatistical comparisons of sign languages
(1970s-90s). The hand-computed answer key for validating any automated
similarity metric.
- Robbin Battison, *Lexical Borrowing in American Sign Language*, 1978. The
unmarked-handshape inventory.
- Notation prior art: Sutton SignWriting (signwriting.org) and HamNoSys
(University of Hamburg).
## 14. What success looks like
Not "everyone signs it." Success in order of ambition: the convergence
study is cited; the contact register demonstrably outperforms IS for
strangers in trials; one international deaf event adopts it; a hearing
family and a deaf traveler manage a real conversation with it; somewhere,
a child signs their first word in it and a deaf adult on another continent
would have understood.
# Methods (accreting draft) LOG.md is the diary; this file is the citable prose. Every completed work item gets rewritten here in past-tense methods style with citations, so a conference talk or paper is a rendering job, not archaeology. Update in the same save as the work it describes, or the next one. ## Data ASL-LEX 2.0 (Sehyr, Caselli, Cohen-Goldberg & Emmorey 2021, J. Deaf Studies and Deaf Education 26(2)) was retrieved from OSF (project zpha4) on 2026-07-05: 2,723 ASL signs with phonological coding (handshape, selected fingers, flexion, sign type, movement, major/minor location, coded per morpheme for compounds), subjective frequency from deaf signers, and iconicity ratings from both hearing and deaf raters. The Dicta-Sign lexicon portal (Matthes et al. 2012, LREC; hosted by the University of Hamburg) provides 2,092 concepts with parallel citation-form videos in BSL, DGS, LSF, and GSL at systematically constructed URLs. It serves as the concept-aligned video corpus for pose-based comparison, sidestepping cross-dictionary gloss matching. Global Signbank (Radboud; Crasborn et al.) hosts the NGT dataset (~7,469 glosses, CC BY 4.0) with per-gloss phonological fields; export pending account access. ## Lexical statistics, ASL-LEX (2026-07-05) Script: analysis/asl_lex_summary.py (Python stdlib; deterministic). Results (data/derived/asl-lex-summary.md): 58 distinct dominant-hand handshapes; the 10/20/25 most frequent cover 56/79/87% of entries. Sign types: 39.2% one-handed, 34.5% symmetrical or alternating, 23.4% asymmetrical, 2.9% violating Battison's dominance/symmetry conditions. Major location: 36.3% neutral space, 26.7% head, 25.0% hand, below-waist locations absent. Movement: straight 45.7%, curved 23.2%, none 15.3%. Iconicity (n=990 signs rated by both groups, 7-point scale): hearing mean 3.16, deaf mean 3.01, Pearson r = .817 between groups; 29.7% (hearing) and 27.1% (deaf) of signs rated >= 4. Fingerspelled loan signs: 1.4% of entries. Interpretation carried into design: a constrained (~20) handshape inventory forfeits little contrast space; three quarters of a natural lexicon is already one-hand-compatible; visual-frame constraints (webcam-high signing space) match natural practice; the iconicity discount is real but bounded (~30% of lexicon). ## Pose extraction pipeline (2026-07-05) pipeline/extract_pose.py: MediaPipe HolisticLandmarker (Tasks API, model bundle float16, VIDEO running mode) over OpenCV-decoded frames, emitting per-frame left/right hand (21 landmarks each) and body pose (33 landmarks) normalized coordinates, gzip-JSON per video. Smoke test on four parallel Dicta-Sign videos of one concept (BSL/DGS/LSF/GSL): pose detected in 99-100% of frames, hands in 45-100% (the 45% left-hand figure is a one-handed sign, i.e. correct rejection, not failure). Throughput approximately real-time on a 2-core VPS: 401 frames in 21 s including model load. Handedness is stored as raw left/right; dominant/non-dominant mapping is deferred to analysis (mirroring normalization). ## Planned next Concept list v0 (D001 recipe), NGT lexical statistics replicating the ASL-LEX summary, keypoint normalization (body-size and framing invariant), form-distance metric (DTW over normalized trajectories) validated by rediscovery of known language families (Woodward's lexicostatistics as answer key).
# Data source inventory
Access status as actually tested from this server (dates noted). Raw
downloads live in data/raw/ (gitignored); derived tables in data/derived/
(committed).
## Ingested
### ASL-LEX 2.0 (ASL) — ingested 2026-07-05
- 2,723 signs, 191 columns: full phonological coding (handshape, selected
fingers, flexion, sign type, movement, major/minor location, per-morpheme
for compounds), signer frequency ratings, iconicity from both hearing and
deaf raters, fingerspelled-loan flags.
- Fetch: OSF project zpha4 via API, files signdata.csv (osf.io/download/9nygd)
and signdataKEY.csv (osf.io/download/ygq4v). CSV is latin-1, not utf-8.
- License: academic dataset, citation required (Sehyr et al. 2021, J. Deaf
Studies & Deaf Education). Aggregate statistics fine; do not redistribute
the CSV itself. Local copy: data/raw/asl-lex-2-signdata.csv.
- First summary: data/derived/asl-lex-summary.md (analysis/asl_lex_summary.py).
## Confirmed accessible, ingest pending
### Global Signbank (Radboud) — probed 2026-07-05, more datasets 2026-07-09
- https://signbank.cls.ru.nl. Publicly listed datasets: **NGT** (~7,469
glosses, CC BY 4.0, public subset browsable), **Kata Kolok** (village
sign language, valuable out-of-family data point), EmblemsNL (Dutch
emblem gestures, useful for the offensive-gesture screen in EDGECASES).
- Operator's logged-in view (2026-07-09) shows the full catalog, far
richer than the anonymous listing. Sizes in parens. Priority order:
1. **NGT** (7,469; CC BY 4.0) — primary cross-language statistics.
2. **CSL_Shanghai** (2,241) — Chinese SL, closes the worst of the
neutrality gap (Q3).
3. **ASL** (3,105) — calibrates Signbank coding against ASL-LEX (D007).
4. **IS series**: IS_WFD1959 (357), IS_WFD1971 (373), IS_WFD1975 =
Gestuno (1,691), IS_WFD1979 (228), IS_WFD2007 (216), IS_Whynot
(207, empirical IS from the 2016 comprehension study), IS_2021
(155). The complete documented history of designed international
sign vocabularies: enables a longitudinal Gestuno autopsy.
5. **ISL** Israeli (2,268), **Kata Kolok** (1,312) + BaliHS homesign
(80), **JSL/KSL/TSL** (203/119/180, the Japanese family, ideal for
the family-rediscovery oracle), **ZEI** Farsi (221), BISINDO (62),
CARICOM (531), Konchri Sain (531), LSC (124).
6. European bulk: **VGT** (16,928), **LSFB** (3,512), **NTS** (1,946).
- Skip: Oefen, tstMH, DRS, LaSiMa (test/personal datasets), NGT-ns
(non-signs; harmless extra).
- Most datasets show 0 public signs, so per-dataset Request Access is
likely needed even though Download CSV links render. Operator is on
it; exports land in data/raw/ as downloaded.
- Phonological fields exist per gloss (handshape, location etc.).
- Bulk export (CSV/ECV) appears to require a free researcher account; the
anonymous ECV endpoint redirects to home. OPERATOR ACTION when v0 needs
it: register at signbank.cls.ru.nl, request NGT dataset access (CC BY
makes approval likely), download the gloss CSV export, drop it in
data/raw/. Alternative: per-gloss public pages are scrapable politely,
but the export is cleaner.
### Dicta-Sign lexicon portal (BSL, DGS, LSF, GSL) — probed 2026-07-05
- https://www.sign-lang.uni-hamburg.de/dicta-sign/portal/
- 2,092 concepts, each page linking parallel videos in all four languages
at predictable URLs (concepts/cs/cs_N.html → bsl|dgs|lsf|gsl/N.mp4).
- No phonological coding on the pages; this is video input for the pose
pipeline (PLAN.md section 9), already concept-aligned across four
languages, which removes the gloss-matching problem entirely.
- EU research project output hosted by University of Hamburg. Research use
is clearly intended; confirm terms before anything derived publishes
(QUESTIONS Q1 applies). Videos: fetch, extract keypoints, discard media.
## Probed, status noted
- **BSL SignBank** (bslsignbank.ucl.ac.uk): up; export/licensing not yet
investigated in depth.
- **Auslan Signbank** (auslan.org.au): up; same.
- **Svenskt teckenspråkslexikon** (teckensprakslexikon.su.se): up; the
Örebro/STS ecosystem behind Spread the Sign; open-data status to check.
- **Spread the Sign** (spreadthesign.com): up; no public API; bulk use is
an operator ToS/ethics call (DECISIONS D004). Not needed for v0.
## Plains Indian Sign Language (public domain, extraction task)
- Tomkins 1926, *Universal Indian Sign Language*: archive.org items
`indiansignlangua00tomk` (djvu.txt exists, 310 KB) and `bp_985073`.
Direct fetches from this server hit archive.org throttling 2026-07-12;
retry later or fetch manually. Once the text lands, extracting the
~800 sign descriptions into structured form gives an out-of-family
historical lexicon and serves the operator's sovereignty framing: PISL
is the proof that a signed auxiliary across nations worked.
- Mallery 1881 (Bureau of Ethnology annual report): archive.org /
Project Gutenberg; same extraction approach, much bigger.
- Davis's pislresearch.com corpus: video, for the pose pipeline later.
## Wanted, not yet located
- Frequency lists: BSL Corpus and Auslan Corpus frequency studies publish
top-N lists in papers; extract tables when the concept list is built.
- Cross-language similarity ground truth: Woodward's lexicostatistical
cognate percentages (for the family-rediscovery oracle); digitize the
tables from the papers.
- CSL, JSL, Indo-Pakistani SL sources with any machine-readable coding:
nothing found yet; this gap is exactly the neutrality risk (QUESTIONS Q3).