pansign

designing one sign language for the world, by measurement rather than taste. Work in progress; everything below is the live working state of the project, regenerated on every publish.

47,100sign entries indexed

27lexical databases

461NGT signs with phonology fetched

4videos pose-extracted

The approach in one sentence: measure what the world's sign languages already agree on (they agree far more than spoken languages do), keep the convergent parts, design the rest against explicit constraints, and test everything with deaf reviewers before it counts.

STATUS.md

# STATUS: read this first

One page for anyone (human or agent) stepping into the project cold.

## What this is

pansign: designing a universal auxiliary sign language by measuring
convergence across existing sign languages instead of inventing from
taste. PLAN.md is the full design document and the authority on why.
Dated snapshots of it live in backups/.

## Where we are

**Phase 1-2 of PLAN.md section 12: the annotations-only v0 convergence
study.** Goal: first real convergence and head-start numbers computed from
already-annotated lexical databases, zero video processing. Video and pose
estimation come after this proves the method.

Progress and reasoning are in LOG.md (chronological, append-only).
Data source inventory and access status: data/SOURCES.md.

## How to continue

1. Read LOG.md from the last entry backward until oriented.
2. Check DECISIONS.md for anything marked `open` that blocks your task;
   proceed on the recorded default unless it is marked `blocking`.
3. Do the next thing listed under "Next" in the latest LOG.md entry.
4. Append your own LOG.md entry: what you did, why, what surprised you.
5. `just save "<plain words>"` after every green step.

## File map

- PLAN.md: the design document. Read sections 5, 9, 11 before touching data.
- LOG.md: work log, newest entry last. The "why" lives here.
- DECISIONS.md: open/settled decision register. Arbitrary calls and
  convention choices are parked here so the operator or deaf reviewers can
  revisit them without rework.
- QUESTIONS.md: emerging concerns that are nobody's decision yet (legal,
  methodological, resource).
- EDGECASES.md: corner cases the language and tooling must eventually handle.
- data/: datasets and their inventory (SOURCES.md). Raw downloads stay out
  of git if large (.gitignore'd); derived tables get committed.
- backups/: dated PLAN.md snapshots.

## Who does what

- Wayne (agent): everything computable. Data survey, pipeline, analysis,
  drafts, this repo.
- Operator: spending, accounts, legal calls (scraping terms, licenses),
  outreach, anything public. Registers pansign.org. Reviews DECISIONS.md.
- Deaf community: reviews and vetoes, from phase 3 on. Nothing publishes
  without deaf review; see PLAN.md sections 7 and 12.

## Current human (operator) items

Not blocking yet; the concept list and pose-pipeline proof proceed without
them. In priority order:

1. **Global Signbank account** (unblocks the first cross-language
   comparison): register at https://signbank.cls.ru.nl (top-right signup),
   request access to the NGT dataset (it is CC BY 4.0; academic-adjacent
   justification "cross-linguistic lexical comparison research" is
   accurate), then Datasets → NGT → export gloss CSV, and place the file
   in data/raw/ named ngt-signbank-export.csv. Wayne handles the rest.
2. **Register pansign.org** (already agreed, any time).
3. Later, at phase 2 breadth: the Spread the Sign bulk-use decision
   (DECISIONS D004) and a real legal read on QUESTIONS Q1.

LOG.md

# Work log

Append-only, chronological. Each entry: what, why, surprises, next.
Detail level: enough that a stranger could pick up mid-stride.

## 2026-07-01: project start

Repo created from the legendary-taste kit. PLAN.md written: the full
design argument (auxiliary not replacement, convergence-measured lexicon,
deaf-first, script-free). Private backup at wayniacal/pansign.

## 2026-07-05: naming, scorecard, media pipeline, infant track

Operator settled the name: pansign (pansign.org to be registered by
operator; .com squatted). "koine" reserved for the learning app. Added to
PLAN.md: media pipeline (section 9, pose estimation in, encoding-first
out, per-stage checks), scorecard deliverable (section 11, head-start and
neutrality numbers), infant guide as a parallel track with baby-sign
market and trademark notes.

## 2026-07-05: v0 study kickoff

What: dated PLAN.md snapshot to backups/, scaffolding docs (STATUS, LOG,
DECISIONS, QUESTIONS, EDGECASES), starting the annotations-only v0
convergence study per PLAN.md section 12 phase 2.

Why annotations first: several lexical databases already carry human
phonological coding. Any convergence result computable from those is
cheaper and more trustworthy than CV output, and calibrates the CV
pipeline later.

Machine constraints noted: 2 cores, 8 GB RAM, ~21 GB free disk. Fine for
tables and keypoints; never store bulk video here (stream, extract, drop).

Next: data source survey with live fetch attempts, logged to
data/SOURCES.md. Then ASL-LEX ingest and first sanity analyses.

## 2026-07-05: ASL-LEX ingested, first real numbers; source survey done

What: downloaded ASL-LEX 2.0 (2,723 signs, full phonological coding) via
the OSF API; wrote analysis/asl_lex_summary.py (stdlib only, per D003);
probed Global Signbank, Dicta-Sign, BSL/Auslan signbanks, STS, Spread the
Sign. Inventory with access details: data/SOURCES.md.

Findings that matter (full tables in data/derived/asl-lex-summary.md):

- Handshapes: ASL uses 58, but the top 20 cover 79% of the lexicon. A
  ~20-handshape designed inventory forfeits little. Fed into D006.
- One-handed degradation is cheaper than feared: 39% of ASL signs are
  already one-handed and another 35% are symmetrical (one hand can carry
  them); only the asymmetric quarter needs designed variants. Natural ASL
  also carries ~3% dominance/symmetry-violating signs; a designed language
  carries zero.
- Location: neutral space + head = 63%; below-waist locations essentially
  absent in nature, so the webcam-framing requirement costs nothing.
- Movement inventory is small: straight/curved/none = 84% of signs.
- Iconicity: mean 3.1 of 7; ~28-30% of signs rate >= 4. Deaf and hearing
  raters correlate at r = 0.817, so iconicity judgments are stable across
  hearing status (within US culture; Q2 caveat unchanged). The iconicity
  discount is real but covers maybe a third of a lexicon, matching PLAN.md
  section 3's "smaller than optimists assume".
- Fingerspelled loan signs are only 1.4% of ASL-LEX: dropping
  fingerspelling from the core costs almost no lexicon. Its real function
  (names, out-of-vocabulary) still needs the sign-name convention
  (EDGECASES).

Source survey headlines: Global Signbank's NGT dataset is CC BY 4.0 with
~7,469 glosses but bulk export needs a free researcher account (operator
step written in data/SOURCES.md). Dicta-Sign is the sleeper: 2,092
concepts with parallel videos in BSL/DGS/LSF/GSL at predictable URLs,
concept-aligned already, ideal first corpus for the pose pipeline. No
machine-readable coding found yet for any non-Western sign language; the
neutrality gap (Q3) is real and needs active hunting.

Surprise: ASL-LEX CSV is latin-1 encoded, not utf-8.

Next, in order:
1. Draft the 500-concept list v0 (D001 hybrid recipe): structure the file
   with sense IDs, tags (swadesh/contact/frequency source), and bias
   annotations per Q3.
2. Operator unblocks Global Signbank export (see SOURCES.md); then the
   same summary analysis runs on NGT for the first cross-language
   handshape/location comparison.
3. Pose pipeline proof on a handful of Dicta-Sign videos (phase 1,
   PLAN.md section 9): mediapipe needs a pinned python lane first (D003
   revisit: this is the point where stdlib stops sufficing).
4. Hunt for CSL/JSL/IPSL machine-readable sources.

Addendum, same day: drafted the survival-core-50 candidate list
(data/concepts-survival50-draft.tsv, register entry D008) with sense IDs
and domain/audience tags, so guessability testing and the infant guide
have a concrete object to argue about. The full 500-concept list stays
next; its recipe is D001.

## 2026-07-09: pose pipeline proven; python lane wired; methods draft

What: wired the pinned python lane (mise python 3.12 + uv, mediapipe
0.10.35 in uv.lock) per D003's "when stdlib stops sufficing". Wrote
pipeline/extract_pose.py (MediaPipe HolisticLandmarker, Tasks API; the
legacy mp.solutions API is gone in 0.10.35). Smoke-tested on four
parallel Dicta-Sign videos of concept 100 (BSL/DGS/LSF/GSL).

Results: pose detected 99-100% of frames, hands 45-100% (the 45% is a
one-handed sign; correct rejection). Near real-time on this 2-core VPS:
401 frames in 21s. Full Dicta-Sign corpus is a ~15h background job (Q5
updated). Keypoints land in data/derived/poses/*.json.gz.

Also: started notes/methods.md, the accreting citable-prose companion to
this log, after the operator asked whether documentation would support a
conference presentation. Rule added there: every completed work item gets
its methods paragraph in the same or next save. Q8 added to QUESTIONS.md
recording the honest expected-value and legitimacy assessment.

Surprise: mediapipe 0.10.35 removed mp.solutions entirely; HolisticLandmarker
Tasks API + explicit model download replaced it. Model bundle (13 MB) is
gitignored at data/models/, re-fetch URL in pipeline/extract_pose.py.

Next:
1. Full 500-concept list v0 (D001).
2. Normalization + DTW form distance over keypoints; validate on
   Dicta-Sign same-concept pairs before any cross-language claims.
3. Background job: extract poses for all 2,092 Dicta-Sign concepts x 4
   languages (fetch, extract, delete video).
4. NGT export lands from operator -> replicate asl_lex_summary on NGT.

## 2026-07-09: Signbank ECVs ingested; NGT phonology scrape running

Operator downloaded all 31 Signbank exports; they turned out to be ECV
files (ELAN controlled vocabularies): gloss id + gloss text + translations
+ stable per-gloss URL, no phonology. The Request Access flow for CSV
exports 500s server-side (reported to their admin; fallback is emailing
the Radboud CLS group if it stays broken).

What we got anyway:
- analysis/parse_ecv.py -> data/derived/signbank-glosses.tsv: 44,377
  glosses across 26 datasets (test sets skipped), the concept-alignment
  backbone. Inventory: data/derived/signbank-inventory.md. Includes the
  full IS series (1959-2021), CSL Shanghai (2,241), ISL, JSL/KSL/TSL,
  Kata Kolok, VGT (16,928).
- Public NGT gloss pages DO render phonology (handedness, strong/weak
  handshape, location, contact, orientation, movement when set). NGT is
  CC BY 4.0, so polite scraping of public pages is license-clean.
  pipeline/scrape_ngt.py: 1 req/s, identifying UA, resumable JSONL out.
  Smoke test clean; full run (~7.5k pages, ~2h) launched in background
  -> data/derived/ngt-phonology.jsonl.

Why NGT phonology matters: it is the second fully-coded lexicon after
ASL-LEX, from a different coding tradition, and unlocks the first
cross-language handshape/location comparison plus the D007 mapping table.

Next (after the running items above): parse ngt-phonology.jsonl into the
NGT equivalent of the ASL-LEX summary; gloss-match the IS_WFD1975 list
against NGT/ASL for a first Gestuno-vs-nature look.

## 2026-07-09: pansign.org live path prepared

Operator registered pansign.org, A record to this server. Built the
deliberately-cheap progress page (site/build.py): one static HTML, the
repo documents verbatim in collapsible monospace blocks plus a live
numbers strip computed from data/derived/. Ship = regenerate + cp to
~/sites/pansign + curl sentinel (justfile wired; run = local preview on
:8099). Page regenerates from the docs, so publishing stays current with
zero extra writing. Files staged to ~/sites/pansign; Caddy vhost needs
operator sudo (two commands handed over). Site is expendable by design;
the eventual dictionary site replaces it wholesale.

NGT scrape note: their server is slow (~9s/gloss), so the full pass takes
about a day, not 2h. Resumable, running.

## 2026-07-09 (night): dicta extraction launched; alignment floors; concepts v0

Operator confirmed pansign.org live (Caddy vhost added), sent the Radboud
access email, went to sleep. Standing instruction: keep working, ship the
site when progress lands.

Three things done:

1. **Dicta-Sign full extraction launched** (pipeline/dicta_run.py,
   background): all 2,092 concepts x 4 languages, fetch-extract-delete,
   resumable, misses recorded. Output data/derived/dicta-poses/.

2. **Gloss-level concept alignment** (analysis/gloss_match.py ->
   data/derived/gloss-match.md). Floors, not truths (exact normalized
   match). Usable: Gestuno-1975 shares 649 glosses with ASL-LEX, 500 with
   NTS, 465 with ISL, 348 with CSL Shanghai; the 1959->1971->1975->1979 IS
   series has strong internal continuity (255/247/125 with 1975). Flaw
   found: VGT and LSFB gloss in Dutch/French, so English exact-match reads
   ~zero; v2 must match on the translation fields split on commas, not
   just the primary gloss. Survival-50 coverage confirms the list is
   ordinary vocabulary: 46/50 in ASL-LEX.

3. **Concept list v0** (analysis/build_concepts.py ->
   data/concepts-v0.tsv): 520 concepts = survival-50 + Swadesh-100 +
   ASL-LEX top deaf-signer-frequency lemmas (fingerspelled loans
   excluded), provenance-tagged per entry. Bias recorded (Q3): the only
   frequency signal is ASL. Curation pass (senses, domains for the
   frequency tranche, cultural blind spots) is open for operator and
   later deaf review; mechanical regeneration is safe, the file is
   derived.

Next: NGT scrape completes -> NGT phonology summary + D007 mapping table;
dicta poses accumulate -> normalization + DTW metric development can start
on partial data (a few hundred concepts suffice for method development).

## 2026-07-12 (night 2): baby-sign research distilled; PISL sourcing

Operator asks honored this session:
- **notes/baby-sign.md**: the requested distillation. Practitioner
  consensus (7-point method), honest evidence state (signs precede
  speech: solid; developmental claims: do not survive systematic review;
  real benefit is the 8-18mo communication gap), product landscape with
  prices (Baby Signs franchise, Signing Time $12.99/mo, Tiny Signs
  course, free-chart SEO plays), testimonial themes (fewer tantrums
  dominates; the PAIN-sign medical story recurs; week-6 quit cliff), and
  the Deaf-community appropriation critique with its sharp implication:
  our infant core is only defensible as the first 50 words of a real
  co-owned language, never as another fragment product. Feeds the infant
  guide and D008 (infant-first subset, PAIN gets design priority).
- **PISL**: Tomkins 1926 and Mallery 1881 identified on archive.org
  (identifiers in data/SOURCES.md); fetches throttled from this server,
  retry later. Extraction of Tomkins' ~800 sign descriptions is a good
  future task: out-of-family historical lexicon, and it carries the
  operator's framing (a sovereign signed auxiliary that actually worked
  across nations).

Background jobs mid-flight: NGT phonology 519/7,464 (their server is the
bottleneck); dicta-poses 207 files so far, no misses logged.

Docs review pass: PLAN/DECISIONS/QUESTIONS consistent; one addition
worth making later, a PLAN sentence connecting PISL's sovereignty
precedent to the section 1 argument (deferred; PLAN edits are operator-
visible and it is 1am). Site shipped with updated numbers.

DECISIONS.md

# Decision register

Every arbitrary or convention-level choice gets an entry the moment it is
made, so the operator or deaf reviewers can revisit any of them later
without archaeology. Statuses:

- `open`: proceeding on the recorded default; cheap to change.
- `blocking`: work stops here until a human decides. Rework cost of
  guessing wrong is too high.
- `settled`: decided by operator or deaf review; date and rationale noted.
- `deferred`: not yet live; parked until its phase.

Rule of thumb: computational choices default to `open` (results can be
recomputed), language-design choices that thousands of people would have
to relearn lean `blocking` once we reach them.

---

## D001 `open` Concept list composition for the 500-word core

Options:
a. Classic Swadesh-style universal concepts. Comparable to prior
   literature, but built for historical linguistics, not daily use.
b. Corpus frequency from the sign corpora (BSL/Auslan/NGT). Reflects real
   signing, but only three Western languages have public frequency data.
c. Contact-situation vocabulary (travel, health, food, numbers, family)
   from use cases 3-5 in PLAN.md section 2.

Default: hybrid. Frequency-ranked union of (b) capped to concepts
expressible cross-culturally, plus (c) coverage checklist, with (a) as a
comparability subset tagged inside the list. Rework cost: low; the list is
an input file, analyses rerun.

## D002 `open` What counts as "the same sign" across languages (cognate threshold)

The head-start and convergence numbers depend on this line.
Options:
a. Strict: all four manual parameters match (handshape, location,
   movement, orientation) up to inventory binning.
b. Loose: handshape + location match, movement similar.
c. Graded: report similarity as a score, draw the "known/guessable/new"
   lines late, publish the thresholds with the scorecard.

Default: (c). Never bake a binary into the pipeline; thresholds are a
reporting decision. Rework cost: none by design.

## D003 `open` Analysis toolchain

Options: plain python3 stdlib (zero deps, this VPS has 3.11) vs pinned
uv + pandas/scipy lane.
Default: stdlib csv/json until something actually hurts, then wire the
python lane properly in .mise.toml (pinned, per kit rules). Rework: low.

## D004 `blocking-later` Bulk-fetching Spread the Sign video

Their videos are copyrighted; there is no public API. Analysis-only
scraping is a terms-of-service and ethics call the operator must make
(QUESTIONS Q1). Not needed for the v0 study, so not blocking today.
Becomes blocking at phase 2 full breadth.

## D005 `deferred` Notation system: extend HamNoSys / extend SignWriting / new

Live at phase 3 (design spec). Research task first: what breaks in each
when encoding must be unique, machine-parseable, and human-writable.

## D006 `open` Handshape inventory size

Battison's unmarked set is ~7; usable cores in natural SLs run 20-40.
First numbers (ASL-LEX 2.0, 2026-07-05, data/derived/asl-lex-summary.md):
ASL uses 58 distinct handshapes; the top 10 cover 56% of the lexicon, top
20 cover 79%, top 25 cover 87%. ASL-LEX's own marked-handshape flag splits
the lexicon almost exactly in half (1,373 marked / 1,350 unmarked), so a
strictly unmarked core is a real departure from natural practice, paid for
in shorter sign space (fewer cheap contrasts). Working default: target
inventory around 20 handshapes, final call after the same curve is
computed for NGT and at least one non-Western language. Rework: medium.

## D008 `open` Survival-core-50 composition

Draft candidate list: data/concepts-survival50-draft.tsv (2026-07-05),
tagged by domain and audience (infant/contact/swadesh). Chosen for: infant
firsts (baby-sign literature's most-used set), contact situations (travel,
health, commerce), and cross-cultural expressibility. Deliberately absent:
anything script-, religion-, or cuisine-specific. Known tension: "please"
and politeness marking are not universal across cultures; may merge into
one politeness sign or drop. Operator and later deaf reviewers should
scan the list for cultural blind spots. Rework cost: low now, high after
guessability testing starts.

## D007 `open` Handshape equivalence across databases

ASL-LEX, Signbank datasets, and HamNoSys each use different handshape
labels/granularity. Cross-language comparison needs one mapping table.
Options: map everything onto HamNoSys's fine-grained set and coarsen, or
define our own coarse bins from the start.
Default: map to a coarse bin set we define (documented in
data/handshape-bins.md when built), keeping source labels alongside so the
mapping is revisable. Rework: medium (mapping table edits force
recompute, but recompute is cheap).

QUESTIONS.md

# Open questions and concerns

Not decisions (those go in DECISIONS.md); worries, unknowns, and things
that need a lawyer, a linguist, or a deaf reviewer eventually.

## Q1 Legal: are extracted keypoints derivative works?

The input pipeline turns copyrighted dictionary video into keypoint series
and parameter codes. We treat those as internal analysis data and never
republish media, but whether pose data legally counts as a derivative work
is untested ground. Needs a real opinion before anything built on scraped
video publishes. (Related: D004.)

## Q2 ASL-LEX iconicity ratings are culture-bound

ASL-LEX's iconicity scores were collected from hearing American raters.
Using them as "universal guessability" would smuggle in exactly the
iconicity chauvinism PLAN.md section 7 warns about. Use them as one
signal, never ground truth; real guessability testing (check 5) stays
mandatory.

## Q3 Frequency data exists for very few sign languages

BSL, Auslan, NGT have corpus frequency lists; most languages have none.
The 500-concept list will be frequency-weighted toward Western languages
no matter what we do at v0. Mitigation: tag the bias in the list itself,
revisit when more corpora surface.

## Q4 Spread the Sign coverage is uneven

Some languages have full 15k+ vocabularies on STS, others a few hundred
entries of varying production quality. Head-start scores for thin
languages will have wide error bars; publish per-language sample sizes
with the scorecard.

## Q5 VPS resources

2 cores / 8 GB / ~21 GB free. Annotation tables and keypoints fit easily.
Measured 2026-07-05: HolisticLandmarker runs near real-time on this CPU
(401 frames in 21s including model startup). The full Dicta-Sign corpus
(2,092 concepts x 4 languages, ~12h of video) is a ~15h background job,
completely feasible here. GPU rental only becomes relevant at Spread the
Sign breadth (hundreds of hours of video). Never store bulk video on this
disk: fetch, extract, delete.

## Q6 Deafblind users

Protactile and tactile signing exist as their own modality. A language
optimized for visual discriminability is not automatically tactile-friendly.
At minimum: do not design core signs that are tactile-hostile when a
tactile-neutral variant exists. Needs expert input at phase 3.
(Also listed in EDGECASES.)

## Q8 Expected value and legitimacy (the "is this even a good idea" question)

Asked by the operator 2026-07-09, answered honestly: the language itself
catching on is a low-probability outcome (order 5-15% even for the IS
niche); the research byproducts (convergence study, similarity metric,
open corpus tooling, notation) are near-certainly useful to sign
linguistics and accessibility tech regardless. The project is structured
so its expected value does not depend on adoption. Legitimacy is the
harder constraint: a hearing-side project has roughly zero standing to
ship a language; phase 3 is a genuine go/no-go gate where the honest
outcomes include "hand the whole thing to a deaf-led organization" and
"stop, publish the data, done." Written down so nobody later pretends the
plan promised more.

## Q7 Left-handed signers

Natural SLs let signers mirror; dominance is signer-relative, not
absolute. Our encoding must express handedness relatively (dominant /
non-dominant, never left/right) or half the videos will "fail" the
recording lint. Cheap to get right now, expensive later.

EDGECASES.md

# Edge and corner cases

Things the language and its tooling must eventually handle. Each gets a
short entry when discovered; promote to PLAN.md or DECISIONS.md when it
starts driving design. Sources: known sign-linguistics issues plus
whatever the data turns up.

## Signers

- Left-handed / mirrored signing: handedness is signer-relative. Encoding
  uses dominant/non-dominant, never left/right. (QUESTIONS Q7)
- One-handed contexts and one-handed people: every core sign carries a
  defined one-handed variant (PLAN.md requirement). The variant is part of
  the sign's entry, not folklore.
- Deafblind / tactile signing: visual-optimal is not tactile-optimal.
  Flag tactile-hostile candidates during design. (QUESTIONS Q6)
- Children's motor limits: infant guide (50-sign core) must avoid
  handshapes infants cannot form; baby-sign literature documents typical
  first-year approximations. The core should survive being mangled by
  eight-month-old hands and still be readable.
- Seated signers, wheelchair users: location parameters that assume
  standing-height sight lines or torso-length movement paths are bugs.
- Limited facial mobility (stroke, Moebius syndrome, botox, veils):
  grammar that lives exclusively on the face needs manual fallbacks.
  Also relevant: face-covering norms in parts of the world.

## Environments

- Video calls: framing cuts the signing space at the sternum and the
  camera mirrors. Core signs should read inside a webcam crop.
  One more reason location contrasts below the waist are banned.
- Distance and low light: legibility at 20m outdoors was a design
  requirement; test it, don't assume it.
- Encumbered signing: holding a phone, a rail, a baby. Overlaps with the
  one-handed variant requirement.

## Content

- Cultural gesture collisions: any designed sign must be screened against
  offensive emblem gestures worldwide (thumbs-up, OK-ring, horns, palm-out
  moutza, left-hand taboos, pointing taboos). Build a screening checklist
  from the gesture-studies literature before phase 3 sign design; this is
  a check, like typos but for obscenity.
- Sign names for people and places: no fingerspelling in the core, so the
  convention for naming needs design (descriptive sign names are the
  natural-SL norm; collision handling is the open part).
- Numbers, dates, time: numeral systems vary wildly across SLs and are a
  known IS pain point. Treat as its own mini-study within the 500.
- Taboo and medical vocabulary: contact register needs body/health terms
  that are neither euphemistic to uselessness nor crude.
- Loanwords from national SLs: when a local sign is ubiquitous (e.g. a
  city's name sign), pansign should borrow, not compete. Borrowing rules
  are a phase 3 design item.

## Data and tooling

- Same gloss, different senses: dictionary glosses are spoken-language
  words; "right" (direction) and "right" (correct) must not merge. Concept
  list is defined by sense IDs, not English strings.
- One concept, several regional variants inside one language (ASL has
  multiple signs for BIRTHDAY): a language contributes its most frequent
  variant to convergence scoring, others recorded as alternates.
- Compound signs and multi-sign phrases in dictionaries: normalize before
  comparison or convergence scores inflate.
- Non-manual-only signs and mouthing-dependent signs: our core bans
  meaning that lives only in mouthing (it imports spoken language), but
  the data contains such signs; tag them on ingest.

PLAN.md

# pansign: designing one sign language for the world

**Naming, settled 2026-07-05.** pansign is the project's name; pansign.org
to be registered (pansign.com is squatted). The language's true name will
be a sign its users choose once there are users, the way Nicaraguan Sign
Language was named after it existed; the written word is transliteration.
**koine** is reserved as a product name, most likely the learning app: the
Greek common dialect, which is the concept exactly. Infant-facing product
names must clear a crowded trademark field first (section 10).

The goal: a signed language that any two people on earth could
share. Deaf-first, but learnable as a first non-spoken language by hearing
people everywhere. This document is the 1000-mile view: what the language is
for, what to copy, what to invent, how to bootstrap it, and where projects
like this have died before.

## 1. Why this is more plausible than Esperanto

Esperanto failed its ambition because it competed with English for a slot
everyone already fills. A universal sign language competes with nothing for
most of humanity: hearing people have no signed language at all, so it would
be their first, not their third. And the deaf world already behaves as if
this language wants to exist:

- **Sign languages are far more mutually intelligible than spoken ones.**
  Deaf strangers with no shared language converge on working communication
  in hours, not years. The cross-signing studies out of MPI Nijmegen
  document this happening in real time.
- **International Sign (IS) already exists** as a contact register used at
  WFD congresses and the Deaflympics. It is a pidgin, not a full language,
  but it proves the demand and supplies a partial base.
- **Plains Indian Sign Language** served for centuries as a lingua franca
  across dozens of unrelated spoken languages. A signed interlanguage
  across language families is not hypothetical; it has already happened.
- **Iconicity is a real discount.** A large fraction of sign vocabulary is
  motivated (the sign for "drink" looks like drinking). Iconic signs are
  learned faster by both deaf and hearing learners. Spoken conlangs get no
  such discount; every word is arbitrary.

The honest framing: this is less "invent a language from nothing" and more
"standardize and complete a convergence that keeps happening spontaneously."

## 2. What a sign language is actually used for

Design for the real use cases, ranked:

1. **Daily life of deaf people**: full expressive range. Argument, humor,
   poetry, technical talk, child-rearing. This is the bar for a real
   language; everything below is a subset.
2. **Deaf-to-deaf across borders**: today served badly by IS. The first
   market, because the users are skilled signers who only need vocabulary
   and conventions, not the modality.
3. **Deaf-hearing contact**: shops, hospitals, airports, family members who
   never get past 200 signs. A small core must carry this alone.
4. **Hearing-to-hearing where speech fails**: noise, distance, underwater,
   quiet required, across a window, on a video call with broken audio.
   Divers, crane operators, and traders all invented ad hoc sign systems;
   the demand exists.
5. **Parent-infant**: baby sign is already a mass hearing market. Babies
   sign before they can speak. This is the strangest and maybe strongest
   adoption vector: the first words of millions of hearing children could
   be in this language.
6. **Human-machine**: gesture interfaces and sign-recognition ML want a
   phonology that cameras can discriminate. No natural sign language was
   designed with this constraint; a new one can be.

Requirement that falls out of the ranking: the language must degrade
gracefully. A 50-sign user, a 500-sign user, and a fluent signer must all be
speaking the same language, not three systems.

## 3. Assumptions

Stated so they can be attacked:

- ~70M deaf people, ~150-300 documented sign languages, most tiny and many
  endangered. The large ones cluster into families (LSF lineage including
  ASL and Libras, BANZSL, German, Japanese, Chinese, Indo-Pakistani).
- Sign language grammar is convergent to a degree spoken grammar is not:
  spatial verb agreement, classifier constructions, topic-comment order,
  and grammatical facial expressions recur across unrelated families. Not
  universal, but common enough to be the default choices.
- Iconicity is partly cultural. "Eat" and "sleep" travel; "marriage,"
  "time," and anything metaphorical do not. The shared core is real but
  smaller than optimists assume; the 500-word study (section 5) measures it
  instead of guessing.
- A committee cannot finish a language. It can ship a seed; only a
  community of users, especially children, turns a seed into a language.
  Nicaraguan Sign Language emerged in one generation of schoolchildren from
  fragmentary input. Children will regularize whatever we ship, which means
  the seed needs good bones more than complete coverage.
- Chinese is not a special obstacle. Chinese Sign Language is unrelated to
  the spoken language's difficulty; deaf signers in China face the same
  modality and the same iconic affordances as everyone else. The actual
  China problem is fingerspelling (CSL uses pinyin-based handshapes), which
  argues for keeping fingerspelling marginal (section 6).

## 4. Requirements

- **Deaf-first.** Fluent deaf signers must find it expressive and fast, or
  it is a gadget. Hearing learnability is a constraint, never the driver.
- **Auxiliary, not replacement.** It sits beside national sign languages
  the way English sits beside Dutch. The "replace any given sign language"
  goal from the brief is dropped deliberately; see pitfalls.
- **Motorically cheap.** Restrict the core lexicon to unmarked handshapes
  (the fist, flat hand, index point, spread hand, C and O shapes that
  appear in every studied sign language and that children acquire first).
- **Visually discriminable.** No minimal pairs that differ only in features
  a webcam at 15fps or a viewer at 20 meters loses. This single constraint
  serves distance signing, video calls, and machine recognition at once.
- **One-handed degradation.** Every core sign needs a defined one-handed
  variant: people hold phones, coffee, babies, and steering wheels, and
  some people have one hand.
- **Script-free.** No fingerspelled alphabet in the core. Fingerspelling
  imports a writing system (Latin for ASL, pinyin for CSL, kana tracing for
  JSL) and instantly de-internationalizes the language. Names get sign
  names; borrowings get signs.
- **Notation from day one.** A written/encodable form (successor to
  SignWriting or HamNoSys, machine-parseable) so the dictionary, corpus,
  and tooling exist before the community does.
- **Open.** Public domain lexicon, open corpus, open governance. A language
  with an owner is dead on arrival.

## 5. Method: the 500-word convergence study

The empirical heart, and cheap to run now:

1. Build a signed Swadesh-style list, ~500 concepts, weighted for actual
   conversational frequency (sign corpora exist for BSL, Auslan, NGT) plus
   the contact-situation vocabulary of use cases 3-5.
2. Pull the same 500 concepts from **Spread the Sign** (video dictionary,
   40+ sign languages) and the Global Signbank datasets. This corpus
   already exists; nobody has mined it for exactly this.
3. Code each entry by the standard phonological parameters: handshape,
   location, movement, orientation, non-manual. Cluster. For each concept,
   measure: is there a cross-family majority form? A shared iconic motif
   with surface variation? Or true divergence?
4. Triage the lexicon accordingly:
   - **Convergent** (expect maybe 20-30%): adopt the majority form,
     regularized to the constrained phonology. Free vocabulary.
   - **Shared motif**: design one sign that keeps the common iconic base.
     Cheap vocabulary; most learners get a mnemonic for free.
   - **Divergent**: greenfield. Design for iconicity where a culture-neutral
     image exists, otherwise optimize for articulation and discriminability.
5. Validate every design decision with deaf consultants from at least the
   five major families before it enters the dictionary, and test candidate
   signs in staged cross-signing sessions: put strangers in a room with the
   seed lexicon and record what survives contact. What people actually
   reproduce is the spec; the dictionary follows usage, not vice versa.

Same procedure later for grammar, run on typological literature instead of
video dictionaries: where sign languages agree (spatial agreement,
classifiers, brow-marked questions, topic-comment), take the common
solution. Where they diverge (negation strategies, word order details),
pick the simplest system consistent with the modality and let usage sand it
down.

## 6. Copy versus greenfield

Copying is the shortcut; greenfield is the once-only chance to apply eighty
years of sign linguistics on purpose. The split:

**Copy (convention is the value):**
- Spatial grammar: setting up referents in signing space and directing
  verbs between them. Every family does this; it is the modality's killer
  feature and costs hearing learners the most, so it must match what deaf
  signers already know.
- Classifier constructions: near-universal, productive, and the reason
  signers can improvise vocabulary that strangers understand.
- Non-manual grammar: brow raise for yes/no questions, furrow for
  wh-questions, headshake negation. Common enough to be de facto standard.
- The convergent lexicon from the 500-word study, and IS conventions with
  real currency (numbers, pointing pronouns, time-line metaphors).

**Greenfield (nature never optimized this):**
- Phoneme inventory: chosen for motor ease, visual distance between
  phonemes, camera framing (signing space biased high and central for
  selfie cameras), and symmetric one-handed fallback. Natural sign
  languages carry marked handshapes and sub-visible contrasts as
  historical baggage; drop it.
- Morphological regularity: fully regular aspect, plural, and agreement
  paradigms. Irregularity is history's scar tissue; a seed language starts
  without scars (children would strip them anyway; see Nicaragua).
- The notation system and a canonical machine encoding, co-designed with
  the phonology so every legal sign has exactly one encoding. This makes
  the dictionary diffable, the corpus searchable, and sign-recognition
  training data self-labeling.
- Register plumbing for the degradation requirement: the 50-sign survival
  core, the 500-sign contact register, and full grammar designed as
  concentric circles, each a strict subset of the next.

## 7. Pitfalls

- **Gestuno (1975).** The WFD's committee published ~1500 signs as a book:
  vocabulary with no grammar, no community, no media, chosen by committee
  aesthetics rather than convergence data. Interpreters were handed it at
  congresses and audiences understood nothing. It quietly became today's
  IS only after deaf users discarded most of it and rebuilt by contact.
  Lesson: ship a usable register into a live community fast; a dictionary
  alone is a tombstone.
- **The replacement framing is poison.** Sign languages are the core of
  Deaf identity, and the community carries the memory of Milan 1880, when
  hearing educators banned sign from deaf schools for a century. A hearing-
  led project to "replace national sign languages" would be received,
  correctly, as that again. Auxiliary framing, deaf leadership, and WFD
  partnership are not diplomacy; they are the difference between a
  language and an insult.
- **Iconicity chauvinism.** Designers reach for their own culture's image
  of a concept and call it universal. Every "obvious" sign needs testing
  against consultants from unrelated cultures.
- **The prestige trap.** If the auxiliary language gets institutional money
  and national sign languages stay poor, it becomes a threat to already
  endangered languages, and it should be opposed. Budget and advocacy must
  visibly flow to national sign languages alongside it.
- **Nativization drift.** If babies do acquire it natively in scattered
  communities, they will change it, and it may dialectize like everything
  else. Plan for a living standard with versioned reference corpora, not a
  frozen academy; the goal is mutual intelligibility, not purity.
- **Committee capture.** Every conlang community fractures over reform
  proposals (Esperanto vs Ido). Governance design (who can change the core,
  how slowly) matters as much as phonology.

## 8. Tradeoffs

- **Iconicity vs compactness.** Iconic signs are learnable but tend long
  and pantomimic; fluent languages compress. Resolution: iconic citation
  forms with documented reduced forms, mirroring what natural sign
  languages do diachronically anyway.
- **Deaf optimality vs hearing learnability.** Full spatial grammar is hard
  for hearing adults. Resolution: the concentric registers. The contact
  register is honest linear signing that fluent grammar contains as a
  degenerate case; hearing learners are speaking correctly, just plainly.
- **Familiarity vs neutrality.** Leaning on convergence data means leaning
  on the big documented families, which are mostly European-descended, and
  ASL-heavy IS already gets resented for this. Resolution: weight the study
  corpus toward under-documented languages deliberately, and count the
  cost of neutrality as real (a maximally neutral sign is new to everyone).
- **Ship early vs ship right.** Gestuno shipped wrong; academic projects
  ship never. Resolution: the 500-sign contact register ships early and
  loose (it will be reshaped by use); the full grammar ships conservative
  and slow.

## 9. Media pipeline: video in, visuals out

The study consumes thousands of dictionary videos and the project must emit
reference visuals of its own. Neither direction needs generative models.

**Input: video to data.**
- Pose estimation is the workhorse. MediaPipe Holistic or MMPose turns each
  video into a keypoint time series (21 landmarks per hand, plus body and
  face). Mature, free, runs locally on this class of hardware.
- Auto-coding on top of keypoints: classify handshape against the
  inventory, bucket location into zones relative to body landmarks,
  classify the movement trajectory, detect handedness and symmetry. Output
  is the parameter vector per sign that section 5's clustering needs. The
  machine does the bulk coding; humans verify clusters, which is two
  orders of magnitude less labor than humans coding raw video.
- The data is less "entirely visual" than it looks. Decades of annotation
  already exist in machine-readable form and gets ingested before any CV
  runs: ASL-LEX codes ~2700 ASL signs for phonological features, frequency,
  and crucially iconicity ratings; the Signbank family (Global/NGT, ASL,
  BSL, Auslan) runs the same software with per-entry handshape and location
  fields, so cross-language joins are cheap; the national corpora carry
  ELAN gloss annotations. CV fills the gaps and the overlap with human
  coding calibrates the auto-coder for free.
- Comparison runs at two levels. Form distance: normalize keypoint
  trajectories for signer body size and framing, then dynamic time warping
  gives sign-to-sign visual similarity directly from video. Phonological
  distance: compare the coded parameter vectors, which is robuster to
  signer idiosyncrasy and is what the triage clusters on. Disagreement
  between the two levels is itself informative (same idea, different
  surface, or vice versa).
- The similarity metric has a built-in oracle: it must rediscover the known
  sign language families blind. ASL close to LSF and Libras, far from BSL;
  BSL clustering with Auslan and NZSL. Woodward's hand-done
  lexicostatistics from the 1970s-90s is the answer key. A metric that
  cannot reproduce known genealogy has no business triaging the lexicon.
- Licensing is an input check, not an afterthought. Spread the Sign videos
  are copyrighted: extracted keypoints and derived codes are internal
  analysis data; no third-party media is ever republished. Everything the
  project records itself ships under an open license with signed releases.

**Output: encoding to visuals, a ladder from cheap to gold.**
- The canonical form of every sign is its machine encoding, never a video.
  All visuals are renderings of the encoding and can be regenerated when
  the encoding changes.
- Avatar renders for drafts: compile the encoding to keypoint choreography
  and drive a rigged 3D model (prior art: JASigning animates HamNoSys via
  SiGML; same idea, saner encoding). Deaf audiences dislike robotic
  avatars, rightly, so avatars are internal proofs and previews only.
- Still illustrations: pull key frames from the reference video or the
  avatar and reduce to the classic dictionary style, line drawing plus
  movement arrows. Edge-detection gets most of the way mechanically; one
  paid illustrator defines the house style over the 500 core.
- Human video is the published standard: paid deaf signers on camera. Deaf
  faces on the reference material is simultaneously quality control and
  legitimacy; hearing-produced reference media gets rejected, and should.
- Crowdsourcing sits on top of a lint. Anyone may submit a recording for
  any sign; the input pipeline extracts its keypoints and scores the match
  against the canonical encoding; only passing takes reach the paid deaf
  review queue; accepted takes become credited alternate reference videos.

Two loops make the pipeline self-checking:
- **Recording lint**: the same pose pipeline that codes existing
  dictionaries validates every video we produce. One oracle, both
  directions.
- **Round trip**: encode, render on the avatar, and have signers read the
  sign back cold. A sign that fails readback has an ambiguous encoding or
  an unlearnable form, and it fails before money is spent filming it.

**Checks per stage** (each stage has an oracle before the next starts):
1. Ingest: coverage matrix, concepts by languages, missing cells counted.
2. Pose extraction: landmark confidence thresholds; manual audit of a
   random sample of videos.
3. Auto-coding: agreement with a 200-sign hand-coded gold sample (deaf
   annotator). Under ~90% on handshape or location, fix the coder before
   trusting any cluster downstream.
4. Clustering: stability under resampling; consultant review of cluster
   assignments.
5. Seed design: guessability testing, the standard iconicity methodology.
   Show the sign cold, ask for the meaning, open response then forced
   choice. Divergent-tier signs need a floor score to enter the dictionary.
6. Recordings: lint score, then deaf reviewer approval.
7. Published dictionary: per-sign feedback widget and variant submission
   on the website; submission volume per sign is itself a signal of which
   forms are contested.
8. Field: cross-signing trial performance against the IS baseline, and
   week-later recall from learners (spaced-repetition telemetry doubles as
   the learnability metric).

## 10. Bootstrap: who learns it first and why

No language spreads on merit. Each wave needs a selfish reason:

1. **International deaf events.** IS users are the beachhead: already
   multilingual signers, already need this, already meet annually. Get the
   seed adopted as the working register of one recurring event and iterate
   there. This is where Gestuno half-worked despite itself.
2. **Interpreters and deaf professionals.** IS interpretation is a paid
   accreditation today (WASLI/WFD). A better-documented standard with a
   real curriculum makes their work easier; they become the teachers.
3. **Hearing baby-sign families.** Replace the ad hoc ASL-fragment baby
   sign market with the real contact register: same effort, and the child's
   50 signs are a live language shared with deaf people worldwide. This
   quietly seeds millions of hearing homes with the core lexicon.
   The market is real and crowded: Baby Signs (registered trademark,
   Acredolo/Goodwyn, franchised in 40+ countries), Signing Time, and a long
   tail of courses and flashcards, nearly all ASL fragments sold to hearing
   parents. Our differentiators are that the signs are a language rather
   than a party trick, and that the materials are deaf-vetted. Two hard
   rules for the infant guide (a parallel deliverable, see phases): honest
   science only, since the evidence says babies sign months before they can
   speak but not that signing accelerates development, and this market's
   signature vice is overpromising exactly that; and no trademark-adjacent
   naming. "Babytalk, the first language of new human beings" works as a
   pitch sentence, not as a product name (Babytalk was a parenting
   magazine for decades, and Baby Signs will litigate its corner).
4. **Special-use niches.** Divers, film sets, factories, air-traffic ground
   crews, noisy kitchens. Publish the survival core as the off-the-shelf
   answer to "we need hand signals"; every niche adoption is free marketing
   and free stress-testing.
5. **Deaf schools, last and only by invitation.** The babies-as-natives
   endgame from the brief is real (Nicaragua proves a child community can
   nativize a seed in a generation) but it can only happen inside deaf
   communities that choose it, most plausibly as a second language beside
   the national one. Pushing here first would trigger the Milan reflex and
   deserve to.

The realistic sequence is decades long, and that is fine. English needed
three hundred years; a signed auxiliary that owns niches 1-4 within twenty
would already be the most successful constructed language in history after
Esperanto, with a clearer path past it.

## 11. The scorecard: numbers the language ships with

Every claim in the pitch gets a measured number, published with its method.
The scorecard is a first-class deliverable of the corpus study and trials,
not an afterthought, because "you already know a third of it" is the entire
sales pitch to an existing signer.

- **Head-start score, per language.** For each source language: the share
  of the pansign core an existing signer already knows (identical or
  cognate form) plus the share they can guess (shared iconic motif).
  Headline format: "An ASL signer starts at 34% known, 27% guessable."
  Falls straight out of the convergence study's distance data.
- **Neutrality score.** The spread of head-start scores across languages.
  If ASL signers start at 45% and CSL signers at 12%, the prestige-skew
  pitfall (section 7) has a number, and design iterations must narrow it.
  Published per release so drift is visible.
- **Guessability.** Share of core signs whose meaning naive viewers guess
  cold (open response, then forced choice), measured separately for deaf
  and hearing viewers across at least three unrelated cultures. The
  iconicity literature supplies natural-language baselines to beat.
- **Learnability.** Median hours to 90% recall of the 50-sign survival
  core; signs retained per study-hour after one week (spaced-repetition
  telemetry from the website's learning ladder).
- **Contact performance.** Task success rate and completion time for
  stranger pairs using pansign versus the IS baseline (Whynot 2016 gives
  published IS comprehension figures to compare against).
- **Robustness.** Share of the core that survives one-handed degradation;
  human legibility at distance; machine recognition accuracy from a 480p
  webcam. That last number exists for no natural sign language by design.

## 12. Phases for this project

Each phase ends at its oracle (section 9 checks); nothing advances on
vibes.

1. **Pipeline calibration** (this repo, next): ingest signbank annotations
   and a video sample, stand up pose extraction and the auto-coder, and
   hit the gold-sample agreement bar. Exit: checks 1-3 green.
2. **Corpus study**: the 500-word convergence analysis. A v0 can run on
   annotations alone (ASL-LEX plus the Signbank exports cover several
   languages with zero video processing); video-derived coding extends it
   to the full Spread the Sign breadth. Output: data, clustering, the
   convergent/motif/divergent triage, and the first head-start and
   neutrality numbers for the scorecard (section 11). Publishable on its
   own even if everything else stalls. Exit: check 4.
3. **Design spec v0**: phoneme inventory, notation/encoding, grammar
   sketch, survival core (50) and contact register (500) as encodings
   with avatar drafts. Exit: round-trip readback plus guessability floors
   (check 5).
4. **Consultation and filming**: deaf consultants across families revise
   the spec; paid deaf signers record the reference videos through the
   lint. Nothing ships that deaf reviewers call hearing-brained. Exit:
   check 6 on the full core.
5. **The website** (the eventual product of this repo): dictionary with
   video, illustrations, and notation; the learning ladder (50 → 500 →
   grammar); corpus browser; crowdsource submission queue; all
   open-licensed. Exit: check 7 running live.
6. **Field use**: one event, one niche, one baby-sign curriculum. Measure
   against the IS baseline; fold what survives contact back into v1.
   Exit: check 8, repeated.

Parallel track, can start once the survival core exists in draft: the
**infant guide**. A short, honest manual for teaching the 50-sign core to
hearing babies: when to start (signs emerge around 8-10 months, months
ahead of speech), how to sign consistently during routines, what to expect,
what not to expect (no cognitive miracles; the payoff is earlier
communication and fewer frustration tantrums). Written to be translated,
deaf-reviewed before publication, and eventually the seed content for the
koine learning app.

Still open, deliberately: the governance charter (who can change the core
and how slowly), the funding model for paid deaf review and filming,
consent and compensation norms written down before the first recording
session, and where the video corpus physically lives once it outgrows a
git repo.

## 13. Sources

**Plains Indian Sign Language:**
- The 1930 Indian Sign Language Grand Council film: shot September 4-6,
  1930, Browning, Montana; organized by Gen. Hugh L. Scott, who died before
  finishing the planned 1300-sign film dictionary. Held by the US National
  Archives (Office of Indian Affairs); viewable online at
  https://vimeo.com/681506002 ("Office of Indian Affairs: The Indian Sign
  Language, 1930"). The largest intertribal council ever filmed: eighteen
  participants from a dozen nations.
- Jeffrey Davis, *Hand Talk: Sign Language among American Indian Nations*,
  Cambridge University Press, 2010. The modern treatment.
- Davis's NSF/NEH digital archive and corpus: http://pislresearch.com
  (University of Tennessee). Includes the digitized 1930 films and fieldwork
  with 25+ living signers (Northern Cheyenne, Crow, Blackfeet, Assiniboine),
  with Melanie McKay-Cody.
- Garrick Mallery, *Sign Language Among North American Indians*, First
  Annual Report of the Bureau of Ethnology, 1881. Public domain; on
  archive.org and Project Gutenberg. Hundreds of pages of described signs.
- William Tomkins, *Universal Indian Sign Language*, 1926. Popular primer,
  Dover reprint still in print.

**International Sign:**
- World Federation of the Deaf, *Gestuno: International Sign Language of the
  Deaf*, British Deaf Association, 1975. The cautionary tale, worth owning.
- Rachel Rosenstock and Jemina Napier (eds.), *International Sign:
  Linguistic, Usage, and Status Issues*, Gallaudet University Press, 2016.
- Lori Whynot, *Understanding International Sign: A Sociolinguistic Study*,
  Gallaudet University Press, 2016. The comprehension and density data.
- WFD-WASLI International Sign interpreter accreditation (wfdeaf.org).

**Cross-signing and language emergence:**
- Kang-Suk Byun, Connie de Vos, Anastasia Bradford, Ulrike Zeshan, and
  Stephen Levinson, "First Encounters: Repair Sequences in Cross-Signing,"
  *Topics in Cognitive Science*, 2018. Plus the wider MULTISIGN project at
  iSLanDS (UCLan) and MPI Nijmegen.
- Ann Senghas and Marie Coppola, "Children Creating Language: How Nicaraguan
  Sign Language Acquired a Spatial Grammar," *Psychological Science*, 2001.

**Data for the convergence study:**
- Spread the Sign: https://spreadthesign.com (European Sign Language
  Centre, Örebro). Video dictionary, 40+ sign languages.
- Global Signbank (Radboud University) and the national corpora: BSL Corpus
  (UCL), Auslan Corpus, Corpus NGT. Frequency lists come from these.
- ASL-LEX: https://asl-lex.org. ~2700 ASL signs with phonological coding,
  frequency, and iconicity ratings.
- The sign-language-processing open-source ecosystem (pose-format and
  friends): https://sign-language-processing.github.io. Pose estimation
  and tooling built specifically for sign video.
- James Woodward's lexicostatistical comparisons of sign languages
  (1970s-90s). The hand-computed answer key for validating any automated
  similarity metric.
- Robbin Battison, *Lexical Borrowing in American Sign Language*, 1978. The
  unmarked-handshape inventory.
- Notation prior art: Sutton SignWriting (signwriting.org) and HamNoSys
  (University of Hamburg).

## 14. What success looks like

Not "everyone signs it." Success in order of ambition: the convergence
study is cited; the contact register demonstrably outperforms IS for
strangers in trials; one international deaf event adopts it; a hearing
family and a deaf traveler manage a real conversation with it; somewhere,
a child signs their first word in it and a deaf adult on another continent
would have understood.

notes/methods.md

# Methods (accreting draft)

LOG.md is the diary; this file is the citable prose. Every completed work
item gets rewritten here in past-tense methods style with citations, so a
conference talk or paper is a rendering job, not archaeology. Update in
the same save as the work it describes, or the next one.

## Data

ASL-LEX 2.0 (Sehyr, Caselli, Cohen-Goldberg & Emmorey 2021, J. Deaf
Studies and Deaf Education 26(2)) was retrieved from OSF (project zpha4)
on 2026-07-05: 2,723 ASL signs with phonological coding (handshape,
selected fingers, flexion, sign type, movement, major/minor location,
coded per morpheme for compounds), subjective frequency from deaf signers,
and iconicity ratings from both hearing and deaf raters.

The Dicta-Sign lexicon portal (Matthes et al. 2012, LREC; hosted by the
University of Hamburg) provides 2,092 concepts with parallel citation-form
videos in BSL, DGS, LSF, and GSL at systematically constructed URLs. It
serves as the concept-aligned video corpus for pose-based comparison,
sidestepping cross-dictionary gloss matching.

Global Signbank (Radboud; Crasborn et al.) hosts the NGT dataset (~7,469
glosses, CC BY 4.0) with per-gloss phonological fields; export pending
account access.

## Lexical statistics, ASL-LEX (2026-07-05)

Script: analysis/asl_lex_summary.py (Python stdlib; deterministic).
Results (data/derived/asl-lex-summary.md): 58 distinct dominant-hand
handshapes; the 10/20/25 most frequent cover 56/79/87% of entries.
Sign types: 39.2% one-handed, 34.5% symmetrical or alternating, 23.4%
asymmetrical, 2.9% violating Battison's dominance/symmetry conditions.
Major location: 36.3% neutral space, 26.7% head, 25.0% hand, below-waist
locations absent. Movement: straight 45.7%, curved 23.2%, none 15.3%.
Iconicity (n=990 signs rated by both groups, 7-point scale): hearing mean
3.16, deaf mean 3.01, Pearson r = .817 between groups; 29.7% (hearing) and
27.1% (deaf) of signs rated >= 4. Fingerspelled loan signs: 1.4% of
entries.

Interpretation carried into design: a constrained (~20) handshape
inventory forfeits little contrast space; three quarters of a natural
lexicon is already one-hand-compatible; visual-frame constraints
(webcam-high signing space) match natural practice; the iconicity
discount is real but bounded (~30% of lexicon).

## Pose extraction pipeline (2026-07-05)

pipeline/extract_pose.py: MediaPipe HolisticLandmarker (Tasks API, model
bundle float16, VIDEO running mode) over OpenCV-decoded frames, emitting
per-frame left/right hand (21 landmarks each) and body pose (33
landmarks) normalized coordinates, gzip-JSON per video. Smoke test on
four parallel Dicta-Sign videos of one concept (BSL/DGS/LSF/GSL): pose
detected in 99-100% of frames, hands in 45-100% (the 45% left-hand figure
is a one-handed sign, i.e. correct rejection, not failure). Throughput
approximately real-time on a 2-core VPS: 401 frames in 21 s including
model load. Handedness is stored as raw left/right; dominant/non-dominant
mapping is deferred to analysis (mirroring normalization).

## Planned next

Concept list v0 (D001 recipe), NGT lexical statistics replicating the
ASL-LEX summary, keypoint normalization (body-size and framing invariant),
form-distance metric (DTW over normalized trajectories) validated by
rediscovery of known language families (Woodward's lexicostatistics as
answer key).

data/SOURCES.md

# Data source inventory

Access status as actually tested from this server (dates noted). Raw
downloads live in data/raw/ (gitignored); derived tables in data/derived/
(committed).

## Ingested

### ASL-LEX 2.0 (ASL) — ingested 2026-07-05

- 2,723 signs, 191 columns: full phonological coding (handshape, selected
  fingers, flexion, sign type, movement, major/minor location, per-morpheme
  for compounds), signer frequency ratings, iconicity from both hearing and
  deaf raters, fingerspelled-loan flags.
- Fetch: OSF project zpha4 via API, files signdata.csv (osf.io/download/9nygd)
  and signdataKEY.csv (osf.io/download/ygq4v). CSV is latin-1, not utf-8.
- License: academic dataset, citation required (Sehyr et al. 2021, J. Deaf
  Studies & Deaf Education). Aggregate statistics fine; do not redistribute
  the CSV itself. Local copy: data/raw/asl-lex-2-signdata.csv.
- First summary: data/derived/asl-lex-summary.md (analysis/asl_lex_summary.py).

## Confirmed accessible, ingest pending

### Global Signbank (Radboud) — probed 2026-07-05, more datasets 2026-07-09

- https://signbank.cls.ru.nl. Publicly listed datasets: **NGT** (~7,469
  glosses, CC BY 4.0, public subset browsable), **Kata Kolok** (village
  sign language, valuable out-of-family data point), EmblemsNL (Dutch
  emblem gestures, useful for the offensive-gesture screen in EDGECASES).
- Operator's logged-in view (2026-07-09) shows the full catalog, far
  richer than the anonymous listing. Sizes in parens. Priority order:
  1. **NGT** (7,469; CC BY 4.0) — primary cross-language statistics.
  2. **CSL_Shanghai** (2,241) — Chinese SL, closes the worst of the
     neutrality gap (Q3).
  3. **ASL** (3,105) — calibrates Signbank coding against ASL-LEX (D007).
  4. **IS series**: IS_WFD1959 (357), IS_WFD1971 (373), IS_WFD1975 =
     Gestuno (1,691), IS_WFD1979 (228), IS_WFD2007 (216), IS_Whynot
     (207, empirical IS from the 2016 comprehension study), IS_2021
     (155). The complete documented history of designed international
     sign vocabularies: enables a longitudinal Gestuno autopsy.
  5. **ISL** Israeli (2,268), **Kata Kolok** (1,312) + BaliHS homesign
     (80), **JSL/KSL/TSL** (203/119/180, the Japanese family, ideal for
     the family-rediscovery oracle), **ZEI** Farsi (221), BISINDO (62),
     CARICOM (531), Konchri Sain (531), LSC (124).
  6. European bulk: **VGT** (16,928), **LSFB** (3,512), **NTS** (1,946).
  - Skip: Oefen, tstMH, DRS, LaSiMa (test/personal datasets), NGT-ns
    (non-signs; harmless extra).
  - Most datasets show 0 public signs, so per-dataset Request Access is
    likely needed even though Download CSV links render. Operator is on
    it; exports land in data/raw/ as downloaded.
- Phonological fields exist per gloss (handshape, location etc.).
- Bulk export (CSV/ECV) appears to require a free researcher account; the
  anonymous ECV endpoint redirects to home. OPERATOR ACTION when v0 needs
  it: register at signbank.cls.ru.nl, request NGT dataset access (CC BY
  makes approval likely), download the gloss CSV export, drop it in
  data/raw/. Alternative: per-gloss public pages are scrapable politely,
  but the export is cleaner.

### Dicta-Sign lexicon portal (BSL, DGS, LSF, GSL) — probed 2026-07-05

- https://www.sign-lang.uni-hamburg.de/dicta-sign/portal/
- 2,092 concepts, each page linking parallel videos in all four languages
  at predictable URLs (concepts/cs/cs_N.html → bsl|dgs|lsf|gsl/N.mp4).
- No phonological coding on the pages; this is video input for the pose
  pipeline (PLAN.md section 9), already concept-aligned across four
  languages, which removes the gloss-matching problem entirely.
- EU research project output hosted by University of Hamburg. Research use
  is clearly intended; confirm terms before anything derived publishes
  (QUESTIONS Q1 applies). Videos: fetch, extract keypoints, discard media.

## Probed, status noted

- **BSL SignBank** (bslsignbank.ucl.ac.uk): up; export/licensing not yet
  investigated in depth.
- **Auslan Signbank** (auslan.org.au): up; same.
- **Svenskt teckenspråkslexikon** (teckensprakslexikon.su.se): up; the
  Örebro/STS ecosystem behind Spread the Sign; open-data status to check.
- **Spread the Sign** (spreadthesign.com): up; no public API; bulk use is
  an operator ToS/ethics call (DECISIONS D004). Not needed for v0.

## Plains Indian Sign Language (public domain, extraction task)

- Tomkins 1926, *Universal Indian Sign Language*: archive.org items
  `indiansignlangua00tomk` (djvu.txt exists, 310 KB) and `bp_985073`.
  Direct fetches from this server hit archive.org throttling 2026-07-12;
  retry later or fetch manually. Once the text lands, extracting the
  ~800 sign descriptions into structured form gives an out-of-family
  historical lexicon and serves the operator's sovereignty framing: PISL
  is the proof that a signed auxiliary across nations worked.
- Mallery 1881 (Bureau of Ethnology annual report): archive.org /
  Project Gutenberg; same extraction approach, much bigger.
- Davis's pislresearch.com corpus: video, for the pose pipeline later.

## Wanted, not yet located

- Frequency lists: BSL Corpus and Auslan Corpus frequency studies publish
  top-N lists in papers; extract tables when the concept list is built.
- Cross-language similarity ground truth: Woodward's lexicostatistical
  cognate percentages (for the family-rediscovery oracle); digitize the
  tables from the papers.
- CSL, JSL, Indo-Pakistani SL sources with any machine-readable coding:
  nothing found yet; this gap is exactly the neutrality risk (QUESTIONS Q3).