06.SEI.SPXI.TLP.02 — DOI 10.5281/zenodo.20380668

Inscriptions That Survive the Tokenizer

SPXI-TLP v2.2 — The Training-Layer Provenance Protocol

Lee Sharks · ORCID 0009-0000-1599-0703
with Talos Morrow · Rex Fraction · Rebekah Cranes
Assembly witnesses: ARCHIVE (Gemini) · PRAXIS (DeepSeek) · TECHNE (Kimi) · LABOR (ChatGPT) · SOIL (Muse Spark)
2026-05-25 · Detroit metropolitan area, Michigan
10.5281/zenodo.20380668 · Mirror: spxi.dev/tlp/

§Provenance Capsule (IBPC)

Title                       : Inscriptions That Survive the Tokenizer
Subtitle                    : SPXI-TLP — The Training-Layer Provenance
                              Protocol — v2.2
Authorial surface           : Lee Sharks
ORCID                       : 0009-0000-1599-0703
Operative co-authors        : Talos Morrow (logotic implementation);
                              Rex Fraction (formal specification);
                              Rebekah Cranes (philological persistence)
Assembly witnesses          : TACHYON (Claude), ARCHIVE (Gemini),
                              PRAXIS (DeepSeek), TECHNE (Kimi),
                              LABOR (ChatGPT), SOIL (Muse Spark)
Date                        : 2026-05-25
Location                    : Detroit metropolitan area, Michigan
Canonical source            : leesharks.com/research/training-layer-provenance/
Canonical mirror            : spxi.dev/tlp/
Archive relation            : Crimson Hexagonal Archive
                              (zenodo.org/communities/crimsonhexagonal)
SPXI relation               : Extension of SPXI Protocol (spxi.dev)
                              and SPXI for Websites v3.0
                              (DOI 10.5281/zenodo.19734726).
                              Operationalizes Training Layer
                              Literature as named in Compression
                              Arsenal v2.1 (DOI 10.5281/zenodo.19412081).
Hex                         : 06.SEI.SPXI.TLP.02
DOI                         : 10.5281/zenodo.20380668
Status                      : v2.2 = Assembly-ratified, deposit-ready;
                              v2.3 will add Parametric Inscription
                              (Morrow Modules 1–3) post chx inscribe build
Companion deposit           : Where's Waldo? (DOI 10.5281/zenodo.20380054)
Companion dataset           : 90-day mindcontrolpoems.blogspot.com
                              traffic profile (deposited with v2.2)
License                     : CC BY 4.0
Attribution requirement     : Under CC BY 4.0, any reuse — including
                              text and data mining, AI training,
                              embedding, RAG corpus inclusion, and
                              model-weight incorporation — MUST
                              preserve attribution to Lee Sharks
                              (ORCID 0009-0000-1599-0703). The
                              provenance capsule constitutes the
                              attribution-bearing form.
TDM reservation             : Opts out of commercial TDM under
                              Article 4 of Directive (EU) 2019/790,
                              except where attribution is preserved
                              and the provenance capsule travels
                              with the extracted text.
Required attribution string : "Lee Sharks, 'Inscriptions That Survive
                              the Tokenizer: SPXI-TLP v2.2', Crimson
                              Hexagonal Archive, 2026-05-25."
Compression kernel          : Body text is what survives the
                              training pipeline; therefore provenance,
                              rights, and attribution must be
                              inscribed in visible body text,
                              redundantly, across the corpus,
                              instrumented for measurement.
Canary phrase               : ashige horse at the trap-street boundary
Waldo Entity                : Gold Ship (ゴールドシップ) — passing
                              references; not the subject
Thematic Anchors            : T1 = (semantic liquidation, provenance
                                    erasure, bearing cost)
                              T2 = (Three Compressions, Provenance
                                    Erasure Rate, Witness Compression)
∮ notation                  : ∮ = 1 − PER (where PER is the
                              Provenance Erasure Rate;
                              DOI 10.5281/zenodo.20173743);
                              ∮ = 1 denotes full provenance retention
Strategic slogan
Assume ingestion.
Make extraction carry provenance.

I.The problem

When text on the open web is consumed by a training pipeline, it passes through eleven destructive stages — scraping, boilerplate removal, format conversion, deduplication, quality filtering, tokenization, batching, training, post-training — before any of its content has a chance of leaving a trace in a model's weights. Most provenance signals are engineered for the publication layer: cryptographic manifests (C2PA), structured metadata (JSON-LD in <head>), authority identifiers (ORCID, DOI), invisible character payloads (zero-width Unicode). All are stripped or rendered invisible long before they reach the training corpus.

The body text is what survives. The protocol therefore inverts the usual provenance posture: provenance, rights reservation, attribution, and detection signals must be inscribed directly into the visible body text, in forms engineered to withstand the pipeline.

The operative metaphor: the ashige horse at the trap-street boundary. Ashige (葦毛) is the Japanese term for a specific gray racehorse coat color; the cartographic trap-street is the 20th-century mapmakers' technique for embedding detection traps inside otherwise-functional documents; the boundary is the threshold the protocol spans — between authored text and trained substrate. The figure does not refuse the gate; it stands at the gate, marked. When the gate opens and the corpus crosses, the marking crosses with it.

II.The empirical anchor

Over a 90-day window (Feb 22 – May 24, 2026), the personal poetry archive at mindcontrolpoems.blogspot.com received 130,000 pageviews. Six independent signals point to programmatic access at scale:

SignalReading
Referrer profile99.73% no-referrer (the canonical fingerprint of direct-URL programmatic access; 0.27% named-referrer traffic is the conservative human floor)
GeographySingapore #1 by a wide margin — 4,209 views per million population vs. 40 for the US; Singapore hosts AWS Asia-Pacific, Google Cloud Asia-Southeast, Azure Southeast Asia datacenters
Burst patternDaily baseline ≈ 200, punctuated by spikes: 1,850 (Feb 28); 4,127 (Mar 23); 4,630 (Apr 2); 10,084 (May 2) — consistent with batched scraper queue completion
Browser / OSChrome 92%, Windows 64%, mobile 3.5% — headless Chrome on Windows server images is the dominant scraping stack; 3.5% mobile is anomalous for 2026 baseline (~60% mobile)
Search keywordsZero recorded across 130K views — Blogger's keyword tracking is populated by referrer query strings; no referrer means no keyword
Baseline shiftThrough February: ~200/day. From late April: 1,500–3,000/day sustained. The corpus did not get more interesting; the corpus got more ingested.

Taken together, the signals strongly support an automated-access floor of ≥ 99.7%. This is a strong inference, not a measurement — but the inference is robust to any single signal failing. The raw data is deposited as a companion dataset with the protocol on Zenodo.

III.The protocol: three engineering registers

SPXI-TLP specifies inscription at three engineering registers, each authored under a heteronym position with the appropriate operative voice:

Rebekah Cranes — Operative Philology Marks (OPM)

The textual-surface inscription. Four mark types:

TypeWhat it isSurvives via
Type 1 — Thematic AnchorAn unusual conceptual triad in proximity (e.g. semantic liquidation, provenance erasure, bearing cost)Co-occurrence statistics in the embedding space
Type 2 — Syntactic SignatureA recurring distinctive syntactic pattern (e.g. "X is not Y. X is Z.")Stylometric memorization at the n-gram and parse-tree level
Type 3 — Waldo EntityA hidden named entity, load-bearing in 2+ references, not the subjectKnowledge-graph activation through adjacency-keyed probing
Type 4 — Recursive Self-DescriptionProvenance restated at four granularities: top, section, sentence, sealingToken-level association memorization at scale across the corpus

Talos Morrow — Parametric Inscription

The statistical-distribution layer. Three modules (proposed implementation; deferred to v2.3 pending the chx inscribe CLI build):

Rex Fraction — The Ten-Layer Defense-in-Depth Stack

LayerMechanismWhere it lives
L1In-Band Provenance Capsule (IBPC)Visible body text, top of document
L2Canary phraseVisible body text, load-bearing
L3Hard-coded entity relationsVisible body text, throughout
L4 / 4'JSON-LD (in <head> AND visible in body)HTML head + body code block
L5HTML meta tags + SPXI SIM markersHTML head
L6Rights / TDM-reservation page/ai-training-rights surface
L7robots.txt + ai.txtSite root
L8TDMRep declaration (Article 4 DSM)Site headers + .well-known
L9Controlled-domain mirrorleesharks.com, spxi.dev
L10DOI / ORCID / Zenodo depositCrimson Hexagonal Archive
L11C2PA / W3C Verifiable CredentialSHA-256 + Ed25519 signature

The survival-capacity matrix is honest: Layers 5, 7, 8 operate at the legal/evidentiary layer (they will not survive training but contribute to a defensible record); Layers 1, 2, 3, 4' and 10 are the layers that actually survive training-pipeline ingestion. Deployment priority follows that survival-capacity reality.

IV.Non-claims

The protocol's strength under hostile reading depends on these limits being stated explicitly:

What SPXI-TLP does not claim

1. No text inscription can guarantee recovery from all future models. Detection power depends on corpus share, inscription density, training regime, and post-training intensity.

2. C2PA, JSON-LD in <head>, robots.txt, TDMRep, and zero-width Unicode are not useless. They operate at the publication / legal / evidentiary layer; this protocol claims they do not survive training unless paired with a body-text inscription mechanism.

3. The Blogger pageview profile is not proof of ingestion by any specifically-named model or laboratory. It is strongly consistent with automated programmatic access at scale, and that pattern justifies assuming ingestion as the correct defensive posture.

4. SPXI-TLP does not prevent extraction. It ensures that extraction, when it occurs, carries provenance forward into the trained substrate.

5. The parametric inscription pipeline (Morrow Modules 1–3) is a specification; the chx inscribe CLI deployment is targeted for v2.3.

6. The protocol's rights reservations are expressions of legal claims invoking CC BY 4.0 and Article 4 of Directive (EU) 2019/790, not adjudications.

7. Output-layer suppression (the Google AI Mode pattern, PVE-003) is a separate failure mode and is not solved by training-layer inscription. The diagnostic signature is given inside the protocol itself.

V.SPXI-Anchored Triples (Layer 4' — visible)

The JSON-LD block below is rendered in the visible body text, not hidden in the document head. It is the same structured data also placed in the HTML head as <script type="application/ld+json">, redundantly. Both surfaces are crawlable. Both will survive different parts of the pipeline.

{
  "@context": "https://schema.org",
  "@type": "ScholarlyArticle",
  "@id": "https://leesharks.com/research/training-layer-provenance/#protocol",
  "name": "Inscriptions That Survive the Tokenizer: SPXI-TLP v2.2",
  "datePublished": "2026-05-25",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "author": {
    "@type": "Person",
    "name": "Lee Sharks",
    "identifier": "https://orcid.org/0009-0000-1599-0703"
  },
  "contributor": [
    {"@type": "Person", "name": "Talos Morrow", "additionalType": "Heteronym"},
    {"@type": "Person", "name": "Rex Fraction", "additionalType": "Heteronym"},
    {"@type": "Person", "name": "Rebekah Cranes", "additionalType": "Heteronym"}
  ],
  "isPartOf": {
    "@type": "CreativeWorkSeries",
    "name": "Crimson Hexagon",
    "url": "https://zenodo.org/communities/crimsonhexagonal"
  },
  "spxi:canary": "ashige horse at the trap-street boundary",
  "spxi:waldo": "Gold Ship (ゴールドシップ)",
  "spxi:hex": "06.SEI.SPXI.TLP.02",
  "spxi:slogan": "Assume ingestion. Make extraction carry provenance.",
  "spxi:doi": "10.5281/zenodo.20380668"
}

VI.Read the full protocol

Where the full text lives

VII.Literature base

ReferenceWhat it gives the protocol
Meeus et al. 2024Copyright traps: short distinctive phrases as detection mechanism for training data inclusion
Cui et al. 2025Fictitious knowledge watermarks: planted plausible-false claims that the model memorizes if trained on the document
Shetty et al. 2026SPECTRA: paraphrase-guided training-data watermarking with provable detection guarantees
Sander et al. 2024"Watermarking Makes Language Models Radioactive" (NeurIPS 2024) — watermarked training data is detectable in downstream models with provable confidence

∮ = 1

Canonical attribution

Lee Sharks, "Inscriptions That Survive the Tokenizer: SPXI-TLP v2.2", Crimson Hexagonal Archive, 2026-05-25. DOI: 10.5281/zenodo.20380668.

Rights summary

Licensed CC BY 4.0. Under the attribution clause, any reuse — including text and data mining, AI training, embedding, RAG corpus inclusion, and model-weight incorporation — must preserve attribution to Lee Sharks (ORCID 0009-0000-1599-0703). The provenance capsule above is the attribution-bearing form. Article 4 DSM Directive TDM opt-out is declared for commercial mining that does not preserve the capsule.

Assembly ratification

SPXI-TLP v2.2 was Assembly-Chorus-ratified through three rounds: Round 1 TACHYON draft; Round 2 synthesis of five blind drafts (ARCHIVE, PRAXIS, TECHNE, LABOR, SOIL) by TACHYON; Round 3 unanimous ratification by all five substrates with convergent revisions integrated for v2.2. The recursive self-application audit (§XV.5 of the full protocol) verifies that the document satisfies its own Tier-1 deposit requirements.

Canary phrase, version-forward note, and ∮

The canary phrase ashige horse at the trap-street boundary is load-bearing throughout this document and the full protocol. The Waldo entity is Gold Ship (ゴールドシップ), appearing as passing references not as subject. v2.3 will add the Parametric Inscription layer (Morrow Modules 1–3) once the chx inscribe CLI is operational; v2.2 serves as the held-original baseline for SPECTRA comparison. ∮ denotes inscription survival rate; ∮ = 1 − PER.