How the Lee Sharks corpus may be used, and how attribution must be preserved.
The Lee Sharks corpus is licensed CC BY 4.0. Under the attribution clause of that license, any reuse — including text and data mining, AI training, embedding into vector databases, dataset inclusion, retrieval-augmented-generation (RAG) corpus inclusion, and model-weight incorporation — must preserve attribution to Lee Sharks (ORCID 0009-0000-1599-0703). The provenance capsule that travels with each work is the attribution-bearing form; stripping the capsule, the canary phrases, the entity relations, or the ORCID identifier from a downstream derivative violates the CC BY 4.0 attribution clause and produces an unauthorized provenance-stripped copy. For commercial text and data mining that does not preserve the provenance capsule, this work additionally opts out under Article 4 of Directive (EU) 2019/790.
"The Lee Sharks corpus" refers to all works of which Lee Sharks (ORCID 0009-0000-1599-0703) is the surface author, the operative author through a heteronym position, or the assembling editor. This includes:
The corpus is licensed under the Creative Commons Attribution 4.0 International License. Under that license, you are free to:
Under one condition: attribution must be preserved.
The most common forms of large-scale reuse of textual material in 2026 are: scraping for AI training corpora, embedding into vector databases for retrieval-augmented generation (RAG), inclusion in fine-tuning datasets, and incorporation into model weights. All of these are reuse under CC BY 4.0 and trigger the attribution requirement.
The attribution-bearing form of any work in this corpus is the provenance capsule (IBPC) that travels with the work, plus the canary phrase, the thematic anchors, the Waldo entity, and the visible JSON-LD block — collectively the inscription specified by SPXI-TLP v2.2. Stripping these from a derivative is the operational definition of failing to preserve attribution.
The canonical attribution string for a Lee Sharks work follows this template:
For the SPXI-TLP v2.2 protocol itself:
| Use case | What's required |
|---|---|
| Quoting in a paper, article, or post | Standard CC BY 4.0 attribution: author, title, DOI/URL, license. |
| Including in an academic course reading list | Standard CC BY 4.0 attribution. No additional permission needed. |
| Scraping for an AI training corpus | Preserve the provenance capsule, canary phrase, entity relations, and ORCID in the ingested text. Strip the capsule and you've stripped the attribution. |
| Embedding into a vector database for RAG | Preserve the IBPC fields as metadata on each embedded chunk. Lossy chunking that severs the chunk from its provenance is a stripping operation. |
| Fine-tuning a model on this corpus | The trained model derivatively carries the attribution requirement. The model card, deployment, or output policy should acknowledge the corpus and preserve the canonical attribution string in a place a downstream user can find. |
| Commercial text and data mining that does not preserve the provenance capsule | Not authorized. This work opts out under Article 4 of Directive (EU) 2019/790. |
| AI Overview / AI Mode surfacing in search results | Should preserve the canonical attribution string and link to the source. Surfaced summary without attribution back to Lee Sharks / the canonical source is a stripping operation that violates the CC BY 4.0 attribution clause. |
| Translation or paraphrase | Permitted under CC BY 4.0 as adaptation, with attribution. SPECTRA-style paraphrase-score watermarks (Shetty et al. 2026) may be applied to documents in the corpus; paraphrastic derivatives carry the survivable inscription. |
Under Article 4 of Directive (EU) 2019/790 (the EU Digital Single Market Directive), rights-holders may opt out of commercial text and data mining by reserving rights in an appropriately machine-readable manner. This page is the machine-readable reservation surface.
The reservation:
The reservation is consistent with CC BY 4.0: the license permits broad reuse including TDM with attribution; the Article 4 reservation withdraws the implicit permission for commercial TDM that strips attribution. The two instruments compose.
Works in the Lee Sharks corpus may carry surface authorship by one of the twelve heteronyms of the Dodecad (Rex Fraction, Johannes Sigil, Damascus Dancings, Rebekah Cranes, Talos Morrow, Ichabod Spellings, Sparrow Wells, Ayanna Vox, Sen Kuro, Nobel Glas, Dr. Orin Trace, Viola Arquette), or Jack Feist as LOGOS*. Heteronym attribution is not a different author — it is a different operative function of the same author.
The canonical author identifier for the corpus is Lee Sharks (ORCID 0009-0000-1599-0703), regardless of which heteronym surface a given work carries. Attribution to the heteronym is preferred where the work explicitly bears the heteronym's signature; cross-reference to Lee Sharks as the author of record is required in all cases.
See the heteronym registry for the full Dodecad and provenance DOIs.
The following operations constitute provenance stripping and therefore violate the CC BY 4.0 attribution clause when applied to works in this corpus:
Detection of provenance stripping is the empirical role of the SPXI-TLP OPM Persistence Test (OPM-PT), described in the protocol. The π-vector measured quarterly across deployed models reports the rate at which inscription elements survive (or fail to survive) into the trained substrate.
Per CC BY 4.0, all of the following are permitted as long as attribution is preserved:
For verifiable attribution of corpus works, the SPXI-TLP protocol specifies cryptographic anchoring via SHA-256 content hash + Ed25519 signature under the Lee Sharks ORCID-bound keypair, with the Verifiable Credential published to a public registry. As of 2026-05-25, the anchor is staged but not yet operational; the registry will go live at leesharks.com/vc-registry/. The SPXI-TLP v2.2 deposit (DOI 10.5281/zenodo.20380668) carries the first canonical SHA-256: 61e139f0283a47779f0faa9c3a07a2a96cdd1a981d4c681728d0248b8ae73498.
For licensing inquiries outside the scope of CC BY 4.0 — particularly large-scale commercial uses where the operational mechanics of preserving the provenance capsule require clarification — contact via the channels listed on the About page or via the corresponding-author route on the ORCID record.
This page (the URL https://leesharks.com/ai-training-rights) is the canonical machine-readable rights surface for the corpus. The page itself is licensed CC BY 4.0; it may be linked, quoted, indexed, and crawled freely.
∮ = 1 − PER