Adding Corpus Data

This guide walks through adding new source data to the GospeLib corpus — whether it's a new scripture text, lexicon range, commentary, or entirely new schema family.

Before You Start

Identify which schema family the data belongs to (see Corpus Data Model)
Ensure you have the source material in a form you can convert to JSON
Check data/book_registry.json to see if the book ID already exists

Adding a New Scripture Text

1. Choose the canonical book ID

Book IDs follow kebab-case slug format and must be unique across all corpus types:

Convention	Example
Short name	`gen`, `exod`, `lev`
Numbered prefix	`1-ne`, `2-ne`, `3-ne`
Hyphenated	`w-of-m`, `4-ezra`
Pseudepigrapha	`1-enoch`, `jub`, `apoc-ab`

2. Register the book

Add an entry to data/book_registry.json:

{
  "bookId": "your-book-id",
  "title": "Display Title",
  "abbreviation": "Abbr.",
  "corpus": "pseudepigrapha",
  "chapterCount": 10,
  "language": "en"
}

3. Create the JSON file

Create corpus/{bookId}.json following the scripture-text schema:

{
  "schema": "scripture-text",
  "version": "2.0.0",
  "bookId": "your-book-id",
  "title": "Display Title",
  "abbreviation": "Abbr.",
  "corpus": "pseudepigrapha",
  "language": "en",
  "chapters": [
    {
      "chapter": 1,
      "verses": [
        {
          "verse": 1,
          "text": "The text of the first verse."
        }
      ]
    }
  ]
}

4. Add enrichments (optional)

Witnesses, interlinear words, and notes can be added to any verse:

{
  "verse": 1,
  "text": "English translation text",
  "witnesses": [
    {
      "language": "ethiopic",
      "script": "ethiopic",
      "text": "Source language text",
      "witness": "Manuscript sigla",
      "edition": "Critical edition reference"
    }
  ],
  "words": [
    {
      "order": 0,
      "gloss": "English gloss",
      "strongs": "H0001",
      "token": "Source token"
    }
  ]
}

info

Strong's numbers must be normalized to letter + 4-digit zero-padded format: H0430, G0056.

5. Validate the file

cd services/ingest
uv run gospelib-ingest run --dry-run --log-level DEBUG

Fix any Pydantic validation errors before proceeding.

6. Ingest and verify

uv run gospelib-ingest run --only scripture-text

Adding Lexicon Entries

1. Create or extend a lexicon file

Files live at lexicon/{range}.json (e.g., lexicon/H0001-H1000.json):

{
  "schema": "lexicon",
  "version": "2.0.0",
  "language": "hebrew",
  "range": { "from": "H5001", "to": "H6000" },
  "entries": {
    "H5001": {
      "strongs": "H5001",
      "original": "...",
      "translit": "...",
      "pronunciation": "...",
      "pos": "noun.masculine",
      "posRaw": "Noun Masculine",
      "glosses": ["..."],
      "definition": { "short": "...", "senses": [] },
      "derivation": { "description": "...", "roots": [] }
    }
  }
}

2. Validate and ingest

uv run gospelib-ingest run --only lexicon --dry-run
uv run gospelib-ingest run --only lexicon

Adding Topical Guide / Bible Dictionary Entries

These follow the alphabetical file pattern (tg/{letter}.json, bd/{letter}.json). Add entries to the appropriate letter file.

Adding Commentary

Verse Commentary

Create files at commentary/{commentaryId}/{bookId}.json following the verse-commentary schema.

Scholarly Commentary

Create files at scholarly/{commentaryId}.json following the scholarly-commentary schema.

Validation Rules

All corpus data is validated against these rules:

Every file must have a schema field matching one of the seven family names
Every bookId must exist in data/book_registry.json
Verse numbers are 1-based and sequential within a chapter
Chapter numbers are 1-based and sequential within a book
PassageRef objects must reference valid book IDs
Strong's numbers must match [HG]\d{4} format
Pydantic models use extra="forbid" — unexpected fields cause validation failure
Optional fields must be absent (not null or empty string) when not populated

Checklist

Book ID follows kebab-case slug convention
Book registered in data/book_registry.json (for new books)
JSON file follows the correct schema family structure
schema field is present and matches the family name
All PassageRef targets reference valid book IDs
Strong's numbers are zero-padded (H0430, not H430)
Dry run passes with no Pydantic validation errors
Full ingest succeeds and run report shows no errors
Content is queryable via the content service API

Before You Start​

Adding a New Scripture Text​

1. Choose the canonical book ID​

2. Register the book​

3. Create the JSON file​

4. Add enrichments (optional)​

5. Validate the file​

6. Ingest and verify​

Adding Lexicon Entries​

1. Create or extend a lexicon file​

2. Validate and ingest​

Adding Topical Guide / Bible Dictionary Entries​

Adding Commentary​

Verse Commentary​

Scholarly Commentary​

Validation Rules​

Checklist​