Adding Corpus Data
This guide walks through adding new source data to the GospeLib corpus — whether it's a new scripture text, lexicon range, commentary, or entirely new schema family.
Before You Start
- Identify which schema family the data belongs to (see Corpus Data Model)
- Ensure you have the source material in a form you can convert to JSON
- Check
data/book_registry.jsonto see if the book ID already exists
Adding a New Scripture Text
1. Choose the canonical book ID
Book IDs follow kebab-case slug format and must be unique across all corpus types:
| Convention | Example |
|---|---|
| Short name | gen, exod, lev |
| Numbered prefix | 1-ne, 2-ne, 3-ne |
| Hyphenated | w-of-m, 4-ezra |
| Pseudepigrapha | 1-enoch, jub, apoc-ab |
2. Register the book
Add an entry to data/book_registry.json:
{
"bookId": "your-book-id",
"title": "Display Title",
"abbreviation": "Abbr.",
"corpus": "pseudepigrapha",
"chapterCount": 10,
"language": "en"
}
3. Create the JSON file
Create corpus/{bookId}.json following the scripture-text schema:
{
"schema": "scripture-text",
"version": "2.0.0",
"bookId": "your-book-id",
"title": "Display Title",
"abbreviation": "Abbr.",
"corpus": "pseudepigrapha",
"language": "en",
"chapters": [
{
"chapter": 1,
"verses": [
{
"verse": 1,
"text": "The text of the first verse."
}
]
}
]
}
4. Add enrichments (optional)
Witnesses, interlinear words, and notes can be added to any verse:
{
"verse": 1,
"text": "English translation text",
"witnesses": [
{
"language": "ethiopic",
"script": "ethiopic",
"text": "Source language text",
"witness": "Manuscript sigla",
"edition": "Critical edition reference"
}
],
"words": [
{
"order": 0,
"gloss": "English gloss",
"strongs": "H0001",
"token": "Source token"
}
]
}
Strong's numbers must be normalized to letter + 4-digit zero-padded format: H0430, G0056.
5. Validate the file
cd services/ingest
uv run gospelib-ingest run --dry-run --log-level DEBUG
Fix any Pydantic validation errors before proceeding.
6. Ingest and verify
uv run gospelib-ingest run --only scripture-text
Adding Lexicon Entries
1. Create or extend a lexicon file
Files live at lexicon/{range}.json (e.g., lexicon/H0001-H1000.json):
{
"schema": "lexicon",
"version": "2.0.0",
"language": "hebrew",
"range": { "from": "H5001", "to": "H6000" },
"entries": {
"H5001": {
"strongs": "H5001",
"original": "...",
"translit": "...",
"pronunciation": "...",
"pos": "noun.masculine",
"posRaw": "Noun Masculine",
"glosses": ["..."],
"definition": { "short": "...", "senses": [] },
"derivation": { "description": "...", "roots": [] }
}
}
}
2. Validate and ingest
uv run gospelib-ingest run --only lexicon --dry-run
uv run gospelib-ingest run --only lexicon
Adding Topical Guide / Bible Dictionary Entries
These follow the alphabetical file pattern (tg/{letter}.json, bd/{letter}.json). Add entries to the appropriate letter file.
Adding Commentary
Verse Commentary
Create files at commentary/{commentaryId}/{bookId}.json following the verse-commentary schema.
Scholarly Commentary
Create files at scholarly/{commentaryId}.json following the scholarly-commentary schema.
Validation Rules
All corpus data is validated against these rules:
- Every file must have a
schemafield matching one of the seven family names - Every
bookIdmust exist indata/book_registry.json - Verse numbers are 1-based and sequential within a chapter
- Chapter numbers are 1-based and sequential within a book
PassageRefobjects must reference valid book IDs- Strong's numbers must match
[HG]\d{4}format - Pydantic models use
extra="forbid"— unexpected fields cause validation failure - Optional fields must be absent (not
nullor empty string) when not populated
Checklist
- Book ID follows kebab-case slug convention
- Book registered in
data/book_registry.json(for new books) - JSON file follows the correct schema family structure
-
schemafield is present and matches the family name - All
PassageReftargets reference valid book IDs - Strong's numbers are zero-padded (
H0430, notH430) - Dry run passes with no Pydantic validation errors
- Full ingest succeeds and run report shows no errors
- Content is queryable via the content service API