M14: Corpus Harmonization — schemas/v0.1.0
Version tag:
schemas/v0.1.0Phase: P0: Foundation Target: Weeks 9--14 Sprints: S3, S4, S5
Phase Context
Goal: Eliminate schema drift between the corpus downloader, ingest pipeline, and GOSPELIB-SCHEMAS.md by extracting shared Pydantic models into a single package, implementing the cross-reference architecture, and closing all schema/pipeline gaps.
Key constraint: Phase 1 (shared models package) is the linchpin -- every subsequent phase depends on it. The existing ingest pipeline must continue working throughout.
ZenHub Configuration
| Field | Value |
|---|---|
| Milestone | M14: Corpus Harmonization |
| Due Date | 2026-06-14 |
| Default Pipeline | Product Backlog |
| Primary Epic(s) | Schema Foundation & Critical Fixes, Ingest Core Pipeline, Corpus Validation |
Prerequisites
- M01: Data Pipeline -- ingest pipeline operational, all 7 schema family models exist, FalkorDB client working
Epic: Schema Foundation & Critical Fixes
Extract shared Pydantic models into packages/schemas/, align downloader models, add missing schema families.
Issues
| Issue | Title | Status | Notes |
|---|---|---|---|
| M14-001 | Create gospelib-schemas Package Scaffold | ✅ Done | packages/schemas with pyproject.toml, project.json, py.typed |
| M14-002 | Migrate Shared Types to gospelib-schemas | ✅ Done | 22+ exported types in gospelib_schemas/shared.py |
| M14-003 | Migrate Scripture Text Models to gospelib-schemas | ✅ Done | AnnotationBlock, ReferenceBlock, VerseNote, WordAlignment, Verse, Chapter |
| M14-004 | Migrate Lexicon Models to gospelib-schemas | ✅ Done | LexiconEntry, LexiconFile, LexiconRange |
| M14-005 | Migrate Remaining Schema Models to gospelib-schemas | ✅ Done | 8 domain model files implemented |
| M14-006 | Update Ingest and Downloader to Import from gospelib-schemas | ✅ Done | 12 ingest model files re-export from gospelib_schemas |
| M14-007 | Define Cross-References Pydantic Models | ✅ Done | CrossRefKind, CrossRefSource, CrossRefAnchor, CrossReference |
Epic: Ingest Core Pipeline
Align downloader models, implement cross-reference pipeline, add remaining pipelines.
Issues
| Issue | Title | Status | Notes |
|---|---|---|---|
| M14-008 | Add Cross-References Schema to GOSPELIB-SCHEMAS.md | ✅ Done | GOSPELIB-SCHEMAS.md section 14 fully documented |
| M14-009 | Align Downloader VerseNote Structure | ✅ Done | PR #960 |
| M14-010 | Align Downloader LexiconEntry Structure | ✅ Done | PR #960 |
| M14-011 | Align Downloader PassageRef and WordAlignment | ✅ Done | PR #960 |
| M14-012 | Align Downloader Book Metadata and SeeAlsoLink | ✅ Done | PR #960 |
| M14-013 | Implement Cross-Reference Ingest Pipeline (Stage 3.5) | ✅ Done | pipelines/cross_references.py, registered as stage 3.5 |
| M14-014 | Add Source Attribution to Existing CROSS_REF Edges | ✅ Done | sourceId, kind, relevance fields on edges |
| M14-015 | Register Cross-Reference Pipeline in Orchestrator | ✅ Done | runner.py STAGES map updated |
| M14-016 | Extract LDS Footnote Cross-References to Standalone Files | ✅ Done | tools/scripts/extract-cross-refs.py with Click CLI. PR #970 |
| M14-017 | Implement Proper Names and Versification Models | ✅ Done | gospelib_schemas/proper_names.py, versification.py |
| M14-018 | Implement Morphology Codes and Theographic Models | ✅ Done | gospelib_schemas/morphology_codes.py, theographic.py |
| M14-019 | Add Missing Schema Families to GOSPELIB-SCHEMAS.md | ✅ Done | Sections 15--18 cover 4 new schema families |
| M14-020 | Implement Remaining Ingest Pipelines | ✅ Done | proper_names, versification, theographic pipeline stages |
| M14-021 | Church Content Adapter Decomposition | ✅ Done | 5 providers, 60 unit tests. PR #970 |
| M14-022 | Update Documentation for Harmonization Changes | ✅ Done | GOSPELIB-SCHEMAS.md + apps/docs schema landscape |
| M14-023 | Witness-Source Ingest Pipeline | ✅ Done | Pydantic models, Dagster assets, 6 sources. PR #1326 |
| M14-024 | Commentary-Source Ingest Pipeline | ✅ Done | CommentarySource model, 56 sources, count-parity tests. PR #1343 |
Progress: 24 Done · 0 Partial · 0 To Do (100%)
Release Info
| Release | Tag | Contains |
|---|---|---|
schemas/v0.1.0 | schemas/v0.1.0 | Shared models package, cross-reference schema, aligned downloader, updated docs |
Relevant Risks
| Risk | Impact | Mitigation |
|---|---|---|
| uv path resolution fails across workspace | Blocks shared package usage | Test with uv pip install -e before automating |
| Downloader drivers break on required fields | Blocks corpus download | Make fields optional during transition, tighten later |
| New pipelines depend on nonexistent corpus data | Pipeline stages fail | Gate behind file existence checks |