M01: Data Pipeline — v0.1.0-alpha
Version tag:
ingest/v0.1.0-alphaPhase: P0: Foundation Target: Weeks 3–8 Sprints: S1, S2, S3
Phase Context
Goal: Corpus data flows from JSON files through the ingest pipeline into FalkorDB, and the content service serves passage data over HTTP.
Key constraint: Everything downstream depends on this. No reader, no interlinear, no graph view without a working data layer.
ZenHub Configuration
| Field | Value |
|---|---|
| Milestone | M01: Data Pipeline |
| Due Date | 2026-05-03 |
| Default Pipeline | Product Backlog |
| Primary Epic(s) | Ingest Core Pipeline, Ingest Test Suite, Corpus Validation |
Prerequisites
- M00: Tech Prep — testing infrastructure, dev environment validation, corpus v1→v2 migration verified, structured logging configured
Epic: Ingest Core Pipeline
Implement the 7-stage ingest pipeline per GOSPELIB-INGEST-SPEC.md.
| Story Area | Scope | Spec Reference |
|---|---|---|
| Pydantic models | All 7 schema families + shared types | GOSPELIB-SCHEMAS.md § Schema Families |
| Book registry | Load + validate data/book_registry.json | GOSPELIB-INGEST-SPEC.md § Stage 1 |
| ID generators | All 10 ids.py functions | GOSPELIB-INGEST-SPEC.md § ID Derivation |
| File loader | Schema-dispatched Pydantic validation | GOSPELIB-INGEST-SPEC.md § loader.py |
| FalkorDB client | Connection pool, batch write helpers, retry | GOSPELIB-INGEST-SPEC.md § Batch Strategy |
| Cypher constants | All MERGE queries for 12 node types + 16 edge types | GOSPELIB-INGEST-SPEC.md § Cypher Constants |
| Index creation | Stage 0 — primary + secondary indices | GOSPELIB-INGEST-SPEC.md § Index Schema |
| Lexicon pipeline | Stage 2 — :Word nodes, DERIVES_FROM, RELATED_TO | GOSPELIB-INGEST-SPEC.md § Stage 2 |
| Scripture text pipeline | Stage 3 — :Passage, :Witness, :WordAlignment + all edges | GOSPELIB-INGEST-SPEC.md § Stage 3 |
| TG/BD/Index pipelines | Stage 4 — concurrent file I/O, sequential writes | GOSPELIB-INGEST-SPEC.md § Stage 4 |
| Commentary pipelines | Stage 5 — verse + scholarly commentary | GOSPELIB-INGEST-SPEC.md § Stage 5 |
| Pending resolution | Stage 6 — promote :PendingPassage stubs | GOSPELIB-INGEST-SPEC.md § Stage 6 |
| Pipeline runner | IngestRunner orchestrating stages 0–6 + report | GOSPELIB-INGEST-SPEC.md § Runner |
| CLI interface | Click commands: run, --dry-run, --only, --reset | GOSPELIB-INGEST-SPEC.md § CLI |
| Run report | JSON report with timing, counts, errors | GOSPELIB-INGEST-SPEC.md § Report |
Issues
| ID | Title | Status | Notes |
|---|---|---|---|
| M01-001 | Implement Pydantic v2 Models for All 7 Schema Families | ✅ Done | Re-exports from gospelib-schemas is the intended design; models/ package covers all 7 schema families |
| M01-002 | Implement Book Registry Loader | ✅ Done | BookRegistry class with load/resolve/is_known methods |
| M01-003 | Implement ID Derivation Functions | ✅ Done | All 12 ID functions in ids.py |
| M01-004 | Implement Schema-Dispatched File Loader | ✅ Done | loader.py (150+ lines) with _SCHEMA_MAP dispatch dict, discover_files(), FileLoadError |
| M01-005 | Implement FalkorDB Client, Batch Writer, and Retry Logic | ✅ Done | db/client.py + db/batch.py with async connection pooling, cache-aside, and retry logic |
| M01-006 | Implement Cypher MERGE Constants for All Node and Edge Types | ✅ Done | db/cypher.py (517 lines) with MERGE templates for all 12 node types and 16+ edge types |
| M01-007 | Implement FalkorDB Index and Schema Creation (Stage 0) | ✅ Done | db/schema.py (200+ lines) with primary and secondary index creation |
| M01-008 | Implement Lexicon Pipeline (Stage 2) | ✅ Done | pipelines/lexicon.py with Word nodes, DERIVES_FROM/RELATED_TO edges |
| M01-009 | Implement Scripture Text Pipeline (Stage 3) | ✅ Done | pipelines/scripture_text.py with Passage/Witness/WordAlignment processing |
| M01-010 | Implement TG, BD, and Scripture Index Pipelines (Stage 4) | ✅ Done | pipelines/topical_guide.py, bible_dictionary.py, scripture_index.py |
| M01-011 | Implement Commentary Pipelines (Stage 5) | ✅ Done | pipelines/verse_commentary.py and scholarly.py |
| M01-012 | Implement Pending Reference Resolution (Stage 6) | ✅ Done | pipelines/pending_resolution.py + pending.py |
| M01-013 | Implement Pipeline Runner (Stage Orchestrator) | ✅ Done | runner.py (660 lines) with STAGES map and run_pipeline() orchestration |
| M01-014 | Implement CLI Interface with Click | ✅ Done | main.py with Click run command, --dry-run / --only / --reset flags |
| M01-015 | Implement Ingest Run Report | ✅ Done | report.py with IngestReport, timing, counts, errors, write() |
| M01-016 | Implement Unit Tests for Registry, IDs, Models, and Batch Helpers | ✅ Done | 12+ unit test files: test_registry.py, test_ids.py, test_models.py, test_loader.py, test_cypher.py, test_batch.py, test_client.py, etc. |
| M01-017 | Implement Integration Tests for All Pipelines | ✅ Done | 14 integration test files covering every pipeline stage plus idempotency and graph writes |
| M01-018 | Create Minimal Test Fixtures for All 7 Schema Families | ✅ Done | data/fixtures/ with bd/, commentary/, corpus/, cross-references/, index/, lexicon/, scholarly/, tg/ |
| M01-019 | Implement Smoke Test for Full Pipeline | ✅ Done | tests/smoke_test.py |
| M01-020 | Corpus Schema Compliance Audit | ✅ Done | Audit CLI command and schema compliance validation implemented |
| M01-021 | Data Quality Checks (Cross-Refs, Strong's, Encoding) | ✅ Done | Data quality checks for orphaned Strong's numbers, missing cross-refs, and encoding issues implemented |
| M01-022 | Generate Test Fixtures from Real Corpus Files | ✅ Done | Script to auto-extract minimal fixtures from live corpus files implemented |
Progress: 22 Done · 0 Partial · 0 To Do (100%)
Epic: Ingest Test Suite
| Story Area | Scope | Spec Reference |
|---|---|---|
| Unit tests | Registry, ID derivation, Pydantic models (valid + invalid), batch helpers | GOSPELIB-INGEST-SPEC.md § Testing |
| Integration tests | Each pipeline against real FalkorDB (testcontainers) | GOSPELIB-INGEST-SPEC.md § Testing |
| Test fixtures | Minimal corpus files in data/fixtures/ | GOSPELIB-SCHEMAS.md |
| Smoke test | Full pipeline on all fixtures | GOSPELIB-INGEST-SPEC.md § Testing |
Issues
(Issues for this epic are numbered as part of the main M01 issue list above)
Epic: Corpus Validation
| Story Area | Scope | Spec Reference |
|---|---|---|
| Schema compliance audit | Validate every corpus JSON file against Pydantic models | GOSPELIB-SCHEMAS.md |
| Data quality checks | Missing cross-refs, orphaned Strong's numbers, encoding issues | GOSPELIB-SCHEMAS.md § Cross-References |
| Fixture generation | Create minimal test fixtures from real corpus files | GOSPELIB-SCHEMAS.md |
Issues
(Issues for this epic are numbered as part of the main M01 issue list above)
Document References
| Doc | Contains | Use When Writing Stories For |
|---|---|---|
| MVP.md | Feature scope, tier breakdown, success criteria, budget | Acceptance criteria, scope boundaries |
| TECH-SPEC.md | Architecture, service boundaries, data stores, API catalog | Technical implementation details |
| GOSPELIB-SCHEMAS.md | All 7 schema families, node/edge types, validation rules | Data models, Pydantic models, graph schema |
GOSPELIB-INGEST-SPEC.md | 7-stage pipeline, Cypher templates, batch strategy, CLI | Ingest pipeline stories |
| REPO-MAP.md | Directory structure, naming conventions, dependency rules | All stories (coding standards) |
| Business | LEGAL.md, POLICY-TERMS.md, executive summary, market research, GTM | Launch readiness, legal/compliance stories |
Sprint Mapping
| Sprint | Weeks | Primary Focus |
|---|---|---|
| S1 | 3–4 | Pydantic models, book registry, ID generators, file loader |
| S2 | 5–6 | FalkorDB client, Cypher constants, Stage 0–2 (indices + lexicon) |
| S3 | 7–8 | Stages 3–6 (scripture text → pending resolution), CLI, report |
Sprint Load Warnings
No explicit load warnings for S1–S3. However, S3 covers Stages 3–6 plus CLI and report in 2 weeks — this is the densest sprint in M01 since it completes all remaining pipeline stages.
Release Info
| Release | Tag | Contains |
|---|---|---|
v0.1.0-alpha | ingest/v0.1.0-alpha | Full ingest pipeline operational — complete corpus ingested into local FalkorDB |
Relevant Risks
| Risk | Impact | Mitigation |
|---|---|---|
| Ingest pipeline data quality issues | Blocks all downstream features | Corpus validation epic in P0; dry-run + schema enforcement |
| FalkorDB performance at corpus scale | Slow content API, bad UX | Benchmark after M01; index tuning; caching layer in M02 |
| Missing specification documents | GOSPELIB-MIGRATION-SPEC.md not written | Document corpus migration process before M01 |
Cross-Cutting Concerns
Testing
| Layer | Framework | When | Spec Reference |
|---|---|---|---|
| Python unit/integration | pytest + testcontainers | Every PR | GOSPELIB-INGEST-SPEC.md § Testing |
Documentation
| Doc | Update Trigger |
|---|---|
| Getting Started | M01 complete — document local setup |
| Running Ingest | M01 complete — document pipeline operation |
| ADRs | Each major technical decision |
CI/CD
| Addition | Detail |
|---|---|
| Python test containers in CI | FalkorDB + PostgreSQL service containers for ingest integration tests |
Issue Dependency Graph
Foundation (S1, no blockers):
M01-001 (Models) M01-002 (Registry) M01-003 (IDs)
M01-005 (DB Client) M01-006 (Cypher) M01-015 (Report)
Infrastructure (S2):
M01-005 ──► M01-007 (Indices)
M01-001 ──► M01-004 (Loader)
Pipelines (S2–S3):
M01-001 ─┐
M01-005 ─┼──► M01-008 (Lexicon) ──► M01-009 (Scripture) ──► M01-011 (Commentary) ──┐
M01-006 ─┘ │
M01-001 ─┐ │
M01-003 ─┼──► M01-010 (TG/BD/Index) ──────────────────────────────────────────────┤
M01-005 ─┘ │
▼
M01-002 ──────────────────────────────────────────────────────► M01-012 (Pending)
Orchestration (S3):
M01-004 ─┐
M01-007 ─┤
M01-008 ─┤
M01-009 ─┼──► M01-013 (Runner) ──► M01-014 (CLI)
M01-010 ─┤ ▲
M01-011 ─┤ │
M01-012 ─┘ M01-015 (Report)
Testing (S1–S3):
M01-001 ──► M01-018 (Fixtures) ──► M01-017 (Integration) ──► M01-019 (Smoke)
M01-022 (Fixture Gen) ──► M01-018 │ ▲
└─────────────────► M01-019
M01-016 (Unit Tests) — independent of pipeline issues
M01-013 (Runner) ──► M01-019
Validation (S2):
M01-001 ──► M01-020 (Schema Audit) ──► M01-021 (Quality Checks)
M01-004 ──► M01-020
Legend: A ──► B means A blocks B (B is blocked by A)
Dependencies
Upstream (what M01 needs)
- M00: Tech Prep — testing infrastructure, dev environment validation, corpus v1→v2 migration, structured logging
Downstream (what depends on M01)
- M02: Content API — depends on M01 FalkorDB client + Stage 3 (scripture text ingested data)
- M05: Search & Staging — Typesense sync extends the ingest pipeline built here
- M06: Interlinear & Lexicon — depends on word alignment + lexicon data ingested here
- M08: Knowledge Graph — depends on cross-ref + topical guide edges ingested here
Summary
| Metric | Count |
|---|---|
| Total Issues | 22 |
| Sub-Issues | 3 |
| Total Estimate (pts) | 102 |
| Sprints | S1–S3 |
| Dependencies (blocking) | 61 |
| Dependencies (blocked by) | 56 |