M01: Data Pipeline — `v0.1.0-alpha`

Version tag: ingest/v0.1.0-alpha Phase: P0: Foundation Target: Weeks 3–8 Sprints: S1, S2, S3

Phase Context

Goal: Corpus data flows from JSON files through the ingest pipeline into FalkorDB, and the content service serves passage data over HTTP.

Key constraint: Everything downstream depends on this. No reader, no interlinear, no graph view without a working data layer.

ZenHub Configuration

Field	Value
Milestone	M01: Data Pipeline
Due Date	2026-05-03
Default Pipeline	Product Backlog
Primary Epic(s)	Ingest Core Pipeline, Ingest Test Suite, Corpus Validation

Prerequisites

M00: Tech Prep — testing infrastructure, dev environment validation, corpus v1→v2 migration verified, structured logging configured

Epic: Ingest Core Pipeline

Implement the 7-stage ingest pipeline per GOSPELIB-INGEST-SPEC.md.

Story Area	Scope	Spec Reference
Pydantic models	All 7 schema families + shared types	GOSPELIB-SCHEMAS.md § Schema Families
Book registry	Load + validate `data/book_registry.json`	`GOSPELIB-INGEST-SPEC.md` § Stage 1
ID generators	All 10 `ids.py` functions	`GOSPELIB-INGEST-SPEC.md` § ID Derivation
File loader	Schema-dispatched Pydantic validation	`GOSPELIB-INGEST-SPEC.md` § loader.py
FalkorDB client	Connection pool, batch write helpers, retry	`GOSPELIB-INGEST-SPEC.md` § Batch Strategy
Cypher constants	All MERGE queries for 12 node types + 16 edge types	`GOSPELIB-INGEST-SPEC.md` § Cypher Constants
Index creation	Stage 0 — primary + secondary indices	`GOSPELIB-INGEST-SPEC.md` § Index Schema
Lexicon pipeline	Stage 2 — `:Word` nodes, `DERIVES_FROM`, `RELATED_TO`	`GOSPELIB-INGEST-SPEC.md` § Stage 2
Scripture text pipeline	Stage 3 — `:Passage`, `:Witness`, `:WordAlignment` + all edges	`GOSPELIB-INGEST-SPEC.md` § Stage 3
TG/BD/Index pipelines	Stage 4 — concurrent file I/O, sequential writes	`GOSPELIB-INGEST-SPEC.md` § Stage 4
Commentary pipelines	Stage 5 — verse + scholarly commentary	`GOSPELIB-INGEST-SPEC.md` § Stage 5
Pending resolution	Stage 6 — promote `:PendingPassage` stubs	`GOSPELIB-INGEST-SPEC.md` § Stage 6
Pipeline runner	`IngestRunner` orchestrating stages 0–6 + report	`GOSPELIB-INGEST-SPEC.md` § Runner
CLI interface	Click commands: `run`, `--dry-run`, `--only`, `--reset`	`GOSPELIB-INGEST-SPEC.md` § CLI
Run report	JSON report with timing, counts, errors	`GOSPELIB-INGEST-SPEC.md` § Report

Issues

ID	Title	Status	Notes
M01-001	Implement Pydantic v2 Models for All 7 Schema Families	✅ Done	Re-exports from gospelib-schemas is the intended design; models/ package covers all 7 schema families
M01-002	Implement Book Registry Loader	✅ Done	BookRegistry class with load/resolve/is_known methods
M01-003	Implement ID Derivation Functions	✅ Done	All 12 ID functions in ids.py
M01-004	Implement Schema-Dispatched File Loader	✅ Done	loader.py (150+ lines) with _SCHEMA_MAP dispatch dict, discover_files(), FileLoadError
M01-005	Implement FalkorDB Client, Batch Writer, and Retry Logic	✅ Done	db/client.py + db/batch.py with async connection pooling, cache-aside, and retry logic
M01-006	Implement Cypher MERGE Constants for All Node and Edge Types	✅ Done	db/cypher.py (517 lines) with MERGE templates for all 12 node types and 16+ edge types
M01-007	Implement FalkorDB Index and Schema Creation (Stage 0)	✅ Done	db/schema.py (200+ lines) with primary and secondary index creation
M01-008	Implement Lexicon Pipeline (Stage 2)	✅ Done	pipelines/lexicon.py with Word nodes, DERIVES_FROM/RELATED_TO edges
M01-009	Implement Scripture Text Pipeline (Stage 3)	✅ Done	pipelines/scripture_text.py with Passage/Witness/WordAlignment processing
M01-010	Implement TG, BD, and Scripture Index Pipelines (Stage 4)	✅ Done	pipelines/topical_guide.py, bible_dictionary.py, scripture_index.py
M01-011	Implement Commentary Pipelines (Stage 5)	✅ Done	pipelines/verse_commentary.py and scholarly.py
M01-012	Implement Pending Reference Resolution (Stage 6)	✅ Done	pipelines/pending_resolution.py + pending.py
M01-013	Implement Pipeline Runner (Stage Orchestrator)	✅ Done	runner.py (660 lines) with STAGES map and run_pipeline() orchestration
M01-014	Implement CLI Interface with Click	✅ Done	main.py with Click run command, --dry-run / --only / --reset flags
M01-015	Implement Ingest Run Report	✅ Done	report.py with IngestReport, timing, counts, errors, write()
M01-016	Implement Unit Tests for Registry, IDs, Models, and Batch Helpers	✅ Done	12+ unit test files: test_registry.py, test_ids.py, test_models.py, test_loader.py, test_cypher.py, test_batch.py, test_client.py, etc.
M01-017	Implement Integration Tests for All Pipelines	✅ Done	14 integration test files covering every pipeline stage plus idempotency and graph writes
M01-018	Create Minimal Test Fixtures for All 7 Schema Families	✅ Done	data/fixtures/ with bd/, commentary/, corpus/, cross-references/, index/, lexicon/, scholarly/, tg/
M01-019	Implement Smoke Test for Full Pipeline	✅ Done	tests/smoke_test.py
M01-020	Corpus Schema Compliance Audit	✅ Done	Audit CLI command and schema compliance validation implemented
M01-021	Data Quality Checks (Cross-Refs, Strong's, Encoding)	✅ Done	Data quality checks for orphaned Strong's numbers, missing cross-refs, and encoding issues implemented
M01-022	Generate Test Fixtures from Real Corpus Files	✅ Done	Script to auto-extract minimal fixtures from live corpus files implemented

Progress: 22 Done · 0 Partial · 0 To Do (100%)

Epic: Ingest Test Suite

Story Area	Scope	Spec Reference
Unit tests	Registry, ID derivation, Pydantic models (valid + invalid), batch helpers	`GOSPELIB-INGEST-SPEC.md` § Testing
Integration tests	Each pipeline against real FalkorDB (testcontainers)	`GOSPELIB-INGEST-SPEC.md` § Testing
Test fixtures	Minimal corpus files in `data/fixtures/`	GOSPELIB-SCHEMAS.md
Smoke test	Full pipeline on all fixtures	`GOSPELIB-INGEST-SPEC.md` § Testing

Issues

(Issues for this epic are numbered as part of the main M01 issue list above)

Epic: Corpus Validation

Story Area	Scope	Spec Reference
Schema compliance audit	Validate every corpus JSON file against Pydantic models	GOSPELIB-SCHEMAS.md
Data quality checks	Missing cross-refs, orphaned Strong's numbers, encoding issues	GOSPELIB-SCHEMAS.md § Cross-References
Fixture generation	Create minimal test fixtures from real corpus files	GOSPELIB-SCHEMAS.md

Issues

(Issues for this epic are numbered as part of the main M01 issue list above)

Document References

Doc	Contains	Use When Writing Stories For
MVP.md	Feature scope, tier breakdown, success criteria, budget	Acceptance criteria, scope boundaries
TECH-SPEC.md	Architecture, service boundaries, data stores, API catalog	Technical implementation details
GOSPELIB-SCHEMAS.md	All 7 schema families, node/edge types, validation rules	Data models, Pydantic models, graph schema
`GOSPELIB-INGEST-SPEC.md`	7-stage pipeline, Cypher templates, batch strategy, CLI	Ingest pipeline stories
REPO-MAP.md	Directory structure, naming conventions, dependency rules	All stories (coding standards)
Business	LEGAL.md, POLICY-TERMS.md, executive summary, market research, GTM	Launch readiness, legal/compliance stories

Sprint Mapping

Sprint	Weeks	Primary Focus
S1	3–4	Pydantic models, book registry, ID generators, file loader
S2	5–6	FalkorDB client, Cypher constants, Stage 0–2 (indices + lexicon)
S3	7–8	Stages 3–6 (scripture text → pending resolution), CLI, report

Sprint Load Warnings

No explicit load warnings for S1–S3. However, S3 covers Stages 3–6 plus CLI and report in 2 weeks — this is the densest sprint in M01 since it completes all remaining pipeline stages.

Release Info

Release	Tag	Contains
`v0.1.0-alpha`	`ingest/v0.1.0-alpha`	Full ingest pipeline operational — complete corpus ingested into local FalkorDB

Relevant Risks

Risk	Impact	Mitigation
Ingest pipeline data quality issues	Blocks all downstream features	Corpus validation epic in P0; dry-run + schema enforcement
FalkorDB performance at corpus scale	Slow content API, bad UX	Benchmark after M01; index tuning; caching layer in M02
Missing specification documents	GOSPELIB-MIGRATION-SPEC.md not written	Document corpus migration process before M01

Cross-Cutting Concerns

Testing

Layer	Framework	When	Spec Reference
Python unit/integration	pytest + testcontainers	Every PR	`GOSPELIB-INGEST-SPEC.md` § Testing

Documentation

Doc	Update Trigger
Getting Started	M01 complete — document local setup
Running Ingest	M01 complete — document pipeline operation
ADRs	Each major technical decision

CI/CD

Addition	Detail
Python test containers in CI	FalkorDB + PostgreSQL service containers for ingest integration tests

Issue Dependency Graph

Foundation (S1, no blockers):
  M01-001 (Models)     M01-002 (Registry)    M01-003 (IDs)
  M01-005 (DB Client)  M01-006 (Cypher)      M01-015 (Report)

Infrastructure (S2):
  M01-005 ──► M01-007 (Indices)
  M01-001 ──► M01-004 (Loader)

Pipelines (S2–S3):
  M01-001 ─┐
  M01-005 ─┼──► M01-008 (Lexicon) ──► M01-009 (Scripture) ──► M01-011 (Commentary) ──┐
  M01-006 ─┘                                                                         │
  M01-001 ─┐                                                                         │
  M01-003 ─┼──► M01-010 (TG/BD/Index) ──────────────────────────────────────────────┤
  M01-005 ─┘                                                                         │
                                                                                      ▼
  M01-002 ──────────────────────────────────────────────────────► M01-012 (Pending)

Orchestration (S3):
  M01-004 ─┐
  M01-007 ─┤
  M01-008 ─┤
  M01-009 ─┼──► M01-013 (Runner) ──► M01-014 (CLI)
  M01-010 ─┤                              ▲
  M01-011 ─┤                              │
  M01-012 ─┘                         M01-015 (Report)

Testing (S1–S3):
  M01-001 ──► M01-018 (Fixtures) ──► M01-017 (Integration) ──► M01-019 (Smoke)
  M01-022 (Fixture Gen) ──► M01-018       │                         ▲
                                           └─────────────────► M01-019
  M01-016 (Unit Tests) — independent of pipeline issues
  M01-013 (Runner) ──► M01-019

Validation (S2):
  M01-001 ──► M01-020 (Schema Audit) ──► M01-021 (Quality Checks)
  M01-004 ──► M01-020

Legend: A ──► B means A blocks B (B is blocked by A)

Dependencies

Upstream (what M01 needs)

M00: Tech Prep — testing infrastructure, dev environment validation, corpus v1→v2 migration, structured logging

Downstream (what depends on M01)

M02: Content API — depends on M01 FalkorDB client + Stage 3 (scripture text ingested data)
M05: Search & Staging — Typesense sync extends the ingest pipeline built here
M06: Interlinear & Lexicon — depends on word alignment + lexicon data ingested here
M08: Knowledge Graph — depends on cross-ref + topical guide edges ingested here

Summary

Metric	Count
Total Issues	22
Sub-Issues	3
Total Estimate (pts)	102
Sprints	S1–S3
Dependencies (blocking)	61
Dependencies (blocked by)	56

Phase Context​

ZenHub Configuration​

Prerequisites​

Epic: Ingest Core Pipeline​

Issues​

Epic: Ingest Test Suite​

Issues​

Epic: Corpus Validation​

Issues​

Document References​

Sprint Mapping​

Sprint Load Warnings​

Release Info​

Relevant Risks​

Cross-Cutting Concerns​

Testing​

Documentation​

CI/CD​

Issue Dependency Graph​

Dependencies​

Upstream (what M01 needs)​

Downstream (what depends on M01)​

Summary​

Phase Context

ZenHub Configuration

Prerequisites

Epic: Ingest Core Pipeline

Issues

Epic: Ingest Test Suite

Issues

Epic: Corpus Validation

Issues

Document References

Sprint Mapping

Sprint Load Warnings

Release Info

Relevant Risks

Cross-Cutting Concerns

Testing

Documentation

CI/CD

Issue Dependency Graph

Dependencies

Upstream (what M01 needs)

Downstream (what depends on M01)

Summary