Skip to main content

Ingest Service

The Ingest service is a Python CLI tool that reads all GospeLib JSON corpus files, validates them with Pydantic models, and writes a fully connected knowledge graph to FalkorDB. It is the sole authoritative writer to the graph database.

Quick Reference

PropertyValue
LanguagePython 3.12
FrameworkClick CLI
Packagegospelib_ingest
Entry pointgospelib-ingest CLI command
Target DBFalkorDB (port 6379)
TypeCLI tool (no HTTP port)
DeploymentK8s Job (full rebuild) / CronJob (incremental)

Responsibilities

  • Validate — Every JSON source file is validated with Pydantic before any database write
  • Transform — Convert corpus JSON into FalkorDB graph nodes and edges
  • Write — Produce a query-ready FalkorDB graph (idempotent via MERGE)
  • Connect — Generate all edges derivable from source data (cross-references, topics, lexicon links)
  • Report — Emit a structured run report at completion

Running Locally

cd services/ingest
uv sync

# Full ingest
uv run gospelib-ingest run

# Dry run (validate without writing)
uv run gospelib-ingest run --dry-run

# Show help
uv run gospelib-ingest --help

The service expects FalkorDB to be running on port 6379:

pnpm infra:up

14-Stage Pipeline

The ingest pipeline executes stages sequentially in dependency order:

StagePipelineDescription
0SchemaCreate graph indices and constraints
1ValidationPass-through (reserved for future use)
2LexiconHebrew/Greek lexicon entries → Word nodes + DERIVES_FROM/RELATED_TO
3Scripture TextCanonical text + interlinear → Passage, Witness, WordAlignment nodes
3.5Cross-ReferencesStandalone cross-reference files → CROSS_REF edges
4Reference DataTG, BD, Index → TGEntry, BDEntry, IndexTopic nodes + CITES edges
4.5Proper NamesProperName nodes and MENTIONS edges
4.6VersificationVersificationScheme nodes and MAPS_TO edges
4.7TheographicEvent, PeopleGroup nodes and relationship edges
5CommentaryClarke, BYU commentaries → Commentary nodes + ANNOTATES edges
6Pending ResolutionPromote PendingPassage nodes to resolved CROSS_REF edges
7DensityMaterialize xrefCount/commentaryCount/entityMentionCount on verses

Additional self-registering stages for church content (talks, curriculum, books, periodicals, hymns, proclamations, publications, scholarly works) and typed entity projection (Person, Place, Event, PeopleGroup) are loaded via the stage registry at runtime.

Why Sequential?

FalkorDB is built on Redis, which is single-threaded. Concurrent writes don't improve throughput — they serialize at the server. The pipeline uses:

  • ThreadPoolExecutor(max_workers=4) for parallel file I/O and Pydantic validation
  • Sequential database writes within each stage for MERGE consistency
  • UNWIND batch writes to minimize network round-trips

Architecture

# services/ingest/src/gospelib_ingest/pipelines/base.py
class BasePipeline(ABC):
@abstractmethod
def run(self, graph_client, data_dir: Path, dry_run: bool = False) -> StageReport:
...

Each stage is a concrete implementation of BasePipeline. The orchestrator runs all stages in order, collecting reports. Additional stages self-register via the stage registry decorator.

Key Design Decisions

  • Idempotent via MERGE — All graph writes use Cypher MERGE (keyed on id property), not CREATE. Re-running ingest never creates duplicates.
  • Pydantic validation first — Every source file is fully validated before any write occurs. If validation fails, no partial writes happen.
  • UNWIND batch writes — Nodes and edges are batched into UNWIND queries for efficient bulk writing.
  • Pending reference resolution — Cross-reference targets that don't exist during early stages are stored as PendingPassage nodes and resolved in Stage 7.

Environment Variables

VariableDefaultDescription
GOSPELIB_INGEST_FALKORDB_URLredis://localhost:6379FalkorDB connection URL
GOSPELIB_INGEST_GRAPH_NAMEgospelibFalkorDB graph name
GOSPELIB_INGEST_DATA_DIR./dataPath to corpus data directory
GOSPELIB_INGEST_BATCH_SIZE500UNWIND batch size
GOSPELIB_INGEST_LOG_LEVELINFOLogging level

CLI Commands

# Full pipeline run
gospelib-ingest run

# Dry run — validate only, no writes
gospelib-ingest run --dry-run

# Run a specific stage
gospelib-ingest run --stage lexicon

# Custom data directory
gospelib-ingest run --data-dir /path/to/corpus

# Verbose output
gospelib-ingest run --verbose

Docker

FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install uv
COPY pyproject.toml ./
RUN uv sync --no-dev
COPY src/ ./src/

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /app /app
ENTRYPOINT ["gospelib-ingest"]

Deep Dive

For detailed pipeline internals, Cypher patterns, and troubleshooting: