Skip to main content

Running the Ingest Pipeline

The ingest pipeline reads all GospeLib JSON corpus files, validates them with Pydantic models, and writes a fully connected property graph into FalkorDB. It is the sole authoritative writer to the database.

Prerequisites

  • FalkorDB running locally (port 6379) — start it with pnpm infra:up
  • Docker (for the recommended method) or Python 3.12+ with uv installed

Quick Start

# Recommended — runs in Docker, logs flow to Grafana
pnpm ingest

# Alternative — runs locally without observability
cd services/ingest && uv sync && uv run gospelib-ingest run

Running via pnpm ingest uses Docker Compose, which means:

  • Logs are tailed by Alloy and appear in the Ingest Pipeline dashboard in Grafana (http://localhost:3000)
  • FalkorDB connectivity uses the Docker network (no host port mapping issues)
  • The observability stack must be running first (pnpm infra:observability)

CLI Options

OptionDefaultDescription
--data-dir../../dataRoot directory for source JSON files
--registry../../data/book_registry.jsonPath to the book registry
--db-hostlocalhostFalkorDB host (env: FALKORDB_HOST)
--db-port6379FalkorDB port (env: FALKORDB_PORT)
--db-password(none)FalkorDB password (env: FALKORDB_PASSWORD)
--graphgospelibFalkorDB graph name (env: FALKORDB_GRAPH)
--only(all)Ingest only one schema type
--resetfalseDrop and recreate the graph before ingest (destructive)
--dry-runfalseParse and validate all files; do not write to DB
--report./ingest-report.jsonPath for the JSON run report
--log-levelINFOLogging verbosity: DEBUG, INFO, WARNING, ERROR
--node-batch500Node UNWIND batch size
--edge-batch200Edge UNWIND batch size

Common Invocations

Ingest only lexicon files

uv run gospelib-ingest run --only lexicon

Validate all files without writing

uv run gospelib-ingest run --dry-run --log-level DEBUG

Reset the graph and re-ingest everything

warning

This destroys all existing data in the graph. Only use in development.

uv run gospelib-ingest run --reset

Run with verbose logging

uv run gospelib-ingest run --log-level DEBUG

What Happens During Ingest

The pipeline runs through seven ordered stages:

StageNameProduces
0Index creationFalkorDB indices (no data)
1Book registry loadIn-memory book lookup table
2Lexicon:Word nodes, DERIVES_FROM / RELATED_TO edges
3Scripture Text:Passage, :Witness, :WordAlignment nodes + edges
4Reference nodes:TGEntry, :BDEntry, :Person, :Place, :IndexTopic nodes
5Commentary:VerseNote, :Commentary, :Section nodes
6Pending resolutionPromotes :PendingPassage stubs to :Passage where possible

See the Ingest Internals guide for details on each stage.

Run Report

After every run, the pipeline writes a JSON report to --report path (default: ./ingest-report.json). The report includes:

  • Total nodes and edges created
  • Per-stage counts and timings
  • Validation errors encountered
  • Unresolvable pending references
  • Whether the run was a reset or dry run

The report is also printed as a summary table to the console.

Running in Kubernetes

Full ingest (staging only)

kubectl apply -f infra/k8s/jobs/ingest-full.yaml -n gospelib-staging

Incremental ingest

kubectl create job ingest-manual-$(date +%s) \
--from=cronjob/ingest-incremental \
-n gospelib-staging

Monitor the job

kubectl logs -f job/ingest-manual-XXXXXXX -n gospelib-staging

Verify It Worked

After a successful ingest:

  1. Check the run report for zero errors

  2. Query FalkorDB directly:

    redis-cli -p 6379
    > GRAPH.QUERY gospelib "MATCH (n) RETURN labels(n)[0] AS label, count(n) ORDER BY count(n) DESC"
  3. Hit the content service API:

    curl http://localhost:8100/api/v1/passages/gen.1.1

Troubleshooting

Cannot connect to FalkorDB

Ensure the infrastructure is running:

pnpm infra:up
docker ps | grep falkordb

Validation errors

Run with --dry-run --log-level DEBUG to see which files fail Pydantic validation. Fix the source JSON before re-running.

Ingest is slow

  • Increase --node-batch and --edge-batch for fewer round-trips (at the cost of higher memory per batch)
  • The full corpus should ingest in under 10 minutes on commodity hardware
  • FalkorDB is single-threaded — parallel writes do not help