Running the Ingest Pipeline

The ingest pipeline reads all GospeLib JSON corpus files, validates them with Pydantic models, and writes a fully connected property graph into FalkorDB. It is the sole authoritative writer to the database.

Prerequisites

FalkorDB running locally (port 6379) — start it with pnpm infra:up
Docker (for the recommended method) or Python 3.12+ with uv installed

Quick Start

# Recommended — runs in Docker, logs flow to Grafana
pnpm ingest

# Alternative — runs locally without observability
cd services/ingest && uv sync && uv run gospelib-ingest run

Running via pnpm ingest uses Docker Compose, which means:

Logs are tailed by Alloy and appear in the Ingest Pipeline dashboard in Grafana (http://localhost:3000)
FalkorDB connectivity uses the Docker network (no host port mapping issues)
The observability stack must be running first (pnpm infra:observability)

CLI Options

Option	Default	Description
`--data-dir`	`../../data`	Root directory for source JSON files
`--registry`	`../../data/book_registry.json`	Path to the book registry
`--db-host`	`localhost`	FalkorDB host (env: `FALKORDB_HOST`)
`--db-port`	`6379`	FalkorDB port (env: `FALKORDB_PORT`)
`--db-password`	(none)	FalkorDB password (env: `FALKORDB_PASSWORD`)
`--graph`	`gospelib`	FalkorDB graph name (env: `FALKORDB_GRAPH`)
`--only`	(all)	Ingest only one schema type
`--reset`	`false`	Drop and recreate the graph before ingest (destructive)
`--dry-run`	`false`	Parse and validate all files; do not write to DB
`--report`	`./ingest-report.json`	Path for the JSON run report
`--log-level`	`INFO`	Logging verbosity: `DEBUG`, `INFO`, `WARNING`, `ERROR`
`--node-batch`	`500`	Node UNWIND batch size
`--edge-batch`	`200`	Edge UNWIND batch size

Common Invocations

Ingest only lexicon files

uv run gospelib-ingest run --only lexicon

Validate all files without writing

uv run gospelib-ingest run --dry-run --log-level DEBUG

Reset the graph and re-ingest everything

warning

This destroys all existing data in the graph. Only use in development.

uv run gospelib-ingest run --reset

Run with verbose logging

uv run gospelib-ingest run --log-level DEBUG

What Happens During Ingest

The pipeline runs through seven ordered stages:

Stage	Name	Produces
0	Index creation	FalkorDB indices (no data)
1	Book registry load	In-memory book lookup table
2	Lexicon	`:Word` nodes, `DERIVES_FROM` / `RELATED_TO` edges
3	Scripture Text	`:Passage`, `:Witness`, `:WordAlignment` nodes + edges
4	Reference nodes	`:TGEntry`, `:BDEntry`, `:Person`, `:Place`, `:IndexTopic` nodes
5	Commentary	`:VerseNote`, `:Commentary`, `:Section` nodes
6	Pending resolution	Promotes `:PendingPassage` stubs to `:Passage` where possible

See the Ingest Internals guide for details on each stage.

Run Report

After every run, the pipeline writes a JSON report to --report path (default: ./ingest-report.json). The report includes:

Total nodes and edges created
Per-stage counts and timings
Validation errors encountered
Unresolvable pending references
Whether the run was a reset or dry run

The report is also printed as a summary table to the console.

Running in Kubernetes

Full ingest (staging only)

kubectl apply -f infra/k8s/jobs/ingest-full.yaml -n gospelib-staging

Incremental ingest

kubectl create job ingest-manual-$(date +%s) \
  --from=cronjob/ingest-incremental \
  -n gospelib-staging

Monitor the job

kubectl logs -f job/ingest-manual-XXXXXXX -n gospelib-staging

Verify It Worked

After a successful ingest:

Check the run report for zero errors

Query FalkorDB directly:

redis-cli -p 6379
> GRAPH.QUERY gospelib "MATCH (n) RETURN labels(n)[0] AS label, count(n) ORDER BY count(n) DESC"

Hit the content service API:

curl http://localhost:8100/api/v1/passages/gen.1.1

Troubleshooting

Cannot connect to FalkorDB

Ensure the infrastructure is running:

pnpm infra:up
docker ps | grep falkordb

Validation errors

Run with --dry-run --log-level DEBUG to see which files fail Pydantic validation. Fix the source JSON before re-running.

Ingest is slow

Increase --node-batch and --edge-batch for fewer round-trips (at the cost of higher memory per batch)
The full corpus should ingest in under 10 minutes on commodity hardware
FalkorDB is single-threaded — parallel writes do not help

Prerequisites​

Quick Start​

CLI Options​

Common Invocations​

Ingest only lexicon files​

Validate all files without writing​

Reset the graph and re-ingest everything​

Run with verbose logging​

What Happens During Ingest​

Run Report​

Running in Kubernetes​

Full ingest (staging only)​

Incremental ingest​

Monitor the job​

Verify It Worked​

Troubleshooting​

Cannot connect to FalkorDB​

Validation errors​

Ingest is slow​