Running the Ingest Pipeline
The ingest pipeline reads all GospeLib JSON corpus files, validates them with Pydantic models, and writes a fully connected property graph into FalkorDB. It is the sole authoritative writer to the database.
Prerequisites
- FalkorDB running locally (port 6379) — start it with
pnpm infra:up - Docker (for the recommended method) or Python 3.12+ with uv installed
Quick Start
# Recommended — runs in Docker, logs flow to Grafana
pnpm ingest
# Alternative — runs locally without observability
cd services/ingest && uv sync && uv run gospelib-ingest run
Running via pnpm ingest uses Docker Compose, which means:
- Logs are tailed by Alloy and appear in the Ingest Pipeline dashboard in Grafana (http://localhost:3000)
- FalkorDB connectivity uses the Docker network (no host port mapping issues)
- The observability stack must be running first (
pnpm infra:observability)
CLI Options
| Option | Default | Description |
|---|---|---|
--data-dir | ../../data | Root directory for source JSON files |
--registry | ../../data/book_registry.json | Path to the book registry |
--db-host | localhost | FalkorDB host (env: FALKORDB_HOST) |
--db-port | 6379 | FalkorDB port (env: FALKORDB_PORT) |
--db-password | (none) | FalkorDB password (env: FALKORDB_PASSWORD) |
--graph | gospelib | FalkorDB graph name (env: FALKORDB_GRAPH) |
--only | (all) | Ingest only one schema type |
--reset | false | Drop and recreate the graph before ingest (destructive) |
--dry-run | false | Parse and validate all files; do not write to DB |
--report | ./ingest-report.json | Path for the JSON run report |
--log-level | INFO | Logging verbosity: DEBUG, INFO, WARNING, ERROR |
--node-batch | 500 | Node UNWIND batch size |
--edge-batch | 200 | Edge UNWIND batch size |
Common Invocations
Ingest only lexicon files
uv run gospelib-ingest run --only lexicon
Validate all files without writing
uv run gospelib-ingest run --dry-run --log-level DEBUG
Reset the graph and re-ingest everything
This destroys all existing data in the graph. Only use in development.
uv run gospelib-ingest run --reset
Run with verbose logging
uv run gospelib-ingest run --log-level DEBUG
What Happens During Ingest
The pipeline runs through seven ordered stages:
| Stage | Name | Produces |
|---|---|---|
| 0 | Index creation | FalkorDB indices (no data) |
| 1 | Book registry load | In-memory book lookup table |
| 2 | Lexicon | :Word nodes, DERIVES_FROM / RELATED_TO edges |
| 3 | Scripture Text | :Passage, :Witness, :WordAlignment nodes + edges |
| 4 | Reference nodes | :TGEntry, :BDEntry, :Person, :Place, :IndexTopic nodes |
| 5 | Commentary | :VerseNote, :Commentary, :Section nodes |
| 6 | Pending resolution | Promotes :PendingPassage stubs to :Passage where possible |
See the Ingest Internals guide for details on each stage.
Run Report
After every run, the pipeline writes a JSON report to --report path (default: ./ingest-report.json). The report includes:
- Total nodes and edges created
- Per-stage counts and timings
- Validation errors encountered
- Unresolvable pending references
- Whether the run was a reset or dry run
The report is also printed as a summary table to the console.
Running in Kubernetes
Full ingest (staging only)
kubectl apply -f infra/k8s/jobs/ingest-full.yaml -n gospelib-staging
Incremental ingest
kubectl create job ingest-manual-$(date +%s) \
--from=cronjob/ingest-incremental \
-n gospelib-staging
Monitor the job
kubectl logs -f job/ingest-manual-XXXXXXX -n gospelib-staging
Verify It Worked
After a successful ingest:
-
Check the run report for zero errors
-
Query FalkorDB directly:
redis-cli -p 6379> GRAPH.QUERY gospelib "MATCH (n) RETURN labels(n)[0] AS label, count(n) ORDER BY count(n) DESC" -
Hit the content service API:
curl http://localhost:8100/api/v1/passages/gen.1.1
Troubleshooting
Cannot connect to FalkorDB
Ensure the infrastructure is running:
pnpm infra:up
docker ps | grep falkordb
Validation errors
Run with --dry-run --log-level DEBUG to see which files fail Pydantic validation. Fix the source JSON before re-running.
Ingest is slow
- Increase
--node-batchand--edge-batchfor fewer round-trips (at the cost of higher memory per batch) - The full corpus should ingest in under 10 minutes on commodity hardware
- FalkorDB is single-threaded — parallel writes do not help