Observability
GospeLib uses a full observability stack built on Grafana's open-source ecosystem. The same stack runs locally and in production — local dev is intended to be as rich as grafana.gospelib.com.
Signals
| Signal | Backend | Collector | Retention (local) | Retention (prod) |
|---|---|---|---|---|
| Logs | Loki | Alloy | 7 days | 90 days (S3) |
| Metrics | Prometheus | Alloy | 7 days | 15 days |
| Traces | Tempo | Alloy (OTLP) | local disk | local disk |
| Profiles | Pyroscope | Alloy (Go) / SDK (Python) | local disk | local disk |
| Frontend | Loki + Tempo | Alloy (Faro receiver) | 7 days | 90 days |
| Errors | Sentry | SDK (per-service) | — | per plan |
Architecture
┌────────────┐
│ Browser │
│ (Faro SDK)│
└─────┬──────┘
│ POST /collect :12347
┌─────────────────────────────────┼───────────────────────────────┐
│ Docker Compose │ │
│ ▼ │
│ ┌──────────┐ scrape /metrics ┌────────────┐ │
│ │ Alloy │◄────────────────────────│ Services │ │
│ │ │ scrape /debug/pprof │ (Go) │ │
│ │ │◄────────────────────────│ │ │
│ │ │ docker log tailing │ │ │
│ │ │◄────────────────────────│ │ │
│ │ │ OTLP gRPC :4317 │ │ │
│ │ │◄────────────────────────│ │ │
│ └────┬─────┘ └────────────┘ │
│ │ │
│ │ push ┌────────────┐ │
│ ├──────────────────────────────►│ Loki │ :3100 │
│ ├──────────────────────────────►│ Prometheus│ :9090 │
│ ├──────────────────────────────►│ Tempo │ (internal) │
│ └──────────────────────────────►│ Pyroscope │ :4040 │
│ └─────┬──────┘ │
│ │ │
│ ┌──────────┐ query datasources ┌─────▼──────┐ │
│ │ Grafana │◄───────────────────────►│ Backends │ │
│ │ :3000 │ └────────────┘ │
│ └──────────┘ │
│ ┌────────────┐ │
│ │ Services │ │
│ │ (Python) │ │
│ │ push ────┼──► Pyroscope│
│ └────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Alloy is the single collection agent. It tails Docker logs, scrapes Prometheus metrics and Go pprof endpoints, receives OTLP traces, receives browser telemetry via the built-in Faro receiver, and pushes everything to the local backends. Python services push profiles directly to Pyroscope via the SDK (Python's GIL makes pull-based CPU profiling impractical).
Local Setup
Start the stack
pnpm infra:observability
This starts Loki, Prometheus, Tempo, Pyroscope, Alloy, Grafana, and the FalkorDB Browser — all behind the observability Docker Compose profile. No environment variables are required; all backends default to the local containers.
Access
| Tool | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3000 | admin / localdev |
| Prometheus | http://localhost:9090 | — |
| Loki | http://localhost:3100 | — |
| Pyroscope | http://localhost:4040 | — |
| Alloy UI | http://localhost:12345 | — |
| Faro | http://localhost:12347/collect | — (POST only) |
| FalkorDB | http://localhost:3004 | — |
Stop the stack
pnpm dev:stack:stop
Reset (wipe data volumes)
docker compose -f infra/docker/compose.yml -f infra/docker/compose.dev.yml \
--profile observability down -v
Grafana Datasources
Four datasources are auto-provisioned via infra/grafana/provisioning/datasources/datasources.yaml:
| Datasource | Type | Default URL |
|---|---|---|
| Loki | loki | http://loki:3100 |
| Prometheus | prometheus | http://prometheus:9090 |
| Tempo | tempo | http://tempo:3200 |
| Pyroscope | grafana-pyroscope-datasource | http://pyroscope:4040 |
Loki → Tempo linking is configured: log lines containing a trace_id field render a clickable link to the corresponding trace in Tempo. Tempo → Loki linking is also configured for trace-to-log correlation.
Dashboards
Pre-built dashboards are stored in infra/grafana/dashboards/ and auto-loaded by Grafana:
| Dashboard | Key panels |
|---|---|
| Service Overview | Request rate, error rate, latency P50/P95/P99 per service |
| FalkorDB | Query latency, memory usage, key count |
| PostgreSQL | Connection pool, query latency, replication lag |
| Redis / ElastiCache | Hit rate, memory, evictions |
| Typesense | Search latency, index size, request rate |
| Kubernetes | Pod CPU/memory, restart count, node health |
| Ingest Pipeline | Job status, nodes created, processing time |
| AI Service | Token usage, response latency, model breakdown |
Querying
Logs (Loki)
Use the Explore tab with the Loki datasource:
# All errors from the gateway in the last hour
{service="gateway"} |= "error" | json | level="error"
# Slow requests (> 500ms)
{service="content"} | json | latency_ms > 500
# Requests for a specific passage
{service="content"} |= "gen.1.1"
# Logs from a specific environment
{env="development"} | json
Traces (Tempo)
Search by service name, trace ID, or duration. Trace spans show the full request lifecycle across services — gateway → content → FalkorDB.
Profiles (Pyroscope)
Use the Explore tab with the Pyroscope datasource. Select a service name and profile type:
- process_cpu — where CPU time is spent
- allocs — heap allocation hotspots
- goroutine — goroutine counts (Go services)
- mutex / block — lock contention (Go services)
Python services report CPU and wall-time profiles via the Pyroscope SDK.
Frontend Observability (Faro)
The web app (apps/web) is instrumented with Grafana Faro, providing:
- Error capture — JavaScript exceptions with stack traces, console errors
- Web Vitals — Core Web Vitals (LCP, FID, CLS) and resource timings
- Session tracking — persistent session IDs correlated with backend traces
- Browser tracing — frontend spans connected to backend traces via
W3C traceparent - Session replay — DOM recording that lets you play back exactly what the user saw when an error occurred
How it works
The Faro Web SDK runs in the browser and POSTs telemetry to Alloy's built-in faro.receiver on port 12347. Alloy routes:
- Frontend logs (errors, console output, web vitals) → Loki
- Frontend traces (navigation, fetch, user interactions) → Tempo
Source maps are automatically downloaded by Alloy from the origin server, so stack traces in Grafana show original TypeScript source locations.
Configuration
Faro is enabled by setting NEXT_PUBLIC_FARO_COLLECTOR_URL. In local dev, the Docker Compose stack defaults this to http://localhost:12347/collect. In production, point it at the central Alloy/Faro endpoint.
To disable Faro (e.g., for performance testing), leave NEXT_PUBLIC_FARO_COLLECTOR_URL empty.
Viewing session replays
In Grafana, use Explore → Loki and filter for {service_name="gospelib-web"}. Frontend errors include session IDs and metadata that let you correlate with the user's session. The Grafana Frontend Observability plugin (when installed) provides a dedicated session browser with replay playback.
Alert Rules
Configured in infra/k8s/base/monitoring/prometheus-alerts.yaml:
| Alert | Condition | Severity |
|---|---|---|
| Service down | Health check fails for > 2 minutes | Critical |
| Error rate spike | HTTP 5xx rate > 5% for 5 minutes | Critical |
| High latency | P99 > 2s for 10 minutes | Warning |
| Pod restart loop | > 3 restarts in 10 minutes | Critical |
| DB connection pool exhausted | Active connections > 80% max | Warning |
| Disk usage high | > 80% on any PV | Warning |
| FalkorDB memory pressure | Used memory > 80% limit | Warning |
| Certificate expiry | TLS cert expires within 14 days | Warning |
Multi-Environment Architecture
The production deployment at grafana.gospelib.com uses a single-backend, multi-env model:
- One Loki, one Prometheus, one Tempo, one Pyroscope — shared across staging and production
- Each environment stamps an
envlabel on all telemetry (staging,production) - Grafana dashboards use a
$envtemplate variable to switch between environments - Local dev stays self-contained by default —
pnpm infra:observabilityruns everything locally
To opt in to sending local telemetry to the central stack (e.g., reproducing a bug that needs team visibility), override the backend URLs in .env.local:
GOSPELIB_LOKI_PUSH_URL=https://<central>/loki/api/v1/push
GOSPELIB_MIMIR_PUSH_URL=https://<central>/api/v1/write
GOSPELIB_TEMPO_OTLP_ENDPOINT=<central>:443
GOSPELIB_PYROSCOPE_URL=https://<central>
Central Stack Deployment
The central observability stack (grafana.gospelib.com) is deployed separately from application environments. It runs in the production Kubernetes cluster under the monitoring namespace:
| Component | Helm Chart | Notes |
|---|---|---|
| Grafana + Prometheus | kube-prometheus-stack | Includes node-exporter, kube-state-metrics |
| Loki | grafana/loki-stack | S3-backed storage, Promtail DaemonSet |
| Tempo | grafana/tempo | OTLP gRPC receiver |
| Pyroscope | grafana/pyroscope | Pull (Go) + push (Python) ingestion |
Ingress is configured at grafana.gospelib.com with cert-manager TLS via letsencrypt-prod. See infra/k8s/base/monitoring/ for the full Kubernetes manifests.
Configuration Files
| File | Purpose |
|---|---|
infra/docker/compose.dev.yml | Local container definitions |
infra/alloy/config.alloy | Alloy collection pipeline |
infra/grafana/provisioning/datasources/ | Grafana datasource auto-provisioning |
infra/grafana/provisioning/dashboards/ | Grafana dashboard auto-loading config |
infra/grafana/dashboards/ | Dashboard JSON files |
infra/loki/loki.yaml | Loki storage and schema config |
infra/tempo/tempo.yaml | Tempo storage config |
infra/prometheus/prometheus.yml | Prometheus config (scraping via Alloy) |
infra/k8s/base/monitoring/ | Kubernetes monitoring manifests |
apps/web/lib/faro.ts | Faro SDK initialization |
.env.example | Default observability env vars |