Observability

GospeLib uses a full observability stack built on Grafana's open-source ecosystem. The same stack runs locally and in production — local dev is intended to be as rich as grafana.gospelib.com.

Signals

Signal	Backend	Collector	Retention (local)	Retention (prod)
Logs	Loki	Alloy	7 days	90 days (S3)
Metrics	Prometheus	Alloy	7 days	15 days
Traces	Tempo	Alloy (OTLP)	local disk	local disk
Profiles	Pyroscope	Alloy (Go) / SDK (Python)	local disk	local disk
Frontend	Loki + Tempo	Alloy (Faro receiver)	7 days	90 days
Errors	Sentry	SDK (per-service)	—	per plan

Architecture

                            ┌────────────┐
                            │  Browser   │
                            │  (Faro SDK)│
                            └─────┬──────┘
                                  │ POST /collect :12347
┌─────────────────────────────────┼───────────────────────────────┐
│  Docker Compose                 │                               │
│                                 ▼                               │
│  ┌──────────┐    scrape /metrics      ┌────────────┐            │
│  │  Alloy   │◄────────────────────────│  Services  │            │
│  │          │    scrape /debug/pprof   │  (Go)      │            │
│  │          │◄────────────────────────│            │            │
│  │          │    docker log tailing    │            │            │
│  │          │◄────────────────────────│            │            │
│  │          │    OTLP gRPC :4317      │            │            │
│  │          │◄────────────────────────│            │            │
│  └────┬─────┘                         └────────────┘            │
│       │                                                         │
│       │  push                         ┌────────────┐            │
│       ├──────────────────────────────►│  Loki      │ :3100      │
│       ├──────────────────────────────►│  Prometheus│ :9090      │
│       ├──────────────────────────────►│  Tempo     │ (internal) │
│       └──────────────────────────────►│  Pyroscope │ :4040      │
│                                       └─────┬──────┘            │
│                                             │                   │
│  ┌──────────┐    query datasources    ┌─────▼──────┐            │
│  │ Grafana  │◄───────────────────────►│  Backends  │            │
│  │  :3000   │                         └────────────┘            │
│  └──────────┘                                                   │
│                                       ┌────────────┐            │
│                                       │  Services  │            │
│                                       │  (Python)  │            │
│                                       │   push ────┼──► Pyroscope│
│                                       └────────────┘            │
└─────────────────────────────────────────────────────────────────┘

Alloy is the single collection agent. It tails Docker logs, scrapes Prometheus metrics and Go pprof endpoints, receives OTLP traces, receives browser telemetry via the built-in Faro receiver, and pushes everything to the local backends. Python services push profiles directly to Pyroscope via the SDK (Python's GIL makes pull-based CPU profiling impractical).

Local Setup

Start the stack

pnpm infra:observability

This starts Loki, Prometheus, Tempo, Pyroscope, Alloy, Grafana, and the FalkorDB Browser — all behind the observability Docker Compose profile. No environment variables are required; all backends default to the local containers.

Access

Tool	URL	Credentials
Grafana	http://localhost:3000	admin / localdev
Prometheus	http://localhost:9090	—
Loki	http://localhost:3100	—
Pyroscope	http://localhost:4040	—
Alloy UI	http://localhost:12345	—
Faro	http://localhost:12347/collect	— (POST only)
FalkorDB	http://localhost:3004	—

Stop the stack

pnpm dev:stack:stop

Reset (wipe data volumes)

docker compose -f infra/docker/compose.yml -f infra/docker/compose.dev.yml \
  --profile observability down -v

Grafana Datasources

Four datasources are auto-provisioned via infra/grafana/provisioning/datasources/datasources.yaml:

Datasource	Type	Default URL
Loki	`loki`	`http://loki:3100`
Prometheus	`prometheus`	`http://prometheus:9090`
Tempo	`tempo`	`http://tempo:3200`
Pyroscope	`grafana-pyroscope-datasource`	`http://pyroscope:4040`

Loki → Tempo linking is configured: log lines containing a trace_id field render a clickable link to the corresponding trace in Tempo. Tempo → Loki linking is also configured for trace-to-log correlation.

Dashboards

Pre-built dashboards are stored in infra/grafana/dashboards/ and auto-loaded by Grafana:

Dashboard	Key panels
Service Overview	Request rate, error rate, latency P50/P95/P99 per service
FalkorDB	Query latency, memory usage, key count
PostgreSQL	Connection pool, query latency, replication lag
Redis / ElastiCache	Hit rate, memory, evictions
Typesense	Search latency, index size, request rate
Kubernetes	Pod CPU/memory, restart count, node health
Ingest Pipeline	Job status, nodes created, processing time
AI Service	Token usage, response latency, model breakdown

Querying

Logs (Loki)

Use the Explore tab with the Loki datasource:

# All errors from the gateway in the last hour
{service="gateway"} |= "error" | json | level="error"

# Slow requests (> 500ms)
{service="content"} | json | latency_ms > 500

# Requests for a specific passage
{service="content"} |= "gen.1.1"

# Logs from a specific environment
{env="development"} | json

Traces (Tempo)

Search by service name, trace ID, or duration. Trace spans show the full request lifecycle across services — gateway → content → FalkorDB.

Profiles (Pyroscope)

Use the Explore tab with the Pyroscope datasource. Select a service name and profile type:

process_cpu — where CPU time is spent
allocs — heap allocation hotspots
goroutine — goroutine counts (Go services)
mutex / block — lock contention (Go services)

Python services report CPU and wall-time profiles via the Pyroscope SDK.

Frontend Observability (Faro)

The web app (apps/web) is instrumented with Grafana Faro, providing:

Error capture — JavaScript exceptions with stack traces, console errors
Web Vitals — Core Web Vitals (LCP, FID, CLS) and resource timings
Session tracking — persistent session IDs correlated with backend traces
Browser tracing — frontend spans connected to backend traces via W3C traceparent
Session replay — DOM recording that lets you play back exactly what the user saw when an error occurred

How it works

The Faro Web SDK runs in the browser and POSTs telemetry to Alloy's built-in faro.receiver on port 12347. Alloy routes:

Frontend logs (errors, console output, web vitals) → Loki
Frontend traces (navigation, fetch, user interactions) → Tempo

Source maps are automatically downloaded by Alloy from the origin server, so stack traces in Grafana show original TypeScript source locations.

Configuration

Faro is enabled by setting NEXT_PUBLIC_FARO_COLLECTOR_URL. In local dev, the Docker Compose stack defaults this to http://localhost:12347/collect. In production, point it at the central Alloy/Faro endpoint.

To disable Faro (e.g., for performance testing), leave NEXT_PUBLIC_FARO_COLLECTOR_URL empty.

Viewing session replays

In Grafana, use Explore → Loki and filter for {service_name="gospelib-web"}. Frontend errors include session IDs and metadata that let you correlate with the user's session. The Grafana Frontend Observability plugin (when installed) provides a dedicated session browser with replay playback.

Alert Rules

Configured in infra/k8s/base/monitoring/prometheus-alerts.yaml:

Alert	Condition	Severity
Service down	Health check fails for > 2 minutes	Critical
Error rate spike	HTTP 5xx rate > 5% for 5 minutes	Critical
High latency	P99 > 2s for 10 minutes	Warning
Pod restart loop	> 3 restarts in 10 minutes	Critical
DB connection pool exhausted	Active connections > 80% max	Warning
Disk usage high	> 80% on any PV	Warning
FalkorDB memory pressure	Used memory > 80% limit	Warning
Certificate expiry	TLS cert expires within 14 days	Warning

Multi-Environment Architecture

The production deployment at grafana.gospelib.com uses a single-backend, multi-env model:

One Loki, one Prometheus, one Tempo, one Pyroscope — shared across staging and production
Each environment stamps an env label on all telemetry (staging, production)
Grafana dashboards use a $env template variable to switch between environments
Local dev stays self-contained by default — pnpm infra:observability runs everything locally

To opt in to sending local telemetry to the central stack (e.g., reproducing a bug that needs team visibility), override the backend URLs in .env.local:

GOSPELIB_LOKI_PUSH_URL=https://<central>/loki/api/v1/push
GOSPELIB_MIMIR_PUSH_URL=https://<central>/api/v1/write
GOSPELIB_TEMPO_OTLP_ENDPOINT=<central>:443
GOSPELIB_PYROSCOPE_URL=https://<central>

Central Stack Deployment

The central observability stack (grafana.gospelib.com) is deployed separately from application environments. It runs in the production Kubernetes cluster under the monitoring namespace:

Component	Helm Chart	Notes
Grafana + Prometheus	`kube-prometheus-stack`	Includes node-exporter, kube-state-metrics
Loki	`grafana/loki-stack`	S3-backed storage, Promtail DaemonSet
Tempo	`grafana/tempo`	OTLP gRPC receiver
Pyroscope	`grafana/pyroscope`	Pull (Go) + push (Python) ingestion

Ingress is configured at grafana.gospelib.com with cert-manager TLS via letsencrypt-prod. See infra/k8s/base/monitoring/ for the full Kubernetes manifests.

Configuration Files

File	Purpose
`infra/docker/compose.dev.yml`	Local container definitions
`infra/alloy/config.alloy`	Alloy collection pipeline
`infra/grafana/provisioning/datasources/`	Grafana datasource auto-provisioning
`infra/grafana/provisioning/dashboards/`	Grafana dashboard auto-loading config
`infra/grafana/dashboards/`	Dashboard JSON files
`infra/loki/loki.yaml`	Loki storage and schema config
`infra/tempo/tempo.yaml`	Tempo storage config
`infra/prometheus/prometheus.yml`	Prometheus config (scraping via Alloy)
`infra/k8s/base/monitoring/`	Kubernetes monitoring manifests
`apps/web/lib/faro.ts`	Faro SDK initialization
`.env.example`	Default observability env vars

Signals​

Architecture​

Local Setup​

Start the stack​

Access​

Stop the stack​

Reset (wipe data volumes)​

Grafana Datasources​

Dashboards​

Querying​

Logs (Loki)​

Traces (Tempo)​

Profiles (Pyroscope)​

Frontend Observability (Faro)​

How it works​

Configuration​

Viewing session replays​

Alert Rules​

Multi-Environment Architecture​

Central Stack Deployment​

Configuration Files​