Operations Overview

GospeLib runs as containerized microservices on Kubernetes, backed by managed AWS data stores. This section covers everything needed to build, ship, observe, and recover the platform across staging and production environments.

What Operations Covers

Concern	Summary	Guide
CI/CD	GitHub Actions + Nx affected builds, Docker image pipeline, ArgoCD GitOps	CI/CD Pipeline
Deployment architecture	Service inventory, data stores, environment topology, prerequisites	Deployment Overview
Staging	k3s on EC2, Terraform provisioning, ArgoCD auto-sync from `stage` branch	Deploy to Staging
Production	EKS cluster, approval gates, release tags, cost breakdown	Deploy to Production
Observability	Grafana + Loki + Prometheus + Tempo + Pyroscope, frontend RUM via Faro	Monitoring & Observability
Rollback	Service rollback, database recovery, infrastructure state restore	Rollback Procedures
Infrastructure	Terraform modules, Docker Compose topology, AWS resource layout	Infrastructure

Environment Overview

GospeLib maintains two deployed environments. Staging mirrors production in every meaningful way -- same Docker images, same Kubernetes manifests (different resource limits), same database engines, same monitoring stack. The only differences are instance sizes and replica counts.

Staging (staging.gospelib.com)        Production (gospelib.com)
  EC2 t3.medium running k3s             EKS managed cluster
  1 replica per service                  2 replicas per service
  RDS db.t3.micro                        RDS db.r6g.large
  ElastiCache t3.micro                   ElastiCache r6g.large
  ~$2.50/mo (free tier)                  ~$213/mo

Deployment Flow

A typical change moves through the pipeline in this order:

Push or PR -- CI runs lint, tests (JS/Python/Go in parallel), and build
Merge to stage -- CD builds affected Docker images, pushes to ECR, updates Kustomize image tags
ArgoCD syncs staging -- manifests applied to the k3s cluster automatically
Bake period -- candidate build runs on staging for at least 24 hours
Release tag -- release/v*.*.* triggers production CD with an approval gate
ArgoCD syncs production -- EKS cluster updated, health checks polled

For emergency fixes that cannot wait for the full cycle, see the hotfix procedure.

Observability Signals

Every service emits logs, metrics, traces, and profiles through a unified collection agent (Alloy) into the Grafana stack. The same tooling runs locally via pnpm infra:observability and in production at grafana.gospelib.com.

Signal	Backend	Local access
Logs	Loki	`http://localhost:3100`
Metrics	Prometheus	`http://localhost:9090`
Traces	Tempo	via Grafana Explore
Profiles	Pyroscope	`http://localhost:4040`
Frontend	Loki + Tempo (Faro)	`http://localhost:12347/collect`
Errors	Sentry	per-plan dashboard

Recovery Playbook

When something goes wrong, the Rollback Procedures guide provides a decision matrix:

Bad code deploy (no DB change) -- kubectl rollout undo or ArgoCD rollback, recovery in under 2 minutes
Bad code + reversible migration -- roll back migration, then roll back code, under 10 minutes
Corrupted graph data -- restore FalkorDB from backup or re-ingest from corpus, 30-60 minutes
Infrastructure misconfiguration -- restore previous Terraform state and re-apply, under 15 minutes

Infrastructure as Code

All cloud resources are defined in infra/terraform/ with reusable modules for EKS, RDS, ElastiCache, S3, ECR, CloudFront, Route53, and Secrets Manager. Kubernetes manifests use Kustomize with base + overlay pattern, and the local development stack runs entirely through Docker Compose. See Infrastructure for the full layout.

What Operations Covers​

Environment Overview​

Deployment Flow​

Observability Signals​

Recovery Playbook​

Infrastructure as Code​