Operations Overview
GospeLib runs as containerized microservices on Kubernetes, backed by managed AWS data stores. This section covers everything needed to build, ship, observe, and recover the platform across staging and production environments.
What Operations Covers
| Concern | Summary | Guide |
|---|---|---|
| CI/CD | GitHub Actions + Nx affected builds, Docker image pipeline, ArgoCD GitOps | CI/CD Pipeline |
| Deployment architecture | Service inventory, data stores, environment topology, prerequisites | Deployment Overview |
| Staging | k3s on EC2, Terraform provisioning, ArgoCD auto-sync from stage branch | Deploy to Staging |
| Production | EKS cluster, approval gates, release tags, cost breakdown | Deploy to Production |
| Observability | Grafana + Loki + Prometheus + Tempo + Pyroscope, frontend RUM via Faro | Monitoring & Observability |
| Rollback | Service rollback, database recovery, infrastructure state restore | Rollback Procedures |
| Infrastructure | Terraform modules, Docker Compose topology, AWS resource layout | Infrastructure |
Environment Overview
GospeLib maintains two deployed environments. Staging mirrors production in every meaningful way -- same Docker images, same Kubernetes manifests (different resource limits), same database engines, same monitoring stack. The only differences are instance sizes and replica counts.
Staging (staging.gospelib.com) Production (gospelib.com)
EC2 t3.medium running k3s EKS managed cluster
1 replica per service 2 replicas per service
RDS db.t3.micro RDS db.r6g.large
ElastiCache t3.micro ElastiCache r6g.large
~$2.50/mo (free tier) ~$213/mo
Deployment Flow
A typical change moves through the pipeline in this order:
- Push or PR -- CI runs lint, tests (JS/Python/Go in parallel), and build
- Merge to
stage-- CD builds affected Docker images, pushes to ECR, updates Kustomize image tags - ArgoCD syncs staging -- manifests applied to the k3s cluster automatically
- Bake period -- candidate build runs on staging for at least 24 hours
- Release tag --
release/v*.*.*triggers production CD with an approval gate - ArgoCD syncs production -- EKS cluster updated, health checks polled
For emergency fixes that cannot wait for the full cycle, see the hotfix procedure.
Observability Signals
Every service emits logs, metrics, traces, and profiles through a unified collection agent (Alloy) into the Grafana stack. The same tooling runs locally via pnpm infra:observability and in production at grafana.gospelib.com.
| Signal | Backend | Local access |
|---|---|---|
| Logs | Loki | http://localhost:3100 |
| Metrics | Prometheus | http://localhost:9090 |
| Traces | Tempo | via Grafana Explore |
| Profiles | Pyroscope | http://localhost:4040 |
| Frontend | Loki + Tempo (Faro) | http://localhost:12347/collect |
| Errors | Sentry | per-plan dashboard |
Recovery Playbook
When something goes wrong, the Rollback Procedures guide provides a decision matrix:
- Bad code deploy (no DB change) --
kubectl rollout undoor ArgoCD rollback, recovery in under 2 minutes - Bad code + reversible migration -- roll back migration, then roll back code, under 10 minutes
- Corrupted graph data -- restore FalkorDB from backup or re-ingest from corpus, 30-60 minutes
- Infrastructure misconfiguration -- restore previous Terraform state and re-apply, under 15 minutes
Infrastructure as Code
All cloud resources are defined in infra/terraform/ with reusable modules for EKS, RDS, ElastiCache, S3, ECR, CloudFront, Route53, and Secrets Manager. Kubernetes manifests use Kustomize with base + overlay pattern, and the local development stack runs entirely through Docker Compose. See Infrastructure for the full layout.