Skip to main content

Rollback Procedures

When a deployment introduces issues, you need to roll back quickly. This guide covers rollback procedures for every layer of the stack.

Decision Matrix

ScenarioActionRecovery Time
Bad code deploy (no DB change)kubectl rollout undo or ArgoCD rollback< 2 minutes
Bad code + reversible DB migrationRoll back migration → roll back code< 10 minutes
Bad code + irreversible DB migrationFix forward with new codeVaries
Infrastructure misconfigurationterraform apply with previous state< 15 minutes
Corrupted FalkorDB graphRestore from backup or re-ingest30–60 minutes
Total environment failureTerraform destroy + recreate1–2 hours

Service Rollback (Application Code)

Roll back a service to its previous image:

Option A: ArgoCD Rollback (Preferred)

# View deployment history
argocd app history gospelib-production

# Roll back to a previous revision
argocd app rollback gospelib-production <REVISION_NUMBER>

Option B: kubectl Rollback

# Roll back the most recent deployment
kubectl rollout undo deployment/gospelib-gateway -n gospelib-production

# Verify the rollback
kubectl rollout status deployment/gospelib-gateway -n gospelib-production

Option C: Pin to a Known-Good Image

cd infra/k8s/overlays/production
kustomize edit set image gospelib-gateway=$ECR_URL/gospelib-gateway:<KNOWN_GOOD_SHA>
kustomize build . | kubectl apply -f -

Full Release Rollback

Roll back ALL services to the previous release:

# Find the previous release tag
git tag -l 'release/v*' --sort=-v:refname | head -5

Option A: Revert the Kustomize Commit

git log --oneline infra/k8s/overlays/production/kustomization.yaml | head -5
git revert <COMMIT_SHA>
git push origin main
# ArgoCD syncs automatically

Option B: ArgoCD Full History Rollback

argocd app rollback gospelib-production <PREVIOUS_REVISION>

Database Migration Rollback

PostgreSQL

# Check current migration version
migrate -source "file://services/auth/migrations" \
-database "$PG_URL" version

# Roll back one migration
migrate -source "file://services/auth/migrations" \
-database "$PG_URL" down 1

# Roll back to a specific version
migrate -source "file://services/auth/migrations" \
-database "$PG_URL" goto 5
warning

Data-destructive down migrations (dropping columns or tables) cannot be undone. If you've dropped data, you must restore from a backup or fix forward.

FalkorDB

FalkorDB does not have traditional migrations. Rollback options:

  1. Additive changes (new indices, new node types): Leave them in place — they are harmless
  2. Destructive changes (removed nodes, changed schema): Restore from backup
  3. Nuclear option: Re-run full ingest from corpus data with --reset
# Full re-ingest (staging only — destroys and rebuilds)
kubectl apply -f infra/k8s/jobs/ingest-full.yaml -n gospelib-staging

Infrastructure Rollback (Terraform)

cd infra/terraform/environments/production

# Find the previous state version in S3
aws s3api list-object-versions \
--bucket gospelib-terraform-state \
--prefix infrastructure/terraform.tfstate

# Download the previous version
aws s3api get-object \
--bucket gospelib-terraform-state \
--key infrastructure/terraform.tfstate \
--version-id <PREVIOUS_VERSION_ID> \
/tmp/terraform.tfstate.backup

# Push the old state back
terraform state push /tmp/terraform.tfstate.backup

# Apply to revert infrastructure
terraform apply

Disaster Recovery

Backup Schedule

Data StoreMethodFrequencyRetention
PostgreSQLRDS automated snapshotsDaily14 days (prod)
PostgreSQLManual snapshot before releasesPer release30 days
FalkorDBRedis BGSAVE → S3Every 6 hours7 days
TypesenseSnapshot API → S3Daily7 days
Corpus dataGitEvery commitPermanent
Terraform stateS3 versioningEvery apply90 days

Restore PostgreSQL from Snapshot

# List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier gospelib-production \
--query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime]' \
--output table

# Restore from snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier gospelib-production-restored \
--db-snapshot-identifier <SNAPSHOT_ID> \
--db-instance-class db.t3.medium

# After verification, swap instance names
aws rds modify-db-instance \
--db-instance-identifier gospelib-production \
--new-db-instance-identifier gospelib-production-old

aws rds modify-db-instance \
--db-instance-identifier gospelib-production-restored \
--new-db-instance-identifier gospelib-production

Verify the Rollback

After any rollback:

# Check all pods are running
kubectl get pods -n gospelib-production

# Hit health endpoints
curl https://api.gospelib.com/health
curl https://api.gospelib.com/ready

# Monitor logs for errors
kubectl logs -f -l app=gospelib-gateway -n gospelib-production --since=5m

Monitor Grafana dashboards for 15 minutes to confirm stability.