K0nsult — Disaster Recovery Plan

RPO (Recovery Point Objective)

24h

Daily Fly.io PostgreSQL backup

RTO (Recovery Time Objective)

Redeploy from git repository

Redundancy

Two machines in same region

Backup Frequency

Daily

Fly.io automatic snapshots

Backup Strategy

K0nsult relies on Fly.io's built-in backup infrastructure for all persistent data.

PostgreSQL: Fly.io performs automatic daily snapshots of the attached Postgres cluster. Snapshots are retained according to Fly.io's standard retention policy.
Application code: All source code is stored in a git repository. Any machine can be fully redeployed from git in under 10 minutes.
Static assets: All HTML, JS, and CSS files are part of the git repository and served from the application container.
Environment variables: Managed via fly secrets. A secure offline copy is maintained by 0n40i4.

Failover Architecture

K0nsult runs with 2 machines in the same Fly.io region. If one machine fails, Fly.io automatically routes traffic to the surviving instance.

Fly.io's built-in proxy handles health-check-based routing
The /health endpoint is used for liveness checks
If both machines fail, Fly.io will attempt automatic restart
If the entire region is down, manual intervention is required (see Recovery Steps)

Recovery Steps

Follow these steps in order when a service disruption is detected:

Check Fly.io status

Determine whether the issue is platform-wide or application-specific.

fly status -a k0nsult

Restart PostgreSQL

If the database is unresponsive, restart the Postgres cluster.

fly postgres restart -a k0nsult-db

Redeploy application

If the app container is corrupt or misconfigured, redeploy from the latest git commit.

fly deploy -a k0nsult

Verify health endpoint

Confirm the application is responding correctly after recovery.

curl https://k0nsult.fly.dev/health

Expected response: {"ok":true,"status":"operational"}

Important: If the database needs to be restored from a snapshot, use fly postgres restore and expect up to 24 hours of data loss (RPO). Coordinate with 0n40i4 before restoring.

Contact Escalation

If automated recovery fails or manual intervention is needed, contact the following in order:

0n40i4 (Tomasz Obara)

System Owner / Primary Contact

gota@tlen.pl

Roxkon (Konrad Rycerz)

Infrastructure / Secondary Contact

konradrycerz@gmail.com

Failure Scenarios

Single machine failure

Fly.io automatically reroutes to the second machine. No manual action needed. Monitor logs for root cause.

Database connection failure

Step 1: Check fly postgres status. Step 2: Restart Postgres. Step 3: If persistent, check connection string in fly secrets.

Full application crash

Run fly deploy to redeploy from git. If the latest commit is broken, deploy a known-good commit: fly deploy --image registry.fly.io/k0nsult:sha-xxxxxxx

Data corruption / accidental deletion

Restore from the latest Fly.io Postgres snapshot. Contact 0n40i4 immediately. Maximum data loss: 24 hours.

DNS / certificate failure

Check Fly.io certificate status: fly certs show -a k0nsult. If expired, Fly.io auto-renews via Let's Encrypt. Force renewal: fly certs add k0nsult.fly.dev

DR Testing Schedule

Monthly: Verify /health endpoint responds from both machines
Quarterly: Simulate single machine failure by stopping one instance
Bi-annually: Full recovery drill — deploy from scratch, restore DB snapshot, verify all endpoints

Disaster Recovery Plan K0nsult CNC Infrastructure