PSI DataSync — Disaster Recovery

This page covers recovery from the big, bad scenarios — the ones that don’t fit in the operations runbook. If the state index is gone, if Azure Files is corrupted, if the whole service needs to be rebuilt from scratch — read here.

RTO / RPO — pending formal definition

These numbers are proposed starting points. They must be confirmed with the customer and recorded against a contract before they can be treated as binding commitments. Plan task P4-5 tracks this.

MetricProposedRationale
RTO (time to restore service)2 hoursApp Service + SQL restore typically 30-60 min; factoring buffer for validation and agent reconnect
RPO (max tolerable data loss)0 for files, 24 hours for state indexMeasurement files themselves are replicated to Azure Files with LRS durability (≥ 11 nines). The state index is a derived secondary — it can be rebuilt from Azure Files contents via reconciliation.
Target for agent reconnect10 minutes from server recoveryAgent heartbeat interval is 60s; 10 min covers auto-recovery + clock skew

What is actually backed up

Critical data (customer-visible)

DatasetWhereBackupRecovery path
Intego measurement filesAzure Files psargostorage/argodatastoreLRS (3x locally redundant) + Azure Files snapshots dailyPoint-in-time restore via portal; snapshots retained 7 days
Agent state on each ARGODPAPI-encrypted state.datNone — ARGO local disk onlyRe-enroll via dashboard approval

Operational data (can be rebuilt if lost)

DatasetWhereBackupRecovery path
State index (synced_files, sync_runs, …)Azure SQL DataSyncAutomatic PITR for S0 — 7 days retentionRestore to new DB, point App Service at it, reconcile drift from Azure Files
Agent logs (90 days)Azure SQL agent_logsSame PITR as aboveRestored together with state index
Alert rules and historyAzure SQL (same)Same PITRRestored together
Agent release binariesAzure Files /agent-releases/LRS + Azure Files snapshotsSame snapshot path as measurement data
App Service config (connection strings, keys)Azure App Service configAzure automatic (per-resource)az webapp config backup/restore
Key Vault secretsps-datasync-kv (or equivalent)Azure Key Vault soft-delete (90 days)Portal restore

Not backed up (intentional)

  • API keys hashed in the runners table — only the hash is stored. If lost, agents must re-enroll.
  • SAS tokens — short-lived, regenerated on demand.
  • Per-agent DPAPI encryption keys — tied to each ARGO’s local machine account.

Recovery procedures

Scenario A: State index lost (Azure SQL unusable)

Trigger: Azure SQL DB DataSync is corrupted, deleted, or otherwise unrecoverable from PITR.

Impact: Agents continue heartbeating and syncing files to Azure Files, but the server can’t track what’s been synced. Sync runs fail because /compare and /complete need the state index. Sync eventually stops.

Recovery:

  1. Create a new DataSync database on procserv-proddata:
    MSYS_NO_PATHCONV=1 az sql db create --name DataSync --server procserv-proddata \
      --resource-group ProcServices-Prod-Data --service-objective S0 --max-size 250GB
  2. Let the server run its startup migrations:
    MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS
    The startup sequence in src/server/index.js runs every migration idempotently (seedRunnerGroups, addHealthCheckColumns, addLastActivityAt, addRuntimeStatusColumns, addChecksumColumns), so the schema comes up clean.
  3. Re-seed runners and jobs from source control or dashboard — if the jobs table is empty, configure via the dashboard or bulk-insert from seed-jobs.sql.
  4. Reconcile each job against Azure Files to repopulate synced_files:
    POST /api/admin/runners/:runnerId/jobs/:jobId/reconcile
    
    Reconciliation enumerates what actually exists on Azure and writes the state index accordingly. Note: this generates real Azure Files list transactions and can run for minutes per job with 20K+ files. Scope to one job at a time, watch progress.
  5. Verify agents reconnect — within 10 minutes, all six ARGOs should heartbeat again. If not, see ops runbook runner-offline section.
  6. Verify sync resumes — trigger a manual sync from the dashboard on one job; confirm files are deltaed against the new state index correctly.

Expected MTTR: 60-90 minutes (DB create: 5 min, startup + migrations: 1 min, reconcile per job: ~5 min × 20 jobs).

Scenario B: App Service unusable (deploy rollback / infra failure)

Trigger: App Service psi-datasync is responding 5xx, the App Service itself won’t start, or a bad deploy broke something and rollback doesn’t help.

Impact: Agents can’t compare manifests or report completion. Files still land in Azure Files (direct SMB) but state index isn’t updated. On the next heartbeat, agents log “server unreachable” but continue trying.

Recovery:

  1. Check deployment history:
    MSYS_NO_PATHCONV=1 az webapp deployment list --name psi-datasync \
      --resource-group PS-WEBAPPS -o table
  2. Roll back via zip redeploy — find the previous successful deploy.zip from CI and redeploy:
    MSYS_NO_PATHCONV=1 az webapp deployment source config-zip \
      --name psi-datasync --resource-group PS-WEBAPPS --src <prev-deploy.zip>
  3. If the App Service is dead at the infrastructure level, redeploy from source:
    cd C:/git/PSISync
    git checkout <known-good-commit>
    # Use the GitHub Actions workflow or manual az webapp deploy
  4. Database connection survives — the DataSync SQL DB is on a different server (procserv-proddata), so App Service restoration doesn’t require DB changes.

Expected MTTR: 20-30 minutes for rollback; 60 minutes for full rebuild from source.

Scenario C: Azure Files share lost

Trigger: Accidental share deletion, catastrophic storage account compromise, or region failure on the LRS locality.

Impact: Years of accumulated measurement data potentially unrecoverable. This is the customer-critical scenario.

Recovery:

  1. DO NOT modify anything yet — call the customer immediately and escalate to Azure support. First-response matters.
  2. Check for share snapshots:
    MSYS_NO_PATHCONV=1 az storage share list-handles --account-name psargostorage \
      --name argodatastore --output table
    MSYS_NO_PATHCONV=1 az storage share snapshot --account-name psargostorage --name argodatastore
  3. Restore from latest snapshot via portal (Shares → argodatastore → Snapshots → Restore).
  4. If snapshots are gone, open a support ticket — Azure retains soft-deleted data for a short window.
  5. Post-recovery reconciliation — run the reconciliation endpoint for every job to rebuild the state index from whatever survived.

Expected MTTR: 4+ hours (snapshot restore is bounded by total data size).

Scenario D: All agents lose DPAPI state (machine replacement / disk failure)

Trigger: ARGO PC is replaced, wiped, or the disk fails. state.dat is gone.

Impact: That ARGO’s existing registration is invalid. Must re-enroll.

Recovery (per ARGO):

  1. Install the agent on the new/rebuilt ARGO (see ops runbook release promotion).
  2. The agent automatically registers and appears in the dashboard as pending.
  3. Approve via dashboard → provision new API key and SAS.
  4. Re-pair to the correct runner_id — if the ARGO’s RunnerId is fixed in appsettings.json (per the current deploy-agent-all.ps1 flow), it’ll come back as the same logical runner. State index entries for that runner are unaffected.
  5. On first sync cycle, agent may re-copy recent files the state index didn’t know were already there — Azure Files dedup handles this.

Expected MTTR: 15-30 minutes per ARGO.

Scenario E: Database quota exhausted (re-occurrence of 2026-04-23)

See ops runbook — Database full. Included here because it’s a mini-DR scenario that’s happened before.

Dependencies and failure cascade map

         Azure SQL (DataSync)         Azure Files (argodatastore)
                │                              │
                │                              │
                ▼                              ▼
          App Service (psi-datasync)
                │
         ┌──────┼──────────┐
         ▼      ▼          ▼
       ARGO1  ARGO2  ...  ARGO6
  • Azure SQL DataSync down → App Service returns 5xx → agents can’t compare or complete, but still copy files to Azure Files. No data loss, state index drift.
  • Azure Files down → agents can’t copy; they retry. Data loss only if source ARGOs also fail before Azure Files recovers (very unlikely).
  • App Service down → agents can’t talk to server but still retry. Files continue to land in Azure Files via direct SMB. State index drift.
  • All ARGOs down → no new data generated; old data in Azure Files is unaffected.

Key insight: The measurement files themselves — the only truly customer-critical data — are written directly to Azure Files by the agent. They don’t flow through the App Service or Azure SQL. So most of the failure modes above create operational pain but not customer data loss.

Rehearsal schedule

Plan task P4-4 covers the initial drill. Document real MTTR numbers from that drill here afterward.

Recommended cadence once the drill establishes a baseline:

  • Quarterly — Scenario A (state index restore) rehearsed on a non-prod copy.
  • Semi-annually — Scenario B (App Service rebuild) rehearsed.
  • Annually — Scenario C (Azure Files share) tabletop only (real restore is too disruptive to practice).

Not-yet-implemented improvements

From the 2026-04-22 remediation plan, these are known gaps that affect recoverability but haven’t been addressed yet:

  • Auto-update (P1-4) — if an agent version has a bug that breaks sync, rolling it back requires manual TeamViewer on six machines. Auto-update would make rollback a config flag.
  • Correlation ID across /compare and /complete (P2-3) — after an agent restart mid-run, we can’t currently reconcile which compare maps to which completion. A correlation ID would let us recover orphaned runs more accurately.
  • Full integration tests for core sync path (P3-1) — gives us confidence that a DR rebuild actually works before we need it.

Contacts

See the ops runbook — Contacts. For DR specifically:

  • Azure support — open via portal; we have a Standard support plan (verify on the subscription).
  • CustomerTBD per plan task P4-5. For data-critical incidents, the customer must be informed within 1 hour of detection.