This page covers recovery from the big, bad scenarios — the ones that don’t fit in the operations runbook. If the state index is gone, if Azure Files is corrupted, if the whole service needs to be rebuilt from scratch — read here.
RTO / RPO — pending formal definition
These numbers are proposed starting points. They must be confirmed with the customer and recorded against a contract before they can be treated as binding commitments. Plan task P4-5 tracks this.
Metric
Proposed
Rationale
RTO (time to restore service)
2 hours
App Service + SQL restore typically 30-60 min; factoring buffer for validation and agent reconnect
RPO (max tolerable data loss)
0 for files, 24 hours for state index
Measurement files themselves are replicated to Azure Files with LRS durability (≥ 11 nines). The state index is a derived secondary — it can be rebuilt from Azure Files contents via reconciliation.
Target for agent reconnect
10 minutes from server recovery
Agent heartbeat interval is 60s; 10 min covers auto-recovery + clock skew
Point-in-time restore via portal; snapshots retained 7 days
Agent state on each ARGO
DPAPI-encrypted state.dat
None — ARGO local disk only
Re-enroll via dashboard approval
Operational data (can be rebuilt if lost)
Dataset
Where
Backup
Recovery path
State index (synced_files, sync_runs, …)
Azure SQL DataSync
Automatic PITR for S0 — 7 days retention
Restore to new DB, point App Service at it, reconcile drift from Azure Files
Agent logs (90 days)
Azure SQL agent_logs
Same PITR as above
Restored together with state index
Alert rules and history
Azure SQL (same)
Same PITR
Restored together
Agent release binaries
Azure Files /agent-releases/
LRS + Azure Files snapshots
Same snapshot path as measurement data
App Service config (connection strings, keys)
Azure App Service config
Azure automatic (per-resource)
az webapp config backup/restore
Key Vault secrets
ps-datasync-kv (or equivalent)
Azure Key Vault soft-delete (90 days)
Portal restore
Not backed up (intentional)
API keys hashed in the runners table — only the hash is stored. If lost, agents must re-enroll.
SAS tokens — short-lived, regenerated on demand.
Per-agent DPAPI encryption keys — tied to each ARGO’s local machine account.
Recovery procedures
Scenario A: State index lost (Azure SQL unusable)
Trigger: Azure SQL DB DataSync is corrupted, deleted, or otherwise unrecoverable from PITR.
Impact: Agents continue heartbeating and syncing files to Azure Files, but the server can’t track what’s been synced. Sync runs fail because /compare and /complete need the state index. Sync eventually stops.
Recovery:
Create a new DataSync database on procserv-proddata:
MSYS_NO_PATHCONV=1 az sql db create --name DataSync --server procserv-proddata \ --resource-group ProcServices-Prod-Data --service-objective S0 --max-size 250GB
Let the server run its startup migrations:
MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS
The startup sequence in src/server/index.js runs every migration idempotently (seedRunnerGroups, addHealthCheckColumns, addLastActivityAt, addRuntimeStatusColumns, addChecksumColumns), so the schema comes up clean.
Re-seed runners and jobs from source control or dashboard — if the jobs table is empty, configure via the dashboard or bulk-insert from seed-jobs.sql.
Reconcile each job against Azure Files to repopulate synced_files:
POST /api/admin/runners/:runnerId/jobs/:jobId/reconcile
Reconciliation enumerates what actually exists on Azure and writes the state index accordingly. Note: this generates real Azure Files list transactions and can run for minutes per job with 20K+ files. Scope to one job at a time, watch progress.
Scenario B: App Service unusable (deploy rollback / infra failure)
Trigger: App Service psi-datasync is responding 5xx, the App Service itself won’t start, or a bad deploy broke something and rollback doesn’t help.
Impact: Agents can’t compare manifests or report completion. Files still land in Azure Files (direct SMB) but state index isn’t updated. On the next heartbeat, agents log “server unreachable” but continue trying.
Recovery:
Check deployment history:
MSYS_NO_PATHCONV=1 az webapp deployment list --name psi-datasync \ --resource-group PS-WEBAPPS -o table
Roll back via zip redeploy — find the previous successful deploy.zip from CI and redeploy:
If the App Service is dead at the infrastructure level, redeploy from source:
cd C:/git/PSISyncgit checkout <known-good-commit># Use the GitHub Actions workflow or manual az webapp deploy
Database connection survives — the DataSync SQL DB is on a different server (procserv-proddata), so App Service restoration doesn’t require DB changes.
Expected MTTR: 20-30 minutes for rollback; 60 minutes for full rebuild from source.
Scenario C: Azure Files share lost
Trigger: Accidental share deletion, catastrophic storage account compromise, or region failure on the LRS locality.
Impact: Years of accumulated measurement data potentially unrecoverable. This is the customer-critical scenario.
Recovery:
DO NOT modify anything yet — call the customer immediately and escalate to Azure support. First-response matters.
Check for share snapshots:
MSYS_NO_PATHCONV=1 az storage share list-handles --account-name psargostorage \ --name argodatastore --output tableMSYS_NO_PATHCONV=1 az storage share snapshot --account-name psargostorage --name argodatastore
Restore from latest snapshot via portal (Shares → argodatastore → Snapshots → Restore).
If snapshots are gone, open a support ticket — Azure retains soft-deleted data for a short window.
Post-recovery reconciliation — run the reconciliation endpoint for every job to rebuild the state index from whatever survived.
Expected MTTR: 4+ hours (snapshot restore is bounded by total data size).
Scenario D: All agents lose DPAPI state (machine replacement / disk failure)
Trigger: ARGO PC is replaced, wiped, or the disk fails. state.dat is gone.
Impact: That ARGO’s existing registration is invalid. Must re-enroll.
The agent automatically registers and appears in the dashboard as pending.
Approve via dashboard → provision new API key and SAS.
Re-pair to the correct runner_id — if the ARGO’s RunnerId is fixed in appsettings.json (per the current deploy-agent-all.ps1 flow), it’ll come back as the same logical runner. State index entries for that runner are unaffected.
On first sync cycle, agent may re-copy recent files the state index didn’t know were already there — Azure Files dedup handles this.
Expected MTTR: 15-30 minutes per ARGO.
Scenario E: Database quota exhausted (re-occurrence of 2026-04-23)
Azure SQL DataSync down → App Service returns 5xx → agents can’t compare or complete, but still copy files to Azure Files. No data loss, state index drift.
Azure Files down → agents can’t copy; they retry. Data loss only if source ARGOs also fail before Azure Files recovers (very unlikely).
App Service down → agents can’t talk to server but still retry. Files continue to land in Azure Files via direct SMB. State index drift.
All ARGOs down → no new data generated; old data in Azure Files is unaffected.
Key insight: The measurement files themselves — the only truly customer-critical data — are written directly to Azure Files by the agent. They don’t flow through the App Service or Azure SQL. So most of the failure modes above create operational pain but not customer data loss.
Rehearsal schedule
Plan task P4-4 covers the initial drill. Document real MTTR numbers from that drill here afterward.
Recommended cadence once the drill establishes a baseline:
Quarterly — Scenario A (state index restore) rehearsed on a non-prod copy.
Semi-annually — Scenario B (App Service rebuild) rehearsed.
Annually — Scenario C (Azure Files share) tabletop only (real restore is too disruptive to practice).
Not-yet-implemented improvements
From the 2026-04-22 remediation plan, these are known gaps that affect recoverability but haven’t been addressed yet:
Auto-update (P1-4) — if an agent version has a bug that breaks sync, rolling it back requires manual TeamViewer on six machines. Auto-update would make rollback a config flag.
Correlation ID across /compare and /complete (P2-3) — after an agent restart mid-run, we can’t currently reconcile which compare maps to which completion. A correlation ID would let us recover orphaned runs more accurately.
Full integration tests for core sync path (P3-1) — gives us confidence that a DR rebuild actually works before we need it.