PSI DataSync — Operations Runbook

This is the on-call companion to the DataSync overview. If an ARGO stops syncing, if alerts fire, or if the dashboard shows red, start here.

For architecture, data model, and “how does it work” questions, read the overview first. This page is specifically about what to do when something is wrong.

Sync Fidelity Audit — the “is everything actually working?” check

The Sync Fidelity Audit is the named, repeatable end-to-end health check. Run it after any deploy, any install, or any incident to get a definitive yes/no on whether the system is healthy top-to-bottom.

# One-liner from any workstation with curl + jq + the CI secret
DATASYNC_CI_UPLOAD_SECRET=xxx scripts/fidelity-audit.sh
 
# Require all agents to be at or above a specific version (post-rollout gate)
DATASYNC_CI_UPLOAD_SECRET=xxx scripts/fidelity-audit.sh --min-version 1.0.93.0
 
# Dashboard equivalent: GET /api/admin/fidelity-audit (Bearer or X-CI-Secret auth)

What it checks

Check	Validates
`AGENT_ALIVE`	Each active runner heartbeated within the last 10 min.
`AGENT_VERSION`	(When `--min-version` is set) Every runner is at or above the required version.
`JOBS_SCHEDULED`	Every enabled job has a sync run within 2× its schedule window.
`JOBS_SUCCEEDED`	Latest run per job is `success`, `partial`, or `running` — NOT `failed`.
`NO_STALE_ORPHANS`	Zero sync runs stuck at `running` beyond the 30-min orphan threshold.
`CHECKSUMS_POPULATED`	Recent `sync_run_files.checksum` rows are populated — proves the P0-1 integrity verification is live on current agents.
`STATUS_HEARTBEAT_RICH`	At least one heartbeat in the lookback window carried `active_jobs` — proves P2-1 real-status reporting is live.
`NO_ACTIVE_ALERTS`	Zero unresolved critical alerts on the dashboard.
`BACKUP_AT_RISK`	No geometry job has a failed latest sync with an unrecovered backlog. A fail here means source files may be purged by the ARGO before they reach Azure — see .

Exit codes

0 — all checks pass or warn (safe to proceed with deploys)
1 — one or more checks failed (block the deploy, read the report)
2 — audit itself failed (network, auth, server error — this is what the CI deploy gate fails the deploy on; a 1 only warns, because failed checks are usually fleet-side state, not a broken deploy)

When to run

Immediately after any installer rollout (with --min-version set to the new version).
As a post-deploy gate in CI after server changes.
Daily from a scheduled task to catch drift.
During any incident as the definitive “is it fixed?” check.

Quick health check (first 60 seconds)

# 1. Is the server up?
curl -s https://datasync.progressivesurface.com/api/health
# Expect: {"status":"healthy","database":"connected", ...}
 
# 2. Are all six ARGOs heartbeating?
curl -s -H "X-CI-Secret: $DATASYNC_CI_UPLOAD_SECRET" \
  https://datasync.progressivesurface.com/api/diag/fleet | jq '.runners[].last_heartbeat'
# Expect: ISO timestamps within the last 5 minutes
 
# 3. Any unresolved alerts?
# Dashboard → Alerts tab, or via API:
# https://datasync.progressivesurface.com → login → Alerts

If (1) is not 200 or database is not connected, go to Server down below. If an ARGO’s last heartbeat is older than 10 minutes, go to Runner offline. If there are active critical alerts, the alert detail tells you which ARGO + job.

Incident playbook

`backup_at_risk` (data-loss risk)

This is the highest-priority alert. The ARGO Intego measurement PCs cap their local storage at a fixed file count and auto-purge the oldest files. DataSync exists to archive that data to Azure before the ARGO purges it. A backup_at_risk alert means a geometry job’s most recent sync failed, left a real backlog of un-synced files, and has not recovered for over a full schedule cycle — so those files are racing the ARGO’s purge clock.

The alert fires critical (Teams + email) and also shows as a BACKUP_AT_RISK fail in the Fidelity Audit.

Respond immediately:

Identify the job — the alert message names the ARGO + job and the pending file count.
Find out why the sync is failing — check that job’s recent runs on the dashboard, and the agent logs (GET /api/admin/logs?runnerId=<id>&level=Warning,Error). Common causes: server 5xx (check server health first), the Intego source share unreachable, or Azure Files auth.
Once the underlying fault is cleared, the agent re-syncs on its next cycle; the alert auto-resolves when a successful run lands. To force it, trigger the job from the dashboard (job → Sync now).
Confirm nothing was lost — run the direct source-vs-Azure audit (scripts/audit-argo-pc.ps1 on the ARGO PC). If the ARGO already purged un-synced files, they are only recoverable if they still exist anywhere on the source.

It is idle-safe: a geometry job with no production (e.g. a weekend) produces no runs at all and is never flagged — only a failed sync with a backlog trips it.

Server down (health endpoint not 200)

Check App Service is running:

MSYS_NO_PATHCONV=1 az webapp show --name psi-datasync --resource-group PS-WEBAPPS \
  --query "{state:state, enabled:enabled}" -o table

Tail live logs for a fatal error on startup:

MSYS_NO_PATHCONV=1 az webapp log tail --name psi-datasync --resource-group PS-WEBAPPS

If the log shows a database connection failure, check the Azure SQL server and the managed-identity auth:

MSYS_NO_PATHCONV=1 az sql db show --name DataSync --server procserv-proddata \
  --resource-group ProcServices-Prod-Data --query "status" -o tsv
# Expect: Online

If the log shows “database has reached its size quota” → jump to Database full below.

Last resort: restart the App Service. Safe, ~30s downtime:

MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS

Database full

Known failure mode — the 2 GB Basic-tier cap was hit on 2026-04-23. Tier is now S0 (250 GB cap) with the retention sweeper running daily; this shouldn’t recur organically, but:

Symptoms: Every INSERT returns RequestError: The database 'DataSync' has reached its size quota. Agent log ingestion rolls back. Sync /complete fails so state index drifts.

Immediate mitigation (one-liner to reclaim space fast):

MSYS_NO_PATHCONV=1 az sql db update --name DataSync --server procserv-proddata \
  --resource-group ProcServices-Prod-Data --max-size 500GB

Root cause fix — force a retention sweep:

# Connect with sqlcmd using Azure AD auth and delete beyond retention:
sqlcmd -S procserv-proddata.database.windows.net -d DataSync -G \
  -Q "DELETE FROM agent_logs WHERE timestamp < DATEADD(DAY, -30, GETUTCDATE())"

Then restart the App Service once the quota is cleared:

MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS

Runner offline (single ARGO not heartbeating)

Confirm the ARGO PC is up — ping it, TeamViewer in, or check the shop-floor UPS monitoring.
Is the Windows service running on that ARGO?
```
# On the ARGO PC
Get-Service PSIDataSync
```
- If Stopped: Start-Service PSIDataSync, then check Event Viewer → Application for why it stopped.
- If Running but no heartbeat: check outbound connectivity:
```
Test-NetConnection datasync.progressivesurface.com -Port 443
```
Check the agent’s local log on the ARGO:
```
C:\ProgramData\PSI\DataSync\logs\agent-<today>.log
```
Look for Heartbeat failed, 401, DataSync:ServerUrl, or auth-related errors.
If the service crashes repeatedly on startup, the DPAPI state may be corrupted:
```
Stop-Service PSIDataSync
Remove-Item "C:\ProgramData\PSI\DataSync\state.dat" -Force
Start-Service PSIDataSync
```
The agent will re-enroll and appear as pending in the dashboard — approve it to restore service.

Sync stalled (runner online, but no progress)

The agent heartbeats with status: running but /complete never fires, or runs are stuck at running:

Check the orphan sweeper is active (logs daily at most, when it closes stale runs):
```
MSYS_NO_PATHCONV=1 az webapp log download --name psi-datasync --resource-group PS-WEBAPPS --log-file /tmp/logs.zip
```
Search for [OrphanSweeper] Closed. Runs that exceed 30 minutes without a last_activity_at touch are closed.

If you see [OrphanSweeper] Sweep failed: Connection is closed. repeating every 5 minutes, the server has hit a transient Azure SQL disconnect and (pre-2026-05-18) was unable to recover. Post-2026-05-18 the schedulers re-fetch the pool on every tick, so this should self-heal — but if you ever see this pattern persisting, restart the App Service to force a fresh boot:
```
MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS
```
Then re-run the Fidelity Audit to confirm NO_STALE_ORPHANS passes.
From the dashboard, check the sync run’s files — a single file stuck retrying can hold up the run.

If Azure Files is the bottleneck (timeouts on upload), check Azure service health and the share’s request metrics:

MSYS_NO_PATHCONV=1 az storage share show --name argodatastore --account-name psargostorage \
  --query "properties.quota" -o tsv

Auth expired (Azure Files credentials)

SAS tokens issued to agents rotate automatically, but the underlying storage account key rotates manually.

Symptoms: Access is denied or network path errors in per-file results on the dashboard. azureFilesStatus heartbeat field shows auth_failure.

Resolution — rotate the key:

In Azure Portal → psargostorage → Access keys → Rotate key1.

Update the App Service configuration:

MSYS_NO_PATHCONV=1 az webapp config appsettings set --name psi-datasync \
  --resource-group PS-WEBAPPS --settings AZURE_STORAGE_KEY="<new-key>"

Restart the App Service (triggers re-issue of SAS tokens to agents on next heartbeat):

MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS

Verify all agents recover: watch the dashboard; azureFilesStatus should tick back to ok within 10 minutes.

All agents failed

If all six ARGOs go offline or sync-fail at roughly the same time, it’s almost never the agents — look at:

Server health endpoint (above) — if the server is dying, the agents can’t do anything.
Azure Files share status — if the share is offline, every agent fails identically.
VPN between PSI shop floor and Azure — if the ExpressRoute or site-to-site VPN dropped, agents can’t reach the server.
Azure SQL down — cascades into the server being unable to serve any request.

Where logs live

Surface	Location	Retention
Server stdout/stderr	`az webapp log tail --name psi-datasync --resource-group PS-WEBAPPS`	App Service rolling (~7 days)
Server file logs (downloadable)	`az webapp log download --name psi-datasync --resource-group PS-WEBAPPS --log-file <zip>`	Same
Agent logs (per ARGO)	`C:\ProgramData\PSI\DataSync\logs\agent-YYYY-MM-DD.log`	14 days (auto-purged on agent startup)
Agent logs (centralized)	Azure SQL `agent_logs` table, queryable via dashboard → Logs	90 days (retention sweeper)
Sync run history	Azure SQL `sync_runs` + `sync_run_files`, queryable via dashboard → Runs	180 days (retention sweeper)
Alert history	Azure SQL `alerts` table, dashboard → Alerts	No auto-purge (low volume)

Routine operations

Seal / unseal a date folder

A geometry date folder is sealed when no new files have been seen in 24 hours. Sealed folders are skipped by agents to save work. If Intego writes to a previously-quiet date folder later, the files are invisible until unsealed.

Check seal status (dashboard): Runner Detail → Job → “Sealed folders” list.

Force-seal a folder manually:

POST /api/admin/runners/:runnerId/jobs/:jobId/seal/:folderDate

(requires dashboard Bearer token)

Unseal a folder so agents rescan it:

POST /api/admin/runners/:runnerId/jobs/:jobId/unseal/:folderDate

Credential rotation (SAS + storage account key)

The storage account key is the master credential. Rotate it every 90 days minimum, or immediately on any compromise signal.

Portal → psargostorage → Access keys → Regenerate key2 (never both at once).
Update the App Service (see Auth expired above).
Wait for all six ARGOs to heartbeat azureFilesStatus: ok.
After 24 hours (well past any cached SAS lifetime), regenerate key1 if you started with key2.

Reconcile an agent’s state index

If a sync failed silently (files in Azure but not recorded in the state index), force a reconciliation:

POST /api/admin/runners/:runnerId/jobs/:jobId/reconcile

This enumerates the agent’s files on Azure and repopulates synced_files. See the reconciliation safeguards — it generates real Azure Files list transactions, so scope it to a single job and watch the progress.

Promote a new agent release

Auto-update is currently disabled (see overview for context) — releases must be installed manually:

CI builds PSIDataSync-Setup-{version}.exe and uploads to the release channel.
Download from dashboard → Agent Releases, OR copy from W:\scripts\PSIDataSync\PSIDataSync-Setup-{version}.exe.
TeamViewer into each ARGO PC1, copy the .exe to C:\scripts\, run as admin.
The installer stops the service, swaps binaries, preserves DPAPI state (no re-enrollment needed), starts the service.

Contacts and escalation

Roles

Role	Primary	Secondary	Notes
Application owner	Adam Devereaux — adevereaux@progressivesurface.com	TBD	End-to-end responsibility: code, infra, ARGO health, customer liaison
On-call rotation	Not yet formalized. Informal = whoever owns the app.		Must be formalized before contracting any customer SLA — see P4-5
Azure infra	Shared with PSI IT team		For SQL/AppService/Files incidents needing subscription-level access
Customer (data-critical)	TBD per P4-5	TBD	Must be informed within 1 hour of a data-loss-class incident
Security incidents	Follow PSI security contact chain		Storage-account-key leak, credential compromise, etc.

Escalation ladder

Minor / non-urgent (single-ARGO failure, transient alert): ticket in the normal work queue. No page.
Sev-2 / operational (multiple ARGOs down, DB performance, sync stalled > 4h): email the application owner. Follow up in Teams if no response within 30 min.
Sev-1 / customer-impacting (all ARGOs down, data loss risk, DB full, Azure Files share compromised): phone/Teams the application owner immediately. If no response within 15 min, escalate to IT leadership.
Security incident: follow the PSI security incident chain. Do NOT post details in Teams — credentials may be in scope.

Alert channels

Teams webhook — configured on the App Service as TEAMS_WEBHOOK_URL. Destination channel: TBD, confirm in dashboard alert rule config.
Email — recipient list configured per alert rule in the dashboard. TBD confirm the distribution list.
No PagerDuty / Opsgenie today. If customer SLA requires on-call paging (P4-5), integrate one before signing.

Change-control contacts

Production deploys — application owner can push to main directly (deploy is auto from push); no change-advisory board required.
Customer-visible schema changes — coordinate with customer contact first.
Credential rotation — notify the application owner before rotating the storage account key.

Known failure modes

DB size quota (2026-04-23)

Basic-tier 2 GB cap hit; all INSERTs failed for ~3 hours. Resolved by bumping to Standard S0 (250 GB) and adding the daily retention sweeper. Watchlist: database_size in az sql db list-usages should stay well below quota.

Auto-update fragility

Self-replace from within the running service fails on Session 0, file locks, staging dir permissions. Only ARGO4 ever updated cleanly via auto-apply. Until that’s fixed (plan task P1-4), every agent release is a manual TeamViewer install on six machines.

Lockfile drift in CI

npm ci sometimes fails during the “Reinstall dev dependencies for cache” step after npm prune --omit=dev. Mitigated by continue-on-error: true so the deploy itself isn’t blocked — but it does mean that if the REAL npm ci fails, we won’t catch it as visibly. Watch the actual build phase, not the trailing cache-warming step.

Verification checklist after any server or agent change

curl .../api/health returns 200 with fresh uptime
active_jobs column in runners shows non-null during active syncs (P2-1)
checksum column in sync_run_files populated for recent geometry runs (P0-1)
All six ARGOs heartbeating within the last 5 minutes
No new unresolved alerts on the dashboard
BUILD_LOG.md entry in C:\git\PSISync\ describing what changed

PSI DataSync — Overview (architecture, data model)
Plan file: C:\Users\AMD\.claude\plans\glittery-strolling-peach.md (remediation tasks)
BUILD_LOG.md in the PSISync repo (deployment history)

PSI Knowledge Base

Explorer

PSI DataSync — Operations Runbook

PSI DataSync — Operations Runbook

Sync Fidelity Audit — the “is everything actually working?” check

What it checks

Exit codes

When to run

Quick health check (first 60 seconds)

Incident playbook

`backup_at_risk` (data-loss risk)

Server down (health endpoint not 200)

Database full

Runner offline (single ARGO not heartbeating)

Sync stalled (runner online, but no progress)

Auth expired (Azure Files credentials)

All agents failed

Where logs live

Routine operations

Seal / unseal a date folder

Credential rotation (SAS + storage account key)

Reconcile an agent’s state index

Promote a new agent release

Contacts and escalation

Roles

Escalation ladder

Alert channels

Change-control contacts

Known failure modes

DB size quota (2026-04-23)

Auto-update fragility

Lockfile drift in CI

Verification checklist after any server or agent change

Graph View

Table of Contents

Backlinks

PSI Knowledge Base

Explorer

PSI DataSync — Operations Runbook

PSI DataSync — Operations Runbook

Sync Fidelity Audit — the “is everything actually working?” check

What it checks

Exit codes

When to run

Quick health check (first 60 seconds)

Incident playbook

backup_at_risk (data-loss risk)

Server down (health endpoint not 200)

Database full

Runner offline (single ARGO not heartbeating)

Sync stalled (runner online, but no progress)

Auth expired (Azure Files credentials)

All agents failed

Where logs live

Routine operations

Seal / unseal a date folder

Credential rotation (SAS + storage account key)

Reconcile an agent’s state index

Promote a new agent release

Contacts and escalation

Roles

Escalation ladder

Alert channels

Change-control contacts

Known failure modes

DB size quota (2026-04-23)

Auto-update fragility

Lockfile drift in CI

Verification checklist after any server or agent change

Related

Graph View

Table of Contents

Backlinks

`backup_at_risk` (data-loss risk)