PSI DataSync — Operations Runbook

This is the on-call companion to the DataSync overview. If an ARGO stops syncing, if alerts fire, or if the dashboard shows red, start here.

For architecture, data model, and “how does it work” questions, read the overview first. This page is specifically about what to do when something is wrong.

Sync Fidelity Audit — the “is everything actually working?” check

The Sync Fidelity Audit is the named, repeatable end-to-end health check. Run it after any deploy, any install, or any incident to get a definitive yes/no on whether the system is healthy top-to-bottom.

# One-liner from any workstation with curl + jq + the CI secret
DATASYNC_CI_UPLOAD_SECRET=xxx scripts/fidelity-audit.sh
 
# Require all agents to be at or above a specific version (post-rollout gate)
DATASYNC_CI_UPLOAD_SECRET=xxx scripts/fidelity-audit.sh --min-version 1.0.93.0
 
# Dashboard equivalent: GET /api/admin/fidelity-audit (Bearer or X-CI-Secret auth)

What it checks

CheckValidates
AGENT_ALIVEEach active runner heartbeated within the last 10 min.
AGENT_VERSION(When --min-version is set) Every runner is at or above the required version.
JOBS_SCHEDULEDEvery enabled job has a sync run within 2× its schedule window.
JOBS_SUCCEEDEDLatest run per job is success, partial, or running — NOT failed.
NO_STALE_ORPHANSZero sync runs stuck at running beyond the 30-min orphan threshold.
CHECKSUMS_POPULATEDRecent sync_run_files.checksum rows are populated — proves the P0-1 integrity verification is live on current agents.
STATUS_HEARTBEAT_RICHAt least one heartbeat in the lookback window carried active_jobs — proves P2-1 real-status reporting is live.
NO_ACTIVE_ALERTSZero unresolved critical alerts on the dashboard.
BACKUP_AT_RISKNo geometry job has a failed latest sync with an unrecovered backlog. A fail here means source files may be purged by the ARGO before they reach Azure — see .

Exit codes

  • 0 — all checks pass or warn (safe to proceed with deploys)
  • 1 — one or more checks failed (block the deploy, read the report)
  • 2 — audit itself failed (network, auth, server error — this is what the CI deploy gate fails the deploy on; a 1 only warns, because failed checks are usually fleet-side state, not a broken deploy)

When to run

  • Immediately after any installer rollout (with --min-version set to the new version).
  • As a post-deploy gate in CI after server changes.
  • Daily from a scheduled task to catch drift.
  • During any incident as the definitive “is it fixed?” check.

Quick health check (first 60 seconds)

# 1. Is the server up?
curl -s https://datasync.progressivesurface.com/api/health
# Expect: {"status":"healthy","database":"connected", ...}
 
# 2. Are all six ARGOs heartbeating?
curl -s -H "X-CI-Secret: $DATASYNC_CI_UPLOAD_SECRET" \
  https://datasync.progressivesurface.com/api/diag/fleet | jq '.runners[].last_heartbeat'
# Expect: ISO timestamps within the last 5 minutes
 
# 3. Any unresolved alerts?
# Dashboard → Alerts tab, or via API:
# https://datasync.progressivesurface.com → login → Alerts

If (1) is not 200 or database is not connected, go to Server down below. If an ARGO’s last heartbeat is older than 10 minutes, go to Runner offline. If there are active critical alerts, the alert detail tells you which ARGO + job.

Incident playbook

backup_at_risk (data-loss risk)

This is the highest-priority alert. The ARGO Intego measurement PCs cap their local storage at a fixed file count and auto-purge the oldest files. DataSync exists to archive that data to Azure before the ARGO purges it. A backup_at_risk alert means a geometry job’s most recent sync failed, left a real backlog of un-synced files, and has not recovered for over a full schedule cycle — so those files are racing the ARGO’s purge clock.

The alert fires critical (Teams + email) and also shows as a BACKUP_AT_RISK fail in the Fidelity Audit.

Respond immediately:

  1. Identify the job — the alert message names the ARGO + job and the pending file count.
  2. Find out why the sync is failing — check that job’s recent runs on the dashboard, and the agent logs (GET /api/admin/logs?runnerId=<id>&level=Warning,Error). Common causes: server 5xx (check server health first), the Intego source share unreachable, or Azure Files auth.
  3. Once the underlying fault is cleared, the agent re-syncs on its next cycle; the alert auto-resolves when a successful run lands. To force it, trigger the job from the dashboard (job → Sync now).
  4. Confirm nothing was lost — run the direct source-vs-Azure audit (scripts/audit-argo-pc.ps1 on the ARGO PC). If the ARGO already purged un-synced files, they are only recoverable if they still exist anywhere on the source.

It is idle-safe: a geometry job with no production (e.g. a weekend) produces no runs at all and is never flagged — only a failed sync with a backlog trips it.

Server down (health endpoint not 200)

  1. Check App Service is running:
    MSYS_NO_PATHCONV=1 az webapp show --name psi-datasync --resource-group PS-WEBAPPS \
      --query "{state:state, enabled:enabled}" -o table
  2. Tail live logs for a fatal error on startup:
    MSYS_NO_PATHCONV=1 az webapp log tail --name psi-datasync --resource-group PS-WEBAPPS
  3. If the log shows a database connection failure, check the Azure SQL server and the managed-identity auth:
    MSYS_NO_PATHCONV=1 az sql db show --name DataSync --server procserv-proddata \
      --resource-group ProcServices-Prod-Data --query "status" -o tsv
    # Expect: Online
  4. If the log shows “database has reached its size quota” → jump to Database full below.
  5. Last resort: restart the App Service. Safe, ~30s downtime:
    MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS

Database full

Known failure mode — the 2 GB Basic-tier cap was hit on 2026-04-23. Tier is now S0 (250 GB cap) with the retention sweeper running daily; this shouldn’t recur organically, but:

Symptoms: Every INSERT returns RequestError: The database 'DataSync' has reached its size quota. Agent log ingestion rolls back. Sync /complete fails so state index drifts.

Immediate mitigation (one-liner to reclaim space fast):

MSYS_NO_PATHCONV=1 az sql db update --name DataSync --server procserv-proddata \
  --resource-group ProcServices-Prod-Data --max-size 500GB

Root cause fix — force a retention sweep:

# Connect with sqlcmd using Azure AD auth and delete beyond retention:
sqlcmd -S procserv-proddata.database.windows.net -d DataSync -G \
  -Q "DELETE FROM agent_logs WHERE timestamp < DATEADD(DAY, -30, GETUTCDATE())"

Then restart the App Service once the quota is cleared:

MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS

Runner offline (single ARGO not heartbeating)

  1. Confirm the ARGO PC is up — ping it, TeamViewer in, or check the shop-floor UPS monitoring.
  2. Is the Windows service running on that ARGO?
    # On the ARGO PC
    Get-Service PSIDataSync
    • If Stopped: Start-Service PSIDataSync, then check Event Viewer → Application for why it stopped.
    • If Running but no heartbeat: check outbound connectivity:
      Test-NetConnection datasync.progressivesurface.com -Port 443
  3. Check the agent’s local log on the ARGO:
    C:\ProgramData\PSI\DataSync\logs\agent-<today>.log
    
    Look for Heartbeat failed, 401, DataSync:ServerUrl, or auth-related errors.
  4. If the service crashes repeatedly on startup, the DPAPI state may be corrupted:
    Stop-Service PSIDataSync
    Remove-Item "C:\ProgramData\PSI\DataSync\state.dat" -Force
    Start-Service PSIDataSync
    The agent will re-enroll and appear as pending in the dashboard — approve it to restore service.

Sync stalled (runner online, but no progress)

The agent heartbeats with status: running but /complete never fires, or runs are stuck at running:

  1. Check the orphan sweeper is active (logs daily at most, when it closes stale runs):

    MSYS_NO_PATHCONV=1 az webapp log download --name psi-datasync --resource-group PS-WEBAPPS --log-file /tmp/logs.zip

    Search for [OrphanSweeper] Closed. Runs that exceed 30 minutes without a last_activity_at touch are closed.

    If you see [OrphanSweeper] Sweep failed: Connection is closed. repeating every 5 minutes, the server has hit a transient Azure SQL disconnect and (pre-2026-05-18) was unable to recover. Post-2026-05-18 the schedulers re-fetch the pool on every tick, so this should self-heal — but if you ever see this pattern persisting, restart the App Service to force a fresh boot:

    MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS

    Then re-run the Fidelity Audit to confirm NO_STALE_ORPHANS passes.

  2. From the dashboard, check the sync run’s files — a single file stuck retrying can hold up the run.

  3. If Azure Files is the bottleneck (timeouts on upload), check Azure service health and the share’s request metrics:

    MSYS_NO_PATHCONV=1 az storage share show --name argodatastore --account-name psargostorage \
      --query "properties.quota" -o tsv

Auth expired (Azure Files credentials)

SAS tokens issued to agents rotate automatically, but the underlying storage account key rotates manually.

Symptoms: Access is denied or network path errors in per-file results on the dashboard. azureFilesStatus heartbeat field shows auth_failure.

Resolution — rotate the key:

  1. In Azure Portal → psargostorage → Access keys → Rotate key1.
  2. Update the App Service configuration:
    MSYS_NO_PATHCONV=1 az webapp config appsettings set --name psi-datasync \
      --resource-group PS-WEBAPPS --settings AZURE_STORAGE_KEY="<new-key>"
  3. Restart the App Service (triggers re-issue of SAS tokens to agents on next heartbeat):
    MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS
  4. Verify all agents recover: watch the dashboard; azureFilesStatus should tick back to ok within 10 minutes.

All agents failed

If all six ARGOs go offline or sync-fail at roughly the same time, it’s almost never the agents — look at:

  1. Server health endpoint (above) — if the server is dying, the agents can’t do anything.
  2. Azure Files share status — if the share is offline, every agent fails identically.
  3. VPN between PSI shop floor and Azure — if the ExpressRoute or site-to-site VPN dropped, agents can’t reach the server.
  4. Azure SQL down — cascades into the server being unable to serve any request.

Where logs live

SurfaceLocationRetention
Server stdout/stderraz webapp log tail --name psi-datasync --resource-group PS-WEBAPPSApp Service rolling (~7 days)
Server file logs (downloadable)az webapp log download --name psi-datasync --resource-group PS-WEBAPPS --log-file <zip>Same
Agent logs (per ARGO)C:\ProgramData\PSI\DataSync\logs\agent-YYYY-MM-DD.log14 days (auto-purged on agent startup)
Agent logs (centralized)Azure SQL agent_logs table, queryable via dashboard → Logs90 days (retention sweeper)
Sync run historyAzure SQL sync_runs + sync_run_files, queryable via dashboard → Runs180 days (retention sweeper)
Alert historyAzure SQL alerts table, dashboard → AlertsNo auto-purge (low volume)

Routine operations

Seal / unseal a date folder

A geometry date folder is sealed when no new files have been seen in 24 hours. Sealed folders are skipped by agents to save work. If Intego writes to a previously-quiet date folder later, the files are invisible until unsealed.

Check seal status (dashboard): Runner Detail → Job → “Sealed folders” list.

Force-seal a folder manually:

POST /api/admin/runners/:runnerId/jobs/:jobId/seal/:folderDate

(requires dashboard Bearer token)

Unseal a folder so agents rescan it:

POST /api/admin/runners/:runnerId/jobs/:jobId/unseal/:folderDate

Credential rotation (SAS + storage account key)

The storage account key is the master credential. Rotate it every 90 days minimum, or immediately on any compromise signal.

  1. Portal → psargostorage → Access keys → Regenerate key2 (never both at once).
  2. Update the App Service (see Auth expired above).
  3. Wait for all six ARGOs to heartbeat azureFilesStatus: ok.
  4. After 24 hours (well past any cached SAS lifetime), regenerate key1 if you started with key2.

Reconcile an agent’s state index

If a sync failed silently (files in Azure but not recorded in the state index), force a reconciliation:

POST /api/admin/runners/:runnerId/jobs/:jobId/reconcile

This enumerates the agent’s files on Azure and repopulates synced_files. See the reconciliation safeguards — it generates real Azure Files list transactions, so scope it to a single job and watch the progress.

Promote a new agent release

Auto-update is currently disabled (see overview for context) — releases must be installed manually:

  1. CI builds PSIDataSync-Setup-{version}.exe and uploads to the release channel.
  2. Download from dashboard → Agent Releases, OR copy from W:\scripts\PSIDataSync\PSIDataSync-Setup-{version}.exe.
  3. TeamViewer into each ARGO PC1, copy the .exe to C:\scripts\, run as admin.
  4. The installer stops the service, swaps binaries, preserves DPAPI state (no re-enrollment needed), starts the service.

Contacts and escalation

Roles

RolePrimarySecondaryNotes
Application ownerAdam Devereaux — adevereaux@progressivesurface.comTBDEnd-to-end responsibility: code, infra, ARGO health, customer liaison
On-call rotationNot yet formalized. Informal = whoever owns the app.Must be formalized before contracting any customer SLA — see P4-5
Azure infraShared with PSI IT teamFor SQL/AppService/Files incidents needing subscription-level access
Customer (data-critical)TBD per P4-5TBDMust be informed within 1 hour of a data-loss-class incident
Security incidentsFollow PSI security contact chainStorage-account-key leak, credential compromise, etc.

Escalation ladder

  1. Minor / non-urgent (single-ARGO failure, transient alert): ticket in the normal work queue. No page.
  2. Sev-2 / operational (multiple ARGOs down, DB performance, sync stalled > 4h): email the application owner. Follow up in Teams if no response within 30 min.
  3. Sev-1 / customer-impacting (all ARGOs down, data loss risk, DB full, Azure Files share compromised): phone/Teams the application owner immediately. If no response within 15 min, escalate to IT leadership.
  4. Security incident: follow the PSI security incident chain. Do NOT post details in Teams — credentials may be in scope.

Alert channels

  • Teams webhook — configured on the App Service as TEAMS_WEBHOOK_URL. Destination channel: TBD, confirm in dashboard alert rule config.
  • Email — recipient list configured per alert rule in the dashboard. TBD confirm the distribution list.
  • No PagerDuty / Opsgenie today. If customer SLA requires on-call paging (P4-5), integrate one before signing.

Change-control contacts

  • Production deploys — application owner can push to main directly (deploy is auto from push); no change-advisory board required.
  • Customer-visible schema changes — coordinate with customer contact first.
  • Credential rotation — notify the application owner before rotating the storage account key.

Known failure modes

DB size quota (2026-04-23)

Basic-tier 2 GB cap hit; all INSERTs failed for ~3 hours. Resolved by bumping to Standard S0 (250 GB) and adding the daily retention sweeper. Watchlist: database_size in az sql db list-usages should stay well below quota.

Auto-update fragility

Self-replace from within the running service fails on Session 0, file locks, staging dir permissions. Only ARGO4 ever updated cleanly via auto-apply. Until that’s fixed (plan task P1-4), every agent release is a manual TeamViewer install on six machines.

Lockfile drift in CI

npm ci sometimes fails during the “Reinstall dev dependencies for cache” step after npm prune --omit=dev. Mitigated by continue-on-error: true so the deploy itself isn’t blocked — but it does mean that if the REAL npm ci fails, we won’t catch it as visibly. Watch the actual build phase, not the trailing cache-warming step.

Verification checklist after any server or agent change

  • curl .../api/health returns 200 with fresh uptime
  • active_jobs column in runners shows non-null during active syncs (P2-1)
  • checksum column in sync_run_files populated for recent geometry runs (P0-1)
  • All six ARGOs heartbeating within the last 5 minutes
  • No new unresolved alerts on the dashboard
  • BUILD_LOG.md entry in C:\git\PSISync\ describing what changed
  • PSI DataSync — Overview (architecture, data model)
  • Plan file: C:\Users\AMD\.claude\plans\glittery-strolling-peach.md (remediation tasks)
  • BUILD_LOG.md in the PSISync repo (deployment history)