This is the on-call companion to the DataSync overview. If an ARGO stops syncing, if alerts fire, or if the dashboard shows red, start here.
For architecture, data model, and “how does it work” questions, read the overview first. This page is specifically about what to do when something is wrong.
Sync Fidelity Audit — the “is everything actually working?” check
The Sync Fidelity Audit is the named, repeatable end-to-end health check. Run it after any deploy, any install, or any incident to get a definitive yes/no on whether the system is healthy top-to-bottom.
# One-liner from any workstation with curl + jq + the CI secretDATASYNC_CI_UPLOAD_SECRET=xxx scripts/fidelity-audit.sh# Require all agents to be at or above a specific version (post-rollout gate)DATASYNC_CI_UPLOAD_SECRET=xxx scripts/fidelity-audit.sh --min-version 1.0.93.0# Dashboard equivalent: GET /api/admin/fidelity-audit (Bearer or X-CI-Secret auth)
What it checks
Check
Validates
AGENT_ALIVE
Each active runner heartbeated within the last 10 min.
AGENT_VERSION
(When --min-version is set) Every runner is at or above the required version.
JOBS_SCHEDULED
Every enabled job has a sync run within 2× its schedule window.
JOBS_SUCCEEDED
Latest run per job is success, partial, or running — NOT failed.
NO_STALE_ORPHANS
Zero sync runs stuck at running beyond the 30-min orphan threshold.
CHECKSUMS_POPULATED
Recent sync_run_files.checksum rows are populated — proves the P0-1 integrity verification is live on current agents.
STATUS_HEARTBEAT_RICH
At least one heartbeat in the lookback window carried active_jobs — proves P2-1 real-status reporting is live.
NO_ACTIVE_ALERTS
Zero unresolved critical alerts on the dashboard.
BACKUP_AT_RISK
No geometry job has a failed latest sync with an unrecovered backlog. A fail here means source files may be purged by the ARGO before they reach Azure — see .
Exit codes
0 — all checks pass or warn (safe to proceed with deploys)
1 — one or more checks failed (block the deploy, read the report)
2 — audit itself failed (network, auth, server error — this is what the CI deploy gate fails the deploy on; a 1 only warns, because failed checks are usually fleet-side state, not a broken deploy)
When to run
Immediately after any installer rollout (with --min-version set to the new version).
As a post-deploy gate in CI after server changes.
Daily from a scheduled task to catch drift.
During any incident as the definitive “is it fixed?” check.
Quick health check (first 60 seconds)
# 1. Is the server up?curl -s https://datasync.progressivesurface.com/api/health# Expect: {"status":"healthy","database":"connected", ...}# 2. Are all six ARGOs heartbeating?curl -s -H "X-CI-Secret: $DATASYNC_CI_UPLOAD_SECRET" \ https://datasync.progressivesurface.com/api/diag/fleet | jq '.runners[].last_heartbeat'# Expect: ISO timestamps within the last 5 minutes# 3. Any unresolved alerts?# Dashboard → Alerts tab, or via API:# https://datasync.progressivesurface.com → login → Alerts
If (1) is not 200 or database is not connected, go to Server down below.
If an ARGO’s last heartbeat is older than 10 minutes, go to Runner offline.
If there are active critical alerts, the alert detail tells you which ARGO + job.
Incident playbook
backup_at_risk (data-loss risk)
This is the highest-priority alert. The ARGO Intego measurement PCs cap
their local storage at a fixed file count and auto-purge the oldest files.
DataSync exists to archive that data to Azure before the ARGO purges it.
A backup_at_risk alert means a geometry job’s most recent sync failed,
left a real backlog of un-synced files, and has not recovered for over a
full schedule cycle — so those files are racing the ARGO’s purge clock.
The alert fires critical (Teams + email) and also shows as a BACKUP_AT_RISK
fail in the Fidelity Audit.
Respond immediately:
Identify the job — the alert message names the ARGO + job and the
pending file count.
Find out why the sync is failing — check that job’s recent runs on
the dashboard, and the agent logs (GET /api/admin/logs?runnerId=<id>&level=Warning,Error).
Common causes: server 5xx (check server health first), the Intego
source share unreachable, or Azure Files auth.
Once the underlying fault is cleared, the agent re-syncs on its next
cycle; the alert auto-resolves when a successful run lands. To force it,
trigger the job from the dashboard (job → Sync now).
Confirm nothing was lost — run the direct source-vs-Azure audit
(scripts/audit-argo-pc.ps1 on the ARGO PC). If the ARGO already purged
un-synced files, they are only recoverable if they still exist anywhere
on the source.
It is idle-safe: a geometry job with no production (e.g. a weekend) produces
no runs at all and is never flagged — only a failed sync with a backlog
trips it.
Server down (health endpoint not 200)
Check App Service is running:
MSYS_NO_PATHCONV=1 az webapp show --name psi-datasync --resource-group PS-WEBAPPS \ --query "{state:state, enabled:enabled}" -o table
Tail live logs for a fatal error on startup:
MSYS_NO_PATHCONV=1 az webapp log tail --name psi-datasync --resource-group PS-WEBAPPS
If the log shows a database connection failure, check the Azure SQL server and the managed-identity auth:
MSYS_NO_PATHCONV=1 az sql db show --name DataSync --server procserv-proddata \ --resource-group ProcServices-Prod-Data --query "status" -o tsv# Expect: Online
If the log shows “database has reached its size quota” → jump to Database full below.
Last resort: restart the App Service. Safe, ~30s downtime:
MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS
Database full
Known failure mode — the 2 GB Basic-tier cap was hit on 2026-04-23. Tier is now S0 (250 GB cap) with the retention sweeper running daily; this shouldn’t recur organically, but:
Symptoms: Every INSERT returns RequestError: The database 'DataSync' has reached its size quota. Agent log ingestion rolls back. Sync /complete fails so state index drifts.
Immediate mitigation (one-liner to reclaim space fast):
MSYS_NO_PATHCONV=1 az sql db update --name DataSync --server procserv-proddata \ --resource-group ProcServices-Prod-Data --max-size 500GB
Root cause fix — force a retention sweep:
# Connect with sqlcmd using Azure AD auth and delete beyond retention:sqlcmd -S procserv-proddata.database.windows.net -d DataSync -G \ -Q "DELETE FROM agent_logs WHERE timestamp < DATEADD(DAY, -30, GETUTCDATE())"
Then restart the App Service once the quota is cleared:
MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS
Runner offline (single ARGO not heartbeating)
Confirm the ARGO PC is up — ping it, TeamViewer in, or check the shop-floor UPS monitoring.
Is the Windows service running on that ARGO?
# On the ARGO PCGet-Service PSIDataSync
If Stopped: Start-Service PSIDataSync, then check Event Viewer → Application for why it stopped.
If Running but no heartbeat: check outbound connectivity:
Search for [OrphanSweeper] Closed. Runs that exceed 30 minutes without a last_activity_at touch are closed.
If you see [OrphanSweeper] Sweep failed: Connection is closed. repeating every 5 minutes, the server has hit a transient Azure SQL disconnect and (pre-2026-05-18) was unable to recover. Post-2026-05-18 the schedulers re-fetch the pool on every tick, so this should self-heal — but if you ever see this pattern persisting, restart the App Service to force a fresh boot:
MSYS_NO_PATHCONV=1 az webapp restart --name psi-datasync --resource-group PS-WEBAPPS
Then re-run the Fidelity Audit to confirm NO_STALE_ORPHANS passes.
From the dashboard, check the sync run’s files — a single file stuck retrying can hold up the run.
If Azure Files is the bottleneck (timeouts on upload), check Azure service health and the share’s request metrics:
MSYS_NO_PATHCONV=1 az storage share show --name argodatastore --account-name psargostorage \ --query "properties.quota" -o tsv
Auth expired (Azure Files credentials)
SAS tokens issued to agents rotate automatically, but the underlying storage account key rotates manually.
Symptoms:Access is denied or network path errors in per-file results on the dashboard. azureFilesStatus heartbeat field shows auth_failure.
Azure SQL agent_logs table, queryable via dashboard → Logs
90 days (retention sweeper)
Sync run history
Azure SQL sync_runs + sync_run_files, queryable via dashboard → Runs
180 days (retention sweeper)
Alert history
Azure SQL alerts table, dashboard → Alerts
No auto-purge (low volume)
Routine operations
Seal / unseal a date folder
A geometry date folder is sealed when no new files have been seen in 24 hours. Sealed folders are skipped by agents to save work. If Intego writes to a previously-quiet date folder later, the files are invisible until unsealed.
Check seal status (dashboard): Runner Detail → Job → “Sealed folders” list.
Force-seal a folder manually:
POST /api/admin/runners/:runnerId/jobs/:jobId/seal/:folderDate
(requires dashboard Bearer token)
Unseal a folder so agents rescan it:
POST /api/admin/runners/:runnerId/jobs/:jobId/unseal/:folderDate
Credential rotation (SAS + storage account key)
The storage account key is the master credential. Rotate it every 90 days minimum, or immediately on any compromise signal.
Portal → psargostorage → Access keys → Regenerate key2 (never both at once).
Update the App Service (see Auth expired above).
Wait for all six ARGOs to heartbeat azureFilesStatus: ok.
After 24 hours (well past any cached SAS lifetime), regenerate key1 if you started with key2.
Reconcile an agent’s state index
If a sync failed silently (files in Azure but not recorded in the state index), force a reconciliation:
POST /api/admin/runners/:runnerId/jobs/:jobId/reconcile
This enumerates the agent’s files on Azure and repopulates synced_files. See the reconciliation safeguards — it generates real Azure Files list transactions, so scope it to a single job and watch the progress.
Promote a new agent release
Auto-update is currently disabled (see overview for context) — releases must be installed manually:
CI builds PSIDataSync-Setup-{version}.exe and uploads to the release channel.
Download from dashboard → Agent Releases, OR copy from W:\scripts\PSIDataSync\PSIDataSync-Setup-{version}.exe.
TeamViewer into each ARGO PC1, copy the .exe to C:\scripts\, run as admin.
The installer stops the service, swaps binaries, preserves DPAPI state (no re-enrollment needed), starts the service.
Not yet formalized. Informal = whoever owns the app.
Must be formalized before contracting any customer SLA — see P4-5
Azure infra
Shared with PSI IT team
For SQL/AppService/Files incidents needing subscription-level access
Customer (data-critical)
TBD per P4-5
TBD
Must be informed within 1 hour of a data-loss-class incident
Security incidents
Follow PSI security contact chain
Storage-account-key leak, credential compromise, etc.
Escalation ladder
Minor / non-urgent (single-ARGO failure, transient alert): ticket in the normal work queue. No page.
Sev-2 / operational (multiple ARGOs down, DB performance, sync stalled > 4h): email the application owner. Follow up in Teams if no response within 30 min.
Sev-1 / customer-impacting (all ARGOs down, data loss risk, DB full, Azure Files share compromised): phone/Teams the application owner immediately. If no response within 15 min, escalate to IT leadership.
Security incident: follow the PSI security incident chain. Do NOT post details in Teams — credentials may be in scope.
Alert channels
Teams webhook — configured on the App Service as TEAMS_WEBHOOK_URL. Destination channel: TBD, confirm in dashboard alert rule config.
Email — recipient list configured per alert rule in the dashboard. TBD confirm the distribution list.
No PagerDuty / Opsgenie today. If customer SLA requires on-call paging (P4-5), integrate one before signing.
Change-control contacts
Production deploys — application owner can push to main directly (deploy is auto from push); no change-advisory board required.
Customer-visible schema changes — coordinate with customer contact first.
Credential rotation — notify the application owner before rotating the storage account key.
Known failure modes
DB size quota (2026-04-23)
Basic-tier 2 GB cap hit; all INSERTs failed for ~3 hours. Resolved by bumping to Standard S0 (250 GB) and adding the daily retention sweeper. Watchlist: database_size in az sql db list-usages should stay well below quota.
Auto-update fragility
Self-replace from within the running service fails on Session 0, file locks, staging dir permissions. Only ARGO4 ever updated cleanly via auto-apply. Until that’s fixed (plan task P1-4), every agent release is a manual TeamViewer install on six machines.
Lockfile drift in CI
npm ci sometimes fails during the “Reinstall dev dependencies for cache” step after npm prune --omit=dev. Mitigated by continue-on-error: true so the deploy itself isn’t blocked — but it does mean that if the REAL npm ci fails, we won’t catch it as visibly. Watch the actual build phase, not the trailing cache-warming step.
Verification checklist after any server or agent change
curl .../api/health returns 200 with fresh uptime
active_jobs column in runners shows non-null during active syncs (P2-1)
checksum column in sync_run_files populated for recent geometry runs (P0-1)
All six ARGOs heartbeating within the last 5 minutes
No new unresolved alerts on the dashboard
BUILD_LOG.md entry in C:\git\PSISync\ describing what changed