PRGJSMES Operational Runbook

What to do when something goes wrong with PRGJSMES in production. Keep this page open during Line 1 pilot week.

Production URL: https://psmes.progressivesurface.com Health endpoint: https://psmes.progressivesurface.com/api/health Staging URL: https://prgjsmes-prod-staging.azurewebsites.net Azure portal: prgjsmes-prod Action group: ag-prgjsmes-oncall → emails adevereaux@progressivesurface.com


General incident response flow

Alert fires  →  acknowledge  →  triage severity  →  execute playbook  →  communicate  →  post-incident notes
  1. Acknowledge — reply “on it” in #prgjsmes-line1-pilot Teams channel so operators know help is coming
  2. Triage — check the health endpoint, App Insights live metrics, which specific alert fired
  3. Execute — match the alert to one of the playbooks below
  4. Communicate — update Teams when resolved, or escalate
  5. Post-mortem — add to docs/knowledge-base/sessions/ or docs/knowledge-base/bug-patterns.md if a new pattern

Alert playbooks

🔴 Alert: prgjsmes-health-check-failing (Sev 1)

What it means: /api/health has been returning non-2xx for 5 minutes. The endpoint internally calls db.Database.CanConnectAsync() so this means the app can’t reach the DB, OR the app itself crashed.

Immediate steps:

  1. Curl the endpoint yourself:

    curl -v https://psmes.progressivesurface.com/api/health
    • 200 OK → alert is stale; clear it
    • 503 with "status":"unhealthy" → app is up, DB is unreachable → go to step 2
    • Connection refused / timeout → app is down → go to step 3
    • 401/403 → Easy Auth is blocking unauthenticated; re-test from inside the VPN or use a valid token
  2. DB unreachable path:

    • Portal → procserv-proddata → check if SQL server is reachable
    • Verify Managed Identity for prgjsmes-prod has access: az sql db show -g ProcServices-Prod-Data -s procserv-proddata -n PRGJSMES
    • Check private endpoint PS-ProdData-SQL-Private at 10.160.140.4 is healthy (portal → private endpoints)
    • If all green and DB is unreachable, check az sql db show for status — if “Paused”, resume via portal (should be off per our config but verify)
  3. App is down path:

    • Portal → prgjsmes-prod → restart the app (Overview → Restart)
    • Watch health endpoint for 60 seconds after restart
    • If still down: check App ServiceLog stream for Program.cs startup errors
    • If startup is failing, check recent deploys → consider re-swapping to prior slot (see Rollback below)

Escalation: if DB auth issue is confirmed and needs a role change on the Managed Identity, page Adam.


🔴 Alert: prgjsmes-http-5xx-storm (Sev 1)

What it means: >10 HTTP 5xx responses in 5 minutes. Code is throwing unhandled exceptions in requests.

Immediate steps:

  1. App Insights → Live Metrics (psi-webapps-insights) → filter by cloud role prgjsmes-prod

  2. Failures tab → sort by count; find the top failing operation

  3. Click a failure → see the exception type, message, stack trace

  4. Decide severity:

    • Single operator hit a bug (e.g., one cart with weird data) → fix forward, not urgent
    • All operators hitting it → rollback. Swap staging back to production:
      az webapp deployment slot swap \
        -g PS-WEBAPPS -n prgjsmes-prod \
        --slot staging --target-slot production
  5. Communicate in Teams #prgjsmes-line1-pilot — “we’re seeing X errors, investigating”

Escalation: if cause is a recent deploy and rollback isn’t enough (e.g., DB state also changed), page Dakota for code fix.


🟡 Alert: prgjsmes-plan-cpu-high (Sev 2)

What it means: App Service plan CPU > 80% for 15 minutes.

Immediate steps:

  1. App Insights → Performance → find the slow request (probably a fat query)
  2. SQL DB metrics (PRGJSMES) → check cpu_percent — if also high, DB is the bottleneck
  3. Short-term: scale up the plan temporarily:
    az appservice plan update -g PS-WEBAPPS -n psi-asp-windows --sku P2v3
  4. Long-term: investigate the slow query; add an index if appropriate (expand-contract not needed for CREATE INDEX)

Not pilot-blocking unless sustained for hours — operators feel slowness but it’s not an outage.


🟡 Alert: prgjsmes-plan-memory-high (Sev 2)

What it means: Plan memory > 85% for 15 minutes.

Immediate steps:

  1. App Service → Overview → Restart (frees accumulated memory)
  2. App Insights → Usage → check for memory leak patterns (growing over time)
  3. If it recurs after restart within an hour, escalate to a memory dump investigation via App Service Diagnose and solve problems blade

🟡 Alert: prgjsmes-db-cpu-high (Sev 2)

What it means: SQL DB CPU > 80% for 15 minutes.

Immediate steps:

  1. Portal → PRGJSMES DB → Query Performance Insight → top queries by CPU
  2. Note the query SQL text; find the corresponding EF Core method
  3. Short-term: bump the DB SKU temporarily (serverless auto-scales up to 2 vCore today; if 2 is maxed, bump to GP_S_Gen5_4):
    az sql db update -g ProcServices-Prod-Data -s procserv-proddata -n PRGJSMES --service-objective GP_S_Gen5_4
  4. Long-term: index tuning or query rewrite

🟢 Alert: prgjsmes-db-storage-high (Sev 3)

What it means: DB storage > 80% of max (currently 32 GB cap).

Immediate steps:

  1. Not urgent — you have time. Bump the max size via az sql db update --max-size 64GB
  2. Longer-term: identify the large tables (App Insights → QA dashboard has some old data reports)

🟡 Alert: prgjsmes-db-connection-failed (Sev 2)

What it means: >10 failed SQL connections in 5 minutes. Usually means auth, firewall, or connection string problems.

Immediate steps:

  1. Check the health endpoint — if it’s green, this might be a specific user/service hitting a firewall rule
  2. SQL server → Firewalls and virtual networks → confirm expected IPs are allowed
  3. Check Managed Identity status hasn’t been disabled
  4. If it just started after a deploy, check appsettings.json for a typo in the connection string (though Key Vault refs should prevent this)

Common situations (not alert-driven)

Operator says “the page won’t load”

  1. Have them refresh (Ctrl+F5 to bust cache)
  2. Have them try another PRGJSMES page — if only one page breaks, it’s a component bug
  3. If all pages fail: check the health endpoint; if that’s fine, have them check their network (Wi-Fi dropped?) and MSAL token (sign out and back in)
  4. If health is failing → follow the health-check-failing playbook

Operator says “the scale isn’t reading”

Not an infra issue — it’s a hardware integration.

  1. Check /scale-setup — is their terminal mapped to a scale?
  2. Is the scale powered on, connected, reachable via IP?
  3. Use Admin → Printers, Scales & Terminals → Test on the scale row. The modal runs a pre-flight reachability check (scale-type aware: I1 round-trip for Mettler, pure TCP-connect for Relay — A&D HID scales emit only on PRINT, so passive-read pre-flight would always false-negative) and shows one of two states in its own row above the live SSE stream:
    • Green “Scale is reachable” → TCP path open. The live stream will then either populate (Mettler auto-stream) or wait silently for PRINT (A&D HID). If a Mettler scale is reachable but no readings appear in the live area, the problem is in the SSE delivery path — but note SSE now uses fetch() + Bearer (not EventSource), so 401s no longer silently break the stream.
    • Red “Scale unreachable — Cannot reach …” → TCP connect failed. Check VNet routing from App Service to the scale subnet, scale/Pi power, the scale_relay service on the Pi, and IP/port in the Scale row.
  4. USB mode: does the browser have Web Serial permission (Site Settings → Serial)?
  5. See wiki/hardware/mettler-toledo-scales.md for MT-SICS protocol details

Operator says “labels aren’t printing”

  1. Check /admin/printers — is the printer registered?
  2. Can you test-print from Admin → Printers → test-print action?
  3. Is the printer IP reachable on TCP:9100 from the App Service (via VNet integration)?
  4. App Insights → Failures → search for PrintService — any timeouts?

Operator says “I can’t close this lot”

Usually a validation failure. Check:

  • Parts balance (pre-grit / thermal: parts ran = parts received - exceptions)
  • Supervisor initials (must match the Supervisors table)
  • Powder lot status (cross-line lockout active?)
  • If App Insights shows no exception, it’s a UI-side validation — walk through the form fields

Operator says “two operators are fighting over the same cart”

Multi-operator contention on the same cart/lot. The app has Math.Max non-regressive status pattern (order.StatusId = Math.Max(order.StatusId, newStatus)) but weight entry may race. Manual resolution:

  • Identify which save actually persisted (check the DB via Admin/query)
  • The other operator’s entry is lost; they need to redo it
  • If this happens repeatedly, it’s a poka-yoke gap — log to bug-patterns.md

Rollback procedures

Rollback a bad deploy

The blue-green pattern gives you instant rollback if the bad deploy was the latest:

az webapp deployment slot swap \
  -g PS-WEBAPPS -n prgjsmes-prod \
  --slot staging --target-slot production

Takes ~30 seconds. The previous production code (sitting in staging slot since the last deploy) comes back as production. Staging now has the broken code — leave it there for diagnosis.

Caveat: only works if you haven’t deployed anything else since the bad deploy. If staging has been overwritten by a newer build, you need Option B below.

Rollback when slot re-swap isn’t available

  1. git revert <merge-commit-sha> on a new branch
  2. Open a PR, merge, let normal deploy run

Or for extreme emergencies, restore from daily App Service backup:

  1. Portal → prgjsmes-prod → Backups → select a daily backup from psiappbackups storage
  2. Restore (takes 5-10 minutes)

Rollback a bad SQL migration

If the migration was additive: no rollback needed, just stop using the new column/table.

If the migration was destructive (broke old code): this should be nearly impossible because the policy + linter enforce expand-contract. But if it slipped through:

  1. Immediately restore the affected DB from Point-in-Time Restore to a new DB:
    az sql db restore -g ProcServices-Prod-Data -s procserv-proddata \
      --dest-name PRGJSMES-recovery --source-database PRGJSMES \
      --time "<pre-migration timestamp>"
  2. Validate data in PRGJSMES-recovery
  3. Rename production DB aside (PRGJSMES-broken), rename recovery to PRGJSMES
  4. Restart App Service
  5. Post-incident review mandatory

Full Line 1 rollback to Access

If PRGJSMES is unusable and operators need to keep running production:

  1. Line 1 supervisor switches operators to paper travelers (kept in reserve at Line 1)
  2. End of shift: operators enter that shift’s data into the original Line 1 Access DB
  3. Preserve the PRGJSMES partial data; do not truncate it
  4. Formal incident review within 24 hours before re-attempting

Monitoring + where to look

NeedTool
Is the app up?curl /api/health or the Azure portal prgjsmes-prod Overview
What’s slow?App Insights → psi-webapps-insights → Performance
What’s failing?App Insights → Failures (top exceptions, request paths)
Live traffic?App Insights → Live Metrics
DB health?Portal → PRGJSMES DB → Overview + Query Performance Insight
AlertsPortal → Monitor → Alerts → scope prgjsmes-prod or PRGJSMES
LogsLog Analytics → DefaultWorkspace-...-NCUS → query AppServiceHTTPLogs, AppServiceAppLogs, SQLInsights
Audit (who did what in SQL)Log Analytics → SQLSecurityAuditEvents
Cost surprisesPortal → Cost Management → subscription budget psi-sub-monthly-budget

Example KQL queries

// Find recent 5xx responses
AppServiceHTTPLogs
| where TimeGenerated > ago(1h)
| where ScStatus >= 500
| project TimeGenerated, CsMethod, CsUriStem, ScStatus, CIp, UserAgent
| order by TimeGenerated desc
 
// Find slow queries
AzureDiagnostics
| where TimeGenerated > ago(1h)
| where Category == "QueryStoreRuntimeStatistics"
| where duration_d > 1000
| project TimeGenerated, statement_s, duration_d, cpu_time_d
 
// App exceptions
exceptions
| where cloud_RoleName == "prgjsmes-prod"
| where timestamp > ago(1h)
| summarize count() by type, outerMessage
| order by count_ desc

Escalation + contacts

TierWhoWhen
1Line 1 supervisorFirst-line operator issues
2Adam (adevereaux@progressivesurface.com)Infrastructure, access, DB, Azure
2DakotaApplication bugs, code fixes, deploy issues
3Vendor support (rare)Azure platform-level incidents — https://portal.azure.com → Help + support

On-call for pilot week: Adam + Dakota both reachable by Teams during business hours; Adam by email overnight via ag-prgjsmes-oncall action group.


  • PRGJSMES application — architecture, readiness status
  • Schema Change Policy — expand-contract rule
  • CD + Slot Swap — deploy flow, rollback mechanism
  • PRGJSMES/docs/DEVELOPMENT.md — developer workflow
  • PRGJSMES/docs/plans/2026-04-28-line1-pilot-launch.md — pilot launch plan + Day 1 support

Last updated: 2026-04-24. Maintainers: Adam Devereaux, Dakota Cooper.