PRGJSMES Operational Runbook
What to do when something goes wrong with PRGJSMES in production. Keep this page open during Line 1 pilot week.
Production URL: https://psmes.progressivesurface.com
Health endpoint: https://psmes.progressivesurface.com/api/health
Staging URL: https://prgjsmes-prod-staging.azurewebsites.net
Azure portal: prgjsmes-prod
Action group: ag-prgjsmes-oncall → emails adevereaux@progressivesurface.com
General incident response flow
Alert fires → acknowledge → triage severity → execute playbook → communicate → post-incident notes
- Acknowledge — reply “on it” in
#prgjsmes-line1-pilotTeams channel so operators know help is coming - Triage — check the health endpoint, App Insights live metrics, which specific alert fired
- Execute — match the alert to one of the playbooks below
- Communicate — update Teams when resolved, or escalate
- Post-mortem — add to
docs/knowledge-base/sessions/ordocs/knowledge-base/bug-patterns.mdif a new pattern
Alert playbooks
🔴 Alert: prgjsmes-health-check-failing (Sev 1)
What it means: /api/health has been returning non-2xx for 5 minutes. The endpoint internally calls db.Database.CanConnectAsync() so this means the app can’t reach the DB, OR the app itself crashed.
Immediate steps:
-
Curl the endpoint yourself:
curl -v https://psmes.progressivesurface.com/api/health- 200 OK → alert is stale; clear it
- 503 with
"status":"unhealthy"→ app is up, DB is unreachable → go to step 2 - Connection refused / timeout → app is down → go to step 3
- 401/403 → Easy Auth is blocking unauthenticated; re-test from inside the VPN or use a valid token
-
DB unreachable path:
- Portal →
procserv-proddata→ check if SQL server is reachable - Verify Managed Identity for
prgjsmes-prodhas access:az sql db show -g ProcServices-Prod-Data -s procserv-proddata -n PRGJSMES - Check private endpoint
PS-ProdData-SQL-Privateat 10.160.140.4 is healthy (portal → private endpoints) - If all green and DB is unreachable, check
az sql db showforstatus— if “Paused”, resume via portal (should be off per our config but verify)
- Portal →
-
App is down path:
- Portal →
prgjsmes-prod→ restart the app (Overview → Restart) - Watch health endpoint for 60 seconds after restart
- If still down: check App Service → Log stream for Program.cs startup errors
- If startup is failing, check recent deploys → consider re-swapping to prior slot (see Rollback below)
- Portal →
Escalation: if DB auth issue is confirmed and needs a role change on the Managed Identity, page Adam.
🔴 Alert: prgjsmes-http-5xx-storm (Sev 1)
What it means: >10 HTTP 5xx responses in 5 minutes. Code is throwing unhandled exceptions in requests.
Immediate steps:
-
App Insights → Live Metrics (
psi-webapps-insights) → filter by cloud roleprgjsmes-prod -
Failures tab → sort by count; find the top failing operation
-
Click a failure → see the exception type, message, stack trace
-
Decide severity:
- Single operator hit a bug (e.g., one cart with weird data) → fix forward, not urgent
- All operators hitting it → rollback. Swap staging back to production:
az webapp deployment slot swap \ -g PS-WEBAPPS -n prgjsmes-prod \ --slot staging --target-slot production
-
Communicate in Teams
#prgjsmes-line1-pilot— “we’re seeing X errors, investigating”
Escalation: if cause is a recent deploy and rollback isn’t enough (e.g., DB state also changed), page Dakota for code fix.
🟡 Alert: prgjsmes-plan-cpu-high (Sev 2)
What it means: App Service plan CPU > 80% for 15 minutes.
Immediate steps:
- App Insights → Performance → find the slow request (probably a fat query)
- SQL DB metrics (
PRGJSMES) → checkcpu_percent— if also high, DB is the bottleneck - Short-term: scale up the plan temporarily:
az appservice plan update -g PS-WEBAPPS -n psi-asp-windows --sku P2v3 - Long-term: investigate the slow query; add an index if appropriate (expand-contract not needed for
CREATE INDEX)
Not pilot-blocking unless sustained for hours — operators feel slowness but it’s not an outage.
🟡 Alert: prgjsmes-plan-memory-high (Sev 2)
What it means: Plan memory > 85% for 15 minutes.
Immediate steps:
- App Service → Overview → Restart (frees accumulated memory)
- App Insights → Usage → check for memory leak patterns (growing over time)
- If it recurs after restart within an hour, escalate to a memory dump investigation via App Service Diagnose and solve problems blade
🟡 Alert: prgjsmes-db-cpu-high (Sev 2)
What it means: SQL DB CPU > 80% for 15 minutes.
Immediate steps:
- Portal →
PRGJSMESDB → Query Performance Insight → top queries by CPU - Note the query SQL text; find the corresponding EF Core method
- Short-term: bump the DB SKU temporarily (serverless auto-scales up to 2 vCore today; if 2 is maxed, bump to
GP_S_Gen5_4):az sql db update -g ProcServices-Prod-Data -s procserv-proddata -n PRGJSMES --service-objective GP_S_Gen5_4 - Long-term: index tuning or query rewrite
🟢 Alert: prgjsmes-db-storage-high (Sev 3)
What it means: DB storage > 80% of max (currently 32 GB cap).
Immediate steps:
- Not urgent — you have time. Bump the max size via
az sql db update --max-size 64GB - Longer-term: identify the large tables (App Insights → QA dashboard has some old data reports)
🟡 Alert: prgjsmes-db-connection-failed (Sev 2)
What it means: >10 failed SQL connections in 5 minutes. Usually means auth, firewall, or connection string problems.
Immediate steps:
- Check the health endpoint — if it’s green, this might be a specific user/service hitting a firewall rule
- SQL server → Firewalls and virtual networks → confirm expected IPs are allowed
- Check Managed Identity status hasn’t been disabled
- If it just started after a deploy, check
appsettings.jsonfor a typo in the connection string (though Key Vault refs should prevent this)
Common situations (not alert-driven)
Operator says “the page won’t load”
- Have them refresh (Ctrl+F5 to bust cache)
- Have them try another PRGJSMES page — if only one page breaks, it’s a component bug
- If all pages fail: check the health endpoint; if that’s fine, have them check their network (Wi-Fi dropped?) and MSAL token (sign out and back in)
- If health is failing → follow the health-check-failing playbook
Operator says “the scale isn’t reading”
Not an infra issue — it’s a hardware integration.
- Check
/scale-setup— is their terminal mapped to a scale? - Is the scale powered on, connected, reachable via IP?
- Use Admin → Printers, Scales & Terminals → Test on the scale row. The modal runs a pre-flight reachability check (scale-type aware:
I1round-trip for Mettler, pure TCP-connect for Relay — A&D HID scales emit only on PRINT, so passive-read pre-flight would always false-negative) and shows one of two states in its own row above the live SSE stream:- Green “Scale is reachable” → TCP path open. The live stream will then either populate (Mettler auto-stream) or wait silently for PRINT (A&D HID). If a Mettler scale is reachable but no readings appear in the live area, the problem is in the SSE delivery path — but note SSE now uses
fetch()+ Bearer (not EventSource), so 401s no longer silently break the stream. - Red “Scale unreachable — Cannot reach …” → TCP connect failed. Check VNet routing from App Service to the scale subnet, scale/Pi power, the
scale_relayservice on the Pi, and IP/port in the Scale row.
- Green “Scale is reachable” → TCP path open. The live stream will then either populate (Mettler auto-stream) or wait silently for PRINT (A&D HID). If a Mettler scale is reachable but no readings appear in the live area, the problem is in the SSE delivery path — but note SSE now uses
- USB mode: does the browser have Web Serial permission (Site Settings → Serial)?
- See
wiki/hardware/mettler-toledo-scales.mdfor MT-SICS protocol details
Operator says “labels aren’t printing”
- Check
/admin/printers— is the printer registered? - Can you test-print from Admin → Printers → test-print action?
- Is the printer IP reachable on TCP:9100 from the App Service (via VNet integration)?
- App Insights → Failures → search for
PrintService— any timeouts?
Operator says “I can’t close this lot”
Usually a validation failure. Check:
- Parts balance (pre-grit / thermal: parts ran = parts received - exceptions)
- Supervisor initials (must match the Supervisors table)
- Powder lot status (cross-line lockout active?)
- If App Insights shows no exception, it’s a UI-side validation — walk through the form fields
Operator says “two operators are fighting over the same cart”
Multi-operator contention on the same cart/lot. The app has Math.Max non-regressive status pattern (order.StatusId = Math.Max(order.StatusId, newStatus)) but weight entry may race. Manual resolution:
- Identify which save actually persisted (check the DB via Admin/query)
- The other operator’s entry is lost; they need to redo it
- If this happens repeatedly, it’s a poka-yoke gap — log to bug-patterns.md
Rollback procedures
Rollback a bad deploy
The blue-green pattern gives you instant rollback if the bad deploy was the latest:
az webapp deployment slot swap \
-g PS-WEBAPPS -n prgjsmes-prod \
--slot staging --target-slot productionTakes ~30 seconds. The previous production code (sitting in staging slot since the last deploy) comes back as production. Staging now has the broken code — leave it there for diagnosis.
Caveat: only works if you haven’t deployed anything else since the bad deploy. If staging has been overwritten by a newer build, you need Option B below.
Rollback when slot re-swap isn’t available
git revert <merge-commit-sha>on a new branch- Open a PR, merge, let normal deploy run
Or for extreme emergencies, restore from daily App Service backup:
- Portal →
prgjsmes-prod→ Backups → select a daily backup frompsiappbackupsstorage - Restore (takes 5-10 minutes)
Rollback a bad SQL migration
If the migration was additive: no rollback needed, just stop using the new column/table.
If the migration was destructive (broke old code): this should be nearly impossible because the policy + linter enforce expand-contract. But if it slipped through:
- Immediately restore the affected DB from Point-in-Time Restore to a new DB:
az sql db restore -g ProcServices-Prod-Data -s procserv-proddata \ --dest-name PRGJSMES-recovery --source-database PRGJSMES \ --time "<pre-migration timestamp>" - Validate data in
PRGJSMES-recovery - Rename production DB aside (
PRGJSMES-broken), rename recovery toPRGJSMES - Restart App Service
- Post-incident review mandatory
Full Line 1 rollback to Access
If PRGJSMES is unusable and operators need to keep running production:
- Line 1 supervisor switches operators to paper travelers (kept in reserve at Line 1)
- End of shift: operators enter that shift’s data into the original Line 1 Access DB
- Preserve the PRGJSMES partial data; do not truncate it
- Formal incident review within 24 hours before re-attempting
Monitoring + where to look
| Need | Tool |
|---|---|
| Is the app up? | curl /api/health or the Azure portal prgjsmes-prod Overview |
| What’s slow? | App Insights → psi-webapps-insights → Performance |
| What’s failing? | App Insights → Failures (top exceptions, request paths) |
| Live traffic? | App Insights → Live Metrics |
| DB health? | Portal → PRGJSMES DB → Overview + Query Performance Insight |
| Alerts | Portal → Monitor → Alerts → scope prgjsmes-prod or PRGJSMES |
| Logs | Log Analytics → DefaultWorkspace-...-NCUS → query AppServiceHTTPLogs, AppServiceAppLogs, SQLInsights |
| Audit (who did what in SQL) | Log Analytics → SQLSecurityAuditEvents |
| Cost surprises | Portal → Cost Management → subscription budget psi-sub-monthly-budget |
Example KQL queries
// Find recent 5xx responses
AppServiceHTTPLogs
| where TimeGenerated > ago(1h)
| where ScStatus >= 500
| project TimeGenerated, CsMethod, CsUriStem, ScStatus, CIp, UserAgent
| order by TimeGenerated desc
// Find slow queries
AzureDiagnostics
| where TimeGenerated > ago(1h)
| where Category == "QueryStoreRuntimeStatistics"
| where duration_d > 1000
| project TimeGenerated, statement_s, duration_d, cpu_time_d
// App exceptions
exceptions
| where cloud_RoleName == "prgjsmes-prod"
| where timestamp > ago(1h)
| summarize count() by type, outerMessage
| order by count_ descEscalation + contacts
| Tier | Who | When |
|---|---|---|
| 1 | Line 1 supervisor | First-line operator issues |
| 2 | Adam (adevereaux@progressivesurface.com) | Infrastructure, access, DB, Azure |
| 2 | Dakota | Application bugs, code fixes, deploy issues |
| 3 | Vendor support (rare) | Azure platform-level incidents — https://portal.azure.com → Help + support |
On-call for pilot week: Adam + Dakota both reachable by Teams during business hours; Adam by email overnight via ag-prgjsmes-oncall action group.
Related docs
- PRGJSMES application — architecture, readiness status
- Schema Change Policy — expand-contract rule
- CD + Slot Swap — deploy flow, rollback mechanism
PRGJSMES/docs/DEVELOPMENT.md— developer workflowPRGJSMES/docs/plans/2026-04-28-line1-pilot-launch.md— pilot launch plan + Day 1 support
Last updated: 2026-04-24. Maintainers: Adam Devereaux, Dakota Cooper.