PRGJSMES Operational Runbook

What to do when something goes wrong with PRGJSMES in production. Keep this page open during Line 1 pilot week.

Production URL: https://psmes.progressivesurface.com Health endpoint: https://psmes.progressivesurface.com/api/health Staging URL: https://prgjsmes-prod-staging.azurewebsites.net Azure portal: prgjsmes-prod Action group: ag-prgjsmes-oncall → emails adevereaux@progressivesurface.com

General incident response flow

Alert fires  →  acknowledge  →  triage severity  →  execute playbook  →  communicate  →  post-incident notes

Acknowledge — reply “on it” in #prgjsmes-line1-pilot Teams channel so operators know help is coming
Triage — check the health endpoint, App Insights live metrics, which specific alert fired
Execute — match the alert to one of the playbooks below
Communicate — update Teams when resolved, or escalate
Post-mortem — add to docs/knowledge-base/sessions/ or docs/knowledge-base/bug-patterns.md if a new pattern

Alert playbooks

🔴 Alert: `prgjsmes-health-check-failing` (Sev 1)

What it means: /api/health has been returning non-2xx for 5 minutes. The endpoint internally calls db.Database.CanConnectAsync() so this means the app can’t reach the DB, OR the app itself crashed.

Immediate steps:

Curl the endpoint yourself:
```
curl -v https://psmes.progressivesurface.com/api/health
```
- 200 OK → alert is stale; clear it
- 503 with "status":"unhealthy" → app is up, DB is unreachable → go to step 2
- Connection refused / timeout → app is down → go to step 3
- 401/403 → Easy Auth is blocking unauthenticated; re-test from inside the VPN or use a valid token
DB unreachable path:
- Portal → procserv-proddata → check if SQL server is reachable
- Verify Managed Identity for prgjsmes-prod has access: az sql db show -g ProcServices-Prod-Data -s procserv-proddata -n PRGJSMES
- Check private endpoint PS-ProdData-SQL-Private at 10.160.140.4 is healthy (portal → private endpoints)
- If all green and DB is unreachable, check az sql db show for status — if “Paused”, resume via portal (should be off per our config but verify)
App is down path:
- Portal → prgjsmes-prod → restart the app (Overview → Restart)
- Watch health endpoint for 60 seconds after restart
- If still down: check App Service → Log stream for Program.cs startup errors
- If startup is failing, check recent deploys → consider re-swapping to prior slot (see Rollback below)

Escalation: if DB auth issue is confirmed and needs a role change on the Managed Identity, page Adam.

🔴 Alert: `prgjsmes-http-5xx-storm` (Sev 1)

What it means: >10 HTTP 5xx responses in 5 minutes. Code is throwing unhandled exceptions in requests.

Immediate steps:

App Insights → Live Metrics (psi-webapps-insights) → filter by cloud role prgjsmes-prod
Failures tab → sort by count; find the top failing operation
Click a failure → see the exception type, message, stack trace
Decide severity:
- Single operator hit a bug (e.g., one cart with weird data) → fix forward, not urgent
- All operators hitting it → rollback. Swap staging back to production:
```
az webapp deployment slot swap \
  -g PS-WEBAPPS -n prgjsmes-prod \
  --slot staging --target-slot production
```
Communicate in Teams #prgjsmes-line1-pilot — “we’re seeing X errors, investigating”

Escalation: if cause is a recent deploy and rollback isn’t enough (e.g., DB state also changed), page Dakota for code fix.

🟡 Alert: `prgjsmes-plan-cpu-high` (Sev 2)

What it means: App Service plan CPU > 80% for 15 minutes.

Immediate steps:

App Insights → Performance → find the slow request (probably a fat query)
SQL DB metrics (PRGJSMES) → check cpu_percent — if also high, DB is the bottleneck

Short-term: scale up the plan temporarily:

az appservice plan update -g PS-WEBAPPS -n psi-asp-windows --sku P2v3

Long-term: investigate the slow query; add an index if appropriate (expand-contract not needed for CREATE INDEX)

Not pilot-blocking unless sustained for hours — operators feel slowness but it’s not an outage.

🟡 Alert: `prgjsmes-plan-memory-high` (Sev 2)

What it means: Plan memory > 85% for 15 minutes.

Immediate steps:

App Service → Overview → Restart (frees accumulated memory)
App Insights → Usage → check for memory leak patterns (growing over time)
If it recurs after restart within an hour, escalate to a memory dump investigation via App Service Diagnose and solve problems blade

🟡 Alert: `prgjsmes-db-cpu-high` (Sev 2)

What it means: SQL DB CPU > 80% for 15 minutes.

Immediate steps:

Portal → PRGJSMES DB → Query Performance Insight → top queries by CPU
Note the query SQL text; find the corresponding EF Core method
Short-term: bump the DB SKU temporarily (serverless auto-scales up to 2 vCore today; if 2 is maxed, bump to GP_S_Gen5_4):
```
az sql db update -g ProcServices-Prod-Data -s procserv-proddata -n PRGJSMES --service-objective GP_S_Gen5_4
```
Long-term: index tuning or query rewrite

🟢 Alert: `prgjsmes-db-storage-high` (Sev 3)

What it means: DB storage > 80% of max (currently 32 GB cap).

Immediate steps:

Not urgent — you have time. Bump the max size via az sql db update --max-size 64GB
Longer-term: identify the large tables (App Insights → QA dashboard has some old data reports)

🟡 Alert: `prgjsmes-db-connection-failed` (Sev 2)

What it means: >10 failed SQL connections in 5 minutes. Usually means auth, firewall, or connection string problems.

Immediate steps:

Check the health endpoint — if it’s green, this might be a specific user/service hitting a firewall rule
SQL server → Firewalls and virtual networks → confirm expected IPs are allowed
Check Managed Identity status hasn’t been disabled
If it just started after a deploy, check appsettings.json for a typo in the connection string (though Key Vault refs should prevent this)

Common situations (not alert-driven)

Operator says “the page won’t load”

Have them refresh (Ctrl+F5 to bust cache)
Have them try another PRGJSMES page — if only one page breaks, it’s a component bug
If all pages fail: check the health endpoint; if that’s fine, have them check their network (Wi-Fi dropped?) and MSAL token (sign out and back in)
If health is failing → follow the health-check-failing playbook

Operator says “the scale isn’t reading”

Not an infra issue — it’s a hardware integration.

Check /scale-setup — is their terminal mapped to a scale?
Is the scale powered on, connected, reachable via IP?
Use Admin → Printers, Scales & Terminals → Test on the scale row. The modal runs a pre-flight reachability check (scale-type aware: I1 round-trip for Mettler, pure TCP-connect for Relay — A&D HID scales emit only on PRINT, so passive-read pre-flight would always false-negative) and shows one of two states in its own row above the live SSE stream:
- Green “Scale is reachable” → TCP path open. The live stream will then either populate (Mettler auto-stream) or wait silently for PRINT (A&D HID). If a Mettler scale is reachable but no readings appear in the live area, the problem is in the SSE delivery path — but note SSE now uses fetch() + Bearer (not EventSource), so 401s no longer silently break the stream.
- Red “Scale unreachable — Cannot reach …” → TCP connect failed. Check VNet routing from App Service to the scale subnet, scale/Pi power, the scale_relay service on the Pi, and IP/port in the Scale row.
USB mode: does the browser have Web Serial permission (Site Settings → Serial)?
See wiki/hardware/mettler-toledo-scales.md for MT-SICS protocol details

Operator says “labels aren’t printing”

Check /admin/printers — is the printer registered?
Can you test-print from Admin → Printers → test-print action?
Is the printer IP reachable on TCP:9100 from the App Service (via VNet integration)?
App Insights → Failures → search for PrintService — any timeouts?

Operator says “I can’t close this lot”

Usually a validation failure. Check:

Parts balance (pre-grit / thermal: parts ran = parts received - exceptions)
Supervisor initials (must match the Supervisors table)
Powder lot status (cross-line lockout active?)
If App Insights shows no exception, it’s a UI-side validation — walk through the form fields

Operator says “two operators are fighting over the same cart”

Multi-operator contention on the same cart/lot. The app has Math.Max non-regressive status pattern (order.StatusId = Math.Max(order.StatusId, newStatus)) but weight entry may race. Manual resolution:

Identify which save actually persisted (check the DB via Admin/query)
The other operator’s entry is lost; they need to redo it
If this happens repeatedly, it’s a poka-yoke gap — log to bug-patterns.md

Rollback procedures

Rollback a bad deploy

The blue-green pattern gives you instant rollback if the bad deploy was the latest:

az webapp deployment slot swap \
  -g PS-WEBAPPS -n prgjsmes-prod \
  --slot staging --target-slot production

Takes ~30 seconds. The previous production code (sitting in staging slot since the last deploy) comes back as production. Staging now has the broken code — leave it there for diagnosis.

Caveat: only works if you haven’t deployed anything else since the bad deploy. If staging has been overwritten by a newer build, you need Option B below.

Rollback when slot re-swap isn’t available

git revert <merge-commit-sha> on a new branch
Open a PR, merge, let normal deploy run

Or for extreme emergencies, restore from daily App Service backup:

Portal → prgjsmes-prod → Backups → select a daily backup from psiappbackups storage
Restore (takes 5-10 minutes)

Rollback a bad SQL migration

If the migration was additive: no rollback needed, just stop using the new column/table.

If the migration was destructive (broke old code): this should be nearly impossible because the policy + linter enforce expand-contract. But if it slipped through:

Immediately restore the affected DB from Point-in-Time Restore to a new DB:

az sql db restore -g ProcServices-Prod-Data -s procserv-proddata \
  --dest-name PRGJSMES-recovery --source-database PRGJSMES \
  --time "<pre-migration timestamp>"

Validate data in PRGJSMES-recovery
Rename production DB aside (PRGJSMES-broken), rename recovery to PRGJSMES
Restart App Service
Post-incident review mandatory

Full Line 1 rollback to Access

If PRGJSMES is unusable and operators need to keep running production:

Line 1 supervisor switches operators to paper travelers (kept in reserve at Line 1)
End of shift: operators enter that shift’s data into the original Line 1 Access DB
Preserve the PRGJSMES partial data; do not truncate it
Formal incident review within 24 hours before re-attempting

Monitoring + where to look

Need	Tool
Is the app up?	`curl /api/health` or the Azure portal `prgjsmes-prod` Overview
What’s slow?	App Insights → `psi-webapps-insights` → Performance
What’s failing?	App Insights → Failures (top exceptions, request paths)
Live traffic?	App Insights → Live Metrics
DB health?	Portal → `PRGJSMES` DB → Overview + Query Performance Insight
Alerts	Portal → Monitor → Alerts → scope `prgjsmes-prod` or `PRGJSMES`
Logs	Log Analytics → `DefaultWorkspace-...-NCUS` → query `AppServiceHTTPLogs`, `AppServiceAppLogs`, `SQLInsights`
Audit (who did what in SQL)	Log Analytics → `SQLSecurityAuditEvents`
Cost surprises	Portal → Cost Management → subscription budget `psi-sub-monthly-budget`

Example KQL queries

// Find recent 5xx responses
AppServiceHTTPLogs
| where TimeGenerated > ago(1h)
| where ScStatus >= 500
| project TimeGenerated, CsMethod, CsUriStem, ScStatus, CIp, UserAgent
| order by TimeGenerated desc
 
// Find slow queries
AzureDiagnostics
| where TimeGenerated > ago(1h)
| where Category == "QueryStoreRuntimeStatistics"
| where duration_d > 1000
| project TimeGenerated, statement_s, duration_d, cpu_time_d
 
// App exceptions
exceptions
| where cloud_RoleName == "prgjsmes-prod"
| where timestamp > ago(1h)
| summarize count() by type, outerMessage
| order by count_ desc

Escalation + contacts

Tier	Who	When
1	Line 1 supervisor	First-line operator issues
2	Adam (adevereaux@progressivesurface.com)	Infrastructure, access, DB, Azure
2	Dakota	Application bugs, code fixes, deploy issues
3	Vendor support (rare)	Azure platform-level incidents — https://portal.azure.com → Help + support

On-call for pilot week: Adam + Dakota both reachable by Teams during business hours; Adam by email overnight via ag-prgjsmes-oncall action group.

PRGJSMES application — architecture, readiness status
Schema Change Policy — expand-contract rule
CD + Slot Swap — deploy flow, rollback mechanism
PRGJSMES/docs/DEVELOPMENT.md — developer workflow
PRGJSMES/docs/plans/2026-04-28-line1-pilot-launch.md — pilot launch plan + Day 1 support

Last updated: 2026-04-24. Maintainers: Adam Devereaux, Dakota Cooper.

PSI Knowledge Base

Explorer

PRGJSMES Operational Runbook

PRGJSMES Operational Runbook

General incident response flow

Alert playbooks

🔴 Alert: `prgjsmes-health-check-failing` (Sev 1)

🔴 Alert: `prgjsmes-http-5xx-storm` (Sev 1)

🟡 Alert: `prgjsmes-plan-cpu-high` (Sev 2)

🟡 Alert: `prgjsmes-plan-memory-high` (Sev 2)

🟡 Alert: `prgjsmes-db-cpu-high` (Sev 2)

🟢 Alert: `prgjsmes-db-storage-high` (Sev 3)

🟡 Alert: `prgjsmes-db-connection-failed` (Sev 2)

Common situations (not alert-driven)

Operator says “the page won’t load”

Operator says “the scale isn’t reading”

Operator says “labels aren’t printing”

Operator says “I can’t close this lot”

Operator says “two operators are fighting over the same cart”

Rollback procedures

Rollback a bad deploy

Rollback when slot re-swap isn’t available

Rollback a bad SQL migration

Full Line 1 rollback to Access

Monitoring + where to look

Example KQL queries

Escalation + contacts

Graph View

Table of Contents

PSI Knowledge Base

Explorer

PRGJSMES Operational Runbook

PRGJSMES Operational Runbook

General incident response flow

Alert playbooks

🔴 Alert: prgjsmes-health-check-failing (Sev 1)

🔴 Alert: prgjsmes-http-5xx-storm (Sev 1)

🟡 Alert: prgjsmes-plan-cpu-high (Sev 2)

🟡 Alert: prgjsmes-plan-memory-high (Sev 2)

🟡 Alert: prgjsmes-db-cpu-high (Sev 2)

🟢 Alert: prgjsmes-db-storage-high (Sev 3)

🟡 Alert: prgjsmes-db-connection-failed (Sev 2)

Common situations (not alert-driven)

Operator says “the page won’t load”

Operator says “the scale isn’t reading”

Operator says “labels aren’t printing”

Operator says “I can’t close this lot”

Operator says “two operators are fighting over the same cart”

Rollback procedures

Rollback a bad deploy

Rollback when slot re-swap isn’t available

Rollback a bad SQL migration

Full Line 1 rollback to Access

Monitoring + where to look

Example KQL queries

Escalation + contacts

Related docs

Graph View

Table of Contents

🔴 Alert: `prgjsmes-health-check-failing` (Sev 1)

🔴 Alert: `prgjsmes-http-5xx-storm` (Sev 1)

🟡 Alert: `prgjsmes-plan-cpu-high` (Sev 2)

🟡 Alert: `prgjsmes-plan-memory-high` (Sev 2)

🟡 Alert: `prgjsmes-db-cpu-high` (Sev 2)

🟢 Alert: `prgjsmes-db-storage-high` (Sev 3)

🟡 Alert: `prgjsmes-db-connection-failed` (Sev 2)