Pipeline Infrastructure
How the PSI data pipeline runs: server, schedulers, databases, and data flow.
Server
All pipeline jobs run on PS-PROXY.AD.PTIHOME.com (Windows Server), which hosts:
- GitHub Actions self-hosted runner — labeled
[self-hosted, ps-proxy], executes workflow.ymlfiles frompsi-data-pipelinerepo - Prefect Server — local OSS instance at
http://ps-proxy:4200, with apsi-pipelinework pool - PSI.UniData.API — Windows service providing REST access to AFTEC/UniData (see UniData API)
Runner requirements:
- Python 3.12 (installed via
actions/setup-python@v5) - Network access to DFS shares (
\\ad.ptihome.com\DFS\*), MRP-PROD (UniData), and Azure SQL - Azure AD service principal credentials (stored in GitHub secrets and system env vars)
- Playwright + Chromium (for Rockwell lifecycle scraper fallback)
Two Schedulers (Dual-Fire Issue)
Both GitHub Actions and Prefect are active and scheduling the same jobs. This is a known issue — 6 jobs currently fire from both schedulers at the same UTC times.
GitHub Actions (Primary — 10 workflows)
Source: psi-data-pipeline/.github/workflows/*.yml
All workflows use shell: powershell on the [self-hosted, ps-proxy] runner.
| Workflow | Schedule (UTC) | Tables |
|---|---|---|
sync-aftec-raw.yml | 0 6 * * * (daily) | 31 raw AFTEC tables |
nightly-data-build.yml | 0 6 30 * * * (daily) | 17 analytics tables |
sync-bom-data.yml | 0 8 * * 1-5 (weekdays) | bom_parts |
sync-kg-data.yml | 0 7 * * * (daily) | kg_* tables |
sync-prims.yml | 0 5 * * 1-5 (weekdays) | prims_installations |
sync-dept-schedules.yml | 0 14 * * 1-5 (weekdays) | dept_schedules, floor_space |
sync-rockwell.yml | 0 10 * * 1 (Mondays) | component_lifecycle, project_components |
sync-product-taxonomy.yml | 0 7 * * 1 (Mondays) | product_classes, product_catalog |
enrich-dell-warranty.yml | 0 12 * * 6 (Saturdays) | dell_warranty |
setup-env-vars.yml | Manual only | (system config) |
Prefect (6 deployments — same scripts, same schedules)
Dashboard: http://ps-proxy:4200/dashboard
Work pool: psi-pipeline
Registration script: psi-data-pipeline/prefect_deployments.py
| Deployment | Schedule (UTC) | Also in GH Actions? |
|---|---|---|
| sync-aftec-raw | 0 6 * * * | Yes — identical |
| sync-bom-data | 0 8 * * 1-5 | Yes — identical |
| sync-prims | 0 5 * * 1-5 | Yes — identical |
| sync-dept-schedules | 0 14 * * 1-5 | Yes — identical |
| sync-product-taxonomy | 0 7 * * 1 | Yes — identical |
| enrich-dell-warranty | 0 12 * * 6 | Yes — identical |
Not in Prefect: nightly-data-build, sync-rockwell (Component Lifecycle), sync-kg-data, setup-env-vars.
How the decorators work
Each pipeline script has a safe import:
try:
from prefect import flow
except ImportError:
def flow(fn=None, **kwargs): # no-op if Prefect not installed
return fn if fn else lambda f: fThis means the scripts work identically whether invoked by Prefect or by GitHub Actions — the @flow decorator only activates if Prefect is installed in the Python environment.
Resolution needed
The dual-fire situation should be resolved by either:
- Disabling Prefect schedules and keeping GH Actions as the sole scheduler (simpler)
- Migrating to Prefect and removing GH Actions cron triggers (better observability, retry logic, dashboard)
- Splitting responsibility — some jobs via Prefect, others via GH Actions (complex, not recommended)
Until resolved, the scripts are idempotent (INSERT OR REPLACE) so duplicate runs don’t corrupt data, but they do waste API calls and DFS I/O.
Data Flow: Two-Tier SQLite
Both databases live on the DFS share at \\ad.ptihome.com\DFS\Schedule\SS123\LEADTIME\.
psi_ingest.db (Persistent Scraper Cache)
Each scraper workflow owns its tables and writes directly here. Data persists across nightly rebuilds.
Owned by: sync-prims, sync-rockwell, sync-product-taxonomy, sync-dept-schedules, sync-kg-data
psi_analytics.db (Rebuilt Nightly)
The nightly-data-build.yml workflow rebuilds this from scratch every night at 1:30 AM ET. It imports AFTEC CSV files, computes derived features, then imports scraper-owned tables from psi_ingest.db.
Atomic deployment: Writes to .tmp file, renames over live DB when complete.
Azure SQL: PSI_Analytics (Final Destination)
All tables sync to procserv-proddata.database.windows.net / PSI_Analytics via MSAL service principal auth. This is what Power BI, Redbook Dashboard, and PSI Explorer connect to.
Four-step scraper pattern
Every scraper workflow follows:
- Script writes to
psi_ingest.db publish_ingest_tables.pycopies topsi_analytics.dbsync_table_to_azure.pysyncs to Azure SQLverify_passthrough_tables.pyconfirms row counts match across all three tiers
Authentication
| Credential | Where stored | Used by |
|---|---|---|
| Azure AD Service Principal (tenant, client ID, secret) | GH secrets + PS-PROXY system env vars | All Azure SQL writes |
| UniData service account | AD extensionAttribute2 | BOM API calls via PSI.UniData.API |
| Rockwell API (embedded public credentials) | Hardcoded in rockwell_lifecycle.py | Rockwell lifecycle lookups |
Related
- Data Brain — master data reference, table schemas, row counts
- UniData API — REST API on PS-PROXY, BOM/Parts/Projects endpoints
psi-data-pipeline/docs/pipeline-architecture.md— in-repo architecture docpsi-data-pipeline/CLAUDE.md— detailed schema and migration plan