Pipeline Infrastructure

How the PSI data pipeline runs: server, schedulers, databases, and data flow.


Server

All pipeline jobs run on PS-PROXY.AD.PTIHOME.com (Windows Server), which hosts:

  • GitHub Actions self-hosted runner — labeled [self-hosted, ps-proxy], executes workflow .yml files from psi-data-pipeline repo
  • Prefect Server — local OSS instance at http://ps-proxy:4200, with a psi-pipeline work pool
  • PSI.UniData.API — Windows service providing REST access to AFTEC/UniData (see UniData API)

Runner requirements:

  • Python 3.12 (installed via actions/setup-python@v5)
  • Network access to DFS shares (\\ad.ptihome.com\DFS\*), MRP-PROD (UniData), and Azure SQL
  • Azure AD service principal credentials (stored in GitHub secrets and system env vars)
  • Playwright + Chromium (for Rockwell lifecycle scraper fallback)

Two Schedulers (Dual-Fire Issue)

Both GitHub Actions and Prefect are active and scheduling the same jobs. This is a known issue — 6 jobs currently fire from both schedulers at the same UTC times.

GitHub Actions (Primary — 10 workflows)

Source: psi-data-pipeline/.github/workflows/*.yml

All workflows use shell: powershell on the [self-hosted, ps-proxy] runner.

WorkflowSchedule (UTC)Tables
sync-aftec-raw.yml0 6 * * * (daily)31 raw AFTEC tables
nightly-data-build.yml0 6 30 * * * (daily)17 analytics tables
sync-bom-data.yml0 8 * * 1-5 (weekdays)bom_parts
sync-kg-data.yml0 7 * * * (daily)kg_* tables
sync-prims.yml0 5 * * 1-5 (weekdays)prims_installations
sync-dept-schedules.yml0 14 * * 1-5 (weekdays)dept_schedules, floor_space
sync-rockwell.yml0 10 * * 1 (Mondays)component_lifecycle, project_components
sync-product-taxonomy.yml0 7 * * 1 (Mondays)product_classes, product_catalog
enrich-dell-warranty.yml0 12 * * 6 (Saturdays)dell_warranty
setup-env-vars.ymlManual only(system config)

Prefect (6 deployments — same scripts, same schedules)

Dashboard: http://ps-proxy:4200/dashboard Work pool: psi-pipeline Registration script: psi-data-pipeline/prefect_deployments.py

DeploymentSchedule (UTC)Also in GH Actions?
sync-aftec-raw0 6 * * *Yes — identical
sync-bom-data0 8 * * 1-5Yes — identical
sync-prims0 5 * * 1-5Yes — identical
sync-dept-schedules0 14 * * 1-5Yes — identical
sync-product-taxonomy0 7 * * 1Yes — identical
enrich-dell-warranty0 12 * * 6Yes — identical

Not in Prefect: nightly-data-build, sync-rockwell (Component Lifecycle), sync-kg-data, setup-env-vars.

How the decorators work

Each pipeline script has a safe import:

try:
    from prefect import flow
except ImportError:
    def flow(fn=None, **kwargs):  # no-op if Prefect not installed
        return fn if fn else lambda f: f

This means the scripts work identically whether invoked by Prefect or by GitHub Actions — the @flow decorator only activates if Prefect is installed in the Python environment.

Resolution needed

The dual-fire situation should be resolved by either:

  1. Disabling Prefect schedules and keeping GH Actions as the sole scheduler (simpler)
  2. Migrating to Prefect and removing GH Actions cron triggers (better observability, retry logic, dashboard)
  3. Splitting responsibility — some jobs via Prefect, others via GH Actions (complex, not recommended)

Until resolved, the scripts are idempotent (INSERT OR REPLACE) so duplicate runs don’t corrupt data, but they do waste API calls and DFS I/O.


Data Flow: Two-Tier SQLite

Both databases live on the DFS share at \\ad.ptihome.com\DFS\Schedule\SS123\LEADTIME\.

psi_ingest.db (Persistent Scraper Cache)

Each scraper workflow owns its tables and writes directly here. Data persists across nightly rebuilds.

Owned by: sync-prims, sync-rockwell, sync-product-taxonomy, sync-dept-schedules, sync-kg-data

psi_analytics.db (Rebuilt Nightly)

The nightly-data-build.yml workflow rebuilds this from scratch every night at 1:30 AM ET. It imports AFTEC CSV files, computes derived features, then imports scraper-owned tables from psi_ingest.db.

Atomic deployment: Writes to .tmp file, renames over live DB when complete.

Azure SQL: PSI_Analytics (Final Destination)

All tables sync to procserv-proddata.database.windows.net / PSI_Analytics via MSAL service principal auth. This is what Power BI, Redbook Dashboard, and PSI Explorer connect to.

Four-step scraper pattern

Every scraper workflow follows:

  1. Script writes to psi_ingest.db
  2. publish_ingest_tables.py copies to psi_analytics.db
  3. sync_table_to_azure.py syncs to Azure SQL
  4. verify_passthrough_tables.py confirms row counts match across all three tiers

Authentication

CredentialWhere storedUsed by
Azure AD Service Principal (tenant, client ID, secret)GH secrets + PS-PROXY system env varsAll Azure SQL writes
UniData service accountAD extensionAttribute2BOM API calls via PSI.UniData.API
Rockwell API (embedded public credentials)Hardcoded in rockwell_lifecycle.pyRockwell lifecycle lookups

  • Data Brain — master data reference, table schemas, row counts
  • UniData API — REST API on PS-PROXY, BOM/Parts/Projects endpoints
  • psi-data-pipeline/docs/pipeline-architecture.md — in-repo architecture doc
  • psi-data-pipeline/CLAUDE.md — detailed schema and migration plan