Pipeline Infrastructure

How the PSI data pipeline runs: server, schedulers, databases, and data flow.

Server

All pipeline jobs run on PS-PROXY.AD.PTIHOME.com (Windows Server), which hosts:

GitHub Actions self-hosted runner — labeled [self-hosted, ps-proxy], executes workflow .yml files from psi-data-pipeline repo
Prefect Server — local OSS instance at http://ps-proxy:4200, with a psi-pipeline work pool
PSI.UniData.API — Windows service providing REST access to AFTEC/UniData (see UniData API)

Runner requirements:

Python 3.12 (installed via actions/setup-python@v5)
Network access to DFS shares (\\ad.ptihome.com\DFS\*), MRP-PROD (UniData), and Azure SQL
Azure AD service principal credentials (stored in GitHub secrets and system env vars)
Playwright + Chromium (for Rockwell lifecycle scraper fallback)

Two Schedulers (Dual-Fire Issue)

Both GitHub Actions and Prefect are active and scheduling the same jobs. This is a known issue — 6 jobs currently fire from both schedulers at the same UTC times.

GitHub Actions (Primary — 10 workflows)

Source: psi-data-pipeline/.github/workflows/*.yml

All workflows use shell: powershell on the [self-hosted, ps-proxy] runner.

Workflow	Schedule (UTC)	Tables
`sync-aftec-raw.yml`	`0 6 * * *` (daily)	31 raw AFTEC tables
`nightly-data-build.yml`	`0 6 30 * * *` (daily)	17 analytics tables
`sync-bom-data.yml`	`0 8 * * 1-5` (weekdays)	bom_parts
`sync-kg-data.yml`	`0 7 * * *` (daily)	kg_* tables
`sync-prims.yml`	`0 5 * * 1-5` (weekdays)	prims_installations
`sync-dept-schedules.yml`	`0 14 * * 1-5` (weekdays)	dept_schedules, floor_space
`sync-rockwell.yml`	`0 10 * * 1` (Mondays)	component_lifecycle, project_components
`sync-product-taxonomy.yml`	`0 7 * * 1` (Mondays)	product_classes, product_catalog
`enrich-dell-warranty.yml`	`0 12 * * 6` (Saturdays)	dell_warranty
`setup-env-vars.yml`	Manual only	(system config)

Prefect (6 deployments — same scripts, same schedules)

Dashboard: http://ps-proxy:4200/dashboard Work pool: psi-pipeline Registration script: psi-data-pipeline/prefect_deployments.py

Deployment	Schedule (UTC)	Also in GH Actions?
sync-aftec-raw	`0 6 * * *`	Yes — identical
sync-bom-data	`0 8 * * 1-5`	Yes — identical
sync-prims	`0 5 * * 1-5`	Yes — identical
sync-dept-schedules	`0 14 * * 1-5`	Yes — identical
sync-product-taxonomy	`0 7 * * 1`	Yes — identical
enrich-dell-warranty	`0 12 * * 6`	Yes — identical

Not in Prefect: nightly-data-build, sync-rockwell (Component Lifecycle), sync-kg-data, setup-env-vars.

How the decorators work

Each pipeline script has a safe import:

try:
    from prefect import flow
except ImportError:
    def flow(fn=None, **kwargs):  # no-op if Prefect not installed
        return fn if fn else lambda f: f

This means the scripts work identically whether invoked by Prefect or by GitHub Actions — the @flow decorator only activates if Prefect is installed in the Python environment.

Resolution needed

The dual-fire situation should be resolved by either:

Disabling Prefect schedules and keeping GH Actions as the sole scheduler (simpler)
Migrating to Prefect and removing GH Actions cron triggers (better observability, retry logic, dashboard)
Splitting responsibility — some jobs via Prefect, others via GH Actions (complex, not recommended)

Until resolved, the scripts are idempotent (INSERT OR REPLACE) so duplicate runs don’t corrupt data, but they do waste API calls and DFS I/O.

Data Flow: Two-Tier SQLite

Both databases live on the DFS share at \\ad.ptihome.com\DFS\Schedule\SS123\LEADTIME\.

psi_ingest.db (Persistent Scraper Cache)

Each scraper workflow owns its tables and writes directly here. Data persists across nightly rebuilds.

Owned by: sync-prims, sync-rockwell, sync-product-taxonomy, sync-dept-schedules, sync-kg-data

psi_analytics.db (Rebuilt Nightly)

The nightly-data-build.yml workflow rebuilds this from scratch every night at 1:30 AM ET. It imports AFTEC CSV files, computes derived features, then imports scraper-owned tables from psi_ingest.db.

Atomic deployment: Writes to .tmp file, renames over live DB when complete.

Azure SQL: PSI_Analytics (Final Destination)

All tables sync to procserv-proddata.database.windows.net / PSI_Analytics via MSAL service principal auth. This is what Power BI, Redbook Dashboard, and PSI Explorer connect to.

Four-step scraper pattern

Every scraper workflow follows:

Script writes to psi_ingest.db
publish_ingest_tables.py copies to psi_analytics.db
sync_table_to_azure.py syncs to Azure SQL
verify_passthrough_tables.py confirms row counts match across all three tiers

Authentication

Credential	Where stored	Used by
Azure AD Service Principal (tenant, client ID, secret)	GH secrets + PS-PROXY system env vars	All Azure SQL writes
UniData service account	AD `extensionAttribute2`	BOM API calls via PSI.UniData.API
Rockwell API (embedded public credentials)	Hardcoded in `rockwell_lifecycle.py`	Rockwell lifecycle lookups

Data Brain — master data reference, table schemas, row counts
UniData API — REST API on PS-PROXY, BOM/Parts/Projects endpoints
psi-data-pipeline/docs/pipeline-architecture.md — in-repo architecture doc
psi-data-pipeline/CLAUDE.md — detailed schema and migration plan

PSI Knowledge Base

Explorer

pipeline-infrastructure

Pipeline Infrastructure

Server

Two Schedulers (Dual-Fire Issue)

GitHub Actions (Primary — 10 workflows)

Prefect (6 deployments — same scripts, same schedules)

How the decorators work

Resolution needed

Data Flow: Two-Tier SQLite

psi_ingest.db (Persistent Scraper Cache)

psi_analytics.db (Rebuilt Nightly)

Azure SQL: PSI_Analytics (Final Destination)

Four-step scraper pattern

Authentication

Graph View

Table of Contents

PSI Knowledge Base

Explorer

pipeline-infrastructure

Pipeline Infrastructure

Server

Two Schedulers (Dual-Fire Issue)

GitHub Actions (Primary — 10 workflows)

Prefect (6 deployments — same scripts, same schedules)

How the decorators work

Resolution needed

Data Flow: Two-Tier SQLite

psi_ingest.db (Persistent Scraper Cache)

psi_analytics.db (Rebuilt Nightly)

Azure SQL: PSI_Analytics (Final Destination)

Four-step scraper pattern

Authentication

Related

Graph View

Table of Contents