Data Flows — OL Data Platform
Generated 2026-06-24 13:33 UTC · c4gen dev
Each scenario below replays one interaction as a C4 Dynamic diagram. Amber steps are asynchronous (queued / scheduled / event-driven).
How to read these diagrams
These are C4 model diagrams (C4-PlantUML). Read them top-down: System Context (the whole SOA) → Container (one system's runtime units) → Dynamic (a single data flow, step by step).
- People are rounded boxes; systems and containers are rectangles; databases and queues have distinct shapes.
- Each arrow is a data flow labelled with what moves.
- Solid arrows are synchronous (request/response, caller blocks).
- Amber dashed arrows are asynchronous (queued, scheduled, or event-driven — caller does not block).
- Drag to pan, scroll to zoom. Boxes with a link drill into the next level.
App-database ingest into the raw lake (asynchronous)
The Dagster daemon triggers the per-connection sync_and_stage schedule in the lakehouse location, which materializes the Airbyte assets. Airbyte extracts a SOA app's Postgres/MySQL and lands raw Iceberg tables (carrying airbyte columns) in the lake on S3. Evidenced by OpenMetadata raw tables raw__mitlearn__app__postgres__ and raw__mitxonline__openedx__mysql__*.
dbt transform raw to dimensional marts (asynchronous)
After staging, the lakehouse location runs the dbt project; dbt executes its model DAG in Trino, reading raw and writing staging → intermediate → dimensional/marts in the Iceberg lake. OpenMetadata lineage confirms dim_course_run integrates raw__{micromasters,mitxonline,xpro}__app__postgres, raw__mitx__openedx__mysql, raw__edxorg__s3 and raw__irx__edxorg__bigquery.
Content webhook back into MIT Learn (asynchronous)
Today the platform's reverse path is HMAC-signed content webhooks (not Hightouch — Hightouch lives on the mit-learn side). After an Open edX course export, the openedx location POSTs to MIT Learn's /api/v1/webhooks/content_files/ to trigger ingest of the new content files.
OpenMetadata catalog & lineage harvest (asynchronous)
The data_platform location runs the OpenMetadata metadata workflow against Trino, while the lakehouse location uploads dbt docs artifacts to S3 that OpenMetadata ingests — together producing the cross-system table/column lineage and PII classification seen in the catalog.
TARGET: consolidate app ETL onto the platform
Goal state: heavy ETL currently embedded in each SOA app's Celery workers (MITx Online edX-sync & certificate generation; MIT Learn edx_content catalog/resource ingest) is relocated onto Dagster on this platform, leaving the apps to serve requests. These edges are tagged target and are not yet implemented.
Ingestion sources (ETL)
Every external source the edx_content / default Celery workers pull from, with transport and cadence. ⚠️ marks brittle linkages (HTML/token scrapes, hardcoded URLs).
| Source | Transport | Cadence | Data | Source of truth |
|---|---|---|---|---|
| Bootcamps (app Postgres) | Airbyte Postgres | daily | raw__bootcamps__app__postgres__* | dg_projects/lakehouse/lakehouse/definitions.py |
| Canvas | Canvas API common-cartridge export → S3 | every 6h | course exports | dg_projects/canvas |
| Emeritus / IRX BigQuery | Airbyte BigQuery | scheduled | raw__irx__edxorg__bigquery__* | dg_projects/lakehouse/lakehouse/definitions.py |
| Forums (mitx/mitxonline/xpro) | Airbyte / mongodump | daily | forum discussion data | dg_projects/legacy_openedx/legacy_openedx/ops/open_edx.py |
| ⚠️ MIT Climate / MITPE / OLL / podcasts | dlt (REST/CSV/RSS) | scheduled | catalog + feed content | dg_projects/data_loading/data_loading/defs |
| MIT Learn (app Postgres) | Airbyte Postgres | every 24h | raw__mitlearn__app__postgres__* | dg_projects/lakehouse/lakehouse/definitions.py |
| MITx Online (Open edX MySQL) | Airbyte MySQL | every 12h | raw__mitxonline__openedx__mysql__* | dg_projects/lakehouse/lakehouse/definitions.py |
| MITx Online (app Postgres) | Airbyte Postgres | every 6h | raw__mitxonline__app__postgres__* | dg_projects/lakehouse/lakehouse/definitions.py |
| Mailgun | Airbyte REST | every 6h | email events | dg_projects/lakehouse/lakehouse/definitions.py |
| MicroMasters (app Postgres) | Airbyte Postgres | daily | raw__micromasters__app__postgres__* | dg_projects/lakehouse/lakehouse/definitions.py |
| OCW Studio (app Postgres) | Airbyte Postgres | daily | raw__ocw__studio__* | dg_projects/lakehouse/lakehouse/definitions.py |
| ODL Video Service | REST API (Dagster asset) | every 10m | video + transcript metadata | dg_projects/learning_resources/learning_resources/assets/ovs_videos.py |
| Salesforce | Airbyte REST | every 24h | CRM objects | dg_projects/lakehouse/lakehouse/definitions.py |
| Sloan Executive Ed | OAuth REST API (Dagster asset) | daily | courses | dg_projects/learning_resources |
| Tracking logs (S3) | Airbyte S3 + DuckDB normalize | scheduled | Open edX tracking events | dg_projects/openedx/openedx/assets/openedx.py |
| edX.org archives + tracking logs | GCS/S3 → dlt | scheduled | raw__edxorg__s3__* | dg_projects/data_loading/data_loading/defs |
| xPRO (app Postgres + Open edX MySQL) | Airbyte | every 12h | raw__xpro__{app__postgres,openedx__mysql}__* | dg_projects/lakehouse/lakehouse/definitions.py |