Skip to content

Data Flows — OL Data Platform

Generated 2026-06-24 13:33 UTC · c4gen dev

Each scenario below replays one interaction as a C4 Dynamic diagram. Amber steps are asynchronous (queued / scheduled / event-driven).

How to read these diagrams

These are C4 model diagrams (C4-PlantUML). Read them top-down: System Context (the whole SOA) → Container (one system's runtime units) → Dynamic (a single data flow, step by step).

  • People are rounded boxes; systems and containers are rectangles; databases and queues have distinct shapes.
  • Each arrow is a data flow labelled with what moves.
  • Solid arrows are synchronous (request/response, caller blocks).
  • Amber dashed arrows are asynchronous (queued, scheduled, or event-driven — caller does not block).
  • Drag to pan, scroll to zoom. Boxes with a link drill into the next level.

App-database ingest into the raw lake (asynchronous)

The Dagster daemon triggers the per-connection sync_and_stage schedule in the lakehouse location, which materializes the Airbyte assets. Airbyte extracts a SOA app's Postgres/MySQL and lands raw Iceberg tables (carrying airbyte columns) in the lake on S3. Evidenced by OpenMetadata raw tables raw__mitlearn__app__postgres__ and raw__mitxonline__openedx__mysql__*.

dbt transform raw to dimensional marts (asynchronous)

After staging, the lakehouse location runs the dbt project; dbt executes its model DAG in Trino, reading raw and writing staging → intermediate → dimensional/marts in the Iceberg lake. OpenMetadata lineage confirms dim_course_run integrates raw__{micromasters,mitxonline,xpro}__app__postgres, raw__mitx__openedx__mysql, raw__edxorg__s3 and raw__irx__edxorg__bigquery.

Content webhook back into MIT Learn (asynchronous)

Today the platform's reverse path is HMAC-signed content webhooks (not Hightouch — Hightouch lives on the mit-learn side). After an Open edX course export, the openedx location POSTs to MIT Learn's /api/v1/webhooks/content_files/ to trigger ingest of the new content files.

OpenMetadata catalog & lineage harvest (asynchronous)

The data_platform location runs the OpenMetadata metadata workflow against Trino, while the lakehouse location uploads dbt docs artifacts to S3 that OpenMetadata ingests — together producing the cross-system table/column lineage and PII classification seen in the catalog.

TARGET: consolidate app ETL onto the platform

Goal state: heavy ETL currently embedded in each SOA app's Celery workers (MITx Online edX-sync & certificate generation; MIT Learn edx_content catalog/resource ingest) is relocated onto Dagster on this platform, leaving the apps to serve requests. These edges are tagged target and are not yet implemented.

Ingestion sources (ETL)

Every external source the edx_content / default Celery workers pull from, with transport and cadence. ⚠️ marks brittle linkages (HTML/token scrapes, hardcoded URLs).

Source Transport Cadence Data Source of truth
Bootcamps (app Postgres) Airbyte Postgres daily raw__bootcamps__app__postgres__* dg_projects/lakehouse/lakehouse/definitions.py
Canvas Canvas API common-cartridge export → S3 every 6h course exports dg_projects/canvas
Emeritus / IRX BigQuery Airbyte BigQuery scheduled raw__irx__edxorg__bigquery__* dg_projects/lakehouse/lakehouse/definitions.py
Forums (mitx/mitxonline/xpro) Airbyte / mongodump daily forum discussion data dg_projects/legacy_openedx/legacy_openedx/ops/open_edx.py
⚠️ MIT Climate / MITPE / OLL / podcasts dlt (REST/CSV/RSS) scheduled catalog + feed content dg_projects/data_loading/data_loading/defs
MIT Learn (app Postgres) Airbyte Postgres every 24h raw__mitlearn__app__postgres__* dg_projects/lakehouse/lakehouse/definitions.py
MITx Online (Open edX MySQL) Airbyte MySQL every 12h raw__mitxonline__openedx__mysql__* dg_projects/lakehouse/lakehouse/definitions.py
MITx Online (app Postgres) Airbyte Postgres every 6h raw__mitxonline__app__postgres__* dg_projects/lakehouse/lakehouse/definitions.py
Mailgun Airbyte REST every 6h email events dg_projects/lakehouse/lakehouse/definitions.py
MicroMasters (app Postgres) Airbyte Postgres daily raw__micromasters__app__postgres__* dg_projects/lakehouse/lakehouse/definitions.py
OCW Studio (app Postgres) Airbyte Postgres daily raw__ocw__studio__* dg_projects/lakehouse/lakehouse/definitions.py
ODL Video Service REST API (Dagster asset) every 10m video + transcript metadata dg_projects/learning_resources/learning_resources/assets/ovs_videos.py
Salesforce Airbyte REST every 24h CRM objects dg_projects/lakehouse/lakehouse/definitions.py
Sloan Executive Ed OAuth REST API (Dagster asset) daily courses dg_projects/learning_resources
Tracking logs (S3) Airbyte S3 + DuckDB normalize scheduled Open edX tracking events dg_projects/openedx/openedx/assets/openedx.py
edX.org archives + tracking logs GCS/S3 → dlt scheduled raw__edxorg__s3__* dg_projects/data_loading/data_loading/defs
xPRO (app Postgres + Open edX MySQL) Airbyte every 12h raw__xpro__{app__postgres,openedx__mysql}__* dg_projects/lakehouse/lakehouse/definitions.py