Containers — OL Data Platform
Generated 2026-06-24 13:33 UTC · c4gen dev
The runtime/deployable units inside OL Data Platform and how data moves between them and adjacent systems.
Containers
| Container | Technology | Responsibility |
|---|---|---|
| Dagster Webserver | Dagster 1.13.9 (dagster-webserver, K8s) | Serves the Dagster UI/GraphQL and loads each code location's gRPC server. Entry point for operators and the launch surface for jobs. |
| Dagster Daemon | Dagster daemon (DagsterDaemonScheduler + QueuedRunCoordinator) | Runs schedules, sensors, and the run queue. Backed by pooled Postgres run/event/schedule storage to avoid connection exhaustion. |
| Code Location — lakehouse | Dagster code location (dg project) | The hub. Builds one sync_and_stage job+schedule per Airbyte connection, runs the full dbt project, pushes Superset datasets, and runs nightly Iceberg maintenance. |
| Code Locations — openedx / edxorg / legacy_openedx | Dagster code locations (dg projects) | Ingest Open edX course archives, tracking logs, and forum mongodumps from S3/GCS/MySQL; normalize tracking logs (DuckDB); POST content webhooks to MIT Learn after course exports. |
| Code Locations — canvas / learning_resources / b2b_organization / student_risk_probability | Dagster code locations (dg projects) | Domain pipelines: Canvas course exports, Sloan Exec-Ed + ODL Video API assets, per-org B2B CSV exports to S3, and a logistic-regression student cheating-risk scoring asset. |
| Code Location — data_platform | Dagster code location (dg project) | Platform-level: Slack run-failure alerting and the OpenMetadata ingestion asset (Trino metadata harvest workflow). |
| Airbyte | Airbyte OSS (api-airbyte.odl.mit.edu) | Connector platform extracting SOA app Postgres/MySQL, forum, and SaaS sources into the raw Iceberg layer. Connections are orchestrated as Dagster assets from the lakehouse location. |
| dlt extractors (data_loading) | dlt → DuckDB → Iceberg/Glue | Custom data-load-tool sources for edX.org S3 TSVs, MIT Climate, MIT Professional Ed, Open Learning Library, and podcast RSS into raw. |
| dbt (ol_dbt) | dbt-trino (profile open_learning) | Transforms raw → staging → intermediate → marts/dimensional/reporting, executed by Trino. Run from Dagster via DbtCliResource; emits docs artifacts for OpenMetadata. |
| Trino / Starburst Galaxy | Starburst Galaxy (Trino) — catalog ol_data_lake_production | Query/compute engine over the Iceberg lake. dbt and Superset run their SQL here; host mitol-ol-data-lake-production.trino.galaxy.starburst.io. |
| S3 / Iceberg Data Lake | AWS S3 + Apache Iceberg (Glue catalogs) | The lake: landing zone (ol-data-lake-landing-zone), raw Iceberg (ol_warehouse_*_raw), then staging/intermediate/dimensional/mart layers. |
| Dagster Postgres | PostgreSQL (pooled storage classes) | Dagster run, event-log, and schedule storage. Uses PooledPostgresRunStorage to cap connections. |
| Apache Superset | Superset (Trino + StarRocks datasources) | BI layer. Dagster syncs dbt mart/reporting/dimensional models as Superset datasets; analysts build dashboards with row-level-security policies. |
| OpenMetadata | OpenMetadata (catalog + lineage) | Data catalog. Harvests Trino metadata directly and ingests dbt artifacts from S3 to build column- and table-level lineage and PII tags. |