Skip to content

Containers — OL Data Platform

Generated 2026-06-24 13:33 UTC · c4gen dev

The runtime/deployable units inside OL Data Platform and how data moves between them and adjacent systems.

Containers

Container Technology Responsibility
Dagster Webserver Dagster 1.13.9 (dagster-webserver, K8s) Serves the Dagster UI/GraphQL and loads each code location's gRPC server. Entry point for operators and the launch surface for jobs.
Dagster Daemon Dagster daemon (DagsterDaemonScheduler + QueuedRunCoordinator) Runs schedules, sensors, and the run queue. Backed by pooled Postgres run/event/schedule storage to avoid connection exhaustion.
Code Location — lakehouse Dagster code location (dg project) The hub. Builds one sync_and_stage job+schedule per Airbyte connection, runs the full dbt project, pushes Superset datasets, and runs nightly Iceberg maintenance.
Code Locations — openedx / edxorg / legacy_openedx Dagster code locations (dg projects) Ingest Open edX course archives, tracking logs, and forum mongodumps from S3/GCS/MySQL; normalize tracking logs (DuckDB); POST content webhooks to MIT Learn after course exports.
Code Locations — canvas / learning_resources / b2b_organization / student_risk_probability Dagster code locations (dg projects) Domain pipelines: Canvas course exports, Sloan Exec-Ed + ODL Video API assets, per-org B2B CSV exports to S3, and a logistic-regression student cheating-risk scoring asset.
Code Location — data_platform Dagster code location (dg project) Platform-level: Slack run-failure alerting and the OpenMetadata ingestion asset (Trino metadata harvest workflow).
Airbyte Airbyte OSS (api-airbyte.odl.mit.edu) Connector platform extracting SOA app Postgres/MySQL, forum, and SaaS sources into the raw Iceberg layer. Connections are orchestrated as Dagster assets from the lakehouse location.
dlt extractors (data_loading) dlt → DuckDB → Iceberg/Glue Custom data-load-tool sources for edX.org S3 TSVs, MIT Climate, MIT Professional Ed, Open Learning Library, and podcast RSS into raw.
dbt (ol_dbt) dbt-trino (profile open_learning) Transforms raw → staging → intermediate → marts/dimensional/reporting, executed by Trino. Run from Dagster via DbtCliResource; emits docs artifacts for OpenMetadata.
Trino / Starburst Galaxy Starburst Galaxy (Trino) — catalog ol_data_lake_production Query/compute engine over the Iceberg lake. dbt and Superset run their SQL here; host mitol-ol-data-lake-production.trino.galaxy.starburst.io.
S3 / Iceberg Data Lake AWS S3 + Apache Iceberg (Glue catalogs) The lake: landing zone (ol-data-lake-landing-zone), raw Iceberg (ol_warehouse_*_raw), then staging/intermediate/dimensional/mart layers.
Dagster Postgres PostgreSQL (pooled storage classes) Dagster run, event-log, and schedule storage. Uses PooledPostgresRunStorage to cap connections.
Apache Superset Superset (Trino + StarRocks datasources) BI layer. Dagster syncs dbt mart/reporting/dimensional models as Superset datasets; analysts build dashboards with row-level-security policies.
OpenMetadata OpenMetadata (catalog + lineage) Data catalog. Harvests Trino metadata directly and ingests dbt artifacts from S3 to build column- and table-level lineage and PII tags.