Skip to content

Data Flows — OCW Studio

Generated 2026-06-24 15:49 UTC · c4gen dev

Each scenario below replays one interaction as a C4 Dynamic diagram. Amber steps are asynchronous (queued / scheduled / event-driven).

How to read these diagrams

These are C4 model diagrams (C4-PlantUML). Read them top-down: System Context (the whole SOA) → Container (one system's runtime units) → Dynamic (a single data flow, step by step).

  • People are rounded boxes; systems and containers are rectangles; databases and queues have distinct shapes.
  • Each arrow is a data flow labelled with what moves.
  • Solid arrows are synchronous (request/response, caller blocks).
  • Amber dashed arrows are asynchronous (queued, scheduled, or event-driven — caller does not block).
  • Drag to pan, scroll to zoom. Boxes with a link drill into the next level.

Author edits and saves content (synchronous + async sync)

The author edits in the SPA, which calls the DRF API through Nginx. Django persists to Postgres and (for media) S3, then enqueues a content-sync task; the publish-queue worker commits the content to the site's GitHub repo.

Publish a site via Concourse + Hugo (asynchronous build)

A publish request enqueues publish_website_backend_{draft,live}. The worker syncs content to GitHub, then upserts/triggers the Concourse site pipeline. Concourse runs Hugo (pulling themes, manifest, and S3 resources), uploads the built site to the preview/publish buckets, purges Fastly, and POSTs status back to the pipeline_status webhook.

Live publish notifies MIT Learn (cross-service, asynchronous)

On a successful live build the Concourse pipeline POSTs an Open Catalog webhook (site prefix + version + key) to MIT Learn, which then runs its learning_resources ETL to ingest the OCW site JSON from the published S3 bucket into its catalog/search index.

Google Drive import & video pipeline (asynchronous)

An on-demand import walks the site's Drive folders, streams files into the S3 storage bucket, creates WebsiteContent resources, and for videos submits an AWS MediaConvert job. On completion MediaConvert webhooks back; the video is uploaded to YouTube and transcripts are ordered from 3Play.

Ingestion sources (ETL)

Every external source the edx_content / default Celery workers pull from, with transport and cadence. ⚠️ marks brittle linkages (HTML/token scrapes, hardcoded URLs).

Source Transport Cadence Data Source of truth
3Play Media transcripts REST + callback webhook on order + beat refresh (hourly) PDF + WebVTT transcripts videos/threeplay_api.py
Google Drive (media import) REST (Drive API v3, service account) on-demand per site course media (files_final / videos_final) gdrive_sync/api.py
S3 import bucket sync S3 (aws s3 sync) every ~24h (AWS_S3_SYNC_INTERVAL) imported content into storage bucket content_sync/pipelines/definitions/concourse/s3_bucket_sync_pipeline.py
ocw-hugo-themes / ocw-hugo-projects Git (build-time) per build Hugo themes + per-starter config websites/constants.py