Data Flows — OCW Studio
Generated 2026-06-24 15:49 UTC · c4gen dev
Each scenario below replays one interaction as a C4 Dynamic diagram. Amber steps are asynchronous (queued / scheduled / event-driven).
How to read these diagrams
These are C4 model diagrams (C4-PlantUML). Read them top-down: System Context (the whole SOA) → Container (one system's runtime units) → Dynamic (a single data flow, step by step).
- People are rounded boxes; systems and containers are rectangles; databases and queues have distinct shapes.
- Each arrow is a data flow labelled with what moves.
- Solid arrows are synchronous (request/response, caller blocks).
- Amber dashed arrows are asynchronous (queued, scheduled, or event-driven — caller does not block).
- Drag to pan, scroll to zoom. Boxes with a link drill into the next level.
Author edits and saves content (synchronous + async sync)
The author edits in the SPA, which calls the DRF API through Nginx. Django persists to Postgres and (for media) S3, then enqueues a content-sync task; the publish-queue worker commits the content to the site's GitHub repo.
Publish a site via Concourse + Hugo (asynchronous build)
A publish request enqueues publish_website_backend_{draft,live}. The worker syncs content to GitHub, then upserts/triggers the Concourse site pipeline. Concourse runs Hugo (pulling themes, manifest, and S3 resources), uploads the built site to the preview/publish buckets, purges Fastly, and POSTs status back to the pipeline_status webhook.
Live publish notifies MIT Learn (cross-service, asynchronous)
On a successful live build the Concourse pipeline POSTs an Open Catalog webhook (site prefix + version + key) to MIT Learn, which then runs its learning_resources ETL to ingest the OCW site JSON from the published S3 bucket into its catalog/search index.
Google Drive import & video pipeline (asynchronous)
An on-demand import walks the site's Drive folders, streams files into the S3 storage bucket, creates WebsiteContent resources, and for videos submits an AWS MediaConvert job. On completion MediaConvert webhooks back; the video is uploaded to YouTube and transcripts are ordered from 3Play.
Ingestion sources (ETL)
Every external source the edx_content / default Celery workers pull from, with transport and cadence. ⚠️ marks brittle linkages (HTML/token scrapes, hardcoded URLs).
| Source | Transport | Cadence | Data | Source of truth |
|---|---|---|---|---|
| 3Play Media transcripts | REST + callback webhook | on order + beat refresh (hourly) | PDF + WebVTT transcripts | videos/threeplay_api.py |
| Google Drive (media import) | REST (Drive API v3, service account) | on-demand per site | course media (files_final / videos_final) | gdrive_sync/api.py |
| S3 import bucket sync | S3 (aws s3 sync) | every ~24h (AWS_S3_SYNC_INTERVAL) | imported content into storage bucket | content_sync/pipelines/definitions/concourse/s3_bucket_sync_pipeline.py |
| ocw-hugo-themes / ocw-hugo-projects | Git (build-time) | per build | Hugo themes + per-starter config | websites/constants.py |