Data Flows — OCW Studio

Generated 2026-06-24 15:49 UTC · c4gen dev

Each scenario below replays one interaction as a C4 Dynamic diagram. Amber steps are asynchronous (queued / scheduled / event-driven).

How to read these diagrams

These are C4 model diagrams (C4-PlantUML). Read them top-down: System Context (the whole SOA) → Container (one system's runtime units) → Dynamic (a single data flow, step by step).

People are rounded boxes; systems and containers are rectangles; databases and queues have distinct shapes.
Each arrow is a data flow labelled with what moves.
Solid arrows are synchronous (request/response, caller blocks).
Amber dashed arrows are asynchronous (queued, scheduled, or event-driven — caller does not block).
Drag to pan, scroll to zoom. Boxes with a link drill into the next level.

Author edits and saves content (synchronous + async sync)

The author edits in the SPA, which calls the DRF API through Nginx. Django persists to Postgres and (for media) S3, then enqueues a content-sync task; the publish-queue worker commits the content to the site's GitHub repo.

Publish a site via Concourse + Hugo (asynchronous build)

A publish request enqueues publish_website_backend_{draft,live}. The worker syncs content to GitHub, then upserts/triggers the Concourse site pipeline. Concourse runs Hugo (pulling themes, manifest, and S3 resources), uploads the built site to the preview/publish buckets, purges Fastly, and POSTs status back to the pipeline_status webhook.

Live publish notifies MIT Learn (cross-service, asynchronous)

On a successful live build the Concourse pipeline POSTs an Open Catalog webhook (site prefix + version + key) to MIT Learn, which then runs its learning_resources ETL to ingest the OCW site JSON from the published S3 bucket into its catalog/search index.

Google Drive import & video pipeline (asynchronous)

An on-demand import walks the site's Drive folders, streams files into the S3 storage bucket, creates WebsiteContent resources, and for videos submits an AWS MediaConvert job. On completion MediaConvert webhooks back; the video is uploaded to YouTube and transcripts are ordered from 3Play.

Ingestion sources (ETL)

Every external source the edx_content / default Celery workers pull from, with transport and cadence. ⚠️ marks brittle linkages (HTML/token scrapes, hardcoded URLs).

Source	Transport	Cadence	Data	Source of truth
3Play Media transcripts	REST + callback webhook	on order + beat refresh (hourly)	PDF + WebVTT transcripts	`videos/threeplay_api.py`
Google Drive (media import)	REST (Drive API v3, service account)	on-demand per site	course media (files_final / videos_final)	`gdrive_sync/api.py`
S3 import bucket sync	S3 (aws s3 sync)	every ~24h (AWS_S3_SYNC_INTERVAL)	imported content into storage bucket	`content_sync/pipelines/definitions/concourse/s3_bucket_sync_pipeline.py`
ocw-hugo-themes / ocw-hugo-projects	Git (build-time)	per build	Hugo themes + per-starter config	`websites/constants.py`