Introduction

This document is meant to be one stop shopping for your MIT OL Devops oncall needs.

Please update this doc as you handle incidents whenever you're oncall.

Style Guide

There should be a table of contents at the top of the document with links to each product heading. Your editor likely has a plugin to make this automatic.

Each product gets its own top level heading.

Entries that are keyed to a specific alert should have the relevant text in a second level heading under the product. Boil the alert down to the most relevant searchable text and omit specifics that will vary. For instance:

"[Prometheus]: [FIRING:1] DiskUsageWarning mitx-production (xqwatcher filesystem /dev/root ext4 ip-10-7-0-78 integrations/linux_hos"

would boil down to DiskUsageWarning xqwatcher because the rest will change and make finding the right entry more difficult.

Each entry should have at least two sections, Diagnosis and Mitigation. Use bold face for the section title. This will allow the oncall to get only as much Diagnosis in as required to identify the issue and focus on putting out the fire.

Products

ODL Open Discussions

InvalidAccessKeyNonProd qa (odl-open-discussions warning)

Diagnosis

You get an alert like "[Prometheus]: [FIRING:1] InvalidAccessKeyNonProd qa (odl-open-discussions warning)".

Mitigation

In the mitodl/ol-infrastructure Github repository, change directory to src/mit/ol-infrastructure/src/ol_infrastructure/applications/open_discussions and run `pulumi up'.

SaltStack

MemoryUsageWarning operations-

Diagnosis

You get an alert like: [Prometheus]: [FIRING:1] MemoryUsageWarning operations-qa (memory ip-10-1-3-33 integrations/linux_host warning).

You'll need an account and ssh key set up on the saltstack master hosts. This should happen when you join the team.

Now, ssh into the salt master appropriate to the environment you received the alert for. The IP address is cited in the alert. So, for the above:

(Substitute your username and the appropriate environment if not qa, e.g. production)

ssh -l cpatti salt-qa.odl.mit.edu

Next, check free memory:

mdavidson@ip-10-1-3-33:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7.5G        7.2G        120M         79M        237M         66M
Swap:            0B          0B          0B

In this case, the machine only has 120M free which isn't great.

Mitigation

We probably need to restart the Salt master service. Use the systemctl command for that:

root@ip-10-1-3-33:~#  systemctl restart salt-master

Now, wait a minute and then check free memory again. There should be significantly more available:

root@ip-10-1-3-33:~# free -h
              total        used        free      shared  buff/cache   available
Mem:           7.5G        1.9G        5.3G         80M        280M        5.3G
Swap:            0B          0B          0B

If what you see is something like the above, you're good to go. Problem solved (for now!)

XQueueWatcher

DiskUsageWarning xqwatcher

Diagnosis

This happens every few months if the xqueue watcher nodes hang around for that long.

Mitigation

From salt-pr master:

sudo ssh -i /etc/salt/keys/aws/salt-production.pem ubuntu@10.7.0.78
sudo su -

root@ip-10-7-0-78:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        20G   16G  3.9G  81% /           <<<<<<<<<<<<<<<<<<<<<<<<<< offending filesystem
devtmpfs        1.9G     0  1.9G   0% /dev
tmpfs           1.9G  560K  1.9G   1% /dev/shm
tmpfs           389M  836K  389M   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/loop1       56M   56M     0 100% /snap/core18/2751
/dev/loop2       25M   25M     0 100% /snap/amazon-ssm-agent/6312
/dev/loop0       25M   25M     0 100% /snap/amazon-ssm-agent/6563
/dev/loop3       54M   54M     0 100% /snap/snapd/19361
/dev/loop4       64M   64M     0 100% /snap/core20/1950
/dev/loop6       56M   56M     0 100% /snap/core18/2785
/dev/loop5       54M   54M     0 100% /snap/snapd/19457
/dev/loop7       92M   92M     0 100% /snap/lxd/24061
/dev/loop8       92M   92M     0 100% /snap/lxd/23991
/dev/loop10      64M   64M     0 100% /snap/core20/1974
tmpfs           389M     0  389M   0% /run/user/1000

root@ip-10-7-0-78:~# cd /edx/var           <<<<<<<<<<<<<<<<<<< intuition / memory

root@ip-10-7-0-78:/edx/var# du -h | sort -hr | head
8.8G    .
8.7G    ./log
8.2G    ./log/xqwatcher         <<<<<<<<<<<< Offender
546M    ./log/supervisor
8.0K    ./supervisor
4.0K    ./xqwatcher
4.0K    ./log/aws
4.0K    ./aws
root@ip-10-7-0-78:/edx/var# cd log/xqwatcher/
root@ip-10-7-0-78:/edx/var/log/xqwatcher# ls -tlrha
total 8.2G
drwxr-xr-x 2 www-data xqwatcher 4.0K Mar 11 08:35 .
drwxr-xr-x 5 syslog   syslog    4.0K Jul 14 00:00 ..
-rw-r--r-- 1 www-data www-data  8.2G Jul 14 14:12 xqwatcher.log             <<<<<<<<< big file

root@ip-10-7-0-78:/edx/var/log/xqwatcher# rm xqwatcher.log

root@ip-10-7-0-78:/edx/var/log/xqwatcher# systemctl restart supervisor.service
Job for supervisor.service failed because the control process exited with error code.
See "systemctl status supervisor.service" and "journalctl -xe" for details.
root@ip-10-7-0-78:/edx/var/log/xqwatcher# systemctl restart supervisor.service       <<<<<<<<<<<<  Restart it twice because ???

root@ip-10-7-0-78:/edx/var/log/xqwatcher# systemctl status supervisor.service
● supervisor.service - supervisord - Supervisor process control system
     Loaded: loaded (/etc/systemd/system/supervisor.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2023-07-14 14:12:51 UTC; 4min 48s ago
       Docs: http://supervisord.org
    Process: 1114385 ExecStart=/edx/app/supervisor/venvs/supervisor/bin/supervisord --configuration /edx/app/supervisor/supervisord.conf (code=exited, status=0/SUCCESS)
   Main PID: 1114387 (supervisord)
      Tasks: 12 (limit: 4656)
     Memory: 485.8M
     CGroup: /system.slice/supervisor.service
             ├─1114387 /edx/app/supervisor/venvs/supervisor/bin/python /edx/app/supervisor/venvs/supervisor/bin/supervisord --configuration /edx/app/supervisor/supervisord.conf
             └─1114388 /edx/app/xqwatcher/venvs/xqwatcher/bin/python -m xqueue_watcher -d /edx/app/xqwatcher

root@ip-10-7-0-78:/edx/var/log/xqwatcher# ls -lthra
total 644K
drwxr-xr-x 5 syslog   syslog    4.0K Jul 14 00:00 ..
drwxr-xr-x 2 www-data xqwatcher 4.0K Jul 14 14:12 .
-rw-r--r-- 1 www-data www-data  636K Jul 14 14:17 xqwatcher.log                <<<<<<<< New file being written to
root@ip-10-7-0-78:/edx/var/log/xqwatcher# df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        20G  7.4G   12G  38%                  <<<<<<<<<<< acceptable utilization

OVS

[Prometheus]: [FIRING:1] InvalidAccessKeyProduction apps-production (odl-video-service critical)

Diagnosis

This happens sometimes when the applications's instance S3 credentials become out of date.

Mitigation

Use the AWS EC2 web console and navigate to the EC2 -> Auto Scaling Group pane. Search on: odl-video-service-production

Once you have the right ASG, click on the "Instance Refresh" tab and then click the "Start Instance Refresh" button.

Be sure to un-check the "Enable Skip Matching" box, or your instance refresh will most likely not do anything at all.

Request by deeveloper to add videos

Diagnosis

N/A - developer request

Mitigation

Use the AWS EC2 web console and find instances of type odl-video-service-production - detailed instructions for accessing the instance can be found here.

The only difference in this case is that the user is admin rather than ubuntu. Stop when you get a shell prompt and rejoin this document.

First, run:

sudo docker compose ps to see a list of running processes. In our case, we're looking for app. This isn't strictly necessary here as we know what we're looking for, but good to look before you leap anyway.

You should see something like:

admin@ip-10-13-3-50:/etc/docker/compose$ sudo docker compose ps
NAME                IMAGE                                 COMMAND                  SERVICE             CREATED             STATUS              PORTS
compose-app-1       mitodl/ovs-app:v0.69.0-5-gf76af37     "/bin/bash -c ' slee…"   app                 3 weeks ago         Up 3 weeks          0.0.0.0:8087->8087/tcp, :::8087->8087/tcp, 8089/tcp
compose-celery-1    mitodl/ovs-app:v0.69.0-5-gf76af37     "/bin/bash -c ' slee…"   celery              3 weeks ago         Up 3 weeks          8089/tcp
compose-nginx-1     pennlabs/shibboleth-sp-nginx:latest   "/usr/bin/supervisor…"   nginx               3 weeks ago         Up 3 weeks          0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp

Now run:

sudo docker compose exec -it app /bin/bash which should get you a new, less colorful shell prompt.

At this point you can run the manage.py command the developer gave you in slack. In my case, this is what I ran and the output I got:

mitodl@486c7fbba98b:/src$ python ./manage.py add_hls_video_to_edx --edx-course-id course-v1:xPRO+DECA_Boeing+SPOC_R0
Attempting to post video(s) to edX...
Video successfully added to edX – VideoFile: CCADE_V11JW_Hybrid_Data_Formats_v1.mp4 (105434), edX url: https://courses.xpro.mit.edu/api/val/v0/videos/

You're all set!

Bootcamp Ecommerce

[Prometheus]: [FIRING:1] AlternateInvalidAccessKeyProduction production (bootcamp-ecommerce critical)

Diagnosis

N/A

Mitigation

You need to refresh the credentials the salt-proxy is using for Heroku to manage this app.

  • ssh to the salt production server: ssh salt-production.odl.mit.edu
  • Run the salt proxy command to refresh creds: salt proxy-bootcamps-production state.sls heroku.update_heroku_config. You should see output similar to the following:
cpatti@ip-10-0-2-195:~$ sudo salt proxy-bootcamps-production state.sls heroku.update_heroku_config
proxy-bootcamps-production:
----------
          ID: update_heroku_bootcamp-ecommerce_config
    Function: heroku.update_app_config_vars
        Name: bootcamp-ecommerce
      Result: True
     Comment:
     Started: 14:43:58.916128
    Duration: 448.928 ms
     Changes:
              ----------
              new:
                  ----------

** 8< snip 8< secret squirrel content elided **

Summary for proxy-bootcamps-production
------------
Succeeded: 1 (changed=1)
Failed:    0
------------
Total states run:     1
Total run time: 448.928 ms
cpatti@ip-10-0-2-195:~$

OpenEdX Residential MITx

Task handler raised error: "OperationalError(1045, "Access denied for user 'v-edxa-fmT0KbL5X'@'10.7.0.237' (using password: YES)

Diagnosis

If the oncall receives this page, instances credentials to access Vault and the secrets it contains have lapsed.

Mitigation

Fixing this issue currently requires an instance refresh, as the newly launched instances will have all the necessary credentials.

From the EC2 console, on the left hand side, click "Auto Scaling Groups", then type 'edxapp-web-mitx-' e.g. 'edxapp-web-mitx-production'. This should yield 1 result with something like 'edxapp-web-autoscaling-group-XXXX' in the 'Name' column. Click that.

Now click the "Instance Refresh" tab.

Click "Start instance refresh".

Be sure to un-check the "Enable Skip Matching" box, or your instance refresh will most likely not do anything at all.

Monitor the instance refresh to ensure it completes successfully. If you have been receiving multiple similar pages, they should stop coming in. If they continue, please escalate this incident as this problem is user visible and thus high impact to customers.

XPro

ApiException hubspot_xpro.tasks.sync_contact_with_hubspot

Diagnosis

This error is thrown when the Hubspot API key has expired.

You'll see an error similar to this one in Sentry.

Mitigation

The fix for this is to generate a new API key in Hubspot and then get that key into Vault, triggering the appropriate pipeline deployment afterwards.

First, generate a new API key in Hubspot. You can do this by logging into Hubspot,

You can do this using the username/password and TOTP token found in [Vault](https://vault-production.odl.mit.edu/ui/vault/secrets/platform-secrets/kv/hubspot/details?version=1.

Once you're logged in, click "Open" next to "MIT XPro" in the Accounts list.

Then, click on the gear icon in the upper right corner of the page and select "Integrations" -> "Private Apps" in the sidebar on the left.

You should then see the XPRo private app and beneath that a link for "View Access Token". Click that, then click on the "Manage Token" link.

On this screen, you should see a "Rotate" button, click that to generate a new API key.

Now that you've generated your new API token, you'll need to get that token into Vault using SOPS. You can find the right secrets file for this in Github here.

The process for deploying secrets deserves its own document, so after adding the new API token to the SOPS decrypted secrets file you just generated, commit it to Github, ensure it runs through the appropriate pipelines and ends up in Vault.

You can find the ultimate home of the XPro Hubspot API key in Vault here.

Once the new API token is in the correct spot, you'll need to ensure that new token gets deployed to production in Heroku by tracking its progress in this pipeline.

You will likely need to close Concourse Github workflow issues to make this happen. See its users guide for details.

Once that's complete, you should have mitigated this issue. Keep checking that Sentry page to ensure that the Last Seen value reflects something appropriately long ago and you can resolve this ticket.

If you are asked to run a sync to Hubspot: - Inform the requester, preferably on the #product-xpro Slack that the process will take quite a long time. If this is time critical they may ask you to run only parts of the sync. You can find documentation on the command you'll run here.

Since XPro runs on Heroku, you'll need to get a Heroku console shell to run the management command. You can get to that shell by logging into heroku with the Heroku CLI and running:

heroku run /bin/bash -a xpro-production

It takes a while but you will eventually get your shell prompt.

From there, run the following commands. To sync all variants:

./manage.py sync_db_to_hubspot create

If you're asked to run only one, for example deals, you can consult the documentation linked above and see that you should add the --deals flag to the invocation.

Be sure to inform the requester of what you see for output and add it to the ticket for this issue if there is one.

If you see the command fail with an exception, note the HTTP response code. In particular a 401 means that the API key is likely out of date. A 409 signals a conflict (e.g. dupe email) that will likely be handled by conflict resolution code and thus can probably be ignored.

MITXOnline

Cybersource credentials potentially out of date

Diagnosis

Often we will get a report like this indicating that one of our Cybersource credentials is out of date.

Mitigation

Since we have no access to the Cybersource web UI, we must send E-mail to: sbmit@mit.edu to validate the status of the current credential or request a new one.

Grading Celery Task Failed (STUB entry. Needs love)

Diagnosis

Usually we'll get reports from our users telling us that grading tasks have failed.

At that point we should surf to celery monitoring and login with your Keycloak Platform Engineering realm credentials.

Then, get the course ID for the failed grading tasks and search for it in Celery Monitoring by entering the course key in the kwargs input, surrounded by {' and '}, for example {'course-v1:MITxT+14.310x+1T2024'}.

Mitigation

You may well be asked to run the compute_graded management command on the LMS for mitxonline. (TODO: Needs details. How do we get there? etc.)

Reddit

[Prometheus]: [FIRING:1] DiskUsageWarning production-apps (reddit filesystem /dev/nvme0n1p1 ext4 ip-10-13-1-59 integrations/linux_

[Pingdom] Open Discussions production home page has an alert

Diagnosis

We often get low disk errors on our reddit nodes, but in this case the low disk alert was paired with a pingdom alert on open-discussions. This may mean that pgbouncer is in trouble on reddit, likely because its credentials are out of date.

You can get a view into what's happening by logging into the node cited in the disk usage ticket and typing:

salt reddit-production* state.sls reddit.config,pgbouncer

Mitigation

Once you've determined that pgbouncer is indeed sad, you can try a restart / credential refresh with the following command:

salt reddit-production* state.sls reddit.config