Pulumi AWS Classic 4 To 5 Upgrade
So, this is roughly what I believe is happening. So you guys can know too.
- I updated / tested a bunch of stacks, mostly CI or sometimes QA, yesterday with the
pulumi-aws 6.5.0
upgrade. That is a two part thing. Python package but also provider. - Part of that upgrade, pulumi changed some of the state/stack structures and they are no longer backwards compatible, in particular those around RDS instances.
- So I, by hand, upgraded a bunch of stacks to the new provider, they migrate nicely everything is good.
- Concourse comes around and something triggers it and it applies the stacks again but with the old provider. It works okay the first time, but when the second time comes around it decides it has lost track of the RDS instance and it tries to recreate it.
- If it DOES try to recreate it, thankfully it fails, because the RDS instance is still there, just pulumi has lost it, and a duplicate error gets thrown and the apply/up fails.
- But, now the stack is broken and requires manual intervention to revive it.
- So what I do is I go into the history/checkpoint area in S3 and find the most recent checkpoint that references the 6.5.0 provider (the one I upgraded to). I pull that down.
- Checkpoints can’t be imported directly they need to be fixed. Fix it with this: https://gist.github.com/clstokes/977b7bd00b37e0a564f707f0ebe36e08
pr pulumi stack import -s <stackname> --file <fixed stack checkpoint file>
poetry install
using an environment pre-6.5.0 upgrade if necessary (I keep multiple ol-inf envs for exactly this kind of thing so I just switch between my 5.4 and 6.5 envs).pulumi plugin rm resource aws 6.5.0
uninstall the PROVIDERpr pulumi up --refresh -s <stackname>
x2 — Shouldn’t be trying to create a database anymore.
- Many of the stacks were fine because they either:
- didn’t have RDS resources
- didn’t get up’d by concourse inbetween.
- BUT! BUT! When the PR making this upgrade was merged to
main
, it trigged nearly everything. Very sad. :disappointed:- But the code has been merged so that is fine right? Wrong.
- The resources that concourse uses to do the needful are not upgraded yet. Specifically these:
- These actually have a complicated silent depedency between them, but basically
mitodl/ol-infrastructure
is a base layer for the other two. - These builds also kicked off (maybe? I did run them by hand too…) when the PR was merged but they didn’t publish to dockerhub before concourse started trying to update CI/QA stacks automatically.
- EVEN IF they had published to dockerhub before, it wouldn’t matter because servers-be-cachin’.
- The concourse workers were holding on to their old versions and efficiently re-using them, ignorant of the new versions available out on dockerhub.
- Solve this by doing and instance refresh on concourse workers. New servers do new
docker pull
on the resources needed.
- Addendum: Made a neat script to pause concourse pipelines en-mass. Should have run this before #8 :disappointed:
for pl in $(fly -t pr-inf ps --json | jq -r '.[].name'); do
fly -t pr-inf pp -p $pl
done