Misbehaving consul cluster incident from 20231004-20231905
Overview
At some point operations-production
consul cluster fell over and stopped being useful. The root cause was not immediately know but was almost definitely related to an upgrade to 1.16.2
from ???
(no good record indication what the cluster was on before all this went down...)
Key takeaway from the logs was this cryptic message about being unable to restore snapshot:
{"@level":"error","@message":"failed to restore snapshot","@module":"agent.server.raft","@timestamp":"2023-10-04T14:17:40.200037Z","error":"failed to restore snapshot 1156-86945697-1696429059987: failed inserting acl token: missing value for index 'accessor'"}
This is something that happens anytime a consul server restarts, it gets a copy of the raft from the other servers currently running and restores it. But, it is failing to do that and crashing.
Initial Response
Tobias was able to revive the cluster by downgrading it to 1.14.10 and it was then able to restore the snapshot it received from other nodes / on the filesystem (unclear how broken the cluster was at this time).
Research
Looking up that message returned one very-not-promising result from the hashicorp forums.
Resolution
Ultimitely spent a lot of time reading and pursuing dead ends but what I believe ulimitely resolved the issue was the following:
- Step through each consul server in the cluster and ensure:
a. it is on version 1.14.10
b. It has this acl stanza in
00-default.json
:"acl": {"enabled": true, "default_policy": "allow", "enable_token_persistence": false}
c. Restart the servers one at a time to ensure the quorum never drops below 3 (or 2 if you're in non-prod). - Now you should be able to issue
acl
commands using the consul cli. a. They won't work though because you don't have a token and in this particular case the cluster says it is not elgible for bootstrapping. b. At some point in history this very special cluster had ACLs enabled and then disabled? Possibly? - We need to reset/recover the ACL master token. Follow the procedure here.
- The output of that command should be "Bootstrap Token (Global Management)" an
AccessorID
andSecretID
. Export the secret ID in your terminal asCONSUL_HTTP_TOKEN
. - Then you can do
consul acl token list
and one of them should be labeled "Master Token". a. export theSecretID
from the Master Token asCONSUL_HTTP_TOKEN
or just save it off somewhere. - Repeat step 1, loop through all the servers and remove the
acl
stanza, restarting one at a time to ensure quorum. - Once all nodes are running again and ACL is disabled, start upgrading the nodes. one at a time, to 1.16.2.
wget https://releases.hashicorp.com/consul/1.16.2/consul_1.16.2_linux_amd64.zip
unzip consul_1.16.2_linux_amd64.zip
mv consul consul_1_16_2
systemctl stop consul
cp consul_1_16_2 /usr/local/bin/consul
cd /var/lib
cp -r consul consul_bak
cd consul
rm -rf serf/ raft/ server_metadata.json checkpoint-signature
systemctl start consul
- Verify quorum:
consul operator raft list-peers
- Verify version:
consul members
andconsul version