💾 Archived View for arcticfire.sytes.net › posts › time_bomb.gmi captured on 2024-09-29 at 00:28:29. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2024-03-21)
-=-=-=-=-=-=-
The first week of last December a developer made a simple mistake. They inadvertently committed their .nmprc to the git repository they were working on. This commit triggered the Continuous Integration process, which built the image, and the Continuous Delivery process which deployed the image to the development cluster.
The image crashed repeatedly without ever successfully starting. The developer spun the number of pods down to 0 to stop the pod from crashing and investigated. They saw their mistake, deleted the .npmrc file, and even rewrote the git history so that the file was no longer in the history, and avoid the problem from returning. Their commit triggered another run of CI and CD to deploy the new image, and their image spun up. They tested and moved on with life.
The time bomb had been set.
Fast forward to last night. Major networking patching was done on the network settings for the development Kubernetes cluster. All deployments were spun down to zero to stop everything and the patches were applied. All deployments were then spun back up.
The time bomb had been triggered.
I had blocked off most of the day to meet with the new folks that had been added to my team in the recent reorganization. And even more meetings to meet with the product owners of the teams they had come from to make sure I was understanding the new work I was getting. A few minutes into the first one, I was pulled out to a separate meeting by my boss to help him prepare for a meeting with senior leadership about some new initiatives I was receiving and just coming up to speed on.
A few minutes into that one, we were all called by the leader of the team that owns the artifact repository. It wasn't working correctly, and breaking roughly 90% of the CI/CD pipelines. We all jumped on the call to help triage.
Being Friday before a big release window, traffic was higher than normal on the pipelines, and with all the teams trying to get last minute changes ready for release window. Tempers were high, patience was low. One of the CI systems hit 300+ jobs in the queue as people realized it was an intermittent issue, and just kept spamming retry on their jobs until it worked.
Investigating the logs in the artifact repository, it seemed the account that the pipeline uses to connect to the artifact repository was getting locked out periodically. We were seeing messages in the log as to when it would be unlocked. Looking at the timestamps, it looked like the account was being locked for only 60 seconds, but it was happening repeatedly.
We worked with the vendor to temporarily disable this locking behavior so that the pipelines could run successfully. We still had to track down where the failed login attempts were coming from.
Further log digging pointed at a Kubernetes host serving north of of 100 pods, things had to be narrowed down further. The service account that the pipeline uses should only be used by the pipeline, but it has been known be used by closely adjacent teams that need something in a hurry, as getting their own service account can be a 3-6 week processes (3 cheers for decreased security through bureaucracy).
We identified 3 namespaces where those credentials were used. We didn't know which one was responsible, but it felt like we were on to something. Pulled in people from those teams to try and see if we could find the issue. As they were well aware of the issues, they suddenly felt the eyes of everyone on them that they may have been at fault. Several hours of deconstructing their applications to find out why they may be calling the artifact repository from a running application, and how they could have bad credentials.
It all turned up nothing. One of the teams hadn't touched their app in years and tore it down while on the call. The problem continued. Another one had the secret in their namespace but couldn't find any place it used it. The third was using it, and it was correct.
While this effort was going on, somebody looked closer at when locks occurred, and realized each one was 309 or 310 seconds after the last one. That eliminated something like cron, because that would be triggered on a regular interval. Even if a job took 20 seconds to generate an error, each execution would happen 5 minutes apart if they were on a 5 minute cron schedule, so the errors would all happen exactly 5 minutes apart. (just 20 seconds after cron triggered the job)
What we needed was something happened after 5 minutes, but the time to execute was also added to that 5 minutes. Then it dawned on me, Kubernetes crashloopbackoff. Searching around I confirmed it tops out at 5 minutes for the version of Kubernetes we were using. 9-10 seconds for a pod to spin up, generate an error, and Kubernetes would kill it and wait 5 minutes before trying again, giving us our timing.
That host still had 6 pods rolling through crashloopbackoffs (yay development clusters) so we dove through the logs in each one. And there was one throwing an error that it couldn't connect to npmjs.com. That's odd since it shouldn't be connecting to npmjs.com as that's blocked at the firewall for everything except the artifact repository. But it was the closest piece of evidence we had so we spun down the pod.
The errors went away.
We spun the pod back up. The error happened. Then it happened 19 seconds later. Then it happened 39 seconds later. This was our offender, as we watched the crashloopbackoff steadily increase the time of the retries.
We immediately dug into the code of this image. The image entrypoint called a script which executed npx web-server. So rather than running an image that had a web server baked in like our Apache, Nginx, or Caddy servers, it decided to pull one from our artifact repository at runtime and execute it.
It was why running containers usually don't reach out to the artifact repository, but this one was. The question was, why was it using the service account with incorrect settings, and where did it get it?
Nothing in the source code repository indicated it, so we moved the container to a test server with just docker. Starting it up there confirmed once again that it was throwing an incorrect login attempt at the artifact repository. Executing into the image found the .npmrc file that had been inadvertently checked in.
It contained an _auth line with a hashed password, and was overriding the npmrc file in the base image that just contained information on our private artifact repository. The _auth was obviously the hashed password of the developer that made the mistake. They probably had their username in another global npmrc on their machine. Why they couldn't have put their password in the same file that had their user, I don't really know.
That repository normally wouldn't have normally required authentication to just pull from it, but the .npmrc file had always-auth set. The bigger question was where did the service id come from. The name wasn't anywhere in the file, or elsewhere in the container.
Again the hint was in the .npmrc file, it had a blank email line. Looking through the users in the artifact repository, everyone had an email address set by ldap, except the service account. The artifact repository seemed to want to match the blank email address that was sent in with the auth of the .npmrc with the blank email of the service account.
That seemed to be our answer. We gave the service account an email address, and removed the image from the image registry. When the developer pushed the version with the .npmrc file in it, it never started up, and probably had locked our artifact repository. But because they were there and spun it down right away, it never really caused an issue.
But they only set the replica sets to zero, not deleted the deployment. So the deployment still existed when last night all the other pods were set to zero so the system could be patched. Then all the pods, including the one from last December, were set back to 1.
This resurected the image, and once again it began trying to start up, and pulling the image with a user's authentication settings and a blank email. The artifact repository equated the blank email with the service account the pipeline used, but the _auth line didn't match its password. The artifact repository locked the account to avoid a password cracking attempt. The locked account broke pipelines.
And that's how a couple minute mistake 2 months ago caused a dozen people to lose an entire day's work today.