💾 Archived View for buetow.org › gemfeed › 2021-10-22-defensive-devops.gmi captured on 2022-04-28 at 17:38:24. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-01-08)
-=-=-=-=-=-=-
c=====e H ____________ _,,_H__ (__((__((___() //| | (__((__((___()()_____________________________________// |ACME | (__((__((___()()()------------------------------------' |_____| ASCII Art by Clyde Watson
Published by Paul at 2021-10-22
I have seen many different setups and infrastructures during my carreer. My roles always included front-line ad-hoc fire fighting production issues. This often involves identifying and fixing these under time pressure, without the comfort of 2-week-long SCRUM sprints and without an exhaustive QA process. I also wrote a lot of code (Bash, Ruby, Perl, Go, and a little Java), and I followed the typical software development process, but that did not always apply to critical production issues.
Unfortunately, no system is 100% reliable, and you can never be prepared for a subset of the possible problem-space. IT infrastructures can be complex. Not even mentioning Kubernetes yet, a Microservice-based infrastructure can complicate things even further. You can take care of 99% of all potential problems by following all DevOps best practices. Those best practices are not the subject of this blog post; this post is about the sub 1% of the issues arising from nowhere you can't be prepared for.
Is there a software bug in a production, even though the software passed QA (after all, it is challenging to reproduce production behaviour in an artificial testing environment) and the software didn't show any issues running in production until a special case came up just now after it got deployed to production a week ago? Are there multiple hardware failure happening which causes loss of service redundancy or data inaccessibility? Is the automation of external customers connected to our infrastructure putting unexpectedly extra pressure on your grid, driving higher latencies and putting the SLAs at risk? You bet the solution is: Sysadmins, SREs and DevOps Engineers to the rescue.
You agree that fixing production issues this way is not proactive but rather reactive. I prefer to call it defensive, though, as you "defend" your system against a production issue. But, at the same time, you have to take a cautious (defensive) approach to fix it, as you don't want to make things worse.
Over time, I have compiled a list of fire-fighting automation strategies, which I would like to share here.
Defensive DevOps is a term I invented by myself. I define it this way:
That sounds a bit crazy, but this is, unfortunately, in rare occasions the reality. As the question is not whether production issues will happen, the question is WHEN they will happen. Every large provider, such as Google, Netflix, and so on, suffered significant outages before, and I firmly believe that their engineers know what they are doing. But you can prepare for the unexpected only to a certain degree.
Do you have to solve problem X? The best solution would be to fully automate it away, correct? No, the best way is to fix problem X manually first. Does the problem appear on one server or on thousand servers? The scale does not matter here. The point is that you should fix the problem at least once manually, so you understand the problem and how to solve it before implementing automation around it.
You should also have a short meeting with your team. Every person may has a different perspective and can give valuable input for determining the best strategy. But, again, keep the session short and efficient. Focus on the facts. After all, you are the domain expert and you probably know what you are doing.
Once you understand the problem, fix it on a different server again. This time maybe write a small program or script. Semi-automate the process, but don't fully automate it yet. Start the semi-automated solution manually on a couple of more servers and observe the result. You want to gain more confidence that this really solved the problem. This can take a couple of hours manually running it over and over again. During that process, you will improve your script iteratively.
You have to develop code directly on a production system. This sounds a bit controversial, but you want to get a working solution ASAP, and there is a very high chance that you can't reproduce problem X in a development or QA environment. Or at least it will consume significant effort and time to reproduce the problem, and by the time your code is ready, it's already too late. So the most practical solution is to directly develop your solution against a production system with the problem at hand.
You might not have your full-featured IDE available on a production system, but a text editor, such as Vim (or Neovim), is sufficient for writing scripts. Some editors allow you to edit files remotely. With Vim you can accomplish it with "vim scp://SERVER///path/to/file.sh". Every time you save the file, it will be automatically uploaded via SCP to the server. From there, you can execute it directly. This comes with the additional benefits of still having access to all the Vim plugins installed locally, which you might not have installed on any production machines. This approach also removes any network delays you might experience when running your editor directly on a remote machine.
Unfortunately, it will be a bit more complicated when you rely on code reviews (e.g. in a FIPS environment). Pair-programming could be the solution here.
You want to triple-check that your script is not damaging your system even further. You might introduce a bug to the code, so there should always be a way to roll back any permanent change it causes. You have to program it in a defensive style:
Furthermore, when you write Bash script, always run the tool ShellSheck (https://shellshock.io/) on it. This helps to catch many potential issues before applying it in production.
You probably won't have time for writing unit tests. But what you can do is to pedantically test your code manually. But you have to do the testing on a production machine. So how can you test your code in production without causing more damage?
Your script should be idempotent. This means you can run it infinite times in a row, and you will always get the same result. For example, in the first run of the script, a file A get's renamed to A.backup. The second time you run the script, it attempts to do the same, but it recognises that A has already been renamed to A.backup and then it is skipping that step. This is very helpful for manually testing, as it means that you can re-run the script every time you extended it. You should dry-run the script at least once before running it for real. You can apply the same principle for almost all features you add to the code.
You may also want to inject manual negative testing into your script. For example, you want to run a particular function F in your script but only if a certain pre-condition is met, and you want to ensure that the code branching works as expected. The pre-condition check could be pretty complex (e.g. N log messages containing a specific warning string are found in the applications logs, but only on the cluster leader server). You can flip the switch directly in the code manually (e.g. run F only, when the pre-condition isn't met) and then perform a dry run of the script and study the output. Once done, flip the switch back to its correct configuration. For double insurance, test the same on a different server type (e.g. on a follower and not on a leader system).
By following these principles, you test every line of code while you are developing on it.
At one point, you will be tired of manually running your script and also confident enough to automate it. You could deploy it with a configuration management system such as puppet Puppet and schedule a periodic execution via cron, a systemd timer or even a separate background daemon process. You have to be extremely careful here. The more you automate, the more damage you can cause. You don't want to automate it on all servers involved at once, but you want to slowly ramp up the automation.
First, automate it only on one single server and monitor the result closely. At first, only automate running the script in dry mode. Also, don't forget that you still should log everything that the script is doing. Once everything looks fine, you can automate the script on the canary server for real. It shouldn't be a disaster if something goes wrong as usually systems are designed in a HA fashion, where the same data is still at least on another server available. In the worst-case scenario, you could recover data from there or from the local backup files your script created.
Now, you can add a handful more canary servers to the automation. You should keep close attention to what the automation is doing. You could use a tool like DTail for distributed log file following. At this point, you could also think of deploying a monitoring check (e.g. Icinga) to see whether your script is not terminating abnormally or logging warnings or errors.
DTail - The distributed log tail program
From there, you could automate the solution on more and more servers. Best, ramp up the automation to a handful of systems, and later to a whole line of servers (e.g. all secondary servers of a given cluster). And afterwards, automate it on all servers.
Remember, whenever something goes wrong, you will have plenty of logs and backup files available. The disaster recovery would involve extending your script to take care of that too or writing a new script for rolling back the backups.
If possible, don't deploy any automation shortly before out of office hours, such as in the evening, before holidays or weekends. The only exception would be that you, or someone else, will be available to monitor the automation out of office hours. If it is a critical issue, someone, for example, the on-call person, could take over. Or ask your boss to work now but to take off another day to compensate.
You should add an easy off-switch to your automation so that everyone from your team knows how to pause it if something goes wrong in order to adjust the automation accordingly. Of course, you should still follow all the principles mentioned in this blog post when making any changes.
For every major incident, you need to follow up with an incident retrospective. A blame-free, detailed description of exactly what went wrong to cause the incident, along with a list of steps to take to prevent a similar incident from occurring again in the future.
This usually means creating one or more tickets, which will be dealt with soon. Once the permanent fix is deployed, you can remove your ad-hoc automation and monitoring around it and focus on your regular work again.
E-Mail me your comments to paul at buetow dot org!