💾 Archived View for dioskouroi.xyz › thread › 29435664 captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
________________________________________________________________________________
Can you specify exactly what you mean by "control system" here? Are you talking about software that actually sends signals to hardware that eventually makes to physical equipment that does something?
I'm trying to understand the actual environment. When you say "deployment", what is changed, where does it start, and how far does it propagate?
For example, would one option for zero downtime be to have replicated (2 or more) "control systems" beyond some "layer" (sorry, it's hard to be precise without knowing more) and enforcing synchronicity between those while having only actually controlling at any time. Then, when you are patching or updating, you freeze on one, update the other, then switch to the other? Not advocating a solution, just trying to understand the situation by throwing out an example to talk around.
I'm not an expert in this at all, but if what I'm talking about above is even close to being on track, I'd recommend this book for starters:
https://www.amazon.com/Introduction-Embedded-Systems-Cyber-P...
Loading, patching, and deploying without downtime is not very complicated on the surface of it. Basically almost all cloud product/service providers do this with fault-tolerant network design, routing/load-balancing, distributed/fault-tolerant datastores, blue-green continuous integration/delivery (CI/CD) pipelines.
The hard part is being very strict to ensure that every change is safe and/or be able to rapidly/automatically restore a working state to stay within a very low error budget. Each '9' in 99.9.. of uptime is order(s) of magnitude harder.
Telephony central office switches are an example of the sort of systems you are asking after. A typical switch is designed to run for 30 years with less than 30 minutes downtime in all that time. Ericsson AXE and Western Electric are two such systems I have worked with. More recent example would be the Erlang language and the related OTP environment.
Some topics for you to research:
12 Factor Applications
Site Reliability Engineering
Chaos Engineering
Kubernetes