💾 Archived View for dioskouroi.xyz › thread › 29435664 captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

Ask HN: High availability control system software architectures

Author: zeus_hammer

Score: 6

Comments: 4

Date: 2021-12-03 21:20:14

________________________________________________________________________________

chermi wrote at 2021-12-04 02:20:49:

Can you specify exactly what you mean by "control system" here? Are you talking about software that actually sends signals to hardware that eventually makes to physical equipment that does something?

I'm trying to understand the actual environment. When you say "deployment", what is changed, where does it start, and how far does it propagate?

For example, would one option for zero downtime be to have replicated (2 or more) "control systems" beyond some "layer" (sorry, it's hard to be precise without knowing more) and enforcing synchronicity between those while having only actually controlling at any time. Then, when you are patching or updating, you freeze on one, update the other, then switch to the other? Not advocating a solution, just trying to understand the situation by throwing out an example to talk around.

I'm not an expert in this at all, but if what I'm talking about above is even close to being on track, I'd recommend this book for starters:

https://www.amazon.com/Introduction-Embedded-Systems-Cyber-P...

karmakaze wrote at 2021-12-04 02:24:16:

Loading, patching, and deploying without downtime is not very complicated on the surface of it. Basically almost all cloud product/service providers do this with fault-tolerant network design, routing/load-balancing, distributed/fault-tolerant datastores, blue-green continuous integration/delivery (CI/CD) pipelines.

The hard part is being very strict to ensure that every change is safe and/or be able to rapidly/automatically restore a working state to stay within a very low error budget. Each '9' in 99.9.. of uptime is order(s) of magnitude harder.

GianFabien wrote at 2021-12-04 01:37:43:

Telephony central office switches are an example of the sort of systems you are asking after. A typical switch is designed to run for 30 years with less than 30 minutes downtime in all that time. Ericsson AXE and Western Electric are two such systems I have worked with. More recent example would be the Erlang language and the related OTP environment.

yuppie_scum wrote at 2021-12-04 00:23:47:

Some topics for you to research:

12 Factor Applications

Site Reliability Engineering

Chaos Engineering

Kubernetes