💾 Archived View for dioskouroi.xyz › thread › 24929710 captured on 2020-10-31 at 00:53:24. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

SRE Teams: Hash

Author: andriosr

Score: 118

Comments: 90

Date: 2020-10-29 12:22:53

Web Link

________________________________________________________________________________

pram wrote at 2020-10-29 13:04:15:

I worked at a major software company that was implementing an “SRE” org and it was frankly pathetic. It basically had its genesis in some higher up hearing that Google does this thing, so we probably should too. That then later turned into a bunch of random DevOps/Sys Admins drafted into the new org reading the Google SRE book. It was a shamelessly cargo cult approach.

After a year all we really had was hundreds of Confluence pages specifying policy for this and that, a bunch of ceremony added around releases, and arbitrary SLA/SLOs defined. Almost zero actual technical work was accomplished. They then started filling the team with new grads who had neither devops or sysadmin experience, kind of cementing the perception that it was a non-technical role.

Either way I’m not a fan of the entire concept after being subjected to that suffering!

andriosr wrote at 2020-10-29 13:42:36:

Blindly replicating what Google is doing is really problematic, I saw something similar at a few companies. The principles are interesting, but they have to consider each company's reality.

Google is putting a lot of marketing money on the topic, but from what I saw they are trying to alert that there is no one size fits all for this.

jeffbee wrote at 2020-10-29 13:53:54:

So far I haven't seen any companies replicating what Google SRE does, blindly or intentionally. Much as this article describes, organizations I've seen are just inventing a role they perceive to exist in their organization and slapping the label "SRE" upon it. Often these roles are directly contradictory to the way Google SRE works: they are just carrying pagers for products they aren't allowed to improve, they own build/test/release for some reason even though that clearly falls in the domain of the developer, SREs are setting service levels which is obviously the responsibility of product.

I think people should realize that when you go into the hiring market looking for SREs, you are about to pay more, not less, for a highly qualified software engineer who is also a systems expert. You want these people to not just operate product, but to be closely involved in architecture and development so the product turns out to be operable. Your SREs should be working in the product code on the bits that ordinary developers might overlook or not be qualified to write, because they lack expertise in system or distributed systems behavior.

What this article is calling "real world" is just an attempt to redefine the term.

sumtechguy wrote at 2020-10-29 14:07:27:

An SRE with no authority to do anything is just an overpaid totally frustrated paperweight. They have to be able to change things on both sides of the 'fence' if they can only do one or the other or neither then their role is not an SRE no matter what sorts of titles you give them.

Any org I go into that says 'we want an SRE because google has one' I can not help myself I blurt out 'you are not google. what works for them will not work for you, you need to find your own way of doing this that fits your org. Trying to mimic google without the understanding of why they way they are will damage your org'

user5994461 wrote at 2020-10-29 16:40:50:

>>> An SRE with no authority to do anything is just an overpaid totally frustrated paperweight.

That matches my experience at most companies, the SRE/DevOps is just the sucker who is on call.

dodobirdlord wrote at 2020-10-29 18:27:51:

I think authority is a critical differentiator between the SRE described in the Google book and SRE as it’s practiced elsewhere. Google SREs can impose moratoriums on feature development or deployments if they judge that a product is becoming critically unreliable (which has a well-defined meaning based on SLO breaches). SRE teams will also sometimes “offboard” a product that is insufficiently respectful of the finite attention that an SRE team needs to spread across multiple products, returning full production control to the development team along with the pager and all responsibilities thereof.

sumtechguy wrote at 2020-10-30 13:00:34:

I think one more differentiator is the dedicated position. Rotation is not SRE. It can be helpful to help people understand both sides. But just rotating your team into the role every few weeks with no real 'group in charge' just seems to make them mad.

apple4ever wrote at 2020-10-30 03:46:38:

> An SRE with no authority to do anything is just an overpaid totally frustrated paperweight

That's exactly how it is implemented in my company. We are called SREs but everything is driven my management (and by that I mean a single VP who is omnipresent).

throwayws wrote at 2020-10-29 17:54:23:

I thought an SRE is contacted when your house burns down due to for example scaling and to improve operations where much more load/customers are expected, no? Where the house is expected to fall but where it is not allowed to happen.

jeffbee wrote at 2020-10-29 19:29:53:

No, you bring the SREs in to consult on designing a fire-resistant house before you turn a shovel. Calling them after you've already designed, built, and burned the house is likely to leave you disappointed.

zimbatm wrote at 2020-10-29 13:59:50:

I feel like there is an analogy with self-help books. Every personality and company is different and they have to do the work to find out what works best for them. Internalizing the meaning of things and leading your own thinking is probably as important as the practices themselves.

Adopting another company's practices is often lazy thinking. There are no silver bullets.

the_70x wrote at 2020-10-29 13:56:21:

and the fact that not all the companies run the same workloads or scale as google

gowld wrote at 2020-10-29 19:14:44:

Marketing money? besides paying people to publish their work?

dilyevsky wrote at 2020-10-29 15:12:03:

You just stumbled upon same issue google had - sre-swe is really tough to hire for. The concept is good but you need software engineers w/ systems background and they’re relatively rare and most good ones are already at faang

mjayhn wrote at 2020-10-29 15:29:16:

I mean Google could stop being completely elitist with SREs needing to come from SWE backgrounds and hire the thousands of SRE/Sysadmin/Ops types who do know systems very well but have a lot/bit less SWE background and let their SWEs contribute to mentoring them on SWE stuff.

Pipe dream of mine, for sure.

There's been at least 3 G SREs that have posted in here about what they do day to day and it's literally no different from any other non-FAANG SRE/Ops role any of us do at various scales except for the gatekeeping.

dilyevsky wrote at 2020-10-29 15:55:59:

Google does hire SRE SEs and I interviewed many - the focus on coding is a lot less and they get grilled more on systems etc. it’s important to have a healthy mixture of both or you run into eng culture issues like described by many posters here

kazen44 wrote at 2020-10-29 20:29:40:

Also, a systems engineer usually has a different way of thinking about a problem then a software engineer and vice versa.

This helps tremendously in troubleshooting and building stable products.

Having a combination of systems, networking and programming "perspective" on a team seems to result in very robust systems all around.

jldugger wrote at 2020-10-29 16:17:32:

> Google does hire SRE SEs and I interviewed many

For those who aren't familiar with the google acroynm soup: SE here means systems engineer not software engineer(SWE).

joshuamorton wrote at 2020-10-29 18:05:25:

Fwiw about half of the sres I know at Google come from sysadmin or help desk roles within Google.

krooj wrote at 2020-10-29 17:56:23:

Is this TEAM? There's a nasty issue with companies trying to emulate Google on "all the things" and it usually doesn't work, since often times these organizational structures are a natural evolution of where a company came from, but you aren't that company...

Best SRE/QE model I've experienced was with a couple of these guys being embedded into a dev team (reporting structure didn't really matter), such that they participated in the team meetings and rituals, becoming domain experts. One fellow eventually learned the SAML2.0 spec inside out and was invaluable for catching bugs.

I suppose being embedded but reporting up through a different org did have it's benefit: namely, the consistent dissemination of best practices across the company, but with a judicious eye to applicable context that the current do-it-all-swe cannot with the current SRE ivory tower model.

The do-it-all-swe model has yielded trash QE/observability in my experience.

pram wrote at 2020-10-29 21:19:16:

Yes, lol

tannhaeuser wrote at 2020-10-29 14:24:26:

Sounds truly horrible, and meets my experience of SOA around 2005, but let's not forget that younger devs don't have the luxury of having tinkered with stuff hands-on as it was introduced, such as SQL, pre-jquery js, SOA, Unix/Linux to form an intuition.

C1sc0cat wrote at 2020-10-29 14:55:25:

Is that not a problem with the training / education of entry level.

This was the problem the raspberry pi was designed for GCSE and A Level students to actually get hands on experience, so that when you go to to university you had at least the basics.

SV_BubbleTime wrote at 2020-10-29 14:51:27:

> Almost zero actual technical work was accomplished.

This is 100% my problem with this article.

At no point is it ever implied WHAT or WHY or WHERE actual work is being done. Only HOW they organize it. It looks to me from the article they are using every “cool” tool and seem to care more about the how than the why. And that’s fine if it’s the point of the article which I guess it is, but it seems like it’s “reliably” structuring a husk.

I defend the costs of IT and in this case the costs of an SRE approach - but these aren’t the things that make you money and keep your business afloat.

When I spend a couple weeks making a system wide event bus to get messages from n to m that’s cool and I need it... but customers don’t pay me for that. They pay me to fix a problem they have.

I’m very cautious of articles like this that say all “the right” things but don’t even hint at solving customer problems.

namanaggarwal wrote at 2020-10-29 12:51:17:

What I have seen in the companies that I have worked in

- Initially every engineer does everything

- Growth gives birth to independent

SRE teams. Now we have engineers who does only SRE

- We realised SRE is bottle neck for product teams. SRE teams are removed and all SRE are embedded into the product teams to do dedicated work for them.

- Other team members are now required to learn SRE stuff, especially now that terraform etc are there so we are back to everyone is a SRE

note that FAANG level companies go back to independent infra, SRE etc maybe because of niche and custom softwares they have.

saalweachter wrote at 2020-10-29 13:55:48:

Is it, uhh, common knowledge that the important thing isn't how the team is organized or how they exactly do their job but that there are engineers paid and incentivized and empowered to keep the status quo running flawlessly (or as close there to as possible) rather than being paid and incentivized and empowered to launch cool new features or make the money graphs go up?

The problem at most companies is that if you have two engineers, Alice and Betsy, and Alice launches a change that generated marketing buzz that gets spun into sales, and Betsy launches a change that prevented the site from going down over the seasonal rush, Alice gets promoted and Betsy gets told to work on more impactful projects.

Which isn't to say that Alice's work isn't important or promotion worthy, but you need an incentive structure that rewards both.

mjayhn wrote at 2020-10-29 14:55:48:

In my experience, Alice moves on to a cool new project to greenfield while Betsy gets stuck supporting Alices creation until they eventually get on-call burnout and quit. Alice checks in only when there's enough noise made either by angry customers, cto or angry SREs.

That's my 10yrs+ ops experience, at least, and this is hugely exacerbated if Alice is put on such a high pedestal that they don't even have to be on-call with Betty at any point of the lifecycle.

CoolGuySteve wrote at 2020-10-29 15:29:31:

I think ultimately this dynamic is why Google can't keep most products around for longer than 5 years.

codingdave wrote at 2020-10-29 13:55:32:

That is not necessarily an unhealthy cycle...

Organizations succeed by delivering customer value. Re-orgs of teams to remove pain points and blockers is a sign of an organization flexible enough to handle their growth. Even if you go back and forth a few times between similar states, the larger point is that you are willing to do what is needed to continue to effectively deliver value to your customers.

vsareto wrote at 2020-10-29 14:26:59:

>the larger point is that you are willing to do what is needed

They seem to be unable to organize and delegate work effectively even when it's needed and instead rely on polyglot-engineers to bridge that gap by overloading their responsibilities.

tupputuppu wrote at 2020-10-30 07:15:35:

Going back and forth isn't progress or keeping up with growth.

andriosr wrote at 2020-10-29 13:03:30:

I have a very similar view. But I see more and more companies trying to build a Google-like SRE structure, where the team actually operates more mature services highly critical to the business. Tricky not to become traditional ops teams, but seems doable.

SRE becoming bottleneck with scale is something I see a lot. Usually regular product engineers that are more familiar with infra start doing it full-time for all the company.

vsareto wrote at 2020-10-29 13:12:47:

It's great to know that we still have no idea what we're all doing in tech.

Moto7451 wrote at 2020-10-29 13:06:11:

This matches my experience at my work. I am an early employee and we did one of everything back then, was told we had dedicated Ops/SRE and lost permissions to do stuff about half way through my tenure, and now this year it’s back to everyone is doing their own Ops/SRE.

That’s been over the course of about a decade.

C1sc0cat wrote at 2020-10-29 15:01:04:

_cough_ back in the 80's I was employed by a division of British Telecom explicitly partially as DEV/OP on a billing system as I had SYSAD experience on the platform.

Later on mid 90's the BT World wide intra net was run by a full service team that handled hosting admin and development

p_l wrote at 2020-10-29 12:53:51:

I suspect the last one is the size of workforce, and even then "embedding in product teams" seems a common mantra mentioned from FAANG companies.

gowld wrote at 2020-10-29 19:17:25:

> We realised SRE is bottle neck for product teams.

This means that management decided velocity is more important than reliability?

dragonwriter wrote at 2020-10-29 19:20:42:

> > We realised SRE is bottle neck for product teams.

> This means that management decided velocity is more important than reliability?

No, given the organizational response ("SRE teams are removed and all SRE are embedded into the product teams to do dedicated work for them.") It seems to be like the motivation for Agile cross-functional teams instead of separate Analysis, Design, Coding, Testing, ... teams, and like DevOps, and like DevSecOps, etc., yet another instance of realizing that _throwing products over internal walls in organizational handoffs produces inefficiencies_.

draw_down wrote at 2020-10-29 13:48:03:

> We realised SRE is bottle neck for product teams

It isn’t clear what to do at this point. Embedding is one way, I don’t prefer it personally. My current employer keeps the teams separate but they expose platforms that other teams consume. Teams have a runner who’s responsible for helping other teams and responding to timely issues. Run is a rotation, not a permanent role.

I think this is a good way, but it might require a bigger size company/tech org than “non FAANG” implies.

martius wrote at 2020-10-29 14:07:09:

I've been an SRE in a company which is not a FAANG and I'm now an SRE at Google.

While there is more structure at Google, it's basically the same job :)

The implementation varies between companies, and I understand this is what these articles try to capture. My point is that as an individual contributor in an organization where "SRE" is really a thing, I believe that the role is often quite well defined.

andriosr wrote at 2020-10-29 17:11:42:

That is really interesting. What are the top 3 differences you saw between the two worlds?

martius wrote at 2020-10-29 17:49:46:

The biggest difference is that the software I keep running in production is 100% made in-house. Previously, I was running a system built on top of open-source software (Hadoop and Kafka, in particular).

It has a lot of benefits:

* we have an influence on the priorities of the project, and make sure our issues are prioritized. You can't be sure of the evolution of an open-source project.

* continuous rollouts. In my previous job each major version bump of Kafka or Hadoop could become a complex migration to plan a long time in advance. Now, we still have migrations to do, but often the scope is smaller and the transition smoother.

* strong culture and policies. We hold ourselves accountable. In my previous job, we would often mitigate incidents and forget about them, not write the post mortem or really root cause the incident. Usually because we knew/believed it would not be solved anytime soon (conflicting priorities, or whatever else). I've never seen an incident swept under the rug at Google.

throwaway38489 wrote at 2020-10-29 14:18:28:

I'm an SRE at Google and when my friend asked what I do and I explained it to him he said

- "oh, so like QA in my company?"

I've asked what QA does at his company and well yeah, exactly. I've read the books and know about the SLA/SLO and honestly it doesn't make any sense for smaller companies with few services.

I code up alerts for various metrics, act upon said alerts, sometimes do the fixes my self if it's simple, otherwise just hand it off to devs by logging a bug. The remaining time I code tools for other SREs to use.

the_only_law wrote at 2020-10-29 14:20:32:

This is the sort of reason I don't put much stock in titles, personally.

toomuchtodo wrote at 2020-10-29 15:14:55:

Are you expected to know the product codebase fairly well, debug complex bugs on the fly when paged, and push fixes out in a short window of time (<1 hour)? Also interested in what tooling you develop for the use by other SREs (if you can share).

martius wrote at 2020-10-29 15:36:03:

Teams have different level of engagement on different services.

In my case:

* I have a good in-depth knowledge of one product I'm oncall for. My team may implement some design changes required (ex: for scaling) and I contribute code to this project.

* I have shallow knowledge of 2 other products, I'm trained at identifying a trigger that causes an outage and mitigate it (common mitigations are: move traffic, increase quota temporarily, rollback a release). If I can't solve an incident without in-depth knowledge of these products, I escalate to the devteam, and they will be responsible for root-causing.

A mitigation that requires a patch, cherry pick and rollout would be a terrible (and risky) thing to do, so I'd rather find any alternative first.

toomuchtodo wrote at 2020-10-29 15:40:22:

Thanks for the reply, I appreciate it!

bovermyer wrote at 2020-10-29 13:00:31:

My commentary has two parts.

First, about the topic. The title, "Real-world SRE: What not FAANG companies are doing," implies that the series is going to look at how the majority of companies (as in, not startups) are implementing SRE. This is not the approach; rather, this specific article divides the world into either FAANG or SV-style startups, and ignores everything else. That's not a microcosm I'm interested in.

Second, about the writing style. The stuttering caused by using sentence fragments everywhere is really jarring. It's so bad that I couldn't focus on the actual content. Normally I'm not much of a stickler for grammar, but in this case, it's distracting.

tn890 wrote at 2020-10-29 14:20:44:

>implies that the series is going to look at how the majority of companies (as in, not startups) are implementing SRE

That's exactly what I was expecting as well.

>either FAANG or SV-style startups, and ignores everything else

SV in a nutshell. Also obviously only North America exists.

jwcrux wrote at 2020-10-29 14:26:55:

> Hash is a Brazilian fintech building the next-generation of payments infrastructure.

In general I agree with the point you're making, though.

nwsm wrote at 2020-10-29 16:10:07:

>Also obviously only North America exists.

Did you... read the article?

slowhand09 wrote at 2020-10-29 14:53:37:

SV? Playing catch-up here.

joshmarlow wrote at 2020-10-29 14:54:51:

I think in this context, SV === Silicon Valley

andriosr wrote at 2020-10-29 13:07:29:

Really interesting, great feedback, thanks!

This is the first issue, so this is very welcome to make sure the following are better. My goal is to actually make it for the a diverse set of companies, this is only the first one.

Noted on the style! I'm probably biased by all the SEO/marketing stuff I read. Should improve readability for humans instead of machines (Google), thanks, will work on that.

quickthrower2 wrote at 2020-10-29 13:12:17:

Forget SEO! The way you wrote his comment is nice. Write like that.

I’m not sure you can even “write for SEO” in 2020. Maybe in 2012 you could. If you are talking about keyword stuffing etc.

Making the article readable and people want to share it is paramount.

czbond wrote at 2020-10-29 13:59:28:

One definitely can write with SEO in mind to make considerable differences in repeatable, organic traffic (in conjunction with numerous other accretive layers).

bovermyer wrote at 2020-10-29 13:11:42:

I'm glad to hear this. The concept is really solid, and I'll be paying attention to future entries in the series!

NationalPark wrote at 2020-10-29 14:59:43:

Is the "everything else" really something you want to read about? So what if the majority of businesses are using FTP to upload new php files from the "website_latest_final_2" directory? We _know_ that's not good, what is there to learn?

bovermyer wrote at 2020-10-29 15:18:20:

The arrogance and ignorance in this comment are staggering.

I want to read about transitions. I want to see how late-adopter and mainstream companies go from old, manual processes to modern, automated processes.

As someone who's been in both slow-moving federal government agencies and bootstrapped startups with the latest tech, I've seen a wide variety of tech journeys happen, and it's fascinating.

stedaniels wrote at 2020-10-29 13:50:36:

As another point on the the metrics... I had exactly the same interpretation of the title and also left disappointed.

schoolornot wrote at 2020-10-29 14:26:44:

Blind mimicry of Google SRE as-it-is-written in the book is one of the most difficult ideologies I've had to deal with at work. Small 5-person teams that historically have helped other teams throughout their projects are now bent on everyone adopting their stacks, their metrics, their security policies, and anything outside that realm becomes "unsupported". These teams are tech obsessed with the latest great thing, except it's not great. Or at least it isn't if it's not aligned with the business.

It's been mentioned in other comments here but I'm super fond of SRE embeds for reasons already shared here.

decafninja wrote at 2020-10-29 13:37:54:

I work at a bank. Our primary SRE practices probably consists of throwing overwhelming amounts of bureaucracy and red tape to make any change difficult. Sometimes it works. Sometimes it doesn't.

bkq wrote at 2020-10-29 13:54:57:

Having worked with some banks on software deployments I feel this pain. Waiting weeks for a CR to be processed just so they can apply a single patch. Diminishing returns in this regard are a real thing.

user5994461 wrote at 2020-10-29 16:29:35:

Worked in another bank. Developers could deploy anything in less than 10 minutes, just need to write a simple configuration file to deploy and get a review. It was a breeze.

It was all automated and self-service, there was something like 700k jobs deployed a day when I left. I suppose it's similar to borg at Google/Facebook but way older.

bobbydreamer wrote at 2020-10-29 17:52:45:

Bank doesn't have any change governance, peer checking and approvals. Wow

user5994461 wrote at 2020-10-29 20:30:49:

Every change must be reviewed and approved by another developer. There's a file at the top of directories to specify which teams own that part of the codebase, only those can approve.

andriosr wrote at 2020-10-29 13:55:16:

Cool, I did some projects with large banks, saw something similar. The problem is that the bureaucracy solves some of the reliability at the cost of speed. That is the biggest problem with most large banks IMO

BossingAround wrote at 2020-10-29 13:09:29:

My anecdotal experience is "SRE" means a sysadmin who has learned Python for some automation (of course using some automated management tools, such as Ansible or Terraform). Is this the case for most "non-FAANG" and "non-startup" companies?

mjayhn wrote at 2020-10-29 15:46:42:

This is what an SRE actually is to me, but not to G. Google seems to think they need to create SREs out of SWEs. It took me far, far longer to learn distsys, network engineering, storage and everything else that goes into a platform than it did for me to learn to write grpc endpoints in golang.

I've been in ops for 10+ years building platforms. I'd love to be able to spend 6 hours a day coding but when I'm jugging 10-20 different products that have 1-10+ different languages involved (go, python, nodejs, javascript, ruby, ansible, terraform, puppet, chef) you really have a hard time getting super good at one thing. So most of us have to get good at one language just to pass the code tests, then it languishes because when launching a product I then spend 1-3 months writing something completely different out of biz necessity.

The most important skill to me for an SRE/Ops is curiosity and pattern recognition, and #1 humility/empathy.

singron wrote at 2020-10-30 02:13:54:

Google SREs are split into SWE-SRE and System-SRE. They have different interview processes and job ladders. Usually they have the same job functions day to day, but anecdotally it seemed like SWE-SREs were SWEs who got into SRE and System-SREs had more traditional sysadmin background. Disclosure: was a Google SRE

fuball63 wrote at 2020-10-29 13:17:58:

Similarly, I find "devops" means "sysadmin that uses Jenkins sometimes".

chasd00 wrote at 2020-10-29 13:49:45:

i once had a junior dev sheepishly ask me what "devops" means. I thought it was pretty brave of her to ask and acknowledge some ignorance. I told her it means whatever the speaker wants it to mean and don't worry about it, concentrate on writing good code, getting better, and getting the job done.

chowned wrote at 2020-10-29 16:45:26:

I appreciate the humor, but I think that was a bit of a disservice for her. Because even though you have a cheeky definition, the term _does_ get used fairly frequently in the industry, and it would've been helpful to explain how it's commonly used, even if you don't agree with it.

darkwater wrote at 2020-10-29 13:34:03:

I don't really get how you could possibly be a sysadmin or a system engineer or an SRE or $WHATEVER if you don't use Jenkins or any other CI tool, if you don't use tools like Terraform or Ansible (which personally don't like at tool for infra mgmt but it's OK for config mgmt or image baking) etc etc. I mean, I was a sysadmin 15 years ago moving pizza boxes and cabling switches in a cold datacenter and gradually my job changed to this new reality (while moving between 4-5 small/medium companies, no corporate or FAANG). Is there really somebody out there whose job didn't change like this?

weeeeelp wrote at 2020-10-29 16:48:07:

There are loads of what you could call "classical" sysadmins/application administrator type people around that never started to do automation, cause sometimes there's no need or willingness to do so.

Especially in the small-to-medium sized companies, where the business is bread-and-butter type and not anything hightech - they might have some infrastructure and processess for years, can't see the reason to move to cloud, they had the same "IT guy" doing desktops and servers since ever, stuff just works and we don't want changes, etc..

There is a massive amount of IT work being done - work that you, me or any other denizen of HN would probably rate as terrible and smelling of middle ages. But it gets the job done, even if it's being done manually. And some people just don't know any better.

Cthulhu_ wrote at 2020-10-29 13:22:27:

Or "Person with too many responsibilities"

dilyevsky wrote at 2020-10-29 15:39:28:

Yes that is what it’s like in companies whose infrastructure is not part of their value add but a cost center (e.g medical/ai startups, finance, etc) If you’re an sre either don’t go there or ask for fy money

icco wrote at 2020-10-29 13:05:11:

As the author of a book titled "Real-World SRE"

https://amzn.to/35HFPic

, this title made we do a double take.

That being said, the article was nice, although I wish there was more detail.

andriosr wrote at 2020-10-29 13:11:41:

Really cool! Interesting to hear that. I thought about diving deep on details, but was concerned it would become too heavy for a blog post to read every week. Will experiment on that.

jeffbee wrote at 2020-10-29 13:26:30:

Why would SREs be responsible for build and test infra? I don't see how that fits the mission, unless the mission is redefined as "things developers perceive as being beneath them."

martius wrote at 2020-10-29 14:08:40:

It can be the role of SREs to build and provide the infrastructure improving the reliability of rollouts, allowing the rest of the company to build reliable products without compromising the velocity.

temp0826 wrote at 2020-10-29 14:25:39:

Well yah, that’s exactly it. Ego thing. Things will roll down hill until it hits someone who can actually fathom a coherent story for the org.

user5994461 wrote at 2020-10-29 16:36:29:

SRE provide the gitlab/jenkins? Sounds normal to me.

chowned wrote at 2020-10-29 16:43:48:

That's exactly it. For some reason, developers really want to _just_ write code, even though it's for a web service and that web service doesn't exist in a perfect vacuum. (Yes, I do ops, and yes, I'm bitter and jaded.)

beny23 wrote at 2020-10-30 09:24:15:

I think

Product teams that opt-out or don’t have their application ready for SRE run on their own. They have full access to the resources they need. The interface with the SRE team is minimal in these cases.

and

Custom features lose the SRE team support.

is something that sounds great in a small team, but I do not think it would scale. It lends itself to the platform just being a loosely coupled collective of independent fiefdoms of Not Invented Heres, which then means each product team needs its own SRE specialists that can then constantly engage in infighting, which IMHO would lead to the reliability as the system as a whole go down rather than up.

Personally, I think there is much value in a "platform" where infrastructure (compute, storage, queues, monitoring, alerting, auditing, etc) are provided in a centralised fashion, owned and operated by platform teams which provides self-service touchpoints for the service teams but not provide full independence to run their own thing

newyorker2 wrote at 2020-10-29 14:42:28:

It is strange that I read every article from the domain *.substack.com with a preliminary conclusion that it was either written directly by GPT-3 or influenced by it.

hn_20591249 wrote at 2020-10-29 13:36:34:

The article headline was pretty misleading, the team they chosen has self-described as having the "Google SRE model", which doesn't really speak to what non-SV companies are doing, if they are just copying SV companies.

parliament32 wrote at 2020-10-29 17:10:32:

Everything is code managed by Git, from infrastructure resources to monitoring alerts. This allows for an extreme speed for any change, including infrastructure and alerts.

So... you're pushing alerts into a Git repo and pulling them somewhere else to display them? I'd call this many things, but "extreme speed" is not one of them.

tikhonj wrote at 2020-10-29 17:41:22:

Reading that, I would assume it's the definition/configuration for alerts that's managed in Git, not the alerts themselves. "Infrastructure as code".

I rather like that approach myself—I would not want logic and configuration living in some stateful system _outside_ Git.

spondyl wrote at 2020-10-29 19:53:26:

Personally I find it weird having 0 results for the word customer although SLOs are mentioned

bobbydreamer wrote at 2020-10-29 18:40:34:

I work at a bank. Currently I am working on a project along with few tech diversified colleagues(mainframe & non-mainframe) to prepare a white paper on SRE. I am Db2 DBA mainframes, for me it's normal to support project while implementing change in production and monitor till the first successful run as I worked on it in the development phase and have a good handover to the production dbas and got answers for all their questions. For me what they had said about production monitoring, daily production meetings, oncall support were all normal. In the team already in the name of PI planning we are all working on automating or changing process, updating documentations activities are being carried out.

Scenario is not like developer throwing over the wall to operations. First there is this architect team deciding things then it comes to developers then for structure changes to dbas at this point almost 80% of things are decided. If any perf issues then only logics and things will be changed then comes handovers, peer checks, approval and then to production. So there are lots of walls, outage is a big thing, implementation is a big thing when there is a structure change to a table that's handling 3000+ transactions per second. There are so many things to look out for. So it's not like I can implement anything to production in few minutes or couple of days unless it's a fix.

On the concepts of error budgeting, yeah it's nice. Since error budget is depleted so let's stop feature development. From banking perspective what if there is a regulatory change, well that has to go. I haven't seen any examples that companies saying that they stopped features which has potentials to bring revenue but stopped because we have depleted our error budgets. For example, let's say, a product making company saying we didn't release next version because we depleted our error budget. Even google hasn't showcased anything on the subject saying we didn't release these because we depleted our error budgets thats why new releases or next versions were delayed.

I will agree with half the book on SLO, they are hard, monitoring the performance of customer journey, the request travels many tiers, technologies. That's hard. Each tier or tech has its own monitoring system but tracking customer journey that's hard e-banking or mobile to mainframe in between there are so many things. That's something to think about. Would like to know how big complex organization's handle this as that's the only way one can know if something is broke, it's better to get notified from monitoring team/system than social media or news.

Tooling and automations need to be centralized there would be many duplications.

Currently we have people who are THE EXPERTS of their tech but not aware of other techs. This needs to be improved. But SRE got to be sort of jack of all trades also be expert.

Postmortems are locked up somewhere it's there but that's confidential. There is a difference between a search engine breaking down and financial institution technically/due to bug breaking down. Smaller things can be show cased as there is always an RCA for an incident.

It talks about test automation but at the moment there is growing trend on more test analysts requirements. If every organization is going for test automation, why is this trend.

Overall in adoption, there will be a hybrid. Centralized team is needed for gathering information about all platforms and build centralized monitoring application or team. There need to be a catalog of automations like git repo organization level where anyone can read readme and clone or fork and use a program as it fits their requirement. It would be good know pool of experts and ask for recommendations as currently we know only when we work with them.

Automation which was not a priority can be made priority.

Book has lots of good ideas. It's a slow process to adopt, but its completely useless just to change job roles like there are few companies with too many vice presidents literally.

Hope is not a strategy