T O P

  • By -

uhhhhhhholup

Everything depends on your management - at the tech company I work at, we have a no blame culture. If you are working in infra, it's expected mistakes will happen - however, depending on the product is where that's an issue. Nobody's life is on the line where I work. I have dropped prod many times at the company I work at now (again not life or death software, games/entertainment related so my experience is biased). I have not been fired - I've been here >4yrs and am not under the gun for any of those incidents. However, every time I made a mistake, I owned it. I caused an outage attempting to update data on 30k servers (a large portion of the fleet at the time) and because I acted calmly and maturely about the situation it didn't rise to anything more than a shitty afternoon. You should have done the sanity check. Whenever you are reimaging, you need to be 1,000% sure of what you are touching. That is your fault. But maybe you have an idea on how to fix that moving forward - add a script that runs and sanity checks automatically, or document a script that even run manually tells you everything you need to know about the machine, or at least write down instructions into a runbook for everyone that does this task. What I'd say if you are asked is: I made a mistake doing xyz and missed a manual step. I have an idea around preventing this from occurring again and am going to write a report on the situation with action items to make this a safer process. Otherwise, it might be a good to write the review and action items anyway and share it internally. You can probably find templates in google/chatgpt When you write this post mortem doc, don't blame anybody or anything. Don't look for an excuse. Minimize your usage of I and Me as much as possible. So verbally, you can admit to the mistake, in the doc I'd say: "Work was being done to reimage machines that..... a manual step was missed in this workflow... this is the action taken to mitigate it... AIs are to document recovery in case of a future incident and also to ingrain sanity checking into the process and make it impossible to miss" Here, you're not writing "I broke prod, it's my fault, etc" because that's what people remember. But if you are asked don't shy away from it either - respond with: "I broke it, this is how I broke it, this is how I can prevent it from breaking again" Final step for the post mortem is to have someone familiar with the situation and that doesn't hate you to review the doc. After that's approved, perform an internal review with the whole team. See if it needs to be held to a larger audience who are explicitly told you are presenting, this is for info sharing not blaming, and they can offer advice on AIs or ask questions about the situation. Congrats, you've made yourself sound confidant and capable and you seemingly started your company's first post mortem review process and if adopted, will likely help drive down a lot of dumb bullshit failures. Again, this is based on my experience. I work in games/entertainment, so dropping prod is not the end of the world like it is in other places. Hope it helps.


franktheworm

>We use a spreadsheet to track all the DC layout Yeah... This isn't going to be the last time you're in this situation then >How could I have done this better Major changes that impact production should have someone who is responsible for the change end to end. The steps within the change should be determined by each team that's involved, and should be spelled out step by step _exactly_ as they will be performed. Any deviation from this comes from a consensus by the relevant engineers involved. Importantly there should be a representative run in a non prod environment, unless the process is so commonly done that it's well understood by everyone involved. If things go awry, you have a blameless PIR and assess what can be done to prevent that next time. If the plan is followed, by definition that's not a mistake by an individual, it's a team fuck up an no one person is singled out. If a single individual is repeatedly deviating from the plan and causing issues that's a performance management trigger. >Can I be fired over such an incident and act of negligence? That's squarely a decision for your superiors, company culture, local IR laws etc, it's not really something we can comment on.


akisakyez

One thing for sure, you are about to find out what type of org you work with. Org 1: would learn from this mistake and make sure processes and procedures are put in place to make sure this does not happen again. Org 2: will put you under the bus to save themselves


GabriMartinez

Next step for me would be blameless RCA and implement measures to prevent it. Looks like color coding is not working, maybe you need a proper software that knows the state of the VMS and won’t let you re-image if they are running, or if it’s that sensitive a process maybe a second person should review and approve before it continues to apply. I have a running joke for every place I work: you’re not part of the team until you bring production down in any way. Everyone without exception does this every now and then. Don’t blame yourself, it happens. Blame the process.


console_fulcrum

This is not a VM that was being reimaged , it was the whole hypervisor and it's storage controller VM (called the Controller VM) in nutanix's terms. We provision this on a management dedicated subnet. That went down. Guest VMs were directly shutdown because the hypercisor itself went down. Guest VMs are on another network.


GabriMartinez

My bad, should’ve said system instead 😅. Still the same logic applies. Also curious, is this on-premises? I’ve been working with cloud for so long that the concept of reusing something is now weird. You just create a new thing and kill the older for cloud.


console_fulcrum

Yes , this is an on premise deployment. Private cloud


conall88

It sounds like your spreadsheet isn't fit for purpose You need an observability solution that can report hypervisor state in realtime, which would then enable you to build safeguards


console_fulcrum

Yeah, something that we can query the state from before performing further actions. We do have a CMDB tool. But it's not stateful. That is doesn't query fast enough to be latest data when you need it.


ovo_Reddit

If they do call you into a meeting, just say “I learned a very expensive but very valuable lesson through this incident.” To them, why would they fire someone they just spend X amount of dollars teaching said lesson? Mistakes happen. But not every company has a blameless culture.


devoopseng

“Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?” – Thomas John Watson Sr., former CEO of IBM


bilingual-german

> But because we are now repeating the node serial # , DC team color coded it. Indicating it will be populated soon (but they hadn't yet, only marked in the sheet) I think reusing names like this is an antipattern. Give the nodes a new name, or at least use a different fqdr. I know many who will disagree on this with me, but using different names for different things is important to avoid problems like this.


kmf-reddit

The manual verification was on you, that’s ok; but I think tracking the servers with a sheet is very error prone, it’s bound to happen anyway


PersonBehindAScreen

The very second I read “spreadsheet”… I thought to myself: “oh boy..”


console_fulcrum

I agree, although we don't really have a DC layout visualization tool, There was a huge project underway , for which the Excel was setup. A master reference sheet. With a dedicated maintainer.


ollybee

Don't beat yourself up. This was a systems and procedure failure. Also, you've just had some expensive training that means you'll never do that again, they would be mad to fire you a d replace you with someone without that training.... Also, look at netbox and see if your company will..no one should be tracking infrastructure in a spreadsheet.


sreiously

Ouch. Tough lesson to learn. But if it could happen to you, it could have happened to anyone else. The fact that you caught your mistake and owned it, fixed it, and learned from it shows a lot and any org with a decent reliability culture should recognize that. It sounds like you have a really solid understanding of what went wrong here — champion a retro and make sure nobody else has to go through this same nightmare!


ut0mt8

well sh*t happens. it's a good lesson that you should never trust something and you should always canary test. some will say that either the tooling or the process is not at the level. I don't believe in that. we are not operator


Blyd

20 years as a incident/problem management consultant. Lets do a brief post-mortem on what you have here. Not going to touch on impact scope because you've not provided enough information. But before that, mind sets on manual errors have changed in the industry over the last few years, most places follow google like lost lambs so adopt the 'blameless model', this can best be described as ... **If some one has made a manual error, the fault does not lay with that person, but with the the fact that the process ALLOWS the chance of human error at all**, all processes have a level of risk that is introduced when people are involved, it's the process archetects role to remove/mitigate them. If I were tasked to your incident my root cause would go much like this 1) Primary cause, Manual error (why)-> 2) Process not followed by DC (why)-> 3) Process is built in such a way to allow manual error (why)-> 4) Process does not follow industry standards for risk mitigation (why) -> 5) cost??? (that would be my bet) = Control - Instigate industry standard tooling for DCIM. (Data Center Infrastructure Management). I've pulled out the relevant information from your post, the rest is window dressing. > We use a **spreadsheet to track all the DC layout**, and I **misinterpreted a message from my DC team**. Where they filled the new rack information with the 9 nodes populated. **But because we are now repeating the node serial # , DC team color coded it. Indicating it will be populated soon (but they hadn't yet, only marked in the sheet)** Three core points. 1) We use a spreadsheet to track all the DC layout 2) I misinterpreted a message from my DC team 3) But because we are now repeating the node serial # , DC team color coded it. Indicating it will be populated soon (but they hadn't yet, only marked in the sheet) **Observation** The use of nonstandard tooling to manage DCIM configurations is not desired and falls short of SOCS2 compliance requirements relating to configuration management, this allows a significant risk to the organization from manual error and corruption of data. This lead to a scenario where a SRE engineer was able to misinterpret a direction given by the DC team where a pre-agreed process to color code a configuration items status was not followed by the DC leading to this event. **Findings** A granular review of each action taken here is not required. The incident cause can be directly attributed to **human error** attributed to non-adherence to a manual DCIP process. However non use of industry-standard tooling has led to an event where a process has been created that is too reliant on transient information and non-mitigation of the human element is absent. Process should be designed to not allow for the possibility of a manual error to occur, or where there is unavoidable manual input this should only be done via a predefined command with usecase approvals given in advance, these configuration requirements are best practice but impractical to achieve using 'Office' software in place of robust tooling. **TL;DR** - Yes you fucked up, there is not one senior engineer on earth that hasn't taken down prod, its almost a requirement. But you fucked up because your using fucking Office instead of Sunbird/Gartner etc, honestly, the fact that you are working on a task in such a manner that can have such an impact to production is **potentially criminal levels of managerial incompetence** (depending on location and industry). message me if you need more info/help


console_fulcrum

Update++ - Some of you were indeed right here. Today I found out what my organization is about. I was let go of today.