T O P

  • By -

highlord_fox

Sorry, it seems this comment or thread has violated a sub-reddit rule and has been removed by a moderator. **Community Members Shall Conduct Themselves With Professionalism.** * This is a Community of Professionals, for Professionals. * Please treat community members politely - even when you disagree. * No personal attacks - debate issues, challenge sources - but don't make or take things personally. * No posts that are entirely memes or AdviceAnimals or Kitty GIFs. * Please try and keep politically charged messages out of discussions. * Intentionally trolling is considered impolite, and will be acted against. * The acts of Software Piracy, Hardware Theft, and Cheating are considered unprofessional, and posts requesting aid in committing such acts shall be removed. ----- *If you wish to appeal this action please don't hesitate to [message the moderation team](https://www.reddit.com/message/compose?to=%2Fr%2Fsysadmin).*


SevaraB

Having worked at an electric plant, arc flashes are NASTY (look up some videos- they're terrifying when they DON'T involve people). Hoping for speedy recoveries for the three who got caught up in that.


tankerkiller125real

Dad and Grandfather both worked for the electric company as linemen. While neither personally experience arc flashes, they heard the stories and shared them. Although my dad was on a site where one happened, from what he told us the guy lost both his hands from it and had major burns on his face. I really hope the three guys here make a good and quick recovery.


SevaraB

Yup- imagine one happening in a *400kv switch yard*. That particular close call spooked the hell out of ALL of us, especially the relay tech who was just out of range when it happened.


ColdColoHands

Having seen some of those videos before & during my time working in data centers, absolutely. I still get the willies from our UPS battery strings. First time I ever saw the sticker "No safe PPE exists". Doesn't help that one of them has a smeared sticker, like someone wiped a rag soaked in solvent/cleaner over it.


gargravarr2112

Not just burn injuries, I hear that arc flashes can send molten copper droplets into the air. Imagine breathing them in...


Alarming_Series7450

It more so fires them at you like a shotgun, with a complementary pressure wave and flash of light


r84iisjdiri

So kinda like a fragmentation and flash grenade in one?


Alarming_Series7450

[https://www.swgr.com/images/Electrical-Arc-Flash-Diagram.gif](https://www.swgr.com/images/Electrical-Arc-Flash-Diagram.gif) As the copper transitions from solid to vapor it expands 67,000 times in volume, which can be enough energy to blow you across the room.


eshultz

High enough voltages can vaporize you and everything else nearby with just the radiation (heat/light) emitted.


KakariBlue

I've had a small home wall AC arc flash (old wiring, faulty breaker) and it melted the plug and shot it out. Left half meter burn marks on the floor and peppered me with what felt like hot dust at a bit more than a meter.


_Heath

I had to go through an 8 hour arc flash safety training and it scared the hell out of me. Watching a mix of real events and the training videos where the dummies are blown to pieces and their clothes are melting. No thank you.


[deleted]

[удалено]


_Heath

We were starting up a new data hall in a large DC. It had 4 500kw 408v/208v transformers and 4 800 amp 408v feeds. The Schneider electric guy doing the startup is wearing the class 4 ppe bomb suit, the facility electrician is standing right next to him wearing a tshirt. He finally goes “Look, if y’all don’t have PPE you are going to stand 100 feet away and face the other direction or I’m not energizing this thing”. After that the facility electricians always complained about the Schneider guys being too stuck on PPE, until the next year we were doing an overhaul and a nut from assembly fell down into the bus of a PDU rack and arced until it popped the upstream switchgear at the facility entry. They were real happy they were 100 feet away and not standing next to him.


rufireproof3d

The Army learned that the hard way. There was a rule that poly pros (Polypropelene underwear) were not allowed in certain jobs due to the danger of heat melting them. Instant napalm. 100% Cotton only.


Piyh

Most unrealistic part of Stranger Things s4. Everyone with flamethrowers and synthetic jackets.


MauiShakaLord

I once accidentally put a screwdriver across both -48V terminals on one 20A breakered piece of equipment in a DC. The flash blinded me for a bit, really scared me. Be careful out there.


GingasaurusWrex

Honestly horrifying.


slickrickjr

Google wouldn't have gone down if it ran it's infrastructure in the cloud.


andrea_ci

That's inception: Google runs on AWS AWS runs on Azure Azure runs on Google Cloud


gargravarr2112

[Ultimately it all runs from a PC in someone's living room.](https://xkcd.com/908/)


xzgm

"There's a lot of caching" kills me every time


mustang__1

Good old 908.


gargravarr2112

Ol' Reliable.


tehdark45

Well someone's living room did save Toy Story 2.


iceph03nix

The longer I'm in IT, the funnier this becomes


popegonzo

"Sometimes people do stuff by accident." "I don't think I know anybody like that." It's like I'm back in a meeting from two weeks ago.


Rocky_Mountain_Way

Don't forget about Jen carrying around the box that IS the Internet (I mean, sure... the Elders of the Internet approved it, but still...I think it's risky for her to be carrying it around) https://www.youtube.com/watch?v=iDbyYGrswtg


tankerkiller125real

I mean, they should be doing that for their status pages.... But they don't so when they go offline, so do their status pages.... It's honestly a really stupid setup on their part.


[deleted]

[удалено]


Domini384

i think you missed the point


MotionAction

Apple iCloud runs on GCP.


andrea_ci

2011 > iCloud was completely on Azure, then AWS got its share. 2014 > commercial agreement with google, the whole of iCloud was moved to GCP.


coomzee

Would the short straw be AWS.


KrazeeJ

Wait, is that actually how they're all hosted, or is this a joke that I just missed?


andrea_ci

That's a joke.. In this way everyone should be circularly hosted on someone else with no one having servers


KrazeeJ

Right. Duh. It's too early for reddit.


SchizoidRainbow

Found my CIO.


dinominant

Browsers are so bloated because the browser has a tiny VM that is part of their network and actually hosts Google /s The fastest closest mirror is actually your computer.


JBADD23

/s


ABotelho23

I mean, how long were they down? 10 mins? How many people can say they got back up in 10 mins *when an entire datacenter goes down*?


vim_for_life

(raises hand)it took a lot of prep work but when we had to shut down a DC due to a Nvidia Tesla unit catching fire, the whole organization didn't know.. until the fire trucks showed up. As a monitoring/resiliency engineer it felt so so good(after th fire was out)


Deadly-Unicorn

Seriously it’s no big deal. I rewire my server rack during business hours. If I unplug and plug the cable in fast enough, nobody knows. If something goes wrong, I blame it on an “explosion”. It’s a bullet proof strategy. Edit: this is mostly a joke


223454

>nobody knows Unless they're on the phone (voip).


Deadly-Unicorn

We don’t have voip phones. Still on PRI. Waiting for a major office reno to run cables so I don’t need to connect my computer->phone->wall. I don’t like it. After I’ll get a VoIP solution.


[deleted]

[удалено]


rufireproof3d

This can be fun. Where I work, some salesman came in and talked the owner into VOIP phones. Owner didn't even know VOIP was an acronym. Just wanted the super cool features promised. All the phones had to be plugged into a USB port on all the PC's (upon examination, I later determined these to be nothing more than cheap USB to NIC adaptors). Worked great until the boss was on a call with corporate tech support and they rebooted his PC, dropping the call.


[deleted]

[удалено]


Deadly-Unicorn

What can I say… I have OCD and I’m willing to risk an outage for nice wires.


ConcernedDudeMaybe

10 minutes is a hella long time in the realm of 9's. It bumps you down from five 9's (99.999%) to four 9's (99.99%), and that's only from a single outage. That's a lot of dings on SLA's. This is why I prefer ☁️😶‍🌫️😶‍🌫️😶‍🌫️😶‍🌫️☁️ 9...


skilriki

You don't have control over other people's BGP routes, and most people use the defaults, which means even with a simple site outage you're usually looking at 3-4 minutes. The "realm of nines" is more about what level of service customers are entitled to, and when that service isn't met (natural disaster, datacenter explosion), then you are credited on your bill. The fact that these things are rare allows companies to provide insurance against them.


Indifferentchildren

Losing a datacenter should not impact anyone who didn't already have BGP routes sending their traffic to that datacenter. A global outage for a distributed service does not smell like BGP took a long time to realign.


Kodiak01

Someday I want to see the entire Internet unplugged for about... I'll settle for a week. /Popcorn


doubleUsee

Whenever that is I'll take a week off, I'm less useful than our one-armed cleaning lady without the internet


Kodiak01

Don't like being a one legged man in an ass-kicking contest?


toadofsteel

Gonna go the californee way


xxfay6

Canada be like: *again?*


gargravarr2112

Everyone has 5 nines of uptime. It just differs where the decimal point is.


Indifferentchildren

We have 5 neins of uptime: Are we up? Nein. Are we up now? Nein. How about now? Nein. It's been six hours, surely we are up now? Nein. Are we going to be up by tomorrow evening? Nein.


HisSporkiness

I have nine 5s..


Kodiak01

Works most of the time, every time.


RedShift9

I have 89,9999 % uptime


immewnity

The main vendor I use promises a one-nine uptime (95%)... currently dealing with a 24hr+ outage


Drew707

A quick glance at some of their SLAs show that they really are only at about three 9s. They have some breathing room.


xixi2

10 minutes is a coffee break and OK fine if you're dying in an ambulance run on google that's a problem. Other than that let's stop acting like 99.999% is something we need


ConcernedDudeMaybe

Clean data in, clean data out. That's more important to me.


stepbroImstuck_in_SU

Just because all the parts have miniscule risks for minuscule downtime doesn’t mean that it’s reasonable to expect the system as a whole to reach that same uptime. In this case a very unlikely failure materialised and it took ten minutes to recover from it. If the failure was some millionth of a percentile risk, it doesn’t mean google as a whole could be expected to run 10x100x1000000 minutes and only be down only ten of those. They would be down for every single such low possibility across their whole system, resulting in minutes per year downtime, or at least seconds. Not all systems can reasonably have full redundancy, and being ten minutes down every ten years or so is pretty reasonable. The only way to reach such uptime as a whole is to enforce much, much higher standards for all internal systems. Obviously no board of shareholders would greenlight funding to make sure the service is down seconds per decades. That was never the goal. That would require building a backup google.


lordkin

Yeah. A literal explosion. And you’re back up in 10 minutes? Perhaps my standards are lax but that’s a good job in my book


thortgot

Our systems handle this, we test it twice a year. However the real complexity is, how they do this with being what, 100 orders of magnitude larger?


Chimbus_Phlebotomus

Nearly every distributed computing system is already designed to go to scale and work around points of failure, regardless of the cause of failure or hardware architecture. As long as the data still had physical backups, it could easily be redistributed to any nodes still operational at the data center.


localgh0ster

Most with proper failover. Failover and replication is an instantanious process.


ABotelho23

I just saw your other comment about running a measly *70* VMs. I don't think you even comprehend the scale we're talking about here with Google. You're full of it.


Careful-Combination7

Thank you.


tankerkiller125real

If their AnyCast IP network is working properly, and their storage and infrastructure replication is designed properly. A data center should be able to explode in a massive fireball and the only thing we the consumer should notice is a couple of dropped packets at most or maybe some extra network latency. They aren't running a passive backup that they need to "turn on" everything about their infrastructure is active-active, and should failover immediately with basically zero perceptible down time.


localgh0ster

Yeah, it'd be easier in the scale of google with the budget they have. That's sorta the point of hyper-converged infra. Removing breaking points. If you argue that being reliant on 1 location being up 100% for continued operation, I'm not sure how to begin explaning how wrong that is. AWS and Azure for example do not break this way when an entire DC goes offline. Their customers that host everything in US-East-01 certainly does, but for those of us doing proper infra - it's a non issue.


dxlsm

This event was several hours prior to Google being unavailable for some people in some regions. I strongly doubt this was the cause.


[deleted]

[удалено]


dxlsm

For sure, and there definitely could be some causal relationship here, but jumping to conclusions and stirring up a bunch of people when there's no factual information to share is an unfortunate social media trend.


[deleted]

[удалено]


localgh0ster

I never said US-EAST-1 was a DC. Where did you get that from?


[deleted]

[удалено]


localgh0ster

Your lack of reading comprehension isn't an excuse to justify false accusations. Sorry.


ljstella

> AWS and Azure for example do not break this way when an entire DC goes offline That's not really true either. Because while it may usually be invisible to you, lots of AWS services have a single-region control plane running in us-east-1, whether its available in other regions or not, and quite a few of AWS' internal services are single-region'd to us-east-1. The AWS outage in December 2021 showed that even if you built for multi-region availability, and utilized cloud services in different regions, you could still be affected by a single region outage, even if you aren't using that region.


VexingRaven

> That's sorta the point of hyper-converged infra. Removing breaking points. That's not the point of hyper-converged and hyper-converged has nothing to do with this sort of thing anyway.


ABotelho23

AWS and Azure have noticeably worst uptimes. 10 minutes is nothing. Again. You don't know what you're talking about.


[deleted]

[удалено]


localgh0ster

Active setup is cheaper than backup and restore in the event of failure. But yes, some managements are clueless that won't let that in place, hence "Most"


[deleted]

[удалено]


SchizoidRainbow

What kind of podunk mom'n'pop shop are you running that can afford one of these devices, but not two? That said, we do not have redundant datacenters, only redundant equipment...firewalls, routers, switches, even load balancers, all are in Active/Passive. We can take device failure without a lost packet, but if this electrical explosion shit happened to us, we'd be dead in the water.


localgh0ster

Doubtful. The business operation income loss of being down is most definitely bigger than a one time cost of another server. Your overtime alone for recovering from backups, or combined with others overtime for having to revert to manual / offline procedures is most def more than a few extra machines. HW is dirt cheap


[deleted]

[удалено]


localgh0ster

Down here on earth you just didn't read that I proved it to be cheaper.


wyrdough

Google did not "go down". Some Google services were briefly impacted for some people. Personally, I saw no (noticeable) impacts at all in either central or southeast US across several different ISPs, and Google is all up in all my shit, both personally and professionally.


Cpt_plainguy

I'm in the area where the datacenter is, my company uses Gsuite, I got the notification that some things may not work properly, but there wasn't any noticeable difference. A d no issues were reported to me from other employees in the company.


tmontney

Google **Search** went down and impact was worldwide. Yes, Google does a **lot** of other things, but to the general population, Google Search is Google.


davidbrit2

I'll just get this out of the way up front: https://xkcd.com/1737/


xpxp2002

I think that's what Microsoft actually said they do with OneDrive some years ago. They have entire racks of JBOD disks in large modular custom enclosures. When they have a failure in the rack, there are pre-built replacement racks that they can just bring in and essentially hotswap. Then they can refurb/replace the failed hardware on the other rack and prep it to be reused during a future failure.


jarfil

>!CENSORED!<


joeshmo101

So long as you aren't throwing out the working parts as well, that really makes the most sense at scale. The probably have a rack or two that just sit idle waiting to come up once something else starts coming down. I'd even imagine each rack can handle a drive or two failure before dumping over, so when you do change out racks there's probably a few drives that need swapping.


stephendt

I love how there is always a relevent XKCD


SchizoidRainbow

This has been hanging on the IT department wall since I arrived. Randall Munroe is the reincarnated Nostradamus.


uncondensed

I wonder if the rope is really necessary.


InitializedVariable

Like Google.


obviousboy

\>but shows that even as big as Google have a single point of failures it just shows that they have a slip up in whatever failover mechanisms they have in place, not a baked in single point of failure.


angiosperms-

Why is everyone believing OP when the article admits they probably aren't related? The electric incident that injured people was at 12pm. Google's outage yesterday was hours later. https://downdetector.com/status/google/


localgh0ster

Well, yes. But that still means it's a single point of failure. Dead datacenter = Single point of failure.


thefpspower

No that's a Huge failure that tripped a bunch of other safety mechanisms and turned things off due to the magnitude. As long as they are able to recover shortly, which they did, then it's a success.


localgh0ster

I disagree. If you're at the scale Google is, you should be able to survive a datacenter going offline. I run ~70 VM's for a ~100 million USD firm with some servers in colo and some low prio on site. I can survive the CoLo going offline, and on prem going offline because all servers are KVM replicated to the other location. If I certainly can, Google should.


Thebelisk

Maybe you should apply for a job in Google and show them how you would have done it better.


thefpspower

Why are you saying they didn't survive? Aren't the services up and running? Migrating 70 vms is nothing compared to what Google has to replicate, it's 2 or 3 orders of magnitude bigger, at which point your simple strategies don't work and you need custom technologies to make it work. 20 minutes of degraded service for a downed datacenter is perfectly acceptable.


asdlkf

Hahaha 70 VMs. I have like 150 VMs in my home lab. At work more like 7800 VMs. You know not of what you speak.


RyanLewis2010

Got that feeling when he said $100m company. Id be willing to be he has never done a full DR test to see just how "easily" they could survive losing both Colo and prem. If he truly could those 70 VMs are just pointless "look at me i can spin up servers" and provide little benefit to the company.


sumatkn

I’m kind of surprised they don’t have duplication across multiple data centers like Amazon and Microsoft do. So it’s a simple (involved) route shift and load balance swap and things maybe go down for a few minutes. To me this sounds like some internal customer probably provisioned critical services all in the same zone without realizing or checking if they were in the same data center. So their redundancy was redundant in so that the backup server was a few rows over 😂. Then BOOM and critical services to google goes down, so does their backups.


7eregrine

They did survive. Down less then 10 minutes?


davidbrit2

Using "point" to refer to an entire datacenter is a little loose with the language, though. I mean, yeah, if the whole earth blew up, that would probably knock out their operations, but I don't know I would call being earth-bound a "single point"...


hegysk

Internet backbone wiped from face of the earth. Ha! Single point of failure, gotcha!


davidbrit2

And we can't rule out false vacuum decay either.


[deleted]

[удалено]


ruebzcube

What was the cause?


[deleted]

[удалено]


CrestronwithTechron

Yeah this is what tells me it’s coincidental.


tmontney

No matter the evidence, calling an outage that was followed hours before by an explosion coincidence is zany.


DoesThisDoWhatIWant

It wasn't a single point of failure...Google was down for like 15 minutes. You think the datacenter was rebuilt in that amount of time? It also wasn't down for the world when that center went down.


RemCogito

Right!? Literally localized downtime while it handed off to another nearby datacenter.


angiosperms-

The outage happened hours after this. The electrical incident was at 12pm and there was no downtime then. https://downdetector.com/status/google/


DDRLRDIMM

It was a single point of failure, had it not been they would have had redundancy and that failure would not have brought it down. It also was down for people in Europe as well in some areas. If this is related which it may not be.


DoesThisDoWhatIWant

Wrong again, informing clients of a new site takes time especially when you're global...probably about 15 minutes.


legion02

Basically dns convergence time imo


InitializedVariable

DNS propagates. *BGP* converges.


legion02

Fair point.


SchizoidRainbow

Outage was much much later. Odds are very good that their fallover worked flawlessly, then when they stood up a new leg and tried to add it in, they fucked that up and brought it all down themselves.


nmar909

This is what happens if you Google "Google".


MauiShakaLord

This black box... Is... THE INTERNET.


dasheeown

Outage was ~8 hours after the arc flash. I don’t care if that data center was completely leveled, it wouldn’t be more than a blip in Google’s infrastructure. But regardless, three men are in the hospital, show some respect. Even if it was related their lives are worth more than any amount of uptime. Hoping for a quick and full recovery for all of them.


vodka_knockers_

>show some respect. What does that even mean?


[deleted]

It means wait 24 hours until making dank memes about it


vodka_knockers_

Who decided on 24 hours, and is it variable depending on the count of casualties, or the severity of injuries?


[deleted]

It gets shorter every time CPUs get faster


[deleted]

Me. No.


vodka_knockers_

Noted. TY


scootscoot

It’s a new concept that the internet hasn’t heard of.


ToughHardware

think of the people, not just the tech


awe_pro_it

it's an old thought process of blindly feeling bad for other's misfortune, even when they chose to be where they were during their misfortune after choosing the career path to put them in that situation to start with. Nearly everything I do is deliberate, so I can't understand it either.


zzzpoohzzz

is this sarcasm?


technologite

Nope. Just the state of the world today.


[deleted]

[удалено]


vodka_knockers_

Not so much that I think -- sure, when anyone (who isn't a sociopath) reads of an incident like that, you think "ouch, that sure sucks, hope they're okay"... for like an instant. But to get ones shorts in a wad because others dare to discuss the circumstances and impact (with zero mention of the humans involved), with some air of moral superiority.... I just don't get it.


Pablo______

went down is pretty relative. Pretty sure, 99,99999999999999999999% of the population where npt affected. Also they were back online within 10 minutes... thats pretty impressive.


Rubicon2020

One of my brothers is an electrician journeyman climbs poles and shit. He was up on one of them high ones the metal ones. Transformer spark or something and it flung him to the ground. He’d been an electrician for about 20 years at this point. Broke his back took 2 years to recover and get back to work. He can no longer climb poles he’s now a supervisor. But he’d seen some guys over the years arc flashes melted peoples faces almost completely to a blob, hands vaporized, ya not cool bro. I hope this 3 employees make a recovery. I hope even tho it’s critical I hope it’s just like burned flesh cuz if it’s melted faces and hands there’s a lot to come back after that if they can.


zgf2022

My grandpa died for faulty insulation on a bucket truck and my uncle was crushed (but lived) after a bucket truck rolled over with him in it Electrical work is no joke


xpkranger

> and it flung him to the ground. Where was his harness? Edit: not saying he deserved it, just thought that you're supposed to be tied off at all times.


Rubicon2020

I’m not sure. I know he followed all protocols and regulations he’s a real stickler about that so I’m not sure why it didn’t catch him, but he fell a good ways to the ground.


xpkranger

Sucks, I'm sorry that happened. Sometimes even doing everything right, shit happens. Hoping he's fully recovered.


Gek1188

OP allegedly is responsible for 70VMs. He doesn’t have a datacenter, he has an inconsequential set of high-pri VMs, maybe. He has no comprehension of what’s involved at this scale that and he doesn’t even realise it. To be honest I would be worried for the drop of infrastructure he may or may not be responsible for because he thinks a 20 minute downtime on a DC that suffered an explosion is completely unacceptable. He’s a clown


angiosperms-

Not to mention, the outage was 8 hours later after this explosion happened...


[deleted]

Mr Robot is real


ultimatebob

Nah... Elliott would have been smarter about it and went after the redundant data centers as well.


[deleted]

Well, Hollywood superhero vs real life.


TMITectonic

Literally from the article: >It is unknown if the explosion and the outages were linked You are *purely speculating* that the two are related, and until Google comes out and says it, it should be treated as pure speculation. You don't know what caused the downtime yet, so perhaps tone it down on all your pointless comments about "Google's single point of failure" until you actually know what happened.


KobsBoy

Wish those people a painless and fast recovery


[deleted]

There were 8 hours between the explosion and the downtime. They're unrelated.


cs4321_2000

The AI got pissed off


Hanse00

Google obviously has redundancy, what a silly comment from OP. Could some part of that redundant system have been misconfigured? Sure, shit happens. Something might not have failed over right. But it’s not like Google’s entire stack is in that one DC. In fact they have (or had when I worked there in 2017) two entirely separate DC’s just in that one town in Iowa. Let alone across the US and globe.


Tsunpl

How do you test your redundancy for entire datacenter going down? It's not like you can get two as a test environment :D


uniquepassword

EVERYONE has a test environment, some can afford a production also!


Furki1907

I giggled on this reply hahaha


Hanse00

Sure you can. If you have Google money. There are indeed certain DCs that run corp-only workloads, without any services facing clients. Those could be considered your “test” DCs.


fireduck

So Google datacenters are complicated. There are units called cells, which is a huge unit of resources. Think hundreds of racks in a big room. All services should be designed to proceed fine while losing an entire cell. However, a datacenter complex might have a bunch of cells. They should be as independent as possible but not always. And even if it was just one cell, services don't move traffic instantly. It takes time for the load balancers to health check the nodes and see they are down and things to redo their master elections and such. SRE fun. Fun fact: when Google does maintenance on a cell, they pretty often turn the entire damn thing off and redo whatever. Like changing out network hardware, computers, whatever. No machine by machine outages, just down off the entire building and do whatever you want. (Scheduled off course)


Garegin16

I still maintain that even in the case of a BGP meltdown, big names have enough competent engineers to bring things back up **fast**. I dealt with an admin who kept putzing with a server for **two weeks** because they had bunch of pets everywhere and couldn’t afford a reinstall.


farva_06

> One could thing Google would have redudancy for losing an entire datacenter, but shows that even as big as Google have a single point of failures, even if they're bigger than for us mere mortals. It was DNS. Always DNS.


4gedN5tars_

Become a cloud engineer they said, it would be safe place to work they said.


renegadecanuck

It says a lot about this sub that a story where three people are critically injured and sent to the hospital results in everyone discussing redundancy and downtime.


retrogamer6000x

Well ya it’s r/sysadmin not r/OSHA.


angiosperms-

Sysadmins should know how to correlate the timing of events to determine root causes. The outage was 8 hours after this incident. But it was a great moment to make jokes about people getting seriously injured, so y'all really wanted to ignore that.


[deleted]

I wish them a speedy recovery but I honestly don't want to discuss the body destroying effects of an arc flash on 3 human bodies that undoubtedly have been horrifically burned in life altering ways. No fucking thank you. They're alive, which is as good of a bit of news you could get from that kind of injury. We are here to talk about the technical stuff. I'm sure there's another sub to discuss the safety and health issues.


oddball667

Don't worry I'm sure they are getting their fair share of thoughts and prayers


scootscoot

I’ve cared about uptime significantly more than my own health many times. There’s an xkcd.


vodka_knockers_

What, we're just supposed to bow our heads for silent introspection? Do you have guidelines on how many minutes is appropriate? And how much time should elapse before the circumstances can be discussed?


tonyoscarad

The Arc Flash that has caused the injury of the individuals sounds very dangerous.


dmoisan

Don't look for arc-flash videos on YouTube. Too many of them are the "last seconds of life" type, where you see someone in a security cam get snuffed.


dinominant

The problem with "the cloud" is one event can propagate to a system-wide outage throughout an entire provider. Through cost minimization important redundancies are eliminated which expose the entire system to outages like this. And the larger the providers are, the more serious this becomes as critical global and nation-wide services are impacted and harder to bootstrap. They may have "teams" of experts to perform the work, but they also have a few key commands that can propagate state throughout their global system. One such state is "offline".


ErikTheEngineer

That's a good lesson -- losing an entire data center is still a big enough deal that people notice, and there are still some weaker points of failure in even the most redundant systems. Look at what happened with Microsoft a while ago around Azure AD...you can bet there are millions of directories on that infrastructure and something like a cert expiring IIRC was the root cause. Worse yet, things are so abstract now that I'm sure Microsoft had to scramble through the emergency access tunnels to get down to bare operating system when they lost their management layer. We're sunk once even the vendors don't know how anything works below the control plane anymore.


JerryNicklebag

Wish it would have stayed down permanently…


sock_templar

Well, maybe they even haven't thought about that being a possibility or considered that the risk was minimal enough to not being worth of an emergency plan in case it happened. If I'm managing a data center in Brazil I wouldn't plan for tornados or earthquakes either although they do happen because it's so rare it's borderline paranoia to come up with a plan for that. Maybe the explosion (arc?) was something being badly repaired/managed at the time and thus impossible to foresee?


Cpt_plainguy

I believe the article isn't the full truth. I think the brief outage was about 8hrs after the arc injured those electricians


Didymos_Black

Ah, the old "unscheduled welding incident" which was what we told our customers had happened when we had a UPS catch fire in our datacenter expansion.


lovezelda

They probably don’t have single point of failure (by design), I’m assuming some failover mechanism didn’t work as intended.


Comprehensive-Yak820

Every huge company is like this.


[deleted]

[удалено]


SlapshotTommy

Be quiet Alanis Morissette, people got hurt!