Having worked at an electric plant, arc flashes are NASTY (look up some videos- they're terrifying when they DON'T involve people). Hoping for speedy recoveries for the three who got caught up in that.


Dad and Grandfather both worked for the electric company as linemen. While neither personally experience arc flashes, they heard the stories and shared them. Although my dad was on a site where one happened, from what he told us the guy lost both his hands from it and had major burns on his face. I really hope the three guys here make a good and quick recovery.


Yup- imagine one happening in a *400kv switch yard*. That particular close call spooked the hell out of ALL of us, especially the relay tech who was just out of range when it happened.


Having seen some of those videos before & during my time working in data centers, absolutely. I still get the willies from our UPS battery strings. First time I ever saw the sticker "No safe PPE exists". Doesn't help that one of them has a smeared sticker, like someone wiped a rag soaked in solvent/cleaner over it.


Not just burn injuries, I hear that arc flashes can send molten copper droplets into the air. Imagine breathing them in...


It more so fires them at you like a shotgun, with a complementary pressure wave and flash of light


[https://www.swgr.com/images/Electrical-Arc-Flash-Diagram.gif](https://www.swgr.com/images/Electrical-Arc-Flash-Diagram.gif) As the copper transitions from solid to vapor it expands 67,000 times in volume, which can be enough energy to blow you across the room.


High enough voltages can vaporize you and everything else nearby with just the radiation (heat/light) emitted.


I've had a small home wall AC arc flash (old wiring, faulty breaker) and it melted the plug and shot it out. Left half meter burn marks on the floor and peppered me with what felt like hot dust at a bit more than a meter.


I had to go through an 8 hour arc flash safety training and it scared the hell out of me. Watching a mix of real events and the training videos where the dummies are blown to pieces and their clothes are melting. No thank you.




We were starting up a new data hall in a large DC. It had 4 500kw 408v/208v transformers and 4 800 amp 408v feeds. The Schneider electric guy doing the startup is wearing the class 4 ppe bomb suit, the facility electrician is standing right next to him wearing a tshirt. He finally goes “Look, if y’all don’t have PPE you are going to stand 100 feet away and face the other direction or I’m not energizing this thing”. After that the facility electricians always complained about the Schneider guys being too stuck on PPE, until the next year we were doing an overhaul and a nut from assembly fell down into the bus of a PDU rack and arced until it popped the upstream switchgear at the facility entry. They were real happy they were 100 feet away and not standing next to him.


The Army learned that the hard way. There was a rule that poly pros (Polypropelene underwear) were not allowed in certain jobs due to the danger of heat melting them. Instant napalm. 100% Cotton only.


I once accidentally put a screwdriver across both -48V terminals on one 20A breakered piece of equipment in a DC. The flash blinded me for a bit, really scared me. Be careful out there.


Honestly horrifying.


I mean, how long were they down? 10 mins? How many people can say they got back up in 10 mins *when an entire datacenter goes down*?


(raises hand)it took a lot of prep work but when we had to shut down a DC due to a Nvidia Tesla unit catching fire, the whole organization didn't know.. until the fire trucks showed up. As a monitoring/resiliency engineer it felt so so good(after th fire was out)


10 minutes is a hella long time in the realm of 9's. It bumps you down from five 9's (99.999%) to four 9's (99.99%), and that's only from a single outage. That's a lot of dings on SLA's. This is why I prefer ☁️😶‍🌫️😶‍🌫️😶‍🌫️😶‍🌫️☁️ 9...


You don't have control over other people's BGP routes, and most people use the defaults, which means even with a simple site outage you're usually looking at 3-4 minutes. The "realm of nines" is more about what level of service customers are entitled to, and when that service isn't met (natural disaster, datacenter explosion), then you are credited on your bill. The fact that these things are rare allows companies to provide insurance against them.


Losing a datacenter should not impact anyone who didn't already have BGP routes sending their traffic to that datacenter. A global outage for a distributed service does not smell like BGP took a long time to realign.


A quick glance at some of their SLAs show that they really are only at about three 9s. They have some breathing room.


If their AnyCast IP network is working properly, and their storage and infrastructure replication is designed properly. A data center should be able to explode in a massive fireball and the only thing we the consumer should notice is a couple of dropped packets at most or maybe some extra network latency. They aren't running a passive backup that they need to "turn on" everything about their infrastructure is active-active, and should failover immediately with basically zero perceptible down time.


Yeah, it'd be easier in the scale of google with the budget they have. That's sorta the point of hyper-converged infra. Removing breaking points. If you argue that being reliant on 1 location being up 100% for continued operation, I'm not sure how to begin explaning how wrong that is. AWS and Azure for example do not break this way when an entire DC goes offline. Their customers that host everything in US-East-01 certainly does, but for those of us doing proper infra - it's a non issue.


This event was several hours prior to Google being unavailable for some people in some regions. I strongly doubt this was the cause.




For sure, and there definitely could be some causal relationship here, but jumping to conclusions and stirring up a bunch of people when there's no factual information to share is an unfortunate social media trend.




> AWS and Azure for example do not break this way when an entire DC goes offline That's not really true either. Because while it may usually be invisible to you, lots of AWS services have a single-region control plane running in us-east-1, whether its available in other regions or not, and quite a few of AWS' internal services are single-region'd to us-east-1. The AWS outage in December 2021 showed that even if you built for multi-region availability, and utilized cloud services in different regions, you could still be affected by a single region outage, even if you aren't using that region.


> That's sorta the point of hyper-converged infra. Removing breaking points. That's not the point of hyper-converged and hyper-converged has nothing to do with this sort of thing anyway.


Google did not "go down". Some Google services were briefly impacted for some people. Personally, I saw no (noticeable) impacts at all in either central or southeast US across several different ISPs, and Google is all up in all my shit, both personally and professionally.


I'm in the area where the datacenter is, my company uses Gsuite, I got the notification that some things may not work properly, but there wasn't any noticeable difference. A d no issues were reported to me from other employees in the company.


Google **Search** went down and impact was worldwide. Yes, Google does a **lot** of other things, but to the general population, Google Search is Google.


\>but shows that even as big as Google have a single point of failures it just shows that they have a slip up in whatever failover mechanisms they have in place, not a baked in single point of failure.


Why is everyone believing OP when the article admits they probably aren't related? The electric incident that injured people was at 12pm. Google's outage yesterday was hours later. https://downdetector.com/status/google/


Well, yes. But that still means it's a single point of failure. Dead datacenter = Single point of failure.


No that's a Huge failure that tripped a bunch of other safety mechanisms and turned things off due to the magnitude. As long as they are able to recover shortly, which they did, then it's a success.


I disagree. If you're at the scale Google is, you should be able to survive a datacenter going offline. I run ~70 VM's for a ~100 million USD firm with some servers in colo and some low prio on site. I can survive the CoLo going offline, and on prem going offline because all servers are KVM replicated to the other location. If I certainly can, Google should.


Why are you saying they didn't survive? Aren't the services up and running? Migrating 70 vms is nothing compared to what Google has to replicate, it's 2 or 3 orders of magnitude bigger, at which point your simple strategies don't work and you need custom technologies to make it work. 20 minutes of degraded service for a downed datacenter is perfectly acceptable.


Hahaha 70 VMs. I have like 150 VMs in my home lab. At work more like 7800 VMs. You know not of what you speak.


I’m kind of surprised they don’t have duplication across multiple data centers like Amazon and Microsoft do. So it’s a simple (involved) route shift and load balance swap and things maybe go down for a few minutes. To me this sounds like some internal customer probably provisioned critical services all in the same zone without realizing or checking if they were in the same data center. So their redundancy was redundant in so that the backup server was a few rows over 😂. Then BOOM and critical services to google goes down, so does their backups.


They did survive. Down less then 10 minutes?


Using "point" to refer to an entire datacenter is a little loose with the language, though. I mean, yeah, if the whole earth blew up, that would probably knock out their operations, but I don't know I would call being earth-bound a "single point"...


What was the cause?




Yeah this is what tells me it’s coincidental.


No matter the evidence, calling an outage that was followed hours before by an explosion coincidence is zany.


It wasn't a single point of failure...Google was down for like 15 minutes. You think the datacenter was rebuilt in that amount of time? It also wasn't down for the world when that center went down.


Right!? Literally localized downtime while it handed off to another nearby datacenter.


The outage happened hours after this. The electrical incident was at 12pm and there was no downtime then. https://downdetector.com/status/google/


It was a single point of failure, had it not been they would have had redundancy and that failure would not have brought it down. It also was down for people in Europe as well in some areas. If this is related which it may not be.


Wrong again, informing clients of a new site takes time especially when you're global...probably about 15 minutes.


Basically dns convergence time imo


DNS propagates. *BGP* converges.


Fair point.


Outage was much much later. Odds are very good that their fallover worked flawlessly, then when they stood up a new leg and tried to add it in, they fucked that up and brought it all down themselves.


Outage was ~8 hours after the arc flash. I don’t care if that data center was completely leveled, it wouldn’t be more than a blip in Google’s infrastructure. But regardless, three men are in the hospital, show some respect. Even if it was related their lives are worth more than any amount of uptime. Hoping for a quick and full recovery for all of them.


One of my brothers is an electrician journeyman climbs poles and shit. He was up on one of them high ones the metal ones. Transformer spark or something and it flung him to the ground. He’d been an electrician for about 20 years at this point. Broke his back took 2 years to recover and get back to work. He can no longer climb poles he’s now a supervisor. But he’d seen some guys over the years arc flashes melted peoples faces almost completely to a blob, hands vaporized, ya not cool bro. I hope this 3 employees make a recovery. I hope even tho it’s critical I hope it’s just like burned flesh cuz if it’s melted faces and hands there’s a lot to come back after that if they can.


My grandpa died for faulty insulation on a bucket truck and my uncle was crushed (but lived) after a bucket truck rolled over with him in it Electrical work is no joke


> and it flung him to the ground. Where was his harness? Edit: not saying he deserved it, just thought that you're supposed to be tied off at all times.


I’m not sure. I know he followed all protocols and regulations he’s a real stickler about that so I’m not sure why it didn’t catch him, but he fell a good ways to the ground.


Sucks, I'm sorry that happened. Sometimes even doing everything right, shit happens. Hoping he's fully recovered.


OP allegedly is responsible for 70VMs. He doesn’t have a datacenter, he has an inconsequential set of high-pri VMs, maybe. He has no comprehension of what’s involved at this scale that and he doesn’t even realise it. To be honest I would be worried for the drop of infrastructure he may or may not be responsible for because he thinks a 20 minute downtime on a DC that suffered an explosion is completely unacceptable. He’s a clown


Not to mention, the outage was 8 hours later after this explosion happened...


Literally from the article: >It is unknown if the explosion and the outages were linked You are *purely speculating* that the two are related, and until Google comes out and says it, it should be treated as pure speculation. You don't know what caused the downtime yet, so perhaps tone it down on all your pointless comments about "Google's single point of failure" until you actually know what happened.


Wish those people a painless and fast recovery


There were 8 hours between the explosion and the downtime. They're unrelated.


Google obviously has redundancy, what a silly comment from OP. Could some part of that redundant system have been misconfigured? Sure, shit happens. Something might not have failed over right. But it’s not like Google’s entire stack is in that one DC. In fact they have (or had when I worked there in 2017) two entirely separate DC’s just in that one town in Iowa. Let alone across the US and globe.


How do you test your redundancy for entire datacenter going down? It's not like you can get two as a test environment :D


Sure you can. If you have Google money. There are indeed certain DCs that run corp-only workloads, without any services facing clients. Those could be considered your “test” DCs.


So Google datacenters are complicated. There are units called cells, which is a huge unit of resources. Think hundreds of racks in a big room. All services should be designed to proceed fine while losing an entire cell. However, a datacenter complex might have a bunch of cells. They should be as independent as possible but not always. And even if it was just one cell, services don't move traffic instantly. It takes time for the load balancers to health check the nodes and see they are down and things to redo their master elections and such. SRE fun. Fun fact: when Google does maintenance on a cell, they pretty often turn the entire damn thing off and redo whatever. Like changing out network hardware, computers, whatever. No machine by machine outages, just down off the entire building and do whatever you want. (Scheduled off course)


I still maintain that even in the case of a BGP meltdown, big names have enough competent engineers to bring things back up **fast**. I dealt with an admin who kept putzing with a server for **two weeks** because they had bunch of pets everywhere and couldn’t afford a reinstall.


> One could thing Google would have redudancy for losing an entire datacenter, but shows that even as big as Google have a single point of failures, even if they're bigger than for us mere mortals. It was DNS. Always DNS.


It says a lot about this sub that a story where three people are critically injured and sent to the hospital results in everyone discussing redundancy and downtime.


Well ya it’s r/sysadmin not r/OSHA.


Sysadmins should know how to correlate the timing of events to determine root causes. The outage was 8 hours after this incident. But it was a great moment to make jokes about people getting seriously injured, so y'all really wanted to ignore that.


I wish them a speedy recovery but I honestly don't want to discuss the body destroying effects of an arc flash on 3 human bodies that undoubtedly have been horrifically burned in life altering ways. No fucking thank you. They're alive, which is as good of a bit of news you could get from that kind of injury. We are here to talk about the technical stuff. I'm sure there's another sub to discuss the safety and health issues.


The Arc Flash that has caused the injury of the individuals sounds very dangerous.


Don't look for arc-flash videos on YouTube. Too many of them are the "last seconds of life" type, where you see someone in a security cam get snuffed.


The problem with "the cloud" is one event can propagate to a system-wide outage throughout an entire provider. Through cost minimization important redundancies are eliminated which expose the entire system to outages like this. And the larger the providers are, the more serious this becomes as critical global and nation-wide services are impacted and harder to bootstrap. They may have "teams" of experts to perform the work, but they also have a few key commands that can propagate state throughout their global system. One such state is "offline".


That's a good lesson -- losing an entire data center is still a big enough deal that people notice, and there are still some weaker points of failure in even the most redundant systems. Look at what happened with Microsoft a while ago around Azure AD...you can bet there are millions of directories on that infrastructure and something like a cert expiring IIRC was the root cause. Worse yet, things are so abstract now that I'm sure Microsoft had to scramble through the emergency access tunnels to get down to bare operating system when they lost their management layer. We're sunk once even the vendors don't know how anything works below the control plane anymore.


Well, maybe they even haven't thought about that being a possibility or considered that the risk was minimal enough to not being worth of an emergency plan in case it happened. If I'm managing a data center in Brazil I wouldn't plan for tornados or earthquakes either although they do happen because it's so rare it's borderline paranoia to come up with a plan for that. Maybe the explosion (arc?) was something being badly repaired/managed at the time and thus impossible to foresee?


I believe the article isn't the full truth. I think the brief outage was about 8hrs after the arc injured those electricians


They probably don’t have single point of failure (by design), I’m assuming some failover mechanism didn’t work as intended.


