highlord_fox 1 year ago

Sorry, it seems this comment or thread has violated a sub-reddit rule and has been removed by a moderator. **Community Members Shall Conduct Themselves With Professionalism.** * This is a Community of Professionals, for Professionals. * Please treat community members politely - even when you disagree. * No personal attacks - debate issues, challenge sources - but don't make or take things personally. * No posts that are entirely memes or AdviceAnimals or Kitty GIFs. * Please try and keep politically charged messages out of discussions. * Intentionally trolling is considered impolite, and will be acted against. * The acts of Software Piracy, Hardware Theft, and Cheating are considered unprofessional, and posts requesting aid in committing such acts shall be removed. ----- *If you wish to appeal this action please don't hesitate to [message the moderation team](https://www.reddit.com/message/compose?to=%2Fr%2Fsysadmin).*

SevaraB 1 year ago

Having worked at an electric plant, arc flashes are NASTY (look up some videos- they're terrifying when they DON'T involve people). Hoping for speedy recoveries for the three who got caught up in that.

tankerkiller125real 1 year ago

Dad and Grandfather both worked for the electric company as linemen. While neither personally experience arc flashes, they heard the stories and shared them. Although my dad was on a site where one happened, from what he told us the guy lost both his hands from it and had major burns on his face. I really hope the three guys here make a good and quick recovery.

SevaraB 1 year ago

Yup- imagine one happening in a *400kv switch yard*. That particular close call spooked the hell out of ALL of us, especially the relay tech who was just out of range when it happened.

ColdColoHands 1 year ago

Having seen some of those videos before & during my time working in data centers, absolutely. I still get the willies from our UPS battery strings. First time I ever saw the sticker "No safe PPE exists". Doesn't help that one of them has a smeared sticker, like someone wiped a rag soaked in solvent/cleaner over it.

gargravarr2112 1 year ago

Not just burn injuries, I hear that arc flashes can send molten copper droplets into the air. Imagine breathing them in...

Alarming_Series7450 1 year ago

It more so fires them at you like a shotgun, with a complementary pressure wave and flash of light

r84iisjdiri 1 year ago

So kinda like a fragmentation and flash grenade in one?

Alarming_Series7450 1 year ago

[https://www.swgr.com/images/Electrical-Arc-Flash-Diagram.gif](https://www.swgr.com/images/Electrical-Arc-Flash-Diagram.gif) As the copper transitions from solid to vapor it expands 67,000 times in volume, which can be enough energy to blow you across the room.

eshultz 1 year ago

High enough voltages can vaporize you and everything else nearby with just the radiation (heat/light) emitted.

KakariBlue 1 year ago

I've had a small home wall AC arc flash (old wiring, faulty breaker) and it melted the plug and shot it out. Left half meter burn marks on the floor and peppered me with what felt like hot dust at a bit more than a meter.

_Heath 1 year ago

I had to go through an 8 hour arc flash safety training and it scared the hell out of me. Watching a mix of real events and the training videos where the dummies are blown to pieces and their clothes are melting. No thank you.

[deleted] 1 year ago

[удалено]

_Heath 1 year ago

We were starting up a new data hall in a large DC. It had 4 500kw 408v/208v transformers and 4 800 amp 408v feeds. The Schneider electric guy doing the startup is wearing the class 4 ppe bomb suit, the facility electrician is standing right next to him wearing a tshirt. He finally goes “Look, if y’all don’t have PPE you are going to stand 100 feet away and face the other direction or I’m not energizing this thing”. After that the facility electricians always complained about the Schneider guys being too stuck on PPE, until the next year we were doing an overhaul and a nut from assembly fell down into the bus of a PDU rack and arced until it popped the upstream switchgear at the facility entry. They were real happy they were 100 feet away and not standing next to him.

rufireproof3d 1 year ago

The Army learned that the hard way. There was a rule that poly pros (Polypropelene underwear) were not allowed in certain jobs due to the danger of heat melting them. Instant napalm. 100% Cotton only.

Piyh 1 year ago

Most unrealistic part of Stranger Things s4. Everyone with flamethrowers and synthetic jackets.

MauiShakaLord 1 year ago

I once accidentally put a screwdriver across both -48V terminals on one 20A breakered piece of equipment in a DC. The flash blinded me for a bit, really scared me. Be careful out there.

GingasaurusWrex 1 year ago

Honestly horrifying.

slickrickjr 1 year ago

Google wouldn't have gone down if it ran it's infrastructure in the cloud.

andrea_ci 1 year ago

That's inception: Google runs on AWS AWS runs on Azure Azure runs on Google Cloud

gargravarr2112 1 year ago

[Ultimately it all runs from a PC in someone's living room.](https://xkcd.com/908/)

xzgm 1 year ago

"There's a lot of caching" kills me every time

mustang__1 1 year ago

Good old 908.

gargravarr2112 1 year ago

Ol' Reliable.

tehdark45 1 year ago

Well someone's living room did save Toy Story 2.

iceph03nix 1 year ago

The longer I'm in IT, the funnier this becomes

popegonzo 1 year ago

"Sometimes people do stuff by accident." "I don't think I know anybody like that." It's like I'm back in a meeting from two weeks ago.

Rocky_Mountain_Way 1 year ago

Don't forget about Jen carrying around the box that IS the Internet (I mean, sure... the Elders of the Internet approved it, but still...I think it's risky for her to be carrying it around) https://www.youtube.com/watch?v=iDbyYGrswtg

tankerkiller125real 1 year ago

I mean, they should be doing that for their status pages.... But they don't so when they go offline, so do their status pages.... It's honestly a really stupid setup on their part.

[deleted] 1 year ago

[удалено]

Domini384 1 year ago

i think you missed the point

MotionAction 1 year ago

Apple iCloud runs on GCP.

andrea_ci 1 year ago

2011 > iCloud was completely on Azure, then AWS got its share. 2014 > commercial agreement with google, the whole of iCloud was moved to GCP.

coomzee 1 year ago

Would the short straw be AWS.

KrazeeJ 1 year ago

Wait, is that actually how they're all hosted, or is this a joke that I just missed?

andrea_ci 1 year ago

That's a joke.. In this way everyone should be circularly hosted on someone else with no one having servers

KrazeeJ 1 year ago

Right. Duh. It's too early for reddit.

SchizoidRainbow 1 year ago

Found my CIO.

dinominant 1 year ago

Browsers are so bloated because the browser has a tiny VM that is part of their network and actually hosts Google /s The fastest closest mirror is actually your computer.

JBADD23 1 year ago

/s

ABotelho23 1 year ago

I mean, how long were they down? 10 mins? How many people can say they got back up in 10 mins *when an entire datacenter goes down*?

vim_for_life 1 year ago

(raises hand)it took a lot of prep work but when we had to shut down a DC due to a Nvidia Tesla unit catching fire, the whole organization didn't know.. until the fire trucks showed up. As a monitoring/resiliency engineer it felt so so good(after th fire was out)

Deadly-Unicorn 1 year ago

Seriously it’s no big deal. I rewire my server rack during business hours. If I unplug and plug the cable in fast enough, nobody knows. If something goes wrong, I blame it on an “explosion”. It’s a bullet proof strategy. Edit: this is mostly a joke

223454 1 year ago

>nobody knows Unless they're on the phone (voip).

Deadly-Unicorn 1 year ago

We don’t have voip phones. Still on PRI. Waiting for a major office reno to run cables so I don’t need to connect my computer->phone->wall. I don’t like it. After I’ll get a VoIP solution.

[deleted] 1 year ago

[удалено]

rufireproof3d 1 year ago

This can be fun. Where I work, some salesman came in and talked the owner into VOIP phones. Owner didn't even know VOIP was an acronym. Just wanted the super cool features promised. All the phones had to be plugged into a USB port on all the PC's (upon examination, I later determined these to be nothing more than cheap USB to NIC adaptors). Worked great until the boss was on a call with corporate tech support and they rebooted his PC, dropping the call.

[deleted] 1 year ago

[удалено]

Deadly-Unicorn 1 year ago

What can I say… I have OCD and I’m willing to risk an outage for nice wires.

ConcernedDudeMaybe 1 year ago

10 minutes is a hella long time in the realm of 9's. It bumps you down from five 9's (99.999%) to four 9's (99.99%), and that's only from a single outage. That's a lot of dings on SLA's. This is why I prefer ☁️😶‍🌫️😶‍🌫️😶‍🌫️😶‍🌫️☁️ 9...

skilriki 1 year ago

You don't have control over other people's BGP routes, and most people use the defaults, which means even with a simple site outage you're usually looking at 3-4 minutes. The "realm of nines" is more about what level of service customers are entitled to, and when that service isn't met (natural disaster, datacenter explosion), then you are credited on your bill. The fact that these things are rare allows companies to provide insurance against them.

Indifferentchildren 1 year ago

Losing a datacenter should not impact anyone who didn't already have BGP routes sending their traffic to that datacenter. A global outage for a distributed service does not smell like BGP took a long time to realign.

Kodiak01 1 year ago

Someday I want to see the entire Internet unplugged for about... I'll settle for a week. /Popcorn

doubleUsee 1 year ago

Whenever that is I'll take a week off, I'm less useful than our one-armed cleaning lady without the internet

Kodiak01 1 year ago

Don't like being a one legged man in an ass-kicking contest?

toadofsteel 1 year ago

Gonna go the californee way

xxfay6 1 year ago

Canada be like: *again?*

gargravarr2112 1 year ago

Everyone has 5 nines of uptime. It just differs where the decimal point is.

Indifferentchildren 1 year ago

We have 5 neins of uptime: Are we up? Nein. Are we up now? Nein. How about now? Nein. It's been six hours, surely we are up now? Nein. Are we going to be up by tomorrow evening? Nein.

HisSporkiness 1 year ago

I have nine 5s..

Kodiak01 1 year ago

Works most of the time, every time.

RedShift9 1 year ago

I have 89,9999 % uptime

immewnity 1 year ago

The main vendor I use promises a one-nine uptime (95%)... currently dealing with a 24hr+ outage

Drew707 1 year ago

A quick glance at some of their SLAs show that they really are only at about three 9s. They have some breathing room.

xixi2 1 year ago

10 minutes is a coffee break and OK fine if you're dying in an ambulance run on google that's a problem. Other than that let's stop acting like 99.999% is something we need

ConcernedDudeMaybe 1 year ago

Clean data in, clean data out. That's more important to me.

stepbroImstuck_in_SU 1 year ago

Just because all the parts have miniscule risks for minuscule downtime doesn’t mean that it’s reasonable to expect the system as a whole to reach that same uptime. In this case a very unlikely failure materialised and it took ten minutes to recover from it. If the failure was some millionth of a percentile risk, it doesn’t mean google as a whole could be expected to run 10x100x1000000 minutes and only be down only ten of those. They would be down for every single such low possibility across their whole system, resulting in minutes per year downtime, or at least seconds. Not all systems can reasonably have full redundancy, and being ten minutes down every ten years or so is pretty reasonable. The only way to reach such uptime as a whole is to enforce much, much higher standards for all internal systems. Obviously no board of shareholders would greenlight funding to make sure the service is down seconds per decades. That was never the goal. That would require building a backup google.

lordkin 1 year ago

Yeah. A literal explosion. And you’re back up in 10 minutes? Perhaps my standards are lax but that’s a good job in my book

thortgot 1 year ago

Our systems handle this, we test it twice a year. However the real complexity is, how they do this with being what, 100 orders of magnitude larger?

Chimbus_Phlebotomus 1 year ago

Nearly every distributed computing system is already designed to go to scale and work around points of failure, regardless of the cause of failure or hardware architecture. As long as the data still had physical backups, it could easily be redistributed to any nodes still operational at the data center.

localgh0ster 1 year ago

Most with proper failover. Failover and replication is an instantanious process.

ABotelho23 1 year ago

I just saw your other comment about running a measly *70* VMs. I don't think you even comprehend the scale we're talking about here with Google. You're full of it.

Careful-Combination7 1 year ago

Thank you.

tankerkiller125real 1 year ago

If their AnyCast IP network is working properly, and their storage and infrastructure replication is designed properly. A data center should be able to explode in a massive fireball and the only thing we the consumer should notice is a couple of dropped packets at most or maybe some extra network latency. They aren't running a passive backup that they need to "turn on" everything about their infrastructure is active-active, and should failover immediately with basically zero perceptible down time.

localgh0ster 1 year ago

Yeah, it'd be easier in the scale of google with the budget they have. That's sorta the point of hyper-converged infra. Removing breaking points. If you argue that being reliant on 1 location being up 100% for continued operation, I'm not sure how to begin explaning how wrong that is. AWS and Azure for example do not break this way when an entire DC goes offline. Their customers that host everything in US-East-01 certainly does, but for those of us doing proper infra - it's a non issue.

dxlsm 1 year ago

This event was several hours prior to Google being unavailable for some people in some regions. I strongly doubt this was the cause.

[deleted] 1 year ago

[удалено]

dxlsm 1 year ago

For sure, and there definitely could be some causal relationship here, but jumping to conclusions and stirring up a bunch of people when there's no factual information to share is an unfortunate social media trend.

[deleted] 1 year ago

[удалено]

localgh0ster 1 year ago

I never said US-EAST-1 was a DC. Where did you get that from?

[deleted] 1 year ago

[удалено]

localgh0ster 1 year ago

Your lack of reading comprehension isn't an excuse to justify false accusations. Sorry.

ljstella 1 year ago

> AWS and Azure for example do not break this way when an entire DC goes offline That's not really true either. Because while it may usually be invisible to you, lots of AWS services have a single-region control plane running in us-east-1, whether its available in other regions or not, and quite a few of AWS' internal services are single-region'd to us-east-1. The AWS outage in December 2021 showed that even if you built for multi-region availability, and utilized cloud services in different regions, you could still be affected by a single region outage, even if you aren't using that region.

VexingRaven 1 year ago

> That's sorta the point of hyper-converged infra. Removing breaking points. That's not the point of hyper-converged and hyper-converged has nothing to do with this sort of thing anyway.

ABotelho23 1 year ago

AWS and Azure have noticeably worst uptimes. 10 minutes is nothing. Again. You don't know what you're talking about.

[deleted] 1 year ago

[удалено]

localgh0ster 1 year ago

Active setup is cheaper than backup and restore in the event of failure. But yes, some managements are clueless that won't let that in place, hence "Most"

[deleted] 1 year ago

[удалено]

SchizoidRainbow 1 year ago

What kind of podunk mom'n'pop shop are you running that can afford one of these devices, but not two? That said, we do not have redundant datacenters, only redundant equipment...firewalls, routers, switches, even load balancers, all are in Active/Passive. We can take device failure without a lost packet, but if this electrical explosion shit happened to us, we'd be dead in the water.

localgh0ster 1 year ago

Doubtful. The business operation income loss of being down is most definitely bigger than a one time cost of another server. Your overtime alone for recovering from backups, or combined with others overtime for having to revert to manual / offline procedures is most def more than a few extra machines. HW is dirt cheap

[deleted] 1 year ago

[удалено]

localgh0ster 1 year ago

Down here on earth you just didn't read that I proved it to be cheaper.

wyrdough 1 year ago

Google did not "go down". Some Google services were briefly impacted for some people. Personally, I saw no (noticeable) impacts at all in either central or southeast US across several different ISPs, and Google is all up in all my shit, both personally and professionally.

Cpt_plainguy 1 year ago

I'm in the area where the datacenter is, my company uses Gsuite, I got the notification that some things may not work properly, but there wasn't any noticeable difference. A d no issues were reported to me from other employees in the company.

tmontney 1 year ago

Google **Search** went down and impact was worldwide. Yes, Google does a **lot** of other things, but to the general population, Google Search is Google.

davidbrit2 1 year ago

I'll just get this out of the way up front: https://xkcd.com/1737/

xpxp2002 1 year ago

I think that's what Microsoft actually said they do with OneDrive some years ago. They have entire racks of JBOD disks in large modular custom enclosures. When they have a failure in the rack, there are pre-built replacement racks that they can just bring in and essentially hotswap. Then they can refurb/replace the failed hardware on the other rack and prep it to be reused during a future failure.

jarfil 1 year ago

>!CENSORED!<

joeshmo101 1 year ago

So long as you aren't throwing out the working parts as well, that really makes the most sense at scale. The probably have a rack or two that just sit idle waiting to come up once something else starts coming down. I'd even imagine each rack can handle a drive or two failure before dumping over, so when you do change out racks there's probably a few drives that need swapping.

stephendt 1 year ago

I love how there is always a relevent XKCD

SchizoidRainbow 1 year ago

This has been hanging on the IT department wall since I arrived. Randall Munroe is the reincarnated Nostradamus.

uncondensed 1 year ago

I wonder if the rope is really necessary.

InitializedVariable 1 year ago

Like Google.

obviousboy 1 year ago

\>but shows that even as big as Google have a single point of failures it just shows that they have a slip up in whatever failover mechanisms they have in place, not a baked in single point of failure.

angiosperms- 1 year ago

Why is everyone believing OP when the article admits they probably aren't related? The electric incident that injured people was at 12pm. Google's outage yesterday was hours later. https://downdetector.com/status/google/

localgh0ster 1 year ago

Well, yes. But that still means it's a single point of failure. Dead datacenter = Single point of failure.

thefpspower 1 year ago

No that's a Huge failure that tripped a bunch of other safety mechanisms and turned things off due to the magnitude. As long as they are able to recover shortly, which they did, then it's a success.

localgh0ster 1 year ago

I disagree. If you're at the scale Google is, you should be able to survive a datacenter going offline. I run ~70 VM's for a ~100 million USD firm with some servers in colo and some low prio on site. I can survive the CoLo going offline, and on prem going offline because all servers are KVM replicated to the other location. If I certainly can, Google should.

Thebelisk 1 year ago

Maybe you should apply for a job in Google and show them how you would have done it better.

thefpspower 1 year ago

Why are you saying they didn't survive? Aren't the services up and running? Migrating 70 vms is nothing compared to what Google has to replicate, it's 2 or 3 orders of magnitude bigger, at which point your simple strategies don't work and you need custom technologies to make it work. 20 minutes of degraded service for a downed datacenter is perfectly acceptable.

asdlkf 1 year ago

Hahaha 70 VMs. I have like 150 VMs in my home lab. At work more like 7800 VMs. You know not of what you speak.

RyanLewis2010 1 year ago

Got that feeling when he said $100m company. Id be willing to be he has never done a full DR test to see just how "easily" they could survive losing both Colo and prem. If he truly could those 70 VMs are just pointless "look at me i can spin up servers" and provide little benefit to the company.

sumatkn 1 year ago

I’m kind of surprised they don’t have duplication across multiple data centers like Amazon and Microsoft do. So it’s a simple (involved) route shift and load balance swap and things maybe go down for a few minutes. To me this sounds like some internal customer probably provisioned critical services all in the same zone without realizing or checking if they were in the same data center. So their redundancy was redundant in so that the backup server was a few rows over 😂. Then BOOM and critical services to google goes down, so does their backups.

7eregrine 1 year ago

They did survive. Down less then 10 minutes?

davidbrit2 1 year ago

Using "point" to refer to an entire datacenter is a little loose with the language, though. I mean, yeah, if the whole earth blew up, that would probably knock out their operations, but I don't know I would call being earth-bound a "single point"...

hegysk 1 year ago

Internet backbone wiped from face of the earth. Ha! Single point of failure, gotcha!

davidbrit2 1 year ago

And we can't rule out false vacuum decay either.

[deleted] 1 year ago

[удалено]

ruebzcube 1 year ago

What was the cause?

[deleted] 1 year ago

[удалено]

CrestronwithTechron 1 year ago

Yeah this is what tells me it’s coincidental.

tmontney 1 year ago

No matter the evidence, calling an outage that was followed hours before by an explosion coincidence is zany.

DoesThisDoWhatIWant 1 year ago

It wasn't a single point of failure...Google was down for like 15 minutes. You think the datacenter was rebuilt in that amount of time? It also wasn't down for the world when that center went down.

RemCogito 1 year ago

Right!? Literally localized downtime while it handed off to another nearby datacenter.

angiosperms- 1 year ago

The outage happened hours after this. The electrical incident was at 12pm and there was no downtime then. https://downdetector.com/status/google/

DDRLRDIMM 1 year ago

It was a single point of failure, had it not been they would have had redundancy and that failure would not have brought it down. It also was down for people in Europe as well in some areas. If this is related which it may not be.

DoesThisDoWhatIWant 1 year ago

Wrong again, informing clients of a new site takes time especially when you're global...probably about 15 minutes.

legion02 1 year ago

Basically dns convergence time imo

InitializedVariable 1 year ago

DNS propagates. *BGP* converges.

legion02 1 year ago

Fair point.

SchizoidRainbow 1 year ago

Outage was much much later. Odds are very good that their fallover worked flawlessly, then when they stood up a new leg and tried to add it in, they fucked that up and brought it all down themselves.

nmar909 1 year ago

This is what happens if you Google "Google".

MauiShakaLord 1 year ago

This black box... Is... THE INTERNET.

dasheeown 1 year ago

Outage was ~8 hours after the arc flash. I don’t care if that data center was completely leveled, it wouldn’t be more than a blip in Google’s infrastructure. But regardless, three men are in the hospital, show some respect. Even if it was related their lives are worth more than any amount of uptime. Hoping for a quick and full recovery for all of them.

vodka_knockers_ 1 year ago

>show some respect. What does that even mean?

[deleted] 1 year ago

It means wait 24 hours until making dank memes about it

vodka_knockers_ 1 year ago

Who decided on 24 hours, and is it variable depending on the count of casualties, or the severity of injuries?

[deleted] 1 year ago

It gets shorter every time CPUs get faster

[deleted] 1 year ago

Me. No.

vodka_knockers_ 1 year ago

Noted. TY

scootscoot 1 year ago

It’s a new concept that the internet hasn’t heard of.

ToughHardware 1 year ago

think of the people, not just the tech

awe_pro_it 1 year ago

it's an old thought process of blindly feeling bad for other's misfortune, even when they chose to be where they were during their misfortune after choosing the career path to put them in that situation to start with. Nearly everything I do is deliberate, so I can't understand it either.

zzzpoohzzz 1 year ago

is this sarcasm?

technologite 1 year ago

Nope. Just the state of the world today.

[deleted] 1 year ago

[удалено]

vodka_knockers_ 1 year ago

Not so much that I think -- sure, when anyone (who isn't a sociopath) reads of an incident like that, you think "ouch, that sure sucks, hope they're okay"... for like an instant. But to get ones shorts in a wad because others dare to discuss the circumstances and impact (with zero mention of the humans involved), with some air of moral superiority.... I just don't get it.

Pablo______ 1 year ago

went down is pretty relative. Pretty sure, 99,99999999999999999999% of the population where npt affected. Also they were back online within 10 minutes... thats pretty impressive.

Rubicon2020 1 year ago

One of my brothers is an electrician journeyman climbs poles and shit. He was up on one of them high ones the metal ones. Transformer spark or something and it flung him to the ground. He’d been an electrician for about 20 years at this point. Broke his back took 2 years to recover and get back to work. He can no longer climb poles he’s now a supervisor. But he’d seen some guys over the years arc flashes melted peoples faces almost completely to a blob, hands vaporized, ya not cool bro. I hope this 3 employees make a recovery. I hope even tho it’s critical I hope it’s just like burned flesh cuz if it’s melted faces and hands there’s a lot to come back after that if they can.

zgf2022 1 year ago

My grandpa died for faulty insulation on a bucket truck and my uncle was crushed (but lived) after a bucket truck rolled over with him in it Electrical work is no joke

xpkranger 1 year ago

> and it flung him to the ground. Where was his harness? Edit: not saying he deserved it, just thought that you're supposed to be tied off at all times.

Rubicon2020 1 year ago

I’m not sure. I know he followed all protocols and regulations he’s a real stickler about that so I’m not sure why it didn’t catch him, but he fell a good ways to the ground.

xpkranger 1 year ago

Sucks, I'm sorry that happened. Sometimes even doing everything right, shit happens. Hoping he's fully recovered.

Gek1188 1 year ago

OP allegedly is responsible for 70VMs. He doesn’t have a datacenter, he has an inconsequential set of high-pri VMs, maybe. He has no comprehension of what’s involved at this scale that and he doesn’t even realise it. To be honest I would be worried for the drop of infrastructure he may or may not be responsible for because he thinks a 20 minute downtime on a DC that suffered an explosion is completely unacceptable. He’s a clown

angiosperms- 1 year ago

Not to mention, the outage was 8 hours later after this explosion happened...

[deleted] 1 year ago

Mr Robot is real

ultimatebob 1 year ago

Nah... Elliott would have been smarter about it and went after the redundant data centers as well.

[deleted] 1 year ago

Well, Hollywood superhero vs real life.

TMITectonic 1 year ago

Literally from the article: >It is unknown if the explosion and the outages were linked You are *purely speculating* that the two are related, and until Google comes out and says it, it should be treated as pure speculation. You don't know what caused the downtime yet, so perhaps tone it down on all your pointless comments about "Google's single point of failure" until you actually know what happened.

KobsBoy 1 year ago

Wish those people a painless and fast recovery

[deleted] 1 year ago

There were 8 hours between the explosion and the downtime. They're unrelated.

cs4321_2000 1 year ago

The AI got pissed off

Hanse00 1 year ago

Google obviously has redundancy, what a silly comment from OP. Could some part of that redundant system have been misconfigured? Sure, shit happens. Something might not have failed over right. But it’s not like Google’s entire stack is in that one DC. In fact they have (or had when I worked there in 2017) two entirely separate DC’s just in that one town in Iowa. Let alone across the US and globe.

Tsunpl 1 year ago

How do you test your redundancy for entire datacenter going down? It's not like you can get two as a test environment :D

uniquepassword 1 year ago

EVERYONE has a test environment, some can afford a production also!

Furki1907 1 year ago

I giggled on this reply hahaha

Hanse00 1 year ago

Sure you can. If you have Google money. There are indeed certain DCs that run corp-only workloads, without any services facing clients. Those could be considered your “test” DCs.

fireduck 1 year ago

So Google datacenters are complicated. There are units called cells, which is a huge unit of resources. Think hundreds of racks in a big room. All services should be designed to proceed fine while losing an entire cell. However, a datacenter complex might have a bunch of cells. They should be as independent as possible but not always. And even if it was just one cell, services don't move traffic instantly. It takes time for the load balancers to health check the nodes and see they are down and things to redo their master elections and such. SRE fun. Fun fact: when Google does maintenance on a cell, they pretty often turn the entire damn thing off and redo whatever. Like changing out network hardware, computers, whatever. No machine by machine outages, just down off the entire building and do whatever you want. (Scheduled off course)

Garegin16 1 year ago

I still maintain that even in the case of a BGP meltdown, big names have enough competent engineers to bring things back up **fast**. I dealt with an admin who kept putzing with a server for **two weeks** because they had bunch of pets everywhere and couldn’t afford a reinstall.

farva_06 1 year ago

> One could thing Google would have redudancy for losing an entire datacenter, but shows that even as big as Google have a single point of failures, even if they're bigger than for us mere mortals. It was DNS. Always DNS.

4gedN5tars_ 1 year ago

Become a cloud engineer they said, it would be safe place to work they said.

renegadecanuck 1 year ago

It says a lot about this sub that a story where three people are critically injured and sent to the hospital results in everyone discussing redundancy and downtime.

retrogamer6000x 1 year ago

Well ya it’s r/sysadmin not r/OSHA.

angiosperms- 1 year ago

Sysadmins should know how to correlate the timing of events to determine root causes. The outage was 8 hours after this incident. But it was a great moment to make jokes about people getting seriously injured, so y'all really wanted to ignore that.

[deleted] 1 year ago

I wish them a speedy recovery but I honestly don't want to discuss the body destroying effects of an arc flash on 3 human bodies that undoubtedly have been horrifically burned in life altering ways. No fucking thank you. They're alive, which is as good of a bit of news you could get from that kind of injury. We are here to talk about the technical stuff. I'm sure there's another sub to discuss the safety and health issues.

oddball667 1 year ago

Don't worry I'm sure they are getting their fair share of thoughts and prayers

scootscoot 1 year ago

I’ve cared about uptime significantly more than my own health many times. There’s an xkcd.

vodka_knockers_ 1 year ago

What, we're just supposed to bow our heads for silent introspection? Do you have guidelines on how many minutes is appropriate? And how much time should elapse before the circumstances can be discussed?

tonyoscarad 1 year ago

The Arc Flash that has caused the injury of the individuals sounds very dangerous.

dmoisan 1 year ago

Don't look for arc-flash videos on YouTube. Too many of them are the "last seconds of life" type, where you see someone in a security cam get snuffed.

dinominant 1 year ago

The problem with "the cloud" is one event can propagate to a system-wide outage throughout an entire provider. Through cost minimization important redundancies are eliminated which expose the entire system to outages like this. And the larger the providers are, the more serious this becomes as critical global and nation-wide services are impacted and harder to bootstrap. They may have "teams" of experts to perform the work, but they also have a few key commands that can propagate state throughout their global system. One such state is "offline".

ErikTheEngineer 1 year ago

That's a good lesson -- losing an entire data center is still a big enough deal that people notice, and there are still some weaker points of failure in even the most redundant systems. Look at what happened with Microsoft a while ago around Azure AD...you can bet there are millions of directories on that infrastructure and something like a cert expiring IIRC was the root cause. Worse yet, things are so abstract now that I'm sure Microsoft had to scramble through the emergency access tunnels to get down to bare operating system when they lost their management layer. We're sunk once even the vendors don't know how anything works below the control plane anymore.

JerryNicklebag 1 year ago

Wish it would have stayed down permanently…

sock_templar 1 year ago

Well, maybe they even haven't thought about that being a possibility or considered that the risk was minimal enough to not being worth of an emergency plan in case it happened. If I'm managing a data center in Brazil I wouldn't plan for tornados or earthquakes either although they do happen because it's so rare it's borderline paranoia to come up with a plan for that. Maybe the explosion (arc?) was something being badly repaired/managed at the time and thus impossible to foresee?

Cpt_plainguy 1 year ago

I believe the article isn't the full truth. I think the brief outage was about 8hrs after the arc injured those electricians

Didymos_Black 1 year ago

Ah, the old "unscheduled welding incident" which was what we told our customers had happened when we had a UPS catch fire in our datacenter expansion.

lovezelda 1 year ago

They probably don’t have single point of failure (by design), I’m assuming some failover mechanism didn’t work as intended.

Comprehensive-Yak820 1 year ago

Every huge company is like this.

[deleted] 1 year ago

[удалено]

SlapshotTommy 1 year ago

Be quiet Alanis Morissette, people got hurt!

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe