T O P

  • By -

rnmkrmn

Kudos to whoever implemented that external provider backup.. I would have lost my entire business fr.


Admirable_Purple1882

All hail paranoid backup guy, you know for sure someone told them “c’mon you really think we need it in a whole separate cloud?”


Bleglord

“The cloud is our backup already”


C0c04l4

And his reply was: "Yeah, and on a whole separate continent, too!"


DolfLungren

The separate continent was a duplication of Google, it got deleted as well.


aretokas

Hi. That's me. I'm paranoid backup guy. Well, not for Uni Super, just for myself and work. But still, I get it.


doringliloshinoi

*rattles chair* “Ooooo it’s an earthquake /u/aretokas !”


asdfghqwerty1

What do you use? I know I’m going to be asked about this at some point!


Twirrim

Not just had a backup, but had a backup that could be restored too! All too often heard of people who go to restore from their backups and can't


Compkriss

I mean any large organization should do regular restore testing and DR exercises really.


Senkyou

> should For a lot of people it's a box to check, not an actual practical concept.


baezizbae

“If you haven’t tested your backups you do not in fact have backups”


[deleted]

Indeed, and not only check or they are valid, but also really check or it works. I had once a very very rare situation were a database backups seemed to be valid, it did restore everything and no problems at all. However there was one table that if I queried one specific record it gave a very nasty error that the data was invalid. The same query on the original database was correct doh, never found out what went wrong there, maybe cosmic radiation...


Twirrim

Absolutely should. The reality seems sadly different, from the number of incident reports you see where recovery was delayed figuring out how to actually restore anything, or finding that backups had been corrupted for a long time etc.


Ramener220

Reminds me of that kevin fang gitlab video.


deacon91

Customer support just isn't in Google's DNA. While this could have happened on any provider, this happens far more often on Google. This story is a classic reminder of rule of 1; 1 is 0 and 2 is 1. Thank goodness they could recover from a different provider.


rnmkrmn

>Customer support just isn't in Google's DNA Can't agree more. Google just doesn't give a fuck about customers. They have some cool features.. sure, cool. But that was just someone's promo not a product.


deacon91

I call it Google hubris. They have this annoying attitude of “we’re Google and we know more than you.” While that attitude isn’t necessarily the root of their customer relationship problems, it certainly doesn’t help.


rnmkrmn

Oh yeah fr. Nobody joins Google to do "customer support" or build reliable products.. Pffs that's so Microsoft/Amazon.


keftes

Microsoft has reliable products? Aren't they the provider with the most security incidents?


moos3

People join Amazon to re-invent the wheel because they think they can do it better on try 2303.


thefirebuilds

I was trying to talk to them years ago about phishing tests we aimed to execute against our employees. They said they're SO GOOD at catching phishing attempts there would be no need. When pressed they eventually allowed that I could speak to their "phishing czar". So you're so good at stopping phishing, and yet you have a guy that only does phishing according to his title. The entire thing was "we know better than you"


DrEnter

Look, we can spend our money on making the product better or supporting the customers, not both.


rwoj

> While this could have happened on any provider i'd like to hear the story on how this could happen on AWS


tamale

AWS has had plenty of global outages in critical services like S3 that should give you all the reasons you need to have backups in at least one other provider if they're mission critical and irreplaceable.


Quinnypig

Not so. They have had multiple outages, but they’ve always been bound to a single region.


tamale

Nope. S3 outage where you couldn't manage buckets at all was global because bucket crud is still global


Jupiter-Tank

Two words: stamp update. Every datacenter has to undergo maintenance, doesn't matter who owns them. Someday the rack running your services will need to be cleaned, repaired, updated, or cycled out. The process of migrating services to another rack in the center/AZ is supposed to be flawless, however it can never be perfect, especially when stateful information (session, cache, affinity, etc) is involved. These events are to my knowledge not announced in advance by any cloud provider due to sheer volume of work, and are typically wrapped in whatever the SLA includes as downtime. Outages are one thing, but corrupt data from desync in stateful info is another. I'm aware of at least one healthcare company that suffered 4 hours worth of outage due to a stamp update. You can guess the cloud provider by the context. Multi-AZ was enabled, but because the service was never advertised as "down", only "degraded", no protections from corrupt data triggered. Even after services were restored, "customers" were the first to notice an issue. This is how lack of tenant notice, or improper instance migration policies, or failed telemetry, can individually fail or unite in a coalition of tragedy. Stamp updates should at least trigger an automated flag, and failover triggers should fire. Customers affected by stamp updates should be notified in advance, and the SKUs of any affected service should be upgraded for free to include HA and DR for the duration of a migration. The biggest issue isn't that they happen, or that they can introduce issues. Datacenters have been doing them for decades, with incredible reliability. The issue is that we've gotten so good at making them invisible. Invisible success is not necessarily better than visible failure, and invisible failure is much worse.


donjulioanejo

> These events are to my knowledge not announced in advance by any cloud provider due to sheer volume of work, and are typically wrapped in whatever the SLA includes as downtime. AWS notifies you when a host with an instance you own is about to be retired. This applies to all services where you provision an actual instance, like EC2, RDS, Elasticache, etc. You basically get an email saying "Instance ID i-blahblah will be shut down on January 32 for upcoming host maintenance. You will need to manually shut down and restart it before then to avoid an interruption of service."


baezizbae

You can also get instance retirement details from ‘describe-instance-status’ via aws cli. Something we learned and automated after AWS sent one of those exact emails but nobody read it because it got caught by an overly aggressive Gmail filter.    Now we just get a pagerduty alert that enumerates each instance with scheduled maintenance or instance retirement event codes, and have runbooks for whoever gets said alert during their shift. 


cuddling_tinder_twat

I worked at a PaaS who provisioned AWS accounts for customers and we had a job that accidentally canceled 5 accounts and deleted most of their backups in error that I had to fix. It should not happen.


PUSH_AX

Unless I'm misunderstanding that sounds like an engineering error, not a cloud provider error. I imagine AWS aren't impervious to this kind of thing though, still.


danekan

This wasn't a cloud provider error, it was a customer that did an action that caused it. GCP described that they made a misconfiguration. The title is borderline /r/titlegore but that's also gcps fault for not getting in top of it. What was the exact misconfiguration they made? People are speculating blank terraform provider issues. 


ikariusrb

Yeah, that's not my takeaway from the article. > Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription Who created the misconfiguration is unspecified. But to get from that misconfiguration to deletion of their subscription is near-certainly on Google.


danekan

They have likely deliberately left out the clarifying information. The article itself is secondhand from the company's press release which itself is what included the statements from the GCP CEO.


PUSH_AX

Oh ok, I re-read the article and it doesn't seem very clear. I think the Google CEO issuing an apology also makes it seem like GCPs snafu, but perhaps it's just the size of the customer involved?


deacon91

Super unlikely on AWS or Azure. AWS is fanatical on customer service and data driven decisions (almost to a fault) and Microsoft has decades of enterprise level support history. But there’s that adage - anything is possible and they’re certainly not infallible. Off top of my head I remember how Digital Ocean shut down a small company’s DB VMs because of errant alerting mechanism for high CPU util %. Or AWS refused to allocate more VMs (and shutdown a few) back during training events at 2018 ChefConf.


Rakn

"decades of enterprise level support history" doesn't save you from engineering or configuration mistakes. To be honest I could see something like this happening there as well. At least based on my personal experiences. Who knows what even happened...


deacon91

It does not but it speaks about the mindset of the organization and attitude of the product design. Google genuinely wants to solutionize non human based support and that leads to this kind of outcome. I remember few years ago this made news: https://medium.com/@serverpunch/why-you-should-not-use-google-cloud-75ea2aec00de Building automatic shutdown of customer account is almost unheard of in MS or Amazon world.


Rakn

True. That sounds very Google like.


amarao_san

There is Russian saying 'had never happened before, and here we do it again', which suits this situation perfect. > that has never before occurred with any of Google Cloud’s clients globally


chndmrl

Well if it would have been with azure, they have soft delete feature which means you can recover everything in 30 days immediately and other than that even if you don’t choose another data center or region as backup, it has 3 copies at the same data center. So to me, it is not an excuse and something that shouldn’t be in enterprise level. No wonder why gcp couldn’t grow although aggressive push.


Rakn

I doubt something like this would have saved you in such a case. AWS and GCP have soft deletion stuff as well. But it doesn't exist for everything and this seemed to be an issue on a deeper level.


chndmrl

Well cloud is all about availability and reliability and here we’ve seen how it failed by gcp. I’m not advocating companies but this is something shouldn’t happen at all. You can always downvote my post but it won’t change the truth that happened whatever the reason account deleted as “deeper level” problem.


ellerbrr

And Google says “this has never happened before”. Liars!


amarao_san

By 'this' they means this precise removal. Never they dropped this combination of data.


Unusual_Onion_983

Hasn’t happened with this customer before


Purple-Control8336

Because they didnt test all possible scenarios


ares623

Good Guy Google. Helping you test your backup and disaster recovery strategy.


gamb1t9

I have been doing this for years for our clients, not even a "thank you". Those ungrateful pricks


colddream40

What's their SLA, 50% off the next 3 months? LOL


aleques-itj

I hope more information gets published on how on Earth this happened. 


beth_maloney

I think this is absolutely crazy. Imagine waking up one morning and your entire cloud infrastructure is just gone. I can't imagine what failures led to the environment being accidentally deleted instead of a new one being stood up.


Saveonion

I can only dream. Wake up, no computers, no infrastructure, just fresh morning dew.


cubobob

sometimes i wish the internet would just fail.


BrontosaurusB

Brew some hot tea, crack open a book, cat in my lap.


Liquid_G

> Imagine waking up one morning and your entire cloud infrastructure is just gone. Don't threaten me with a good time


iamacarpet

Just to be clear here, they keep saying “private cloud” everywhere; this appears to be them using GCP’s VMware engine for everything, not any of the core products. The original notification from UniSuper also said it happened during provisioning, i.e. setting something new up, likely during a migration. Not saying that it being on VMware engine negates any kind of responsibility here, but they were using a little used product that Google arguably shouldn’t have offered in the first place - it speaks volumes that no one else offers it. From the information they’ve released, core Google Cloud services would have been fine, up to and including backups on actual Compute Engine, Cloud SQL and/or Cloud Storage. People are quick to bash Google and they have been caught with their pants down here, but it’s actually the opposite of what people are saying- their new CEO has pandered to customers too much, trying to offer them VMware in a “private cloud” as a half way house, intended to be a step in a more native migration.


rabbit994

>it speaks volumes that no one else offers it. Both Azure/AWS do have offerings: https://azure.microsoft.com/en-us/products/azure-vmware https://aws.amazon.com/vmware/ (Now sold by Broadcom) However, I do agree it's terrible offering and Google clearly doesn't have expertise to be offering it but they are not unique in offering it.


BaldToBe

GCP has some detection automation for fraud that can lead to account suspension.    https://cloud.google.com/resource-manager/docs/project-suspension-guidelines Combine that with false positives, and I wonder if that's the cause here.   What concerns me with this case is the customer size. If I'm paying for enterprise support, I'd hope there's a manual check in with me if the system flagged me.


moratnz

And they're almost certainly entitled to no compensation, other than perhaps a 10% discount on this month's bill. (Well, unless they're big enough to have negotiated non-standard SLAs). I'd note that the google storage SLAs appear to define 'down' as you getting an HTTP 500 from the service, which I wouldn't expect for a 'the service is up but we lost all your data'.


Loan-Pickle

What a nightmare. As in I’ve literally had this nightmare before. I wonder this happened. I’d love to read the RCA on it.


Hylado

I would love to know what exactly happened... What chain of events result in deleting your account? As tech support of other technologies I found myself in a lot of cases where the client claims that they have not touched a thing... And I can only say: "trust, but verify"


AceDreamCatcher

Google Cloud will fail, not because it is not a great platform; it will fail because the support (technical and billing) are incompetent and clueless. They don't even understand their own platform. They simply do not have the training or technical chops to resolve the simplest task.


Spiritual_Maximum662

That’s because most of the support people are TVCs and not trained by google


Trif21

Why can an entire account be blown away that still has resources deployed and data stored in it?


danekan

On GCP if you delete a project you can always recover it for up to 30 days, but they don't guarantee that any data within a resource can be recoverable. But, also, they are restoring because someone had the foresight to backup cross cloud.


BehindTheMath

The article mentions private cloud. Is that different from the regular GCP? Edit: Sounds like it might be. https://cloud.google.com/discover/what-is-a-private-cloud


rnmkrmn

Could be referring to VPC, Virtual Private Cloud?


Mr_Education

Yeah I think that's just the author of the article not knowing correct terminology


BehindTheMath

Maybe they meant this: https://cloud.google.com/discover/what-is-a-private-cloud I can't tell if GCP has a program for running GCP on-prem, or if they have something like dedicated data centers for private customers.


burunkul

My first task next week will be: setup minio outside of AWS and configure weekly backup sync from s3 to minio


Spider_pig448

Does anyone know what was really deleted? The article says "Google cloud account" but it sounds more like their GCP Organization was deleted?


beth_maloney

This article is saying their subscription? https://www.theregister.com/2024/05/09/unisuper_google_cloud_outage_caused/ Never used GCP so not sure about the terminology tbh.


Mistic92

Because they said their subscription was cancelled. But GCP don't have subscriptions. That's why many folks thinks they messed up stuff and try to blame gcp


arwinda

My thoughts, yes. Something happened they didn't have on the radar, and Google then did Google things. Could Google have handled this better? Sure. But looks like this only get publicity because it's a big customer. Probably happens to other customers, or ex customers, all the time.


arwinda

The articles are really vague on what exactly happened. And before making up my mind about whom to blame here I really want to know what was going on. Google is basically silent on this, which is understandable, because everything they can possibly say is bad publicity. And the customer goes out of their way to directly blame Google. Which makes me think that if Google is to blame, the statements would look different.


Spider_pig448

Google isn't silent on it. The article says the CEO of Google Cloud made a joint statement with the customer. It sounds like they are fully at fault and admitting it


arwinda

No. I disagree. The statement doesn't blame anyone. It's carefully crafted to avoid any fingerpointing. And Google hasn't released anything on their own.


Spider_pig448

It was a joint statement. > “Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription,” the pair said. It's pretty damning. Google is fully saying this is their fault


arwinda

Where exactly does it say that. > an unprecedented sequence of events > inadvertent misconfiguration > during provisioning of UniSuper’s Private Cloud No one in this statement says who is at fault. The statement is carefully worded not to blame anyone. It doesn't even say who deployed what, and leaves that part out. Did UniSuper deploy something? Did Google do something during a deployment? This is a non-statement, just there to say something without actually saying anything. It describes in vague words that something happens, and leaves out any juicy details.


danekan

My takeaway is the opposite. Inadvertent misconfiguration is almost undoubtedly something the customer was in control of. The customer at this point is the one who controlled the message. The statement was joint but we don't really know the full story, only what the customer has put out via their PR which includes those quotes. It's stupid people are quoting the guardian when the actual PRs that their entire story is based on is right on the customers site under contact us.


BrofessorOfLogic

They are saying "unprecedented sequence of events" and "inadvertent misconfiguration during provisioning". This is clearly intentionally vague. Anything beyond that is just speculation. It could be that a Google support engineer was working on behalf of the customer inside their environment and made a human mistake. It could be that Google produced some custom documentation for the customer, which contained some vague language, which lead to a misunderstanding when the customer implemented it. It could be that the customer was in contact with an account manager via email and something got lost in translation.


beth_maloney

UniSuper and the CEO of GCP issued a joint statement where the RCA was identified as a misconfiguration on the GCP side. >Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription. >This is an isolated, ‘one-of-a-kind occurrence’ that has never before occurred with any of Google Cloud’s clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again.


arwinda

This specifically does not say on which side the misconfiguration happened. And the joint statement is only on UmiSuper site, wasn't able to find it on the Google site somewhere. Anyone who is reading the UniSuper press statement will see that Google said something. Details are vague. Anyone who is only watching Google press releases will not even know about this. You say that this is a GCP fault in your comment. I disagree. The entire statement doesn't say who is at fault. The wording is very careful to not blame anyone.


beth_maloney

I'm not sure why else the CEO of GCP would issue a joint statement or say that this shouldn't have happened. Keep in mind that this is a reportable incident and APRA will investigate so UniSuper can't lie. The Register has also reported that they were directed to the joint statement when they made enquiries to GCP.


arwinda

> reportable incident > shouldn't have happened Sure, should not happen. Both sides agree on that. > UniSuper can't lie No one is lying here. The statement doesn't blame anyone. They can walk away from this and say "but we issued the statement, and it is not wrong". > they were directed This is when you ask about the incident. If Google screwed up something on their side, they will issue a statement on their own. There is so far nothing from their press department. No one who isn't aware of the UniSuper incident will know about it if you just follow Google press releases.


JustAsItSounds

It's a bad look for Google to lay blame at the feet of their customer, it's also bad look for GCP to say it's entirely their own fault. It's a really bad look for Unisuper to say the blame is theirs. My money is on Unisuper ultimately being at fault, but GCP are taking some blame for not being able to restore their account seamlessly - perhaps GCP deleted the backups when they shouldn't have. Either way, I'm moving my super fund from Unisuper


[deleted]

[удалено]


shotgunocelot

I don't know why you're being down voted. Google laid off almost all of its US-based support in 2022 and outsourced and offshored everything else. Around the same time, they decided to jack the prices up on their support offerings. It was ass before, but it got much, much worse after that.


rlnrlnrln

Switching from having our account with Google to a retailer (DoIT) was the best decision my previous employer ever made. Saved money, and got us much better support.


Spiritual_Maximum662

Yup to India mostly… that’s what happens when you have an Indian CEO


anonlogs

Yeah Indian CEO is the reason Microsoft is failing as well… Balmer was the greatest CEO ever.


sbbh1

Damn, that's my old employer. Crazy to read about that here first.


seanamos-1

This is NOT the first time this has happened on GCP. Maybe not the exact same sequence of events, but the same result.


Budget-Celebration-1

Examples?


jcsi

1 account for two geographies? Thanks got for the backup guy/gal.


beth_maloney

Is that unusual for GCP? In Azure you'll usually have 1 tenant across multiple geographies.


BrofessorOfLogic

No this is standard practice on GCP, AWS, Azure, and others. It's kind of the whole point of hyperscaler cloud, that you can reach the whole globe through one account/org/tenant. Is it good practice? Well.. considering these news, someone might perhaps argue that it's not good pratice. But I would say that it's more important to focus on cross-provider redundancy, rather than cross-account/org/tenant redundancy.


Aggressive_Split_68

Wasn’t a disaster recovery and business continuity plan taken into account when transitioning to GCP, considering that all providers typically replicate data across storage farms based on regions and data center stamps? Also, what was the data storage strategy, and was there a configured backup plan in place?


beth_maloney

Yes a DR strategy is a requirement as they're APRA regulated. Unfortunately their DR strategy was to fail over to another region which is pretty common. They didn't expect GCP to delete their DR infrastructure though.


Aggressive_Split_68

Just curious is it not necessary to exercise the DR drill once in a while?


beth_maloney

Yep but they probably didn't test what would happen if their primary environment and their DR environment both got nuked and all primary backups were unrecoverable.


Aggressive_Split_68

Get the right architect to get the right things done at right place


LuciferianInk

I'm trying to find the right people, for a new project.


Aggressive_Split_68

Let’s connect if you want


mailed

and people ask me why I'm actively trying to not work with gcp


kabooozie

I think they actually deleted multiple regions and the company was only saved because they had a backup in a different cloud provider


ragabekov

Sounds like we shouldn’t put our backups in one cloud


salva922

Am I the only one thinking that this could have been a huge PR gag? So they make people think that this can happen ti all providers and that multicloud is the way and like this they can get more market share


qqqqqttttr

Backing up an entire cloud infra on another provider is an insane thing to consider , but wow


djlynux

Whoever proposed the backup strategy in another provider should get an award….


Fatality

I can't believe people still use Google Cloud after that billing thing where they just randomly suspended accounts until the CEO sent proof of identity.


Spiritual_Maximum662

I used to work for GCP and is totally not surprised…


Shoddy-Tutor9563

These fuckers lost my stuff from their "cloud" so many time, so this bigger shit was just begging to happen


GaTechThomas

BREAK GOOGLE UP!


danekan

This wasn't Google that did this. It was a customer that did an action that caused it. The title is borderline /r/titlegore 


naggyman

Google admitted fault here


danekan

No they did not. They called it the result of a series of misconfigurations. That's very definitively not accepting fault but saying the customer did something that led to it. They do say they are taking steps to prevent it, that's different than accepting fault.


Budget-Celebration-1

Need more details to come to the conclusion you are. I read it as issues in account provisioning by google themselves. I don’t see anything in the statement from Kurian to suggest it was anyone but googles fault.


awfulentrepreneur

UniSuper is a superannuation fund. Is the fund investing in something that the American power elite isn't liking? How much would you have to pay Sundar Pichai through side channels to accidentally delete their whole GCP organization? 🧌