Reflections on UniSuper’s Google Cloud Meltdown

Dave Cecil
Published: 20 June 2024

The accidental deletion by Google Cloud of Australian superannuation fund UniSuper’s entire cloud subscription, and the subsequent weeks-long service disruption triggered by this unprecedented occurrence, garnered worldwide attention when details emerged last month – and caused system administrators, chief technology officers and executive leadership teams around the world to sit up and take notice.

If disaster was averted thanks to backups held with a yet unconfirmed “additional service provider”, it is worth reflecting on what lessons might be drawn from this unfortunate turn of events for organizations already committed to, or considering the benefits of, a migration to the cloud.

Lesson number one, and certainly the most important: UniSuper had global replication across two Google cloud geographies – meaning that, like all prudent institutions, they had planned for potential data loss and taken advantage of the type of replication the use of cloud offers at the touch of a button. However, this did not protect them, as the misconfiguration impacted both locations simultaneously.

It was UniSuper’s decision to adopt this approach, with backups held at a third party, that saved the day and allowed service to be restored – a rarely used, but clearly still very necessary, failsafe. This backup rotation strategy (often called the ‘3-2-1 principle’) has been around since long before the cloud, first developed in the days when data was backed up to tape and then taken to another physical location.

All well and good at the end of the day of course, but why did it happen at all? When the story broke, an expert observer might have been forgiven for questioning how and why there were not controls in place to prevent to such a catastrophic event, given the huge emphasis placed on automation and safeguarding around provisioning cloud services. This is something we have grown used to and instrumentation is very good – though apparently not in this case, which is all the more surprising.

The evidence suggests that this account deletion happened extremely swiftly. One might conclude that it was scripted in some way, and that script was replicated almost instantaneously – both sites would have needed to truncate a serious amount of data, drop a serious amount of code, and delete a serious amount of configuration. If true, then a lot had to go wrong to precipitate the outcome we saw. How did nobody notice? Was there a lack of instrumentation, monitoring or alerts?

In their joint statement following the account deletion, UniSuper CEO Peter Chun and Google Cloud CEO Thomas Kurian described it as an “isolated, ‘one-of-a-kind occurrence’”, and subsequent communications maintained this line.¹ However, in the last week of May, Google released more details.² The good news is that – as per previous statements – this particular sequence of events appears to have been a genuine one-off.

The issue was caused by an internal tool used in some deployments of an earlier version of a specific google service (GCVE, which provides a way to convert VMware workloads to run in Google Cloud).³ All the processes were followed correctly – but the attribute used to determine the lifetime of the subscription was inadvertently left blank, leaving a default value of one year. Accordingly, on its one year anniversary, the subscription effectively ‘ended’ and the account was closed. There were no customer notifications, no alerts, because this was not a customer deletion request.

This particular subscription set up is no longer possible – an update to the API deprecated the previous version and deployments of this type are now managed in the usual way via customer dashboards with all the expected default alerting.

What is interesting is the apparent lack of testing (or at least, missing test cases) of the original tool, the API it used, the defaults applied to mandatory fields, and the documentation describing the process to be followed. Ultimately this omission led to a bug in production – and a pretty costly one at that. Presumably there was also no process (automatic or otherwise) to inform a customer that a subscription was being deleted.

This then leads on to an assessment of the blurred line between cloud provider and customer. The shared responsibility model has been in place since the inception of the modern cloud, but it really came to the fore during this particular event, manifesting as what Google has described as “shared fate”. ⁴ Both the customer and Google worked together to diagnose and then solve the problem, with the actions taken by the customer to protect their assets via a backup to a third party saving the day.

If you are not following this approach at your organisation – either by using another cloud provider in a multi-cloud architecture and/or ensuring regular backups – it is time to take note. Most of the benefits of a cloud migration are clear and well documented. Remove the headache of maintaining your hardware, operating systems, networking and so on by transferring those tasks to a specialist provider, leaving your business to focus on what it should prioritise: functionality and service that makes you stand out from the crowd.

The UniSuper event makes it clear that you should not rely on cloud providers for everything – mistakes are still made, albeit extremely rarely. Ultimately, you still own your content, your data . Without it you are in serious trouble, so look after it. Furthermore, your organisation has ultimate responsibility for both the availability and quality of the services provided to your customers, which is particularly critical given the ‘high stakes’ nature of financial services. So what does this all mean for your organisation and your relationship with the cloud. Is it safer within your four walls after all? I do not think so.

Cloud providers have been working hard to demonstrate the availability, security, compliance and resiliency of their services and the results speak for themselves. This is an extremely rare incident, and cloud providers remain ahead of the game when it comes to keeping services available, patched and secure to a level that others would struggle to match. Levels of automation are significant, ensuring deployment consistency and reliability.

Given the globalised nature of cloud adoption and the workloads deployed, it is fair to say that security incidents are also extremely rare. That said, innovation continues to accelerate at pace, with new services and features being released on a regular cadence, so it is important to be wary and to recognize there will always be some risk, albeit slim, that unique conditions could combine to trigger something similar in future.

So keep on migrating – statistically it is the right thing to do. However, observing these three guidelines will help ensure everything runs smoothly:

Review your cloud strategy, assessing the operational and concentration risk for the workloads you run in cloud (and those you do not). Make sure your strategy is clear on what the shared responsibility model means to your organisation.
Be sure to follow good practice with your backup rotation strategy, ensuring you have three copies of data, stored on two different types of storage media, with one copy kept off site/with a third party. Test everything, often.
Consider how to de-risk your critical workloads by looking at a multi-cloud approach where it makes sense to do so

REFERENCES

¹ https://www.unisuper.com.au/about-us/media-centre/2024/a-joint-statement-from-unisuper-and-google-cloud
² https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident
³ https://cloud.google.com/vmware-engine?hl=en
⁴ https://cloud.google.com/architecture/framework/security/shared-responsibility-shared-fate

REFLECTIONS ON UNISUPER’S GOOGLE CLOUD MELTDOWN

Reflections on UniSuper’s Google Cloud Meltdown

Contact Us