Multicolored background with text.

Preparing for a Cloud Outage

Published August 2023

Nearly all of our clients and a majority of companies are using the cloud for at least a portion of their infrastructure.  It’s important for companies to plan for cloud outages to minimize the damage caused by them.

In this post we will cover how to minimize damage and recover quickly after a cloud outage. 

6 Tips for Preparing for a Cloud Outage

1. Why do cloud outages occur?

Anything that can cause an on-prem outage can cause a cloud outage. 

Cloud services are outsourced infrastructure management. Hard drives still fail, network switches fail, air conditioning units fail, power outages occur, buggy code gets pushed to cloud instance management software, construction workers accidentally break fiber cables, etc.

2. What kind of damage can a cloud outage create?

If you’re using a cloud provider, you’re giving up some control of your product because you think the cloud provider will perform better for your use case than your staff. The worst case scenario is that you lose functionality of part of your product and can’t control/prioritize the restoration of your product.

Smaller companies that don’t have a dedicated account manager at the cloud provider are always lower priority.

3. How frequent are cloud outages?

Big cloud providers like AWS and GCP will have a major outage once every one-to-two years. Smaller outages like disk performance degradation and network throughput happen once every few months.

4. How can a company protect against a cloud outage?

Run full copies of your product across multiple geographic regions and multiple cloud providers.

Running your product in multiple geographic regions will also make your product faster because the consumer can be directed to servers closer to them.

We also recommend a robust backup plan that stores data off of the cloud.  With that, create processes for restoring data from the backup to another location.

5. How to recover from a cloud outage?

If the product is multi-region, DNS should automatically update to point customers to a second site.

If there are backups, then you can begin the process of restoring the product to another site.

If the product lives in a single site and there are no off-cloud backups, then you need to get on the phone with the cloud provider to encourage them to prioritize your product.

6. What types of organizations are most vulnerable to a cloud outage?

Organizations that have all of their product and backups with a single cloud provider at a single geographic location are most vulnerable to cloud outages.

Final Thoughts

Remember that using cloud services is outsourcing all of the problems you currently have to a third party who specializes in infrastructure management. 

The problems all still happen, just the cloud provider does its best to mitigate/hide them.  Don’t expect the full truth about the cause and extent of the outage to be posted on the cloud provider’s website.

Support on Your Environment

At Dattell we support data infrastructure on your environment, whether on-prem or in the cloud.  We specialize in Kafka, Pulsar, Elasticsearch, and OpenSearch.

Published by

Dattell - Kafka & Elasticsearch Support

Benefit from the experience of our Kafka, Pulsar, Elasticsearch, and OpenSearch expert services to help your team deploy and maintain high-performance platforms that scale. We support Kafka, Elasticsearch, and OpenSearch both on-prem and in the cloud, whether on stand alone clusters or running within Kubernetes. We’ve saved our clients $100M+ over the past six years. Without our guidance companies tend to overspend on hardware or purchase unnecessary licenses. We typically save clients multiples more money than our fees cost in addition to building, optimizing, and supporting fault-tolerant, highly available architectures.