Fastly Outage - How to have a Plan B

Dan   

You may have heard that Fastly, one of the world’s largest provider of CDN services, had an outage of about 1 hour on the 8th July. There has been much wailing and gnashing of teeth as some of the world's largest websites and services were down, including reddit, CNN, The Guardian, Shopify Stores, Stripe, Spotify, to just name a few.

According to Fastly themselves, the outage was caused by a 'service misconfiguration' (Update: Bug triggered by a client changing their configuration), which propagated globally to take websites offline. When trying to access a website using the Fastly service users were presented with a Varnish 503 Guru Mediation error (for those of us old enough to remember, Guru Meditation is a geek reference to the Commodore Amiga computer of the late 80s!) This generally occurs when there is an issue contacting the server that the website is actually hosted on. Additionally there were some reports on twitter saying 'unknown domain'.

Essentially, Fastly took down its own network with a bad software update, something that has afflicted similar online platforms in the recent past, including Google, Amazon, and Cloudflare.

Why wasn’t there a Plan B?

Fastly is an excellent service, with an enviable reliability record. There is a reason why they're trusted by some of the world's largest websites to improve reliability and load times. However, the vast majority of Fastly clients had to sit tight and wait for Fastly to fix the issue. Luckily this was only an hour, it could have been much longer.

Just like death and taxes, software outages are a certainty in life. The real story is not that Fastly had an outage, it's why didn't these large websites have a contingency plan for a single point of failure. This is a major oversight in infrastructure planning on the part of their technical teams.

How to handle a CDN failure

The simple solution is to have a backup CDN provider, already configured and tested, that you can switch over to in the event of a failure with your primary provider. You can then utilise short expiry of DNS records to redirect users when the failure happens. This needn't be very expensive or complicated, your individual circumstances may of course vary.

A Quick Introduction To DNS (Domain Name System)

Modern CDNs, like Fastly, Cloudflare, and Peakhour, operate as ‘reverse proxies’. This means they sit in between the end users of a website, and the website server itself. The way they achieve this is by way of DNS configuration.

When someone types in a domain url into a browser, eg fastly.com, a request is sent to a DNS server with the host name (eg fastly.com) to find the IP address of the server to retrieve the content from. CDNs, like Fastly, get website admins to list the address of the CDN on the DNS server. By doing that, requests for a website go through the CDN first. The process is analogous to listing someone else’s number in the phone book so they take calls for you.

The DNS server has a TTL (Time To Live) associated with its records. This TTL tells whoever asked for an IP address, for a given hostname, to remember the answer, and not ask again until after the TTL has passed. Typically DNS record TTLs will be 1 hour, but they can be shorter, eg 1 minute.

Switching providers in case of an outage

By keeping a short TTL in DNS, webmasters can simply switch the answer for a DNS request to that of another provider, meaning users can quickly be directed to an alternative Cloud Provider. Once service has been resumed on the primary provider DNS can be switched again so normal traffic is resumed. The key is that the alternative provider is configured, tested, and ready to go.

This switch can even be automated to minimise any outages. Premium DNS services, like Amazon’s Route 53, have optional health checking of DNS answers, this allows a switch to happen nearly instantly, the only downtime suffered would be those people already on the site that have to wait for the TTL to expire before being directed to the backup Cloud Provider. In fact this is exactly what Peakhour.io does, in the event of a catastrophic outage we use DNS to switch to backup infrastructure so our clients are minimally affected.

Backup provider options

Now we've shown how switching CDN providers can easily be done, let's compare the major players to see how they might serve as a backup CDN for Fastly. The three things we'll look at are Cost, Features, Integration.

Simply route traffic to the origin

This would be the simplest and most cost effective, Assuming your origin server can handle the increased load that removing its CDN would entail. It also assumes that it's ok to lose any features that you may have been relying on, eg load balancing, WAF, edge scripting, image optimisation etc.

Cloudflare

Many people use Fastly because it uses Varnish, a richly featured, programmable cache that has several advanced features. If you rely on those features, eg cache tags, cache on cookie value, custom cache tags, then you have to be on Cloudflare's top plan, which is not cheap.

The other major drawback of Cloudflare is that, unless you are on the most expensive plans, you have to cede control of DNS to them by delegating your domain. Cloudflare DNS is a great service, however it has the major drawback of caching negative DNS requests for an hour. So if you were doing a switch from an A record to a CNAME record or vice versa you could go down for an hour regardless. Not ideal.

Akamai

Akamai has a highly respected, fully featured, and very expensive product. To maintain a backup option with them will run into the $1000s a month, only you could decide whether it’s worth it.

Cloudfront

Amazon's CDN offering is the third of the big three alternatives. Since it uses volume based billing, it could offer an attractive CDN option as a standby, as long as you don't mind missing out on cache by tag (sorry Magento and Drupal). It is also complicated to configure for dynamic content and could miss features that you need. In fact most people use Cloudfront for static content, eg images, CSS, etc and run a Varnish instance within AWS to provide easier to configure full page caching.

In fact this is what the BBC did with the Fastly outage, they had their backup infrastructure on Cloudfront, and as of time of writing hadn't switched back to Fastly.

Peakhour.io

Peakhour is also volume based billing with a minimum monthly charge of $20. We provide all the advanced caching features that Fastly does, as well as WAF and image optimisation as standard, all in the one service fee. We don't require you cede control of DNS to us and we're Australian owned and based.

Conclusion

CDNs, no matter how big, can fail. If your website is important then it needs a Plan B. We've shown how this Plan B works, we've also shown that it doesn't have to be expensive when using a provider like Peakhour.io.

Your business is worth it.

Interest

© PEAKHOUR.IO PTY LTD 2024   ABN 76 619 930 826    All rights reserved.