Incident ResponseOpsDocumentation

Checklist: What to Do When a Major CDN or Cloud Provider Has an Outage

wwordpres

2026-02-11

9 min read

A practical incident runbook for site owners: triage steps, templates, quick workarounds and long-term fixes for CDN or cloud outages.

Immediate action: what to do in the first 10 minutes of a CDN or cloud outage

Hook: When Cloudflare, AWS, or another major cloud/CDN goes down, your site’s traffic, revenue, and reputation are at risk — and the clock starts the moment users hit errors. This runbook gives you a clear, prioritized set of triage steps, ready-to-use communication templates, temporary workarounds to restore service fast, and longer-term fixes to prevent repeat incidents.

Why this matters in 2026

Late 2025 and early 2026 saw several high-profile incidents where one provider’s outage rippled across thousands of sites. That trend accelerated enterprise interest in multi-CDN, multi-region failover, and AI-driven observability. But small and mid-size site owners still need practical, executable playbooks — not boardroom strategy. This guide is a compact incident runbook built for marketing teams, site owners, and devops-lite operators who must act fast.

At-a-glance incident checklist (first 10 minutes)

Confirm the outage: Check provider status pages (Cloudflare, AWS, Fastly, your host). Look for a banner or incident ID.
Check your monitoring: Open synthetic monitors and RUM dashboards. Confirm if errors are global or regional.
Communicate immediately: Post a short status update on your status page, social, and internal Slack/Teams. (Templates below.)
Identify the failure domain: Is it CDN/proxy, DNS, origin, or application? Use the quick tests below.
Apply temporary workarounds: Disable CDN proxying, switch to direct-origin DNS, or publish a cached maintenance page.
Open a ticket: With the provider(s) and your hosting partner. Record incident IDs and links.

Quick technical triage: commands and what they tell you

Run these checks from your laptop and from an external public machine (e.g., CI runner, an external VM) to rule out local network problems.

DNS resolution:
```
dig +short example.com @8.8.8.8
dig +trace example.com
```
Use dig to confirm whether DNS records are resolving and which authoritative server is responding. If DNS fails to resolve globally, suspect your DNS provider or registrar.
HTTP reachability:
```
curl -I https://example.com --max-time 10
curl -I --resolve example.com:443:203.0.113.45 https://example.com/
```
The --resolve trick forces curl to connect to an IP while preserving the Host header — useful when testing origin reachability behind a CDN.
Traceroute/mtr:
```
traceroute example.com
mtr -r -c 20 example.com
```
Traceroute shows network path failures or blackholes often aligned with a provider’s backbone issue.
Check provider status & social: Provider status pages, their @ status account, and DownDetector-like aggregators give early signals. In January 2026, spikes on these sites preceded many public outages.

Decision matrix: which path to take

Based on the triage, pick an action path:

DNS not resolving: Failover DNS to secondary provider or restore A/ALIAS records. Use a low TTL (if preconfigured) or immediate failover via your registrar if supported.
CDN/proxy failing (e.g., Cloudflare): Disable proxying for the domain (orange -> grey cloud) or switch to direct-origin DNS A record.
Origin unreachable but CDN up: Restore origin (scale up, restart services) or redirect traffic to a static cache hosted on a secondary bucket/site.
Partial region outage: Use geo-routing policies (Route 53, NS1, Fastly) to steer traffic to healthy regions.

Practical temporary workarounds (get traffic serving within minutes)

These are pragmatic, sometimes blunt instruments — use them while you coordinate a safer long-term fix.

Disable CDN proxying (Cloudflare example): log in and toggle to DNS-only. This sends traffic directly to your origin IP and often restores site if only the CDN edge network is degraded.

Use /etc/hosts to test origin direct:

# on macOS/Linux
sudo -- sh -c 'echo "203.0.113.45 example.com" >> /etc/hosts'
# then test
curl -I https://example.com

Use this to confirm your origin will serve correctly if you change DNS.

Switch DNS to a secondary provider: If you have a preconfigured secondary DNS (NS1, Cloudflare secondary, or a registrar-hosted fallback), update NS records. Use a provider with low propagation time and fast UI or API for rapid action. See notes on domain portability and DNS failover.
Serve a cached maintenance page from an S3/Cloud Storage bucket or Netlify/Vercel site and switch DNS for the root or subdomain. This preserves UX for marketing/sales pages quickly while APIs remain offline.
Enable provider “Always Online” or cache-only modes if available. These are not perfect but can reduce visible errors for static-heavy sites; consider cache-only modes as a stopgap.

Communication templates you can copy

Clear, consistent communications reduce support load and build trust. Use short, factual updates with next-steps and ETA when possible.

Status page / Public update (short)

Status: Degraded service — Investigating

Impact: Some users may see errors or slow pages. Core APIs may be affected.

What we're doing: Working with our CDN/cloud provider to restore service. Implementing direct-origin routing as a temporary workaround.

Next update: in 30 minutes or sooner.

We’re aware of site issues linked to a CDN/cloud provider outage and are working on a fix. Updates on our status page: [status.example.com]

Customer email (short, for paying customers)

Subject: Service update — temporary disruption

Hi — We’re responding to a third-party CDN/cloud provider outage that affects access to [example.com]. Our team is implementing a temporary routing change to restore service. We’ll update you within 60 minutes. We apologize for the disruption.

How to execute safe DNS/Proxy changes without breaking SSL

Ensure origin serves a valid certificate: If you switch DNS away from a CDN that provided TLS, make sure the origin server has a certificate matching your domain (Let’s Encrypt or provider certificate). See security best practices for origin TLS and key handling.
Use short TTLs pre-incident: Best practice is to keep critical records at TTL 60–300s when you expect to failover quickly. If you didn’t, expect propagation lag.
Automate certificate issuance: Use ACME/Certbot or an automated cert manager so origins can serve HTTPS instantly when you fail over; consider developer automation patterns from a developer automation guide.

Longer-term resilience: hardening your stack post-incident

After service is restored, move from firefighting to durable fixes. Treat this as a prioritized project with owners and deadlines.

Adopt multi-CDN or multi-edge strategies: Modern multi-CDN platforms and orchestration reduce single-vendor blast radius. Evaluate cost vs. availability tradeoffs and automation for failover. See notes on edge signals and orchestration.
Multi-region origins: Use global object storage plus origin failover across regions. For dynamic services, deploy multi-region application clusters with health checks and session migration strategies.
Failover DNS & health checks: Implement health-checked DNS failover (Route 53, NS1, or your DNS vendor) with automated monitors and runbooks to switch records on failure.
Reduce coupling to a single provider: Keep a minimal secondary stack (simple static site or bucket) that can host marketing and status pages within minutes. Store static assets in a secondary bucket as part of your secondary hosting target.
Practice runbooks: Run fire drills quarterly. Use chaos-testing on staging to validate DNS failover, proxy toggle, and origin direct serving.
Instrument better monitoring: Combine synthetic checks, RUM (real-user monitoring), and AI-backed observability tools introduced in 2025–2026 that can auto-surface root-cause signals.
Define SLOs and error budgets: Use them to make data-driven decisions of when to failover or tolerate degraded performance. Tie SLO impact back to revenue metrics.

Post-incident: an actionable postmortem template

Run a blameless postmortem within 48–72 hours. Publish a public summary if customers were affected.

Summary: One-paragraph overview, impact, duration, number of users affected.
Timeline: Minute-by-minute timeline of detection, mitigation, and restoration actions. Include logs and monitoring charts.
Root cause: Technical root cause plus contributing factors (e.g., cached control-plane misconfiguration, dependency on single DNS provider).
Mitigations applied during incident: List temporary fixes and outcomes.
Long-term fixes & owners: Specific action items, owners, and due dates (e.g., implement secondary DNS by 2026-03-01 — owner: Ops Lead).
Learnings: What worked, what didn’t, and updated runbook links.
Communication and transparency: Link to public status updates and customer communications.

Monitoring and SRE improvements to prioritize (2026 focus)

As vendor ecosystems adopted AI-assisted observability in 2025, the emphasis shifted to proactive detection. Prioritize:

Synthetic multi-location checks: Every critical path should be synthetically checked from multiple ISPs/regions.
Real-user monitoring: Capture Core Web Vitals and top errors during incidents to tie user impact to revenue metrics.
Incident automation: Automate remediation playbooks for routine failovers (DNS swap, proxy toggle). See a developer automation guide for patterns you can adapt.
Alerting by SLOs: Alert on SLO breaches rather than raw error spikes to reduce noise and focus on customer impact.

Example run-through: Cloudflare edge outage (mini case)

Scenario: Edge network has large-scale errors and many domains proxied through the CDN show 502/523 errors.

Immediate: Post status page stating degraded service. Disable proxying for critical subdomains (dns-only).
Triage: Use curl with --resolve and host header to confirm origin response. Confirm origin SSL cert. If origin is healthy, toggle DNS to direct A/ALIAS records to origin IP(s).
Workaround: Route marketing pages to a static S3 bucket for public-facing content to reduce load on origin.
Aftermath: Implement secondary CDN or multi-region origin and create an automated script to switch proxy settings and DNS via provider APIs.

Checklists and runbook snippets you should add to your knowledge base

Store these as short, copy-pasteable actions in your KB or incident playbook. Keep them under titled sections (Triage, Communications, Workarounds, Postmortem).

Contact list: provider support links, escalation numbers, account IDs, and technical contact handles.
DNS rollback steps: exact API commands, NS records, and expected TTL propagation times.
Origin access: SSH keys, jump box addresses, and a sample curl command to validate TLS.
Communication templates: status, social, and customer email (as above).
Runbook ownership: who executes each step and who approves the follow-up changes.

Key takeaways — what to do now

Be prepared: Pre-provision a secondary DNS and a minimal secondary hosting target (static bucket/site).
Document runbooks: Keep short, tested playbooks for proxy-disable, DNS failover, and origin direct testing.
Automate safe failover: Use provider APIs and IaC to make failover repeatable and auditable.
Practice regularly: Run quarterly drills that include real DNS toggles in staging environments.
Maintain transparent communications: Quick, honest updates reduce customer churn and support overhead.

Final notes: the new normal (2026 and beyond)

Outages of large providers will continue to happen — but resilience is no longer only for big enterprises. In 2026, cost-effective multi-provider strategies, automated failovers, and AI-backed observability make it feasible for smaller publishers to maintain uptime and deliver consistent user experiences. Building and rehearsing a compact incident runbook like this one is the fastest way to shrink time-to-recover and protect SEO, revenue, and trust. See analysis of recent market moves and what SMBs should do in response to major provider changes in the cloud vendor merger playbook.

Call to action

Save this runbook to your documentation portal and add the checklists to your team’s incident playbook. If you want a prefilled, editable incident-runbook template (with provider API scripts and status-page copy), download our free kit or contact our team for a resilience audit.

wordpres

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.