Checklist: What to Do When a Major CDN or Cloud Provider Has an Outage
A practical incident runbook for site owners: triage steps, templates, quick workarounds and long-term fixes for CDN or cloud outages.
Immediate action: what to do in the first 10 minutes of a CDN or cloud outage
Hook: When Cloudflare, AWS, or another major cloud/CDN goes down, your site’s traffic, revenue, and reputation are at risk — and the clock starts the moment users hit errors. This runbook gives you a clear, prioritized set of triage steps, ready-to-use communication templates, temporary workarounds to restore service fast, and longer-term fixes to prevent repeat incidents.
Why this matters in 2026
Late 2025 and early 2026 saw several high-profile incidents where one provider’s outage rippled across thousands of sites. That trend accelerated enterprise interest in multi-CDN, multi-region failover, and AI-driven observability. But small and mid-size site owners still need practical, executable playbooks — not boardroom strategy. This guide is a compact incident runbook built for marketing teams, site owners, and devops-lite operators who must act fast.
At-a-glance incident checklist (first 10 minutes)
- Confirm the outage: Check provider status pages (Cloudflare, AWS, Fastly, your host). Look for a banner or incident ID.
- Check your monitoring: Open synthetic monitors and RUM dashboards. Confirm if errors are global or regional.
- Communicate immediately: Post a short status update on your status page, social, and internal Slack/Teams. (Templates below.)
- Identify the failure domain: Is it CDN/proxy, DNS, origin, or application? Use the quick tests below.
- Apply temporary workarounds: Disable CDN proxying, switch to direct-origin DNS, or publish a cached maintenance page.
- Open a ticket: With the provider(s) and your hosting partner. Record incident IDs and links.
Quick technical triage: commands and what they tell you
Run these checks from your laptop and from an external public machine (e.g., CI runner, an external VM) to rule out local network problems.
- DNS resolution:
dig +short example.com @8.8.8.8 dig +trace example.comUse dig to confirm whether DNS records are resolving and which authoritative server is responding. If DNS fails to resolve globally, suspect your DNS provider or registrar.
- HTTP reachability:
curl -I https://example.com --max-time 10 curl -I --resolve example.com:443:203.0.113.45 https://example.com/The
--resolvetrick forces curl to connect to an IP while preserving the Host header — useful when testing origin reachability behind a CDN. - Traceroute/mtr:
traceroute example.com mtr -r -c 20 example.comTraceroute shows network path failures or blackholes often aligned with a provider’s backbone issue.
- Check provider status & social: Provider status pages, their @ status account, and DownDetector-like aggregators give early signals. In January 2026, spikes on these sites preceded many public outages.
Decision matrix: which path to take
Based on the triage, pick an action path:
- DNS not resolving: Failover DNS to secondary provider or restore A/ALIAS records. Use a low TTL (if preconfigured) or immediate failover via your registrar if supported.
- CDN/proxy failing (e.g., Cloudflare): Disable proxying for the domain (orange -> grey cloud) or switch to direct-origin DNS A record.
- Origin unreachable but CDN up: Restore origin (scale up, restart services) or redirect traffic to a static cache hosted on a secondary bucket/site.
- Partial region outage: Use geo-routing policies (Route 53, NS1, Fastly) to steer traffic to healthy regions.
Practical temporary workarounds (get traffic serving within minutes)
These are pragmatic, sometimes blunt instruments — use them while you coordinate a safer long-term fix.
- Disable CDN proxying (Cloudflare example): log in and toggle to DNS-only. This sends traffic directly to your origin IP and often restores site if only the CDN edge network is degraded.
- Use /etc/hosts to test origin direct:
# on macOS/Linux sudo -- sh -c 'echo "203.0.113.45 example.com" >> /etc/hosts' # then test curl -I https://example.comUse this to confirm your origin will serve correctly if you change DNS.
- Switch DNS to a secondary provider: If you have a preconfigured secondary DNS (NS1, Cloudflare secondary, or a registrar-hosted fallback), update NS records. Use a provider with low propagation time and fast UI or API for rapid action. See notes on domain portability and DNS failover.
- Serve a cached maintenance page from an S3/Cloud Storage bucket or Netlify/Vercel site and switch DNS for the root or subdomain. This preserves UX for marketing/sales pages quickly while APIs remain offline.
- Enable provider “Always Online” or cache-only modes if available. These are not perfect but can reduce visible errors for static-heavy sites; consider cache-only modes as a stopgap.
Communication templates you can copy
Clear, consistent communications reduce support load and build trust. Use short, factual updates with next-steps and ETA when possible.
Status page / Public update (short)
Status: Degraded service — Investigating
Impact: Some users may see errors or slow pages. Core APIs may be affected.
What we're doing: Working with our CDN/cloud provider to restore service. Implementing direct-origin routing as a temporary workaround.
Next update: in 30 minutes or sooner.
Social / Tweet-length
We’re aware of site issues linked to a CDN/cloud provider outage and are working on a fix. Updates on our status page: [status.example.com]
Customer email (short, for paying customers)
Subject: Service update — temporary disruption
Hi — We’re responding to a third-party CDN/cloud provider outage that affects access to [example.com]. Our team is implementing a temporary routing change to restore service. We’ll update you within 60 minutes. We apologize for the disruption.
How to execute safe DNS/Proxy changes without breaking SSL
- Ensure origin serves a valid certificate: If you switch DNS away from a CDN that provided TLS, make sure the origin server has a certificate matching your domain (Let’s Encrypt or provider certificate). See security best practices for origin TLS and key handling.
- Use short TTLs pre-incident: Best practice is to keep critical records at TTL 60–300s when you expect to failover quickly. If you didn’t, expect propagation lag.
- Automate certificate issuance: Use ACME/Certbot or an automated cert manager so origins can serve HTTPS instantly when you fail over; consider developer automation patterns from a developer automation guide.
Longer-term resilience: hardening your stack post-incident
After service is restored, move from firefighting to durable fixes. Treat this as a prioritized project with owners and deadlines.
- Adopt multi-CDN or multi-edge strategies: Modern multi-CDN platforms and orchestration reduce single-vendor blast radius. Evaluate cost vs. availability tradeoffs and automation for failover. See notes on edge signals and orchestration.
- Multi-region origins: Use global object storage plus origin failover across regions. For dynamic services, deploy multi-region application clusters with health checks and session migration strategies.
- Failover DNS & health checks: Implement health-checked DNS failover (Route 53, NS1, or your DNS vendor) with automated monitors and runbooks to switch records on failure.
- Reduce coupling to a single provider: Keep a minimal secondary stack (simple static site or bucket) that can host marketing and status pages within minutes. Store static assets in a secondary bucket as part of your secondary hosting target.
- Practice runbooks: Run fire drills quarterly. Use chaos-testing on staging to validate DNS failover, proxy toggle, and origin direct serving.
- Instrument better monitoring: Combine synthetic checks, RUM (real-user monitoring), and AI-backed observability tools introduced in 2025–2026 that can auto-surface root-cause signals.
- Define SLOs and error budgets: Use them to make data-driven decisions of when to failover or tolerate degraded performance. Tie SLO impact back to revenue metrics.
Post-incident: an actionable postmortem template
Run a blameless postmortem within 48–72 hours. Publish a public summary if customers were affected.
- Summary: One-paragraph overview, impact, duration, number of users affected.
- Timeline: Minute-by-minute timeline of detection, mitigation, and restoration actions. Include logs and monitoring charts.
- Root cause: Technical root cause plus contributing factors (e.g., cached control-plane misconfiguration, dependency on single DNS provider).
- Mitigations applied during incident: List temporary fixes and outcomes.
- Long-term fixes & owners: Specific action items, owners, and due dates (e.g., implement secondary DNS by 2026-03-01 — owner: Ops Lead).
- Learnings: What worked, what didn’t, and updated runbook links.
- Communication and transparency: Link to public status updates and customer communications.
Monitoring and SRE improvements to prioritize (2026 focus)
As vendor ecosystems adopted AI-assisted observability in 2025, the emphasis shifted to proactive detection. Prioritize:
- Synthetic multi-location checks: Every critical path should be synthetically checked from multiple ISPs/regions.
- Real-user monitoring: Capture Core Web Vitals and top errors during incidents to tie user impact to revenue metrics.
- Incident automation: Automate remediation playbooks for routine failovers (DNS swap, proxy toggle). See a developer automation guide for patterns you can adapt.
- Alerting by SLOs: Alert on SLO breaches rather than raw error spikes to reduce noise and focus on customer impact.
Example run-through: Cloudflare edge outage (mini case)
Scenario: Edge network has large-scale errors and many domains proxied through the CDN show 502/523 errors.
- Immediate: Post status page stating degraded service. Disable proxying for critical subdomains (dns-only).
- Triage: Use curl with
--resolveand host header to confirm origin response. Confirm origin SSL cert. If origin is healthy, toggle DNS to direct A/ALIAS records to origin IP(s). - Workaround: Route marketing pages to a static S3 bucket for public-facing content to reduce load on origin.
- Aftermath: Implement secondary CDN or multi-region origin and create an automated script to switch proxy settings and DNS via provider APIs.
Checklists and runbook snippets you should add to your knowledge base
Store these as short, copy-pasteable actions in your KB or incident playbook. Keep them under titled sections (Triage, Communications, Workarounds, Postmortem).
- Contact list: provider support links, escalation numbers, account IDs, and technical contact handles.
- DNS rollback steps: exact API commands, NS records, and expected TTL propagation times.
- Origin access: SSH keys, jump box addresses, and a sample curl command to validate TLS.
- Communication templates: status, social, and customer email (as above).
- Runbook ownership: who executes each step and who approves the follow-up changes.
Key takeaways — what to do now
- Be prepared: Pre-provision a secondary DNS and a minimal secondary hosting target (static bucket/site).
- Document runbooks: Keep short, tested playbooks for proxy-disable, DNS failover, and origin direct testing.
- Automate safe failover: Use provider APIs and IaC to make failover repeatable and auditable.
- Practice regularly: Run quarterly drills that include real DNS toggles in staging environments.
- Maintain transparent communications: Quick, honest updates reduce customer churn and support overhead.
Final notes: the new normal (2026 and beyond)
Outages of large providers will continue to happen — but resilience is no longer only for big enterprises. In 2026, cost-effective multi-provider strategies, automated failovers, and AI-backed observability make it feasible for smaller publishers to maintain uptime and deliver consistent user experiences. Building and rehearsing a compact incident runbook like this one is the fastest way to shrink time-to-recover and protect SEO, revenue, and trust. See analysis of recent market moves and what SMBs should do in response to major provider changes in the cloud vendor merger playbook.
Call to action
Save this runbook to your documentation portal and add the checklists to your team’s incident playbook. If you want a prefilled, editable incident-runbook template (with provider API scripts and status-page copy), download our free kit or contact our team for a resilience audit.
Related Reading
- Cost Impact Analysis: Quantifying Business Loss from Social Platform and CDN Outages
- Edge Signals, Live Events, and the 2026 SERP
- Edge Signals & Personalization: Analytics Playbook
- Domain Portability as a Growth Engine (DNS failover notes)
- Security Best Practices with Mongoose.Cloud
- Fleet Last‑Mile Savings: When to Use Cheap E‑Bikes and Scooters for Deliveries
- How to Add a Smart RGBIC Lamp to Your Living Room Without Rewiring
- Canary updates for Raspberry Pi HATs: Safe rollout patterns for AI hardware add-ons
- Rechargeable Hot-Water Bottles vs Microwavable Heat Packs: Which Is Best for Sciatica?
- Using Process Roulette & Chaos to Harden Production Services
Related Topics
wordpres
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Why Multisite + ABAC Is the Default for Government Publishing in 2026
Building a Creator-Led Commerce Store on WordPress in 2026: From Tutorials to Micro-Subscriptions
Building a Performance‑First WordPress Events & Pop‑Up Stack for 2026: Tickets, Logistics and Low‑Carbon Ops
From Our Network
Trending stories across our publication group