i-suite

This Black Friday, don’t bet on the cloud alone

insight
Nov 21, 2025
9 min read

Author

Dushyant Sahni

A Global Practice Leader for Private Equity, Horizontal Tech, and Management Consulting at Nagarro. A seasoned technology consultant, he specializes in several topics including AI's use in the SDLC, resilience engineering, and Cloud FinOps — helping enterprises build fluid, high-performing teams that deliver lasting value.

What the AWS & Cloudflare outages revealed about false resilience, and why AI raises the stakes

A single control plane failure brought millions of transactions to a standstill. What made the October outage remarkable was not the size of the disruption, but the layer where it began. A fault in one AWS region rippled outward because the control plane, responsible for routing, DNS, and the coordination fabric, stopped functioning while the systems above it continued to run. The applications remained operational, but the intelligence that connected them did not. The impact was felt instantly around the world.

A few weeks later, the Cloudflare outage proved the same point: a failure inside its core proxy module triggered a cascading global outage for systems that depended on Cloudflare for web traffic management, even though the underlying infrastructure remained healthy.

It is a signal that resilience is not defined by uptime; it is defined by control. Control has evolved from a technical safeguard into a core part of enterprise stewardship.

As Black Friday approaches, the question is not whether the systems will fail, as complex, interconnected environments ensure that they will. What needs to be thought-out is how quickly control can be re-established when they do. Resilience now exists in the space between technology and leadership. It is measured by visibility, coordination, and the ability to steady the business when the environment falters.

Cloudflare and AWS outages aren't glitches.
They are governance failures

The AWS and recent Cloudflare outages exposed the same risk: third-party dependencies are not outside a business; they are part of it. Many organizations still treat the cloud as something managed elsewhere, rather than what it truly is an extension of their own operating fabric.
.

Three resilience blind spots in modern architectures

1. Redundancy without independence is an illusion

Regional redundancy, by itself, cannot protect from centralized control. Workloads can be distributed across regions, but if they all depend on the same control plane, a single failure can still take them down. When identity, traffic management, and orchestration all pass through a single layer, that layer becomes the hidden point of failure.

The same applies to AI. When inference, model versioning, and API access are centralized through one provider, organizations inherit their risks. When every model call relies on a single provider's API, a rate limit or service disruption affects every product using it. True resilience begins with disciplined design: emergency credentials, routing logic, and AI backups outside the primary provider's fabric.

2. Contain failures before they spread

A failure should end where it begins. Design systems with clear boundaries so one issue doesn’t cascade through the entire stack. This is as much about governance as it is about engineering.

If an AI model slows, drifts, or produces bad output, it shouldn’t freeze a checkout flow or corrupt live data. Containment is a design choice; it keeps disruption local and recoverable.

3. Backups that aren’t tested don’t exist

Resilience isn’t built in an outage; it’s built through rehearsal. Test failovers, backup systems, and AI fallback logic until they’re second nature. Simulate API slowdowns, model failures, or degraded performance. That’s the way to find where control ends, and how to extend it.

The AWS outage proved the point: a DNS failure in US-EAST-1 cut off access to healthy servers. Compute power was there, but control wasn’t.

Black Friday requires a new kind of resilience

In finance, retail, and technology, downtime is measured not in minutes but in millions. During peak demand, a single stall can erase months of growth. Adding AI dependencies amplify both, risk and complexity.

If a recommender engine fails, sales drop. If fraud detection slows, losses rise. And when an AI-powered support system breaks, customer frustration spreads instantly.

Some companies are already designing for “continuity of intelligence.” DoorDash’s RAG-based chatbot checks facts automatically and hands off complex cases to humans when confidence drops. LinkedIn uses an “LLM judge” that constantly evaluates AI response quality and retrains models in real time. These are resilience controls, not just clever features. They protect experience continuity as much as uptime.

In healthcare and retail, resilience goes even deeper. Outages can threaten not just revenue but safety and trust. Here, control means autonomy, using local data caching, edge computing, and offline transactions so essential operations continue even when the cloud’s control plane goes dark.

Get the cybersecurity assessment playbook.

Four imperatives for: building real resilience

1. Rehearse for failure

Assume something will fail, including a provider’s control plane. Maintain recovery environments that stay reachable even when the primary cloud is not. Run gamedays that simulate both infrastructure and AI failures, enabling teams to act decisively when automation stops responding.

2. Take Own the digital front door

Resilience starts with controlling the path home, and the keys to get in. Avoid single-provider routing. Use independent DNS and load-balancing services, and test them often. For AI, build in retry logic, circuit breakers, and cached responses. Keep critical credentials in secure, independent vaults to authenticate even if the primary IAM goes dark.

3. Secure the AI surface

AI systems create new control layers and new risks. Test for prompt injection, manipulation, and data leakage. Monitor model drift and degradation under load. For autonomous agents, enforce least privilege, validate every action, and log all decisions for auditability.

4. Lead with a resilience mindset

Resilience isn’t an engineering artifact; it’s a leadership discipline. Build recovery exercises into normal operations, assign ownership for AI incidents, and define escalation paths before they’re needed. The goal isn’t to prevent every failure, but to ensure the system bends without breaking.

The real test of leadership

For CISOs and CIOs, the AWS outage was more than a cloud failure; it was a test of control. It demonstrated that resilience, security, and governance now operate as a single, interdependent discipline.

When the control plane collapses, visibility vanishes. When AI systems falter, decisions go dark. Resilience is what spans those gaps. It's true measure is not uptime, but the speed with which trust, control, and communication are restored under pressure.

As Black Friday approaches, the stakes are as reputational as they are technical. Customers rarely remember the failure itself; they remember the clarity, transparency, and steadiness that followed. In an environment where AI increasingly shapes every interaction, control not raw availability, has become the defining signal of leadership.

Real resilience is not the absence of failure. It is the ability to govern complexity so that when systems fail, the enterprise continues to stand.

Also, read here to explore the internal blind spot—Shadow AI.

It showed that resilience no longer depends on infrastructure uptime but on the stability of the control plane. Even when applications remained healthy, a failure in routing and DNS logic brought entire environments to a halt.

Control planes manage identity, routing, DNS, and system coordination. When they malfunction, the “intelligence layer” collapses, leaving applications running but disconnected, unreachable, or unable to authenticate.

AWS suffered a centralized control-plane failure; Cloudflare suffered a cascading failure triggered by a small configuration change. Both outages prove the need to test dependencies, build resilience to recover faster.

AI systems depend on metadata services, model registries, and single-provider APIs. When any of these control layers fail or even slow down, the services built on top of them, recommendation engines, fraud detection, support flows, break immediately and in full view.

Multi-region setups still share global control layers like identity, IAM, routing, and orchestration. A failure in those layers can affect all regions simultaneously, making geographic redundancy meaningless.

Three patterns appear across AWS, CrowdStrike, and Cloudflare outages:

Redundancy without true independence (shared control planes)
Failures that cascade instead of containing themselves
Backups and failovers that exist only in theory because they aren’t tested

Black Friday traffic pushes AI-powered recommendation, fraud, and customer-support systems to their limits. A stalled model or degraded API can instantly drop sales, increase losses, and spread customer frustration.

Failover drills, AI fallback logic, DNS independence, routing alternatives, authentication backup paths, and scenario testing that includes both cloud failures and AI degradation.

Critical workloads such as payments, authentication, and core AI decision systems require full independence and rapid failover paths. Non-critical workloads like analytics or recommendations can recover more slowly.

Providers are deeply embedded in enterprise operations. Failures in their automation layers directly become failures inside the business. Governance must reflect this entanglement.

Get in touch

Digital Engineering

Intelligent Enterprise

Experience and Design

This Black Friday, don’t bet on the cloud alone

Author

Dushyant Sahni

What the AWS & Cloudflare outages revealed about false resilience, and why AI raises the stakes

Cloudflare and AWS outages aren't glitches.
They are governance failures

Three resilience blind spots in modern architectures

Black Friday requires a new kind of resilience

Get the cybersecurity assessment playbook.

Four imperatives for: building real resilience

1. Rehearse for failure

2. Take Own the digital front door

3. Secure the AI surface

4. Lead with a resilience mindset

The real test of leadership

Also, read here to explore the internal blind spot—Shadow AI.

What did the AWS outage reveal about modern cloud resilience?

Why do control planes fail even when infrastructure remains stable?

How did the Cloudflare outage differ from the AWS incident?

Why does AI increase the impact of cloud outages?

Why are multi-region architectures not enough anymore?

What are the biggest resilience blind spots in modern architectures?

Why are AI support systems a point of failure during Black Friday?

What should enterprises test before Black Friday to ensure resilience?

How should enterprises prioritize investments in resilience?

Why are third-party dependencies now a governance issue?