This Black Friday, don’t bet on the cloud alone

insight
Nov 21, 2025
9 min read

Author

Dushyant-pngDushyant Sahni

A Global Practice Leader for Private Equity, Horizontal Tech, and Management Consulting at Nagarro. A seasoned technology consultant, he specializes in several topics including AI's use in the SDLC, resilience engineering, and Cloud FinOps — helping enterprises build fluid, high-performing teams that deliver lasting value.

What the AWS & Cloudflare outages revealed about false resilience, and why AI raises the stakes


A single control plane failure brought millions of transactions to a standstill. What made the October outage remarkable was not the size of the disruption, but the layer where it began. A fault in one AWS region rippled outward because the control plane, responsible for routing, DNS, and the coordination fabric, stopped functioning while the systems above it continued to run. The applications remained operational, but the intelligence that connected them did not. The impact was felt instantly around the world. 

A few weeks later, the Cloudflare outage proved the same point: a failure inside its core proxy module triggered a cascading global outage for systems that depended on Cloudflare for web traffic management, even though the underlying infrastructure remained healthy. 

It is a signal that resilience is not defined by uptime; it is defined by control. Control has evolved from a technical safeguard into a core part of enterprise stewardship. 

As Black Friday approaches, the question is not whether the systems will fail, as complex, interconnected environments ensure that they will. What needs to be thought-out is how quickly control can be re-established when they do.   Resilience now exists in the space between technology and leadership. It is measured by visibility, coordination, and the ability to steady the business when the environment falters. 

Cloudflare and AWS outages aren't glitches.
They are governance failures

The AWS and recent Cloudflare outages exposed the same risk: third-party dependencies are not outside a business; they are part of it. Many organizations still treat the cloud as something managed elsewhere, rather than what it truly is an extension of their own operating fabric. 

 

Three resilience blind spots in modern architectures

1. Redundancy without independence is an illusion

Regional redundancy, by itself, cannot protect from centralized control. Workloads can be distributed across regions, but if they all depend on the same control plane, a single failure can still take them down. When identity, traffic management, and orchestration all pass through a single layer, that layer becomes the hidden point of failure. 

The same applies to AI. When inference, model versioning, and API access are centralized through one provider, organizations inherit their risks. When every model call relies on a single provider's API, a rate limit or service disruption affects every product using it. True resilience begins with disciplined design: emergency credentials, routing logic, and AI backups outside the primary provider's fabric.

 

2. Contain failures before they spread

A failure should end where it begins. Design systems with clear boundaries so one issue doesn’t cascade through the entire stack. This is as much about governance as it is about engineering. 

If an AI model slows, drifts, or produces bad output, it shouldn’t freeze a checkout flow or corrupt live data. Containment is a design choice; it keeps disruption local and recoverable. 

 

3. Backups that aren’t tested don’t exist

Resilience isn’t built in an outage; it’s built through rehearsal. Test failovers, backup systems, and AI fallback logic until they’re second nature. Simulate API slowdowns, model failures, or degraded performance. That’s the way to find where control ends, and how to extend it. 

The AWS outage proved the point: a DNS failure in US-EAST-1 cut off access to healthy servers. Compute power was there, but control wasn’t. 

Black Friday requires a new kind of resilience

In finance, retail, and technology, downtime is measured not in minutes but in millions. During peak demand, a single stall can erase months of growth. Adding AI dependencies amplify both, risk and complexity. 

If a recommender engine fails, sales drop. If fraud detection slows, losses rise. And when an AI-powered support system breaks, customer frustration spreads instantly. 

Some companies are already designing for “continuity of intelligence.” DoorDash’s RAG-based chatbot checks facts automatically and hands off complex cases to humans when confidence drops. LinkedIn uses an “LLM judge” that constantly evaluates AI response quality and retrains models in real time. These are resilience controls, not just clever features. They protect experience continuity as much as uptime. 

In healthcare and retail, resilience goes even deeper. Outages can threaten not just revenue but safety and trust. Here, control means autonomy, using local data caching, edge computing, and offline transactions so essential operations continue even when the cloud’s control plane goes dark. 

Cybersecurity for black friday

Four imperatives for: building real resilience 

1. Rehearse for failure

Assume something will fail, including a provider’s control plane. Maintain recovery environments that stay reachable even when the primary cloud is not. Run gamedays that simulate both infrastructure and AI failures, enabling teams to act decisively when automation stops responding.

2. Take Own the digital front door

Resilience starts with controlling the path home, and the keys to get in. Avoid single-provider routing. Use independent DNS and load-balancing services, and test them often. For AI, build in retry logic, circuit breakers, and cached responses. Keep critical credentials in secure, independent vaults to authenticate even if the primary IAM goes dark. 

3. Secure the AI surface

AI systems create new control layers and new risks. Test for prompt injection, manipulation, and data leakage. Monitor model drift and degradation under load. For autonomous agents, enforce least privilege, validate every action, and log all decisions for auditability.      

4. Lead with a resilience mindset

Resilience isn’t an engineering artifact; it’s a leadership discipline. Build recovery exercises into normal operations, assign ownership for AI incidents, and define escalation paths before they’re needed. The goal isn’t to prevent every failure, but to ensure the system bends without breaking. 

The real test of leadership

For CISOs and CIOs, the AWS outage was more than a cloud failure; it was a test of control. It demonstrated that resilience, security, and governance now operate as a single, interdependent discipline.

When the control plane collapses, visibility vanishes. When AI systems falter, decisions go dark. Resilience is what spans those gaps. It's true measure is not uptime, but the speed with which trust, control, and communication are restored under pressure.

As Black Friday approaches, the stakes are as reputational as they are technical. Customers rarely remember the failure itself; they remember the clarity, transparency, and steadiness that followed. In an environment where AI increasingly shapes every interaction, control not raw availability, has become the defining signal of leadership.

Real resilience is not the absence of failure. It is the ability to govern complexity so that when systems fail, the enterprise continues to stand.



Also, read here to explore the internal blind spot—Shadow AI.

Get in touch

This Black Friday, don’t bet on the cloud alone