01 · Introduction
Somewhere in your organisation, someone is asking: "What happens if something fails?" Maybe it's the CFO before signing off on the ExaCC purchase. Maybe it's you, staring at an architecture diagram at 11 PM, trying to remember whether RAC covers a full-site outage. (It doesn't.)
Oracle Exadata Cloud@Customer (ExaCC) doesn't pretend failures won't happen. It layers protection so that most failures never reach the user — and the catastrophic ones have a rehearsed escape route. Oracle Real Application Clusters (RAC) handles compute. Oracle Automatic Storage Management (ASM) handles disks and storage cells. Oracle Data Guard handles the whole site going dark. Each layer picks up where the last one stops.
If you want the rack-and-control-plane picture first, start with our ExaCC Architecture Overview. This article assumes you know the hardware exists — and focuses on what keeps it alive.
"We had RAC. We thought we were covered. Then the UPS failed."
Priya runs a four-node RAC cluster on ExaCC for a regional bank. When a single compute node failed, users didn't notice — Application Continuity did its job. Six months later, a UPS failure took out the entire rack. Every RAC node went down together.
RAC had done exactly what it was designed to do. It just wasn't designed for that failure. Data Guard at the secondary site did the job instead — but only because they'd configured it, tested it, and knew which runbook to open. The lesson wasn't "RAC failed." It was "we'd been solving the wrong problem."
One node dies inside the rack. Users keep working. You find out from a monitoring alert, not a phone call.
The whole site goes dark. A standby in another city takes over — if you built it, sized it, and rehearsed the cutover.
The two numbers that turn architecture debates into budget decisions. Get them from the business first.
02 · What Is the Difference Between High Availability and Disaster Recovery?
Here's the mistake I see most often: a team buys RAC, feels "highly available," and never budgets for a DR site. Or they build a beautiful standby database and leave the application pointing at hard-coded IP addresses that die on failover. HA and DR solve different problems. You usually need both — but they are not interchangeable.
Figure 1 · HA keeps the lights on locally. DR keeps the business alive when the whole site doesn't.
High Availability — the "something broke in the rack" problem
Think of HA as your day-to-day safety net. A compute node dies during month-end close. A flash drive throws errors. You're patching Grid Infrastructure on a rolling basis. The cluster detects it, moves work elsewhere, and life continues. If you've done the application-side work (Application Continuity, connection pools, service names), most users never know anything happened.
Disaster Recovery — the "we've lost the data center" problem
DR is what you reach for when HA has nothing left to protect. The UPS dies. The building floods. A ransomware event takes every node offline at once. Now you're not relocating a service inside a cluster — you're promoting a standby database in another city and praying your lag was acceptable.
RTO and RPO — the numbers your business actually cares about
Before you pick a protection mode or redundancy level, get these two numbers from the business — not from the DBA team guessing in a vacuum.
| Metric | Definition | Typical HA Target | Typical DR Target |
|---|---|---|---|
| Recovery Time Objective (RTO) | Maximum acceptable downtime before service is restored | Seconds to minutes | Minutes to hours |
| Recovery Point Objective (RPO) | Maximum acceptable age of data that can be lost | Absolute zero for critical DBs | Zero (sync) or minimal (async lag) |
03 · How Oracle RAC Delivers High Availability on Exadata Cloud@Customer
On ExaCC, RAC is what stops a single dead server from becoming a production outage. Multiple compute nodes share the same datafiles and work as one database — not one active and one passive, but genuinely active-active. When one node goes away, the others keep going.
If you live in RAC day-to-day, our Advanced Oracle RAC Administration course goes deeper on cache fusion tuning, service relocation, and connection pool setup. This section is the "what actually happens when things break" version.
Cache Fusion — why RAC feels like one database
When Node A needs a data block that Node B already has in memory, RAC doesn't send both nodes to disk. Cache Fusion moves the block directly between SGAs over ExaCC's internal Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) fabric. That's why RAC on Exadata feels fast — and why node-to-node coordination stays in the millisecond range instead of the I/O-wait range.
When a compute node dies — the eight-second story
Picture this: Node 2 in a three-node cluster stops responding. You didn't reboot it. Clusterware did its job before you opened a ticket. Here's the sequence, roughly:
Figure 2 · What happens in the first eight seconds after an ExaCC compute node failure
FAILOVER_TYPE=TRANSACTION, RETRY_COUNT, and connect through a service name — not a hard-coded SID on one node. I've seen million-dollar RAC clusters defeated by a JDBC URL someone wrote in 2019.
04 · What Role Does Oracle ASM Play in ExaCC Resilience?
RAC protects your compute. ASM protects your data on disk. On ExaCC, every database file lives inside ASM disk groups that stripe across Exadata Storage Cells and mirror through Failure Groups — so losing one cell doesn't mean losing your data.
Failure Group design is where a lot of "we thought we had HA" stories go wrong. Our ASM Failure Groups deep dive walks through the mapping in detail. The short version: one Failure Group per storage cell, always.
Striping spreads the load. Mirroring absorbs the hit.
ASM splits your datafiles into extents and spreads them across every storage cell in the rack — so no single disk becomes a bottleneck. Then it mirrors each extent across separate Failure Groups. If Cell 2 dies on a Friday night, the database keeps running from copies on Cell 1 and Cell 3. You fix the hardware Monday. The database never went offline.
Figure 3 · How ASM places mirrored copies across independent Exadata Storage Cells
| Redundancy Mode | Copies per Extent | Cell Failures Tolerated | Typical ExaCC Use Case |
|---|---|---|---|
| High Redundancy | 3-way mirror | 2 simultaneous cells | Production, financial, regulated workloads |
| Normal Redundancy | 2-way mirror | 1 cell | Dev/test, non-critical secondary systems |
Automatic rebalancing — the part that saves your weekend
When a drive or an entire cell fails, ASM doesn't wait for you to open a change ticket. It starts rebuilding missing mirror copies onto surviving cells — while the database stays open and serving traffic. You replace the hardware when Oracle's field engineer arrives. The data layer already healed itself.
05 · What Happens If an Entire Site Goes Down?
RAC and ASM are brilliant inside the building. They mean nothing when the building loses power.
That's the moment you stop talking about service relocation and start talking about failover — promoting a standby database somewhere else to become the new primary. It's stressful, it's irreversible (unlike a switchover), and it's exactly why you rehearse it twice a year instead of reading about it for the first time during an actual outage.
Figure 4 · Switchover is a scheduled handoff. Failover is an emergency promotion.
The nightmare scenario isn't failover itself — it's discovering your standby was 45 minutes behind on redo apply when the primary died. That's the RTO-versus-RPO trade-off in real life: how fast can you come back online, and how much committed data are you willing to leave behind?
06 · How Does Oracle Data Guard Protect Enterprise Databases on ExaCC?
Data Guard is your insurance policy for the whole site. It keeps a live copy of your production database at a second location — another data center, another city, sometimes an OCI region — by continuously shipping redo from primary to standby.
Configuration details matter here. Our Active Data Guard setup guide for Exadata covers the step-by-step. This section is about choosing the right protection mode and knowing what you're trading away.
Three protection modes — pick your poison carefully
Every protection mode is a trade-off between "how sure are we no data was lost" and "how much latency does every commit pay." There's no free lunch. Here's the honest comparison:
| Protection Mode | Replication Type | Data Loss Risk (RPO) | Performance Impact | Best For |
|---|---|---|---|---|
| Maximum Protection | Synchronous | Absolute zero | High — double commit latency | Core banking, trading systems |
| Maximum Availability | Synchronous (falls back to async on network loss) | Zero under normal conditions | Moderate | Most enterprise production workloads |
| Maximum Performance | Asynchronous | Minimal — depends on transport lag | None | DR with distance/latency constraints |
Table 1 · Data Guard protection modes — replication, RPO, and performance trade-offs
Active Data Guard — your standby shouldn't sit idle
With Active Data Guard, the standby stays open for read-only queries while redo apply runs in the background. That means your month-end reporting, heavy analytics, and even RMAN backups can run against the standby instead of hammering production. It's one of the few DR investments that pays dividends before disaster strikes.
Fast-Start Failover (FSFO) adds automation. A Data Guard Observer — a small witness process in a third network zone — watches both sites. If the primary vanishes, the Observer and standby coordinate a failover without waiting for someone to find the runbook at 2 AM.
Figure 5 · Data Guard architecture — redo transport, Active DG workloads, and FSFO Observer
07 · How Do Oracle RAC, ASM, and Data Guard Work Together?
Think of ExaCC resilience like a building with three fire doors. Each one closes a different kind of breach. You don't pick one — you stack them.
Figure 6 · Each layer handles failures the layer below it cannot — together they cover node, cell, and site outages
| Layer | Technology | Failure Scope | Typical RTO | Typical RPO |
|---|---|---|---|---|
| Compute | Oracle RAC + Application Continuity | Single node / NIC | Seconds | Zero |
| Storage | Oracle ASM (Failure Groups) | Disk / storage cell | None (online rebuild) | Zero |
| Site | Oracle Data Guard (+ FSFO) | Entire data center | Minutes – hours | Zero – near-zero |
08 · How Should Enterprises Design Business Continuity on ExaCC?
There's no single "correct" HA/DR design. A hospital can't tolerate the same latency trade-offs as a factory floor ERP. Here are four patterns we see repeatedly — and why each one makes sense for that industry.
Banking and financial services — zero data loss, no excuses
What keeps the CIO awake: A lost transaction isn't a bug report — it's a regulatory incident.
Typical ExaCC design: ASM High Redundancy on the primary rack. Synchronous Data Guard to a second ExaCC rack nearby (under 5 ms round-trip) in Maximum Availability mode. A third asynchronous standby in a distant OCI region for true geographic DR. Application Continuity on every connection pool. FSFO with an Observer in a third network zone.
Healthcare — clinical systems can't stutter
What keeps the CIO awake: A nurse entering vitals at bedside can't get an ORA-03113 because a node rebooted.
Typical ExaCC design: Dual- or quad-node RAC with Application Continuity fully wired into EHR connection pools. Data Guard in Maximum Performance to a secondary site — clinical writes can't wait on synchronous round-trips. Active Data Guard on the standby handles analytics and audit reporting so production stays fast.
Global manufacturing — the line stops, money burns
What keeps the CIO awake: ERP downtime stops assembly lines. Every minute has a dollar figure attached.
Typical ExaCC design: Multi-node RAC with separate ASM disk groups for production data vs backup archives. Maximum Availability Data Guard to a secondary facility — production runs at full speed, and the supply chain engine survives a regional network blip without manual intervention.
Government and defense — sovereignty plus distance
What keeps the CIO awake: Data cannot leave sovereign boundaries, but it also cannot exist in only one physical location.
Typical ExaCC design: ExaCC in on-premises sovereign data centers. Data Guard over dedicated encrypted dark fiber. FSFO Observers in three separate administrative zones to prevent split-brain when networks get unreliable — which they will, under stress.
09 · Common Misconceptions About HA and DR
These come up in almost every architecture review. They're understandable — and they're wrong in ways that hurt.
"Oracle RAC replaces the need for a Disaster Recovery site."
The reality: RAC protects against server crashes inside the same room. If the data center loses power, catches fire, or suffers a destructive cyber attack, all RAC nodes go offline simultaneously. RAC is HA — you still need Data Guard for DR.
"ASM mirroring eliminates the need for database backups."
The reality: ASM mirrors hardware failure — not human error. A DROP TABLE without a flashback clause gets mirrored to every failure group instantly. RMAN backups to OCI Object Storage or a Zero Data Loss Recovery Appliance remain mandatory.
"High Availability means absolutely zero downtime under every scenario."
The reality: HA dramatically reduces downtime for common infrastructure faults. Major architecture changes, structural database modifications, or complex application upgrades may still require planned maintenance windows.
"Disaster Recovery planning is only for mega-corporations."
The reality: Ransomware and infrastructure failures don't check revenue before striking. Small and mid-market ExaCC deployments need disciplined DR strategies just as much as global banks.
10 · Enterprise Best Practices for ExaCC Resilience
- Wire up Application Continuity before you need it.I've seen perfect RAC failovers produce angry users because the JDBC URL pointed at a SID, not a service. Set
FAILOVER_TYPE=TRANSACTIONand test it — don't assume. - Give redo transport its own network path.Nothing kills a standby faster than sharing the redo NIC with bulk ETL or backup traffic. Isolate it. Monitor lag daily, not quarterly.
- Size the standby like you mean to run on it.A standby with half the OCPUs of primary will embarrass you during the one failover that actually matters.
- Drill failover without breaking replication.Snapshot Standby lets you test read-write against a clone of your DR database. Use it. A runbook nobody has executed is fiction.
- Alert on lag before lag becomes a crisis.Fifteen minutes of transport lag is a warning. Four hours is a board-level conversation. Configure OEM or Cloud Guard alerts and act on them.
11 · The Business Continuity Checklist
Print this. Walk your ExaCC deployment with a colleague. Be honest about the unchecked boxes — those are your real gaps.
- Is Oracle RAC configured across multiple nodes? (Verifies compute-tier high availability.)
- Are ASM Failure Groups explicitly mapped across distinct storage cell units? (Protects against storage server chassis failure.)
- Is Oracle Data Guard configured with an explicit protection mode matching business RPO? (Verifies site-level DR readiness.)
- Have you enabled and tested Application Continuity on application servers? (Ensures end-user transaction replay during node failures.)
- Is a third-party Data Guard Fast-Start Failover (FSFO) Observer deployed in a separate network zone? (Enables hands-free automated failover.)
- Are RMAN database backups stored on a separate platform like OCI Object Storage or Zero Data Loss Recovery Appliance? (Protects against localized rack destruction.)
- Has a full role-reversal (planned switchover) drill been performed within the last 6 months? (Validates operational readiness.)
- Are application connection strings configured to automatically search for the active primary database service across both data centers? (Ensures application traffic follows the active instance.)
12 · The Short Version — 8 Things Every Enterprise Should Know
- HA and DR are distinct operational pillars.HA handles local failures within one data center; DR handles entire site losses across regions.
- Oracle RAC handles compute node crashes.Active-active multi-node setups keep connections running via Cache Fusion and Application Continuity.
- Oracle ASM prevents data loss from drive failures.Intelligent mirroring and automatic rebalancing protect storage blocks without operational disruption.
- Oracle Data Guard is your site protection layer.It continuously ships transactional redo logs across geographical boundaries to a standby destination.
- ExaCC combines these layers into one architecture.RAC, ASM, and Data Guard form a cohesive cloud-managed resilience stack.
- RTO and RPO determine your budget and setup.Establish clear recovery metrics before choosing Maximum Availability or Maximum Performance.
- Software settings matter as much as hardware.HA fails if application connection strings aren't tuned for seamless node transitions.
- Resilience requires regular testing.A DR plan is only as good as your last successful failover test.
13 · Frequently Asked Questions
Does Oracle RAC protect my data if an entire Exadata rack loses power?
No. RAC only protects against individual compute node failures within that rack. If the entire rack loses power, all instances go offline. Deploy Oracle Data Guard to replicate data to a separate system at another location.
What is the difference between a Data Guard Switchover and Failover?
Switchover is planned — primary and standby swap roles with zero data loss, usually for maintenance. Failover is emergency-only — executed when the primary is unexpectedly lost or destroyed.
Can I use Active Data Guard to offload read-heavy reports on ExaCC?
Yes. Active Data Guard keeps the standby open in read-only mode while applying changes from the primary — a common strategy for offloading reporting and backup tasks without impacting production.
How does ExaCC handle a failure of an entire storage cell?
ASM redundancy absorbs it. With Normal (2-way) or High (3-way) Redundancy, ASM continues serving reads and writes from mirrored extents on remaining healthy cells without interrupting the database.
What happens to active transactions when a compute node crashes?
With Application Continuity configured, uncommitted in-flight transactions are replayed against a surviving node. Users see a brief pause — not a connection error.
Is Maximum Performance mode safe for financial databases?
Generally no. Asynchronous replication introduces minor data-loss risk if the primary is destroyed suddenly. For financial workloads, Maximum Availability with synchronous replication is the recommended choice.
How often should we test our DR failover plan?
Execute a full DR drill at least once or twice per year. Use Active Data Guard Snapshot Standby to test safely against a read-write clone without disrupting production replication.
Does ExaCC automate database patching without downtime?
Yes. ExaCC supports rolling patch updates. Clusterware patches individual compute nodes one at a time while remaining nodes continue handling production workloads.
14 · Conclusion
No single technology saves you on ExaCC. RAC, ASM, and Data Guard each handle a different kind of failure — and none of them work if the application layer isn't configured to follow the database when it moves.
The architects who sleep well aren't the ones who bought the most redundancy. They're the ones who tested failover last quarter, know their redo lag right now, and can tell you their RTO in plain English without opening a slide deck.
The question that separates a working ExaCC deployment from one that survives a real outage is still the same one DBAs have always asked: how does this fail — and how do we recover?
Our Exadata Expert course covers ExaCC HA/DR design hands-on — RAC failover labs, ASM Failure Group mapping, Active Data Guard setup, and the failover drills most teams never get around to doing on their own.