What is the difference between an Oracle Data Guard Switchover and Failover?

A Switchover is a planned operation where the primary and standby databases swap roles with zero data loss, usually for rolling maintenance. A Failover is an emergency operation executed only when the primary database is unexpectedly lost or destroyed.

Can I use Active Data Guard to offload my read-heavy reports on ExaCC?

Yes. Active Data Guard allows you to keep the standby database open in read-only mode while it continues applying changes from the primary database.

What happens to active transactions when an ExaCC compute node crashes?

If Application Continuity is configured, uncommitted in-flight transactions are intercepted by the client driver and safely replayed against a surviving compute node. Users typically see only a brief pause rather than a connection error.

Is it safe to use Maximum Performance mode for financial databases?

Maximum Performance mode uses asynchronous replication, which introduces a slight risk of minor data loss if the primary site is destroyed suddenly. For strict financial workloads, Maximum Availability mode is highly recommended.

How often should we test our Disaster Recovery failover plan on ExaCC?

Enterprise best practices recommend executing a full disaster recovery drill at least once or twice a year. Active Data Guard Snapshot Standby can run tests safely without disrupting production replication.

Does ExaCC automate database patching without causing downtime?

Yes. Exadata Cloud@Customer supports rolling patch updates. Oracle Clusterware updates individual compute nodes one at a time while remaining nodes continue handling production workloads.

Oracle Exadata Cloud@Customer HA & DR Architecture Guide

Q: Does Oracle RAC protect my data if an entire Exadata rack loses power?

No. Oracle RAC only protects against individual compute node failures within that specific rack. If the entire rack loses power, all instances go offline. You must deploy Oracle Data Guard to replicate data to a separate system at another location.

Q: How does ExaCC handle a failure of an entire storage cell?

ExaCC utilizes Oracle ASM with either Normal Redundancy (2-way mirroring) or High Redundancy (3-way mirroring). If an entire storage cell goes offline, ASM continues serving reads and writes using mirrored data extents on the remaining healthy cells.

01 · Introduction

Somewhere in your organisation, someone is asking: "What happens if something fails?" Maybe it's the CFO before signing off on the ExaCC purchase. Maybe it's you, staring at an architecture diagram at 11 PM, trying to remember whether RAC covers a full-site outage. (It doesn't.)

Oracle Exadata Cloud@Customer (ExaCC) doesn't pretend failures won't happen. It layers protection so that most failures never reach the user — and the catastrophic ones have a rehearsed escape route. Oracle Real Application Clusters (RAC) handles compute. Oracle Automatic Storage Management (ASM) handles disks and storage cells. Oracle Data Guard handles the whole site going dark. Each layer picks up where the last one stops.

If you want the rack-and-control-plane picture first, start with our ExaCC Architecture Overview. This article assumes you know the hardware exists — and focuses on what keeps it alive.

Field note · Production DBA · 14 years Exadata

"We had RAC. We thought we were covered. Then the UPS failed."

Priya runs a four-node RAC cluster on ExaCC for a regional bank. When a single compute node failed, users didn't notice — Application Continuity did its job. Six months later, a UPS failure took out the entire rack. Every RAC node went down together.

RAC had done exactly what it was designed to do. It just wasn't designed for that failure. Data Guard at the secondary site did the job instead — but only because they'd configured it, tested it, and knew which runbook to open. The lesson wasn't "RAC failed." It was "we'd been solving the wrong problem."

HA

One node dies inside the rack. Users keep working. You find out from a monitoring alert, not a phone call.

DR

The whole site goes dark. A standby in another city takes over — if you built it, sized it, and rehearsed the cutover.

RTO/RPO

The two numbers that turn architecture debates into budget decisions. Get them from the business first.

02 · What Is the Difference Between High Availability and Disaster Recovery?

High Availability (HA) handles localized, component-level failures within a single site. Disaster Recovery (DR) handles catastrophic events that take an entire infrastructure footprint offline and require failover to a remote location.

Here's the mistake I see most often: a team buys RAC, feels "highly available," and never budgets for a DR site. Or they build a beautiful standby database and leave the application pointing at hard-coded IP addresses that die on failover. HA and DR solve different problems. You usually need both — but they are not interchangeable.

Figure 1 · HA keeps the lights on locally. DR keeps the business alive when the whole site doesn't.

High Availability — the "something broke in the rack" problem

Think of HA as your day-to-day safety net. A compute node dies during month-end close. A flash drive throws errors. You're patching Grid Infrastructure on a rolling basis. The cluster detects it, moves work elsewhere, and life continues. If you've done the application-side work (Application Continuity, connection pools, service names), most users never know anything happened.

Disaster Recovery — the "we've lost the data center" problem

DR is what you reach for when HA has nothing left to protect. The UPS dies. The building floods. A ransomware event takes every node offline at once. Now you're not relocating a service inside a cluster — you're promoting a standby database in another city and praying your lag was acceptable.

RTO and RPO — the numbers your business actually cares about

Before you pick a protection mode or redundancy level, get these two numbers from the business — not from the DBA team guessing in a vacuum.

Metric	Definition	Typical HA Target	Typical DR Target
Recovery Time Objective (RTO)	Maximum acceptable downtime before service is restored	Seconds to minutes	Minutes to hours
Recovery Point Objective (RPO)	Maximum acceptable age of data that can be lost	Absolute zero for critical DBs	Zero (sync) or minimal (async lag)

03 · How Oracle RAC Delivers High Availability on Exadata Cloud@Customer

Oracle RAC is an active-active cluster where multiple compute nodes simultaneously access the same database files, synchronizing locks and data blocks through Cache Fusion over a low-latency Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) fabric.

On ExaCC, RAC is what stops a single dead server from becoming a production outage. Multiple compute nodes share the same datafiles and work as one database — not one active and one passive, but genuinely active-active. When one node goes away, the others keep going.

If you live in RAC day-to-day, our Advanced Oracle RAC Administration course goes deeper on cache fusion tuning, service relocation, and connection pool setup. This section is the "what actually happens when things break" version.

Component

Oracle RAC

Scope

Compute nodes in one rack

Interconnect

Cache Fusion over RoCE

Survives

Loss of N−1 nodes

Cache Fusion — why RAC feels like one database

When Node A needs a data block that Node B already has in memory, RAC doesn't send both nodes to disk. Cache Fusion moves the block directly between SGAs over ExaCC's internal Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) fabric. That's why RAC on Exadata feels fast — and why node-to-node coordination stays in the millisecond range instead of the I/O-wait range.

When a compute node dies — the eight-second story

Picture this: Node 2 in a three-node cluster stops responding. You didn't reboot it. Clusterware did its job before you opened a ticket. Here's the sequence, roughly:

Figure 2 · What happens in the first eight seconds after an ExaCC compute node failure

The part teams skip: RAC can recover perfectly and your app still falls over. Application Continuity only works if developers set FAILOVER_TYPE=TRANSACTION, RETRY_COUNT, and connect through a service name — not a hard-coded SID on one node. I've seen million-dollar RAC clusters defeated by a JDBC URL someone wrote in 2019.

04 · What Role Does Oracle ASM Play in ExaCC Resilience?

Oracle ASM is the integrated volume manager and file system for Oracle database files. On ExaCC, ASM stripes data across all Exadata Storage Cells and mirrors extents through Failure Groups mapped to distinct storage servers.

RAC protects your compute. ASM protects your data on disk. On ExaCC, every database file lives inside ASM disk groups that stripe across Exadata Storage Cells and mirror through Failure Groups — so losing one cell doesn't mean losing your data.

Failure Group design is where a lot of "we thought we had HA" stories go wrong. Our ASM Failure Groups deep dive walks through the mapping in detail. The short version: one Failure Group per storage cell, always.

Striping spreads the load. Mirroring absorbs the hit.

ASM splits your datafiles into extents and spreads them across every storage cell in the rack — so no single disk becomes a bottleneck. Then it mirrors each extent across separate Failure Groups. If Cell 2 dies on a Friday night, the database keeps running from copies on Cell 1 and Cell 3. You fix the hardware Monday. The database never went offline.

Figure 3 · How ASM places mirrored copies across independent Exadata Storage Cells

Redundancy Mode	Copies per Extent	Cell Failures Tolerated	Typical ExaCC Use Case
High Redundancy	3-way mirror	2 simultaneous cells	Production, financial, regulated workloads
Normal Redundancy	2-way mirror	1 cell	Dev/test, non-critical secondary systems

Automatic rebalancing — the part that saves your weekend

When a drive or an entire cell fails, ASM doesn't wait for you to open a change ticket. It starts rebuilding missing mirror copies onto surviving cells — while the database stays open and serving traffic. You replace the hardware when Oracle's field engineer arrives. The data layer already healed itself.

05 · What Happens If an Entire Site Goes Down?

RAC and ASM are brilliant inside the building. They mean nothing when the building loses power.

That's the moment you stop talking about service relocation and start talking about failover — promoting a standby database somewhere else to become the new primary. It's stressful, it's irreversible (unlike a switchover), and it's exactly why you rehearse it twice a year instead of reading about it for the first time during an actual outage.

Figure 4 · Switchover is a scheduled handoff. Failover is an emergency promotion.

The nightmare scenario isn't failover itself — it's discovering your standby was 45 minutes behind on redo apply when the primary died. That's the RTO-versus-RPO trade-off in real life: how fast can you come back online, and how much committed data are you willing to leave behind?

06 · How Does Oracle Data Guard Protect Enterprise Databases on ExaCC?

Oracle Data Guard maintains one or more synchronized standby copies of your production database at a remote data center or OCI public cloud region, continuously shipping redo log changes from the primary ExaCC system to the standby.

Data Guard is your insurance policy for the whole site. It keeps a live copy of your production database at a second location — another data center, another city, sometimes an OCI region — by continuously shipping redo from primary to standby.

Configuration details matter here. Our Active Data Guard setup guide for Exadata covers the step-by-step. This section is about choosing the right protection mode and knowing what you're trading away.

Three protection modes — pick your poison carefully

Every protection mode is a trade-off between "how sure are we no data was lost" and "how much latency does every commit pay." There's no free lunch. Here's the honest comparison:

Protection Mode	Replication Type	Data Loss Risk (RPO)	Performance Impact	Best For
Maximum Protection	Synchronous	Absolute zero	High — double commit latency	Core banking, trading systems
Maximum Availability	Synchronous (falls back to async on network loss)	Zero under normal conditions	Moderate	Most enterprise production workloads
Maximum Performance	Asynchronous	Minimal — depends on transport lag	None	DR with distance/latency constraints

Table 1 · Data Guard protection modes — replication, RPO, and performance trade-offs

Active Data Guard — your standby shouldn't sit idle

With Active Data Guard, the standby stays open for read-only queries while redo apply runs in the background. That means your month-end reporting, heavy analytics, and even RMAN backups can run against the standby instead of hammering production. It's one of the few DR investments that pays dividends before disaster strikes.

Fast-Start Failover (FSFO) adds automation. A Data Guard Observer — a small witness process in a third network zone — watches both sites. If the primary vanishes, the Observer and standby coordinate a failover without waiting for someone to find the runbook at 2 AM.

Figure 5 · Data Guard architecture — redo transport, Active DG workloads, and FSFO Observer

07 · How Do Oracle RAC, ASM, and Data Guard Work Together?

Think of ExaCC resilience like a building with three fire doors. Each one closes a different kind of breach. You don't pick one — you stack them.

Figure 6 · Each layer handles failures the layer below it cannot — together they cover node, cell, and site outages

Layer	Technology	Failure Scope	Typical RTO	Typical RPO
Compute	Oracle RAC + Application Continuity	Single node / NIC	Seconds	Zero
Storage	Oracle ASM (Failure Groups)	Disk / storage cell	None (online rebuild)	Zero
Site	Oracle Data Guard (+ FSFO)	Entire data center	Minutes – hours	Zero – near-zero

08 · How Should Enterprises Design Business Continuity on ExaCC?

There's no single "correct" HA/DR design. A hospital can't tolerate the same latency trade-offs as a factory floor ERP. Here are four patterns we see repeatedly — and why each one makes sense for that industry.

Banking and financial services — zero data loss, no excuses

What keeps the CIO awake: A lost transaction isn't a bug report — it's a regulatory incident.

Typical ExaCC design: ASM High Redundancy on the primary rack. Synchronous Data Guard to a second ExaCC rack nearby (under 5 ms round-trip) in Maximum Availability mode. A third asynchronous standby in a distant OCI region for true geographic DR. Application Continuity on every connection pool. FSFO with an Observer in a third network zone.

Healthcare — clinical systems can't stutter

What keeps the CIO awake: A nurse entering vitals at bedside can't get an ORA-03113 because a node rebooted.

Typical ExaCC design: Dual- or quad-node RAC with Application Continuity fully wired into EHR connection pools. Data Guard in Maximum Performance to a secondary site — clinical writes can't wait on synchronous round-trips. Active Data Guard on the standby handles analytics and audit reporting so production stays fast.

Global manufacturing — the line stops, money burns

What keeps the CIO awake: ERP downtime stops assembly lines. Every minute has a dollar figure attached.

Typical ExaCC design: Multi-node RAC with separate ASM disk groups for production data vs backup archives. Maximum Availability Data Guard to a secondary facility — production runs at full speed, and the supply chain engine survives a regional network blip without manual intervention.

Government and defense — sovereignty plus distance

What keeps the CIO awake: Data cannot leave sovereign boundaries, but it also cannot exist in only one physical location.

Typical ExaCC design: ExaCC in on-premises sovereign data centers. Data Guard over dedicated encrypted dark fiber. FSFO Observers in three separate administrative zones to prevent split-brain when networks get unreliable — which they will, under stress.

09 · Common Misconceptions About HA and DR

These come up in almost every architecture review. They're understandable — and they're wrong in ways that hurt.

Misconception 1

"Oracle RAC replaces the need for a Disaster Recovery site."

The reality: RAC protects against server crashes inside the same room. If the data center loses power, catches fire, or suffers a destructive cyber attack, all RAC nodes go offline simultaneously. RAC is HA — you still need Data Guard for DR.

Misconception 2

"ASM mirroring eliminates the need for database backups."

The reality: ASM mirrors hardware failure — not human error. A DROP TABLE without a flashback clause gets mirrored to every failure group instantly. RMAN backups to OCI Object Storage or a Zero Data Loss Recovery Appliance remain mandatory.

Misconception 3

"High Availability means absolutely zero downtime under every scenario."

The reality: HA dramatically reduces downtime for common infrastructure faults. Major architecture changes, structural database modifications, or complex application upgrades may still require planned maintenance windows.

Misconception 4

"Disaster Recovery planning is only for mega-corporations."

The reality: Ransomware and infrastructure failures don't check revenue before striking. Small and mid-market ExaCC deployments need disciplined DR strategies just as much as global banks.

10 · Enterprise Best Practices for ExaCC Resilience

Wire up Application Continuity before you need it.I've seen perfect RAC failovers produce angry users because the JDBC URL pointed at a SID, not a service. Set FAILOVER_TYPE=TRANSACTION and test it — don't assume.
Give redo transport its own network path.Nothing kills a standby faster than sharing the redo NIC with bulk ETL or backup traffic. Isolate it. Monitor lag daily, not quarterly.
Size the standby like you mean to run on it.A standby with half the OCPUs of primary will embarrass you during the one failover that actually matters.
Drill failover without breaking replication.Snapshot Standby lets you test read-write against a clone of your DR database. Use it. A runbook nobody has executed is fiction.
Alert on lag before lag becomes a crisis.Fifteen minutes of transport lag is a warning. Four hours is a board-level conversation. Configure OEM or Cloud Guard alerts and act on them.

11 · The Business Continuity Checklist

Print this. Walk your ExaCC deployment with a colleague. Be honest about the unchecked boxes — those are your real gaps.

Is Oracle RAC configured across multiple nodes? (Verifies compute-tier high availability.)
Are ASM Failure Groups explicitly mapped across distinct storage cell units? (Protects against storage server chassis failure.)
Is Oracle Data Guard configured with an explicit protection mode matching business RPO? (Verifies site-level DR readiness.)
Have you enabled and tested Application Continuity on application servers? (Ensures end-user transaction replay during node failures.)
Is a third-party Data Guard Fast-Start Failover (FSFO) Observer deployed in a separate network zone? (Enables hands-free automated failover.)
Are RMAN database backups stored on a separate platform like OCI Object Storage or Zero Data Loss Recovery Appliance? (Protects against localized rack destruction.)
Has a full role-reversal (planned switchover) drill been performed within the last 6 months? (Validates operational readiness.)
Are application connection strings configured to automatically search for the active primary database service across both data centers? (Ensures application traffic follows the active instance.)

12 · The Short Version — 8 Things Every Enterprise Should Know

HA and DR are distinct operational pillars.HA handles local failures within one data center; DR handles entire site losses across regions.
Oracle RAC handles compute node crashes.Active-active multi-node setups keep connections running via Cache Fusion and Application Continuity.
Oracle ASM prevents data loss from drive failures.Intelligent mirroring and automatic rebalancing protect storage blocks without operational disruption.
Oracle Data Guard is your site protection layer.It continuously ships transactional redo logs across geographical boundaries to a standby destination.
ExaCC combines these layers into one architecture.RAC, ASM, and Data Guard form a cohesive cloud-managed resilience stack.
RTO and RPO determine your budget and setup.Establish clear recovery metrics before choosing Maximum Availability or Maximum Performance.
Software settings matter as much as hardware.HA fails if application connection strings aren't tuned for seamless node transitions.
Resilience requires regular testing.A DR plan is only as good as your last successful failover test.

13 · Frequently Asked Questions

Does Oracle RAC protect my data if an entire Exadata rack loses power?

No. RAC only protects against individual compute node failures within that rack. If the entire rack loses power, all instances go offline. Deploy Oracle Data Guard to replicate data to a separate system at another location.

What is the difference between a Data Guard Switchover and Failover?

Switchover is planned — primary and standby swap roles with zero data loss, usually for maintenance. Failover is emergency-only — executed when the primary is unexpectedly lost or destroyed.

Can I use Active Data Guard to offload read-heavy reports on ExaCC?

Yes. Active Data Guard keeps the standby open in read-only mode while applying changes from the primary — a common strategy for offloading reporting and backup tasks without impacting production.

How does ExaCC handle a failure of an entire storage cell?

ASM redundancy absorbs it. With Normal (2-way) or High (3-way) Redundancy, ASM continues serving reads and writes from mirrored extents on remaining healthy cells without interrupting the database.

What happens to active transactions when a compute node crashes?

With Application Continuity configured, uncommitted in-flight transactions are replayed against a surviving node. Users see a brief pause — not a connection error.

Is Maximum Performance mode safe for financial databases?

Generally no. Asynchronous replication introduces minor data-loss risk if the primary is destroyed suddenly. For financial workloads, Maximum Availability with synchronous replication is the recommended choice.

How often should we test our DR failover plan?

Execute a full DR drill at least once or twice per year. Use Active Data Guard Snapshot Standby to test safely against a read-write clone without disrupting production replication.

Does ExaCC automate database patching without downtime?

Yes. ExaCC supports rolling patch updates. Clusterware patches individual compute nodes one at a time while remaining nodes continue handling production workloads.

14 · Conclusion

No single technology saves you on ExaCC. RAC, ASM, and Data Guard each handle a different kind of failure — and none of them work if the application layer isn't configured to follow the database when it moves.

The architects who sleep well aren't the ones who bought the most redundancy. They're the ones who tested failover last quarter, know their redo lag right now, and can tell you their RTO in plain English without opening a slide deck.

The question that separates a working ExaCC deployment from one that survives a real outage is still the same one DBAs have always asked: how does this fail — and how do we recover?

Our Exadata Expert course covers ExaCC HA/DR design hands-on — RAC failover labs, ASM Failure Group mapping, Active Data Guard setup, and the failover drills most teams never get around to doing on their own.

ExaGuru — Oracle Cloud Training & Consulting

Exadata · ExaCC/ExaCS · OCI · Oracle DB Migration · Fusion ERP/HCM · Oracle Database 23ai & AI

Email: [email protected]

Web: www.exaguru.com

Contact Us: +91-6394049607 · +91-9161111705

Join our WhatsApp community