How VCS campus clusters work

This section describes how VCS works with VxVM to provide high availability in a campus cluster environment.

In a campus cluster setup, VxVM automatically mirrors volumes across sites. To enhance read performance, VxVM reads from the plexes at the local site where the application is running. VxVM writes to plexes at both the sites.

In the event of a storage failure at a site, VxVM detaches all the disks at the failed site from the diskgroup to maintain data consistency. When the failed storage comes back online, VxVM automatically reattaches the site to the diskgroup and recovers the plexes.

See Veritas Volume Manager Administrator's Guide for more information.

When service group or system faults occur, VCS fails over the service groups or the nodes based on the values you set for the service group attributes SystemZones and AutoFailOver.

See Service group attributes

For campus cluster setup, you must define the SystemZones attribute in such a way that the nodes at each site are grouped together. Depending on the value of the AutoFailOver attribute, VCS failover behavior is as follows:

VCS does not fail over the service group or the node. 

VCS fails over the service group to another suitable node. VCS chooses to fail over the service group within the same site before choosing a node in the other site. 

By default, the AutoFailOver attribute value is set to 1. 

VCS fails over the service group if another suitable node exists in the same site. Otherwise, VCS waits for administrator intervention to initiate the service group failover to a suitable node in the other site. 

This configuration requires the HA/DR license enabled. 

Symantec recommends that you set the value of AutoFailOver attribute to 2. 

Sample definition for these service group attributes in the VCS main.cf is as follows:

group oragroup1 (

SystemList = { node1=0, node2=1, node3=2, node4=3 }

SystemZones = { node1=0, node2=0, node3=1, node4=1 }

AutoFailOver = 2

...

)

Failure scenarios in campus cluster lists the possible failure scenarios and how VCS campus cluster recovers from these failures.

Failure scenarios in campus cluster

Failure

Description and recovery

Node failure 

  • A node in a site fails.

    If the value of the AutoFailOver attribute is set to 1, VCS fails over the Oracle service group to another system within the same site that is defined in the SystemZones attribute.

  • All nodes in a site fail.

    If the value of the AutoFailOver attribute is set to 1, VCS fails over the Oracle service group to a system in the other site that is defined in the SystemZones attribute.

    If the value of the AutoFailOver attribute is set to 2, VCS requires administrator intervention to initiate the Oracle service group failover to a system in the other site.

If the value of the AutoFailOver attribute is set to 0, VCS requires administrator intervention to initiate a fail over in both the cases of node failure. 

Application failure 

The behavior is similar to the node failure. 

Storage failure - one or more disks at a site fails 

VCS does not fail over the service group when such a storage failure occurs. 

VxVM detaches the site from the diskgroup if any volume in that diskgroup does not have at least one valid plex at the site where the disks failed. 

VxVM does not detach the site from the diskgroup in the following cases: 

  • None of the plexes are configured on the failed disks.
  • Some of the plexes are configured on the failed disks, and at least one plex for a volume survives at each site.

If only some of the disks that failed come online and if the vxrelocd daemon is running, VxVM relocates the remaining failed disks to any available disks. Then, VxVM automatically reattaches the site to the diskgroup and resynchronizes the plexes to recover the volumes. 

If all the disks that failed come online, VxVM automatically reattaches the site to the diskgroup and resynchronizes the plexes to recover the volumes. 

Storage failure - all disks at both sites fail 

VCS acts based on the DiskGroup agent's PanicSystemOnDGLoss attribute value. 

See Veritas Bundled Agents Reference Guide for more information. 

Site failure 

All nodes and storage at a site fail. 

Depending on the value of the AutoFailOver attribute, VCS fails over the Oracle service group as follows: 

  • If the value is set to 1, VCS fails over the Oracle service group to a system in the other site that is defined in the SystemZones attribute.
  • If the value is set to 2, VCS requires administrator intervention to initiate the Oracle service group failover to a system in the other site.

Because the storage at the failed site is inaccessible, VCS imports the disk group in the application service group with all devices at the failed site marked as NODEVICE. 

When the storage at the failed site comes online, VxVM automatically reattaches the site to the diskgroup and resynchronizes the plexes to recover the volumes. 

Network failure (LLT interconnect failure) 

Nodes at each site lose connectivity to the nodes at the other site 

The failure of private interconnects between the nodes can result in split brain scenario and cause data corruption.  

Review the details on other possible causes of split brain and how I/O fencing protects shared data from corruption. 

See About data protection 

Symantec recommends that you configure I/O fencing to prevent data corruption in campus clusters. 

See About I/O fencing in campus clusters 

Network failure (LLT and storage interconnect failure) 

Nodes at each site lose connectivity to the storage and the nodes at the other site 

Symantec recommends that you configure I/O fencing to prevent split brain and serial split brain conditions. 

  • If I/O fencing is configured:

    The site that loses the race commits suicide.

    See About I/O fencing in campus clusters

    When you restore the network connectivity, VxVM detects the storage at the failed site, reattaches the site to the diskgroup, and resynchronizes the plexes to recover the volumes.

  • If I/O fencing is not configured:

    If the application service group was online at site A during such failure, the application service group remains online at the same site. Because the storage is inaccessible, VxVM detaches the disks at the failed site from the diskgroup. At site B where the application service group is offline, VCS brings the application service group online and imports the disk group with all devices at site A marked as NODEVICE. So, the application service group is online at both the sites and each site uses the local storage. This causes inconsistent data copies and leads to a site-wide split brain.

    When you restore the network connectivity between sites, a serial split brain may exist.

    See Veritas Volume Manager Administrator's Guide for details to recover from a serial split brain condition.