Components of membership arbitration

Components of membership arbitration

Prev	About communications, membership, and data protection in the cluster	Next

Components of membership arbitration

The components of membership arbitration are the fencing module and the coordination points.

Fencing module

Each system in the cluster runs a kernel module called vxfen, or the fencing module. This module is responsible for ensuring valid and current cluster membership on a membership change through the process of membership arbitration. vxfen performs the following actions:

Registers with the coordinator disks during normal operation
Races for control of the coordinator disks during membership changes

Coordination points

Coordination points provide a lock mechanism to determine which nodes get to fence off data drives from other nodes. A node must eject a peer from the coordination points before it can fence the peer from the data drives. Racing for control of the coordination points to fence data disks is the key to understand how fencing prevents split brain.

The coordination points can either be disks or servers or both. Typically, a cluster must have three coordination points.

Coordinator disks
Disks that act as coordination points are called coordinator disks. Coordinator disks are three standard disks or LUNs set aside for I/O fencing during cluster reconfiguration. Coordinator disks do not serve any other storage purpose in the VCS configuration.

You can configure I/O fencing to use either the DMP devices or the underlying raw character devices. Veritas Volume Manager Dynamic Multipathing (DMP) feature allows coordinator disks to take advantage of the path failover and the dynamic adding and removal capabilities of DMP. Based on the disk device type that you use, you must define the I/O fencing SCSI-3 disk policy as either raw or dmp. The disk policy is dmp by default.

See the Veritas Volume Manager Administrator's Guide for details on the DMP feature.
Coordination point servers
The coordination point server (CP server) is a software solution which runs on a remote system or cluster. CP server provides arbitration functionality by allowing the VCS cluster nodes to perform the following tasks:
- Self-register to become a member of an active VCS cluster (registered with CP server with access to the data drives
- Check which other nodes are registered as members of this active VCS cluster
- Self-unregister from this active VCS cluster
- Forcefully unregister other nodes (preempt) as members of this active VCS cluster
  Note With the CP server, the fencing arbitration logic still remains on the VCS cluster.
  
  Multiple VCS clusters running different operating systems can simultaneously access the CP server. TCP/IP based communication is used between the CP server and the VCS clusters.

How the fencing module starts up

The fencing module starts up as follows:

The coordinator disks are placed in a disk group.
This allows the fencing start up script to use Veritas Volume Manager (VxVM) commands to easily determine which disks are coordinator disks, and what paths exist to those disks. This disk group is never imported, and is not used for any other purpose.
The fencing start up script on each system uses VxVM commands to populate the file /etc/vxfentab with the paths available to the coordinator disks.
For example, if the user has configured 3 coordinator disks with 2 paths to each disk, the /etc/vxfentab file will contain 6 individual lines, representing the path name to each disk, such as

/dev/sdy
When the fencing driver is started, it reads the physical disk names from the /etc/vxfentab file. Using these physical disk names, it determines the serial numbers of the coordinator disks and builds an in-memory list of the drives.
The fencing driver verifies that any other systems in the cluster that are already up and running see the same coordinator disks.
The fencing driver examines GAB port B for membership information. If no other systems are up and running, it is the first system up and is considered the correct coordinator disk configuration. When a new member joins, it requests a coordinator disks configuration. The system with the lowest LLT ID will respond with a list of the coordinator disk serial numbers. If there is a match, the new member joins the cluster. If there is not a match, vxfen enters an error state and the new member is not allowed to join. This process ensures all systems communicate with the same coordinator disks.
The fencing driver determines if a possible preexisting split brain condition exists.
This is done by verifying that any system that has keys on the coordinator disks can also be seen in the current GAB membership. If this verification fails, the fencing driver prints a warning to the console and system log and does not start.
If all verifications pass, the fencing driver on each system registers keys with each coordinator disk.
Topology of coordinator disks in the cluster

Click the thumbnail above to view full-sized image.

How membership arbitration works

Upon startup of the cluster, all systems register a unique key on the coordinator disks. The key is unique to the cluster and the node, and is based on the LLT cluster ID and the LLT system ID.

See About the I/O fencing registration key format

When there is a perceived change in membership, membership arbitration works as follows:

GAB marks the system as DOWN, excludes the system from the cluster membership, and delivers the membership change—the list of departed systems—to the fencing module.
The system with the lowest LLT system ID in the cluster races for control of the coordinator disks
- In the most common case, where departed systems are truly down or faulted, this race has only one contestant.
- In a split brain scenario, where two or more subclusters have formed, the race for the coordinator disks is performed by the system with the lowest LLT system ID of that subcluster. This system races on behalf of all the other systems in its subcluster.
The race consists of executing a preempt and abort command for each key of each system that appears to no longer be in the GAB membership.
The preempt and abort command allows only a registered system with a valid key to eject the key of another system. This ensures that even when multiple systems attempt to eject other, each race will have only one winner. The first system to issue a preempt and abort command will win and eject the key of the other system. When the second system issues a preempt and abort command, it can not perform the key eject because it is no longer a registered system with a valid key.
If the preempt and abort command returns success, that system has won the race for that coordinator disk.
Each system will repeat this race to all the coordinator disks. The race is won by, and control is attained by, the system that ejects the other system's registration keys from a majority of the coordinator disks.
On the system that wins the race, the vxfen module informs all the systems that it was racing on behalf of that it won the race, and that subcluster is still valid.
On the system(s) that do not win the race, the vxfen module will trigger a system panic. The other systems in this subcluster will note the panic, determine they lost control of the coordinator disks, and also panic and restart.
Upon restart, the systems will attempt to seed into the cluster.

Prev	Up	Next
About membership arbitration	Home	About server-based I/O fencing