When a system panics

When a system panics

Prev	VCS performance considerations	Next

When a system panics

There are several instances in which GAB will intentionally panic a system, including if it detects an internal protocol error or discovers an LLT node-ID conflict. Under severe load conditions, if VCS cannot heartbeat with GAB, it may induce a panic or a VCS restart.

This section describes the scenarios when GAB panics a system.

Client process failure

If a client process fails to heartbeat to GAB, the process is killed. If the process hangs or cannot be killed, GAB halts the system. This could also occur if the node is overloaded and HAD cannot run on the node.

If the -k option is used in the gabconfig command, GAB tries to kill the client process until successful, which may have an impact on the entire cluster. If the -b option is used in gabconfig, GAB does not try to kill the client process. Instead, it panics the system when the client process fails to heartbeat. This option cannot be turned off once set.

HAD heartbeats with GAB at regular intervals. The heartbeat timeout is specified by HAD when it registers with GAB; the default is 15 seconds. If HAD gets stuck within the kernel and cannot heartbeat with GAB within the specified timeout, GAB tries to kill HAD by sending a SIGABRT signal. If it does not succeed, GAB sends a SIGKILL and closes the port. By default, GAB tries to kill HAD five times before closing the port. The number of times GAB tries to kill HAD is a kernel tunable parameter, gab_kill_ntries, and is configurable. The minimum value for this tunable is 3 and the maximum is 10.

This is an indication to other nodes that HAD on this node has been killed. Should HAD recover from its stuck state, it first processes pending signals. Here it will receive the SIGKILL first and get killed.

After sending a SIGKILL, GAB waits for a specific amount of time for HAD to get killed. If HAD survives beyond this time limit, GAB panics the system. This time limit is a kernel tunable parameter, gab_isolate_time and is configurable. The minimum value for this timer is 16 seconds and maximum is 4 minutes.

Registration monitoring

The registration monitoring features lets you configure GAB behavior when HAD is killed and does not reconnect after a specified time interval.

This scenario may occur in the following situations:

The system is very busy and the hashadow process cannot restart HAD.
The HAD and hashadow processes were killed by user intervention.
The hashadow process restarted HAD, but HAD could not register.
A hardware/DIMM failure caused termination of the HAD and hashadow processes.
Any other situation where the HAD and hashadow processes ae not run.

When this occurs, the registration monitoring timer starts. GAB takes action if HAD does not register within the time defined by the VCS_GAB_RMTIMEOUT parameter, which is defined in the vcsenv file. The default value for VCS_GAB_RMTIMEOUT is 200 seconds.

When HAD cannot register after the specified time period, GAB logs a message every 15 seconds saying it will panic the system.

You can control GAB behavior in this situation by setting the VCS_GAB_RMACTION environment variable.

To configure GAB to panic the system in this situation, set:
VCS_GAB_RMACTION=panic

In this configuration, killing the HAD and hashadow processes results in a panic unless you start HAD within the registration monitoring timeout interval.
To configure GAB to log a message in this situation, set:
VCS_GAB_RMACTION=SYSLOG

The default value of this parameter is SYSLOG, which configures GAB to log a message when HAD does not reconnect after the specified time interval.

In this scenario, you can choose to restart HAD (using hastart) or unconfigure GAB (using gabconfig -U).

When you enable registration monitoring, GAB takes no action if the HAD process unregisters with GAB normally, that is if you stop HAD using the hastop command.

Network failure

If a network partition occurs, a cluster can "split into two or more separate sub-clusters. When two clusters join as one, VCS designates that one system be ejected. GAB prints diagnostic messages and sends iofence messages to the system being ejected. The system receiving the iofence messages tries to kill the client process. The -k option applied here. If the -j option is used in gabconfig, the system is halted when the iofence message is received.

GAB panics the system on receiving an iofence message on kernel client port and tries to kill only for clients running in user land. The -k option does not work for clients running in the kernel.

Quick reopen

If a system leaves cluster and tries to join the cluster before the new cluster is configured (default is five seconds), the system is sent an iofence message with reason set to "quick reopen. When the system receives the message, it tries to kill the client process.

Prev	Up	Next
When a network link fails	Home	When a service group switches over