Fabric Monitoring and proactive error detection

In previous releases, DMP handled failed paths reactively, by only disabling paths when active I/O failed on the storage. Using the Storage Networking Industry Association (SNIA) HBA API library, vxesd now is able to receive SAN fabric events from the HBA. This information allows DMP to take a proactive role by checking suspect devices from the SAN events, even if there is no active I/O. New I/O is directed to healthy paths while the suspect devices are verified.

During startup, vxesd queries the HBA (by way of the SNIA library) to obtain the SAN topology. The vxesd daemon determines the Port World Wide Names (PWWN) that correspond to each of the device paths that are visible to the operating system. After the vxesd daemon obtains the topology, vxesd registers with the HBA for SAN event notification. If LUNs are disconnected from a SAN, the HBA notifies vxesd of the SAN event, specifying the PWWNs that are affected. The vxesd daemon uses this event information and correlates it with the previous topology information to determine which set of device paths have been affected.

The vxesd daemon sends the affected set to the vxconfigd daemon (DDL) so that the device paths can be marked as suspect. When the path is marked as suspect, DMP does not send new I/O to the path unless it is the last path to the device. In the background, the DMP restore task checks the accessibility of the paths on its next periodic cycle using a SCSI inquiry probe. If the SCSI inquiry fails, DMP disables the path to the affected LUNs, which is also logged in the event log.

If the LUNs are reconnected at a later time, the HBA informs vxesd of the SAN event. When the DMP restore task runs its next test cycle, the disabled paths are checked with the SCSI probe and re-enabled if successful.

Note:

If vxesd receives an HBA LINK UP event, the DMP restore task is restarted and the SCSI probes run immediately, without waiting for the next periodic cycle. When the DMP restore task is restarted, it starts a new periodic cycle. If the disabled paths are not accessible by the time of the first SCSI probe, they are re-tested on the next cycle (300s by default).

The fabric monitor functionality is enabled by default. The value of the dmp_monitor_fabric tunable is persistent across reboots.

To disable the Fabric Monitoring functionality, use the following command:

# vxdmpadm settune dmp_monitor_fabric=off

To enable the Fabric Monitoring functionality, use the following command:

# vxdmpadm settune dmp_monitor_fabric=on

To display the current value of the dmp_monitor_fabric tunable, use the following command:

# vxdmpadm gettune dmp_monitor_fabric