README VERSION : 1.1 README CREATION DATE : 2012-09-20 PATCH-ID : 5.1.133.000 PATCH NAME : VRTSvcs 5.1 SP1RP3 BASE PACKAGE NAME : Veritas Cluster Server by Symantec BASE PACKAGE VERSION : VRTSvcs 5.1SP1 SUPERSEDED PATCHES : NONE REQUIRED PATCHES : NONE INCOMPATIBLE PATCHES : NONE SUPPORTED PADV : rhel5_x86_64,rhel6_x86_64,sles10_x86_64,sles11_x86_64 (P-PLATFORM , A-ARCHITECTURE , D-DISTRIBUTION , V-VERSION) PATCH CATEGORY : CORE , CORRUPTION , PANIC PATCH CRITICALITY : OPTIONAL HAS KERNEL COMPONENT : NO ID : NONE REBOOT REQUIRED : NO PATCH INSTALLATION INSTRUCTIONS: -------------------------------- Please refer the release notes for installation instructions PATCH UNINSTALLATION INSTRUCTIONS: ---------------------------------- Please refer the release notes for un-installation instructions. SPECIAL INSTALL INSTRUCTIONS: ----------------------------- NONE SUMMARY OF FIXED ISSUES: ----------------------------------------- 2411882 (2561722) The imf_register entry point failure count gets incremented even when we imf_unregister entry point fails. 2576778 (2896402) The resource unregistration gets executed with wrong state when we fire hagrp -online/-offline or hares -online/-offline command. 2684821 (2684818) If a pure local attribute like PreOnline is specified before SystemList in main.cf then it gets rejected when HAD is started. 2692170 (2692173) The Child service group can be online on the same node with parent group when -nopre is used for an online remote firm dependency. 2722774 (2660011) Restart of an agent moves a critical resource to FAULTED state and hence the group, even if value of ManageFaults attribute is set to NONE at service group level. 2817929 (2729816) Service group failover failure caused by ToQ not getting cleared when OnlineRetryLimit larger than 0 2817933 (2729867) Global group did not failover to remote site after HAD gets killed and the primary site node crashed. 2818046 (2558988) CurrentLimits not getting updated when a node faults 2818234 (2746802) VCS engine should not clear the MigrateQ and TargetCount when failover service group is probed on a system 2818373 (2732228) VCS is unable to shut down with the init script 2818375 (2741299) CmdSlave dumped core with SIGSEGV 2818535 (2199924) hacf dies dumping VCS configuration due to some attributes containing strings > 4Kb in length 2837027 (2647049) On SLES11, VCS logs do not reflect time zone changes. 2837278 (2832754) When a Global Cluster Option (GCO) is configured across clusters having duplicate system names, command-line utility hagrp gives incorrect output with the "-clear", "-flush", "-state" options 2848182 (2746816) Remove the syslog() call from SIGALRM handler 2860833 (2735410) The High Availability Daemon (HAD) core dumps and gets restarted 2909173 (2909184) Failure messages of resource un-registration with IMF appear in agent/engine logs after performing online/offline operations on the resource SUMMARY OF KNOWN ISSUES: ----------------------------------------- KNOWN ISSUES : -------------- FIXED INCIDENTS: ---------------- PATCH ID:5.1.133.000 * INCIDENT NO:2411882 TRACKING ID:2561722 SYMPTOM: Agent does not retry resource registration with IMF to the value specified in RegisterRetryLimit key of IMF attributes. DESCRIPTION: Agent maintains internal count for each resource to track the number of times a resource registration is attempted with IMF. This count gets incremented if any attempts to register a resource with IMF fails. At present, any failure in un-registration of resource with IMF, also increments this internal count. This means that if resource un-registration fails, next time agent may not try as many resource registrations with IMF as specified in RegisterRetryLimit. This issue is also observed for multiple resources. RESOLUTION: Symantec has modified the agent frame work library so that the resource registration will be retried as per the value specified in RegisterRetryLimit if registration is failing frquently * INCIDENT NO:2576778 TRACKING ID:2896402 SYMPTOM: Failure messages of resource un-registration with IMF appear in agent/engine logs after performing online/offline operations on the resource. DESCRIPTION: When a resource is registered with IMF for monitoring, any online/offline operation triggers un-registration of the resource from IMF. During such operations, agent may log an error message in agent/engine logs stating un-registration tried with wrong state. This issue is also observed for multiple resources. RESOLUTION: Symantec has modified the agent frame work library so that these error message of wrong state unregistration will not come. * INCIDENT NO:2684821 TRACKING ID:2684818 SYMPTOM: Veritas Cluster Server (VCS) may fail to implement the configured values of attributes such as PreOnline, ContainerInfo, TriggersEnabled, and AutoStartList. DESCRIPTION: In the main.cf file, if you specify the value of an attribute before you specify the value of the SystemList attribute, then this behavior occurs. When VCS starts, the hacf utility reads the main.cf file, and generates a list of modify commands to load the configured values of various attributes. The sequence of the modify commands is the same as the sequence in which you specify the attributes in the main.cf file. For example, if you specify the PreOnline attribute before the SystemList attribute, the modify command for PreOnline precedes the modify command for SystemList. The modify command for PreOnline fails, and VCS therefore fails to implement the value of PreOnline. VCS logs an error in the hacf-err_A.log log. RESOLUTION: Symantec has modified the hacf utility to execute the modify command for the SystemList attribute before the modify commands for the other attributes in the main.cf file. * INCIDENT NO:2692170 TRACKING ID:2692173 SYMPTOM: Even if you set an "online remote" dependency between two service groups, the service groups come online on the same node. DESCRIPTION: This behavior occurs if you use the 'hagrp -online -nopre -sys ' command to bring a child service group online on the node where its parent service groups is online. RESOLUTION: Symantec has modified VCS to reject the 'hagrp -online -nopre -sys' command until VCS executes the preonline trigger for the child group. * INCIDENT NO:2722774 TRACKING ID:2660011 SYMPTOM: Resource moves to FAULTED state even if value of ManageFaults attribute is set to NONE at service group level. This will cause service group to fault if the resource is Critical. DESCRIPTION: When monitor entry point of an ONLINE resource reports the state as OFFLINE in a service group which has ManageFaults attribute set to NONE, then the resource must move to ONLINE|ADMIN_WAIT state irrespective of the fact that its Critical attribute is set or not. Similarly a resource in ONLINE|ADMIN_WAIT state must stay in the same state. There are couples of scenarios as mentioned below when a resource with ONLINE or ONLINE|ADMN_WAIT state is moved to faulted state and hence the group: 1. If agent restarts by any means (due to abnormal death or user manually stops and starts the agent using HA commands) and the subsequent monitor report the state as OFFLINE. 2. If user runs "hagrp -flush" command and the subsequent monitor report the state as OFFLINE. RESOLUTION: Symantec has modified the 'had', 'hacf' binary and agent framework library to avoid resource to move to faulted state on agent restart. * INCIDENT NO:2817929 TRACKING ID:2729816 SYMPTOM: Even if the value of the OnlineRetryLimit attribute is exhausted, Veritas Cluster Server (VCS) does not fail over a service group. DESCRIPTION: You can initiate a switch operation for a service group that is restarting. If you specify a target, then VCS populates that target in the ToQ attribute of the service group. Once the value of OnlineRetryLimit is exhausted, VCS evaluates a failover target for the service group. However, the evaluation fails because the failover target is already present in ToQ. RESOLUTION: Symantec has fixed the code such that VCS aborts an online retry operation if any of the following attributes are configured: ToQ, FromQ and MigrateQ. VCS fails over service groups, based on the values of ToQ, FromQ, and MigrateQ. * INCIDENT NO:2817933 TRACKING ID:2729867 SYMPTOM: When High Availability Daemon (HAD) goes down, and a node at the primary site crashes, Veritas Cluster Server (VCS) is unable to fail over a global service group. DESCRIPTION: The global service group and the Cluster Service Group (CSG) are online on the same node. When HAD goes down, VCS sets the IntentOnline attribute of the global service group to 0. VCS fails over the CSG. When the CSG comes online, VCS attempts to fail over the global service group. However the attempt fails because the value of the IntentOnline attribute is 0. RESOLUTION: Symantec has introduced checks other than the value of the IntentOnline attribute, to determine whether to fail over a global service group. * INCIDENT NO:2818046 TRACKING ID:2558988 SYMPTOM: When a system re-joins a cluster after fault recovery, the value of the CurrentLimits attribute on that system is not correctly reflected in the other member nodes of the cluster. DESCRIPTION: The behaviour occurs because after the faulted system re-joins, it broadcasts an update message with incorrect syntax. RESOLUTION: Symantec has modified the VCS engine code to fix this issue. * INCIDENT NO:2818234 TRACKING ID:2746802 SYMPTOM: A failover service group does not auto start online after VCS is started DESCRIPTION: The issue occurs when an offline local dependency is configured between failover service groups. Based on its configuration, if the auto start logic tried to online the child service group on a system where its parent service group is already online, VCS marks the parent service group for migration and also initiates its offline. However post offline, VCS failed to bring child service group as well as the parent service group online on different systems according to the offline local dependency. This was because VCS inadvertently unmarked the parent service group for migration and marked it to go online one the same system. RESOLUTION: Symantec has fixed the code such that VCS brings the parent and child service groups online on different systems. * INCIDENT NO:2818373 TRACKING ID:2732228 SYMPTOM: VCS is unable to shut down with the init script. DESCRIPTION: When Veritas Cluster Server (VCS) shuts down with the init script, it executes the '/etc/init.d/vcs stop' command. This command attempts to stop HAD on the node, by using the 'hastop -sysoffline' command, and completes its execution when HAD reaches the EXITED state. VCS then runs the 'hasys -wait' command that waits to receive cluster information from HAD. However, since HAD is in the process of exiting, it is unable to send the cluster information. As a result, the 'hasys -wait' command does not reach completion. RESOLUTION: Symantec has modified VCS to ensure that it does not run the 'hasys -wait' command when HAD is in the process of exiting. VCS now waits for 60 seconds for HAD to close the GAB port 'h'. * INCIDENT NO:2818375 TRACKING ID:2741299 SYMPTOM: CmdSlave process dumps core with a SIGSEGV. DESCRIPTION: CmdSlave process gets stuck in a tight loop when it gets an EBADF on a file descriptor. The CmdSlave process keeps waiting for events on this file descriptor and eventually dumps core with SIGSEGV. RESOLUTION: Symantec has modified VCS code to make CmdSlave process exit if it gets an EBADF on a file descriptor. * INCIDENT NO:2818535 TRACKING ID:2199924 SYMPTOM: Veritas Cluster Server (VCS) is unable to load configuration if the size an attribute is greater than 4 KB. DESCRIPTION: You may want to set large attributes for VCS resources. The size of an attribute was limited to 4 KB up to VCS 5.0 MP3. If you specify an attribute that is greater than 4 KB in size, VCS was unable to dump or load the VCS configuration and the hacf utility dumps core. RESOLUTION: Symantec has fixed the hacf utility by creating a flexible method to handle large input strings. VCS now supports up to 8 KB (8192 bytes) in size, so the VCS High Availability Daemon successfully loads and saves configuration. * INCIDENT NO:2837027 TRACKING ID:2647049 SYMPTOM: On SLES11, Veritas Cluster Server (VCS) logs do not reflect time zone changes. DESCRIPTION: VCS uses the strftime() function to format the time stamp for logs. On SLES11, strftime() does not update the global variables for the time zone. RESOLUTION: Symantec has modified the code for VCS logging to explicitly call the tzset() function before strftime(). * INCIDENT NO:2837278 TRACKING ID:2832754 SYMPTOM: When a Global Cluster Option (GCO) is configured across clusters having duplicate system names, command-line utility hagrp gives incorrect output with the "-clear", "-flush", "-state" options. DESCRIPTION: The system names across clusters in a GCO are duplicate and a global service group is configured on such systems from each cluster. Therefore, when you query the state of the global service group, incorrect output is displayed. Moreover, when the you flush or clear such a service group and specify the cluster as input with the hagrp command, the command gives an error message asking for the cluster name. RESOLUTION: Symantec has fixed the HAD utility to handle cases where the system names are duplicate across clusters. * INCIDENT NO:2848182 TRACKING ID:2746816 SYMPTOM: GAB terminates the HAD process DESCRIPTION: HAD maintains a periodic heartbeat with GAB. HAD also does a periodic self-test to check to control the interval between successive heartbeats. The self-test is perfomed using the Unix alarm mechanism which sends SIGALRM to the HAD process. The SIGALRM handler invokes the syslog() system call which can cause the HAD process to go into an infinite sleep. Thus, HAD stops to heartbeat with GAB and GAB terminates the HAD process. RESOLUTION: Symantec has removed the syslog() system call and changed the signal handler function of HAD such that it writes a message to had_hb_delay.txt file in the log directory. * INCIDENT NO:2860833 TRACKING ID:2735410 SYMPTOM: The High Availability Daemon (HAD) floods the engine logs and after sometime dumps core and restarts. DESCRIPTION: When a client encounters certain connection issues, many warning messages are logged in the engine logs. When a client (for example, Veritas Cluster Manager - Java Console) registers for the log cache with HAD, HAD can dereference a NULL pointer if the engine logs are flooded. This makes HAD to dump core and restart. RESOLUTION: Symantec has fixed the HAD code by adding proper null checks and has changed the log message to a debug log message. * INCIDENT NO:2909173 TRACKING ID:2909184 SYMPTOM: Failure messages of resource un-registration with IMF appear in agent/engine logs after performing online/offline operations on the resource. DESCRIPTION: When a resource is registered with IMF for monitoring, any online/offline operation triggers un-registration of the resource from IMF. During such operations, agent may log an error message in agent/engine logs stating un-registration failed. This issue is also observed for multiple resources. RESOLUTION: These failure messages are false positives and no INCIDENTS FROM OLD PATCHES: --------------------------- Patch Id::5.1.132.000 * Incident no::2296172 Tracking ID ::2296148 Symptom::Even if you set the value of the AutoFailover attribute to 2, VCS may fail over a service group from a FAULTED node to a node in another SystemZone. Description::If you set the value of the AutoFailover attribute to 2, then VCS must fail over service groups only to target nodes that are in the same SystemZone (as defined in the SystemZones attribute). However, if VCS does not find any eligible target node in the same SystemZone, then it fails over the service group to a node in another SystemZone. This behavior is due to an error in the code. Resolution::Symantec has fixed the code to ensure that if you set the AutoFailover attribute to 2, then VCS fails over service groups to a node that is in the same SystemZone. If VCS does not find an eligible target node in the same SystemZone, VCS does not fail over the service group. * Incident no::2330041 Tracking ID ::2252095 Symptom::When a child service group is auto-started, the parallel parent service groups that have a online-global dependency are not auto-started. Description::When a child service group is auto-started, parallel parent service groups that have online-global dependency are not auto-started on the systems in the parent service group's AutoStartList attribute. For online-local dependency, VCS performs a validation that ensures that the parent service group is auto- started only on nodes where the child service group has auto-started. This restricts the parallel parent service groups having online-global dependency to auto-start on all nodes. Resolution::The validation is modified and VCS now allows the parallel parent service groups that have online-global and online-remote dependency to auto- start on all nodes. * Incident no::2330980 Tracking ID ::2330978 Symptom::When you add a node to the SystemList attribute of a group, a VCS agent may report an incorrect state of a resource on the existing nodes in the SystemList. Description::When you add a node to the SystemList of a group, the related agent must start monitoring resources from that group on the new node. For this purpose, the High Availability Daemon (HAD) module sends a snapshot (information related to the monitored resources in the group, including attribute values) to the agent on the new node. However, HAD also sends the snapshot to the existing nodes. As a result, the agent framework may incorrectly modify certain attributes, and the agent may report an incorrect state of a resource on an existing node. Resolution::Symantec has modified HAD to send a snapshot only to the agent on a newly-added node in the SystemList of a group. * Incident no::2354932 Tracking ID ::2354931 Symptom::'hacli -cmd' Triggers had Coredump Description::When had Is Running In '-onenode'(Started Using 'hastart -onenode'), hacli Tries To Send Ucast Messages To Other Systems(Which Are Not Part Of Cluster As had Is Running In -onenode Mode). The Attempt To Send Ucast Message To Other Systems Causes had Coredump. Resolution::Symantec Has Made Changes So That hacli Does Not Send Ucast Messages When had Is Running In '-onenode' Mode Or Simulator Mode. * Incident no::2372072 Tracking ID ::2365444 Symptom::"hacf" dumps core Description::If 'hacf' could not get its current working directory, it logs an error. At this point the log object is not initialized. Resolution::The source code of hacf is changed to make sure the Log object is initialzed before any logging is done. * Incident no::2382463 Tracking ID ::2209536 Symptom::If Preferred Fencing is enabled then VCS assigns a node weight of 10000 when the value of FencingWeight is set to 10000. Description::When Preferred Fencing is enabled, the least node weight that can be assigned is 1. Hence, VCS adds 1 to the value specified in FencingWeight attribute. If the value of FencingWeight is set to 10000 then VCS tries to set the node weight 10001 which fails since the maximum node weight that can be assigned is 10000. Resolution::The maximum value that can be assigned to attribute FencingWeight is 9999. * Incident no::2382493 Tracking ID ::2204340 Symptom::Parent Service Group Does Not Failover If It Is Dependent On An Online-Local-Firm Parallel Child Service Group. Description::Parallel Child Service Group Has OnOnly Resources. While AutoStarting, TargetCount Of Child Service Group Is Elevated To Number Of Systems In AutoStartList. When An OnOff Resource Is Added In Child Service Group, TargetCount Is Further Incremented Such That TargetCount Is Greater Than Number Of Systems In The SystemList. When Parent Service Group Tries To Failover After Fault Engine Sees TargetCount > CurrentCount For Child Service Group. Engine Concludes That Child Service Group Has Not Completed Failover Hence Aborts Failover Of Parent Service Group. Resolution::Symantec Has Added Validation To Make Sure That TargetCount For Parallel Service Group Is Less Than Or Equal To Number Of Systems In The SystemList. For Failover Service Group, TargetCount Must Be Less Than Or Equal To 1. * Incident no::2382592 Tracking ID ::2207990 Symptom::While Displaying Value Of ResourceInfo Using 'hares -display', 'Status' Key Was Not Displayed. Description::In 'hares -display', There Was An Upper Cap Of 20 Characters. Any Attribute Value Greater Than 20 Characters Was Truncated. Hence 'Status' Keys Was Not Shown As 20 Characters Limit Was Exhausted By Other Keys(Like State, Msg, TS) Of ResourceInfo. Resolution::Symantec Has Removed Upper Cap Of 20 Characters So That Keys Were Not Truncated While Displaying Using 'hares -display' And Full Value Of Attribute Is Shown. * Incident no::2398807 Tracking ID ::2398802 Symptom::In /opt/VRTSvcs/bin/vcsenv, File Descriptors Limit Was Set To 2048. Description::In /opt/VRTSvcs/bin/vcsenv, Soft & Hard Limit Of File Descriptors Was Set To 2048. If Hard Limit Was Set To Higher Value Then Hard Limit Was Overridden With Lower Value(2048). Resolution::Symantec Has Made Changes In /opt/VRTSvcs/bin/vcsenv So That If Hard Limit Of File Descriptors Is Less Than 2048 Then Set Both Limits(Soft & Hard) To 2048, Else Only Set Soft Limit To 2048. Thus Hard Limit Of File Descriptors Is Not Decremented. * Incident no::2399898 Tracking ID ::2198770 Symptom::In a firm or soft group dependency, if more than one parent service group is ONLINE in a cluster, then you cannot switch a child service group from one node to another. Description::When you run the 'hagrp -switch' command, the VCS engine checks the state of a parent service group before switching a child service group. If more than one parent service group is ONLINE, the VCS engine rejects the command irrespective of the rigidity (soft or firm) or location (local, global, or remote) of the group dependency. Resolution::Symantec has introduced the following check for the 'hagrp -switch' command: The VCS engine checks the state of parent service groups and validates the switch operation for a child service group, based on the location and rigidity of dependency. * Incident no::2416842 Tracking ID ::2416758 Symptom::The CPU consumption of "had" process is very high. The had process does not respond to any HA command. HA command is hung in pollsys() system call. Description::The "had" process can accept a limited number of connections from clients. This limit (FD_SETSIZE) is determined by the operating system. However, the accept system call can return a file descriptor greater than the limit. In such a case "had" cannot process this file descriptor using the select system call. As a result "had" goes into a unrecoverable loop. Resolution::Symantec has fixed the code to ensure that "had" will close a file descriptor which is greater than FD_SETSIZE. This prevents "had" process from going into a unrecoverable loop. * Incident no::2426572 Tracking ID ::2433479 Symptom::A persistent (Operations=NONE) type of resource is reported OFFLINE instead of FAULTED Description::VCS assumes the State of a resource is OFFLINE before it is probed. If the first monitor cycle of the agent returns OFFLINE, then the persistent resource is not reported as FAULTED. Resolution::Symantec has fixed the code to ensure that the default State of a persistent resource is ONLINE. If the first monitor cycle returns OFFLINE then VCS will mark the resource as FAULTED. * Incident no::2439695 Tracking ID ::2513764 Symptom::vxfen module is loaded in memory even if the vxfen service is disabled. Description::The VCS HAD process opens the vxfen device file even if the UseFence attribute is not set. This results in the vxfen module getting loaded into memory. Resolution::Symantec has fixed the code to ensure that the HAD process does not open the vxfen device file unless UseFence is set to SCSI3. * Incident no::2439772 Tracking ID ::2439767 Symptom::The remote cluster becomes unresponsive after a network interruption and the cluster state is LOST_CONN. Description::This issue occurs when local wide-area connector(WAC) receives a TCP Connection Close request along with multiple IPM messages from the client. This might happen in the following cases: aC/ When the TCP traffic from one cluster to another cluster is blocked, the remote WAC does not receive any heartbeats. If the remote WAC does not receive heartbeats for a period that is greater than SocketTimeout, it closes the TCP connection with the local WAC. The TCP layer on remote node pushes all the un- ACKed data along with the FIN to the peer(server). aC/ If the local WAC is unresponsive or not scheduled, it is unable to process incoming TCP packets or send heartbeats for a period greater than SocketTimeout. The remote WAC closes the connection. When the local WAC is scheduled again, it receives all the messages along with the TCP FIN msg. When the remote WAC receives multiple messages along with a TCP connection close request, it tries to clear the received messages and clean the connection. However, this corrupts the internal data structures on the local WAC, causing it to send messages to wrong clients. Resolution::Symantec has modified the code to ensure that there is no data corruption. The connection is re-established when the network is up. * Incident no::2477280 Tracking ID ::2477273 Symptom::After VCS detects a concurrency violation it subsequently does not failover the service group on fault. Description::When VCS detects a concurrency violation, the service group is brought offline on the node which violated the concurrency. VCS inadvertently sets the value of TargetCount attribute to 0 which subsequently prevents VCS from failing over the service group. Resolution::Symantec has fixed the code to ensure that the value of TargetCount attribute is not set to 0 if the service group is active in the cluster. * Incident no::2477296 Tracking ID ::2159816 Symptom::Service group does not failover when node panics Description::When a service group is in the process of failover, if a flush operation is performed on the target node when the service group is not active on the target node, then the value of TargetCount attribute is inadvertently set to 0. Hence the service group does not failover when the node panics in future. Resolution::Symantec has fixed the code to ensure that the flush operation does not set the value of TargetCount to 0 when a failover service group is active in the cluster. * Incident no::2483044 Tracking ID ::2438954 Symptom::'had' coredumps with SIGSEGV when asserting against following gp-> activecount()-> gets32GL(nodename) == 0, in "Resource.C" in check_failover function. Description::If engine receives the second offline message for already offline resources while VCS is still offlining resources in path of faulted resource (i.e. when PathCount of the group is still positive) engine core dumps in above location. Resolution::Solution for this problem is provided in 5.1SP1RP2 and in 6.0. * Incident no::2407755 Tracking ID ::2245069 Symptom::Agent crashes while allocating or de-allocating memory Description::Any kind of memory allocation done between the fork and execve system calls results in memory corruption followed by agent crash. Resolution::Symantec has modified the agent framework library which does not do any memory allocation operations between fork and execve system calls (in child context). This prevents memory corruption and agent crash. The async- signal-safe function from signal handler is also removed to avoid agent hang during signal handling when memory corruption happens.