README VERSION : 1.0 README Creation Date : 2011-09-14 Patch-ID : 5.1.132.000 Patch Name : VRTSvcs-5.1SP1RP2-SLES11 BASE PACKAGE NAME : Veritas Cluster Server BASE PACKAGE VERSION : VRTSvcs 5.1SP1 Obsolete Patches : NONE Superseded Patches : NONE Required Patches : NONE Incompatible Patches : NONE Supported PADV : sles11_x86_64 (P-Platform , A-Architecture , D-Distribution , V-Version) Patch Category : CORE , HANG , PERFORMANCE Reboot Required : YES KNOWN ISSUES : FIXED INCIDENTS: ---------------- Patch Id::5.1.132.000 * Incident no::2296172 Tracking ID ::2296148 Symptom::Even if you set the value of the AutoFailover attribute to 2, VCS may fail over a service group from a FAULTED node to a node in another SystemZone. Description::If you set the value of the AutoFailover attribute to 2, then VCS must fail over service groups only to target nodes that are in the same SystemZone (as defined in the SystemZones attribute). However, if VCS does not find any eligible target node in the same SystemZone, then it fails over the service group to a node in another SystemZone. This behavior is due to an error in the code. Resolution::Symantec has fixed the code to ensure that if you set the AutoFailover attribute to 2, then VCS fails over service groups to a node that is in the same SystemZone. If VCS does not find an eligible target node in the same SystemZone, VCS does not fail over the service group. * Incident no::2330041 Tracking ID ::2252095 Symptom::When a child service group is auto-started, the parallel parent service groups that have a online-global dependency are not auto-started. Description::When a child service group is auto-started, parallel parent service groups that have online-global dependency are not auto-started on the systems in the parent service group's AutoStartList attribute. For online-local dependency, VCS performs a validation that ensures that the parent service group is auto- started only on nodes where the child service group has auto-started. This restricts the parallel parent service groups having online-global dependency to auto-start on all nodes. Resolution::The validation is modified and VCS now allows the parallel parent service groups that have online-global and online-remote dependency to auto- start on all nodes. * Incident no::2330980 Tracking ID ::2330978 Symptom::When you add a node to the SystemList attribute of a group, a VCS agent may report an incorrect state of a resource on the existing nodes in the SystemList. Description::When you add a node to the SystemList of a group, the related agent must start monitoring resources from that group on the new node. For this purpose, the High Availability Daemon (HAD) module sends a snapshot (information related to the monitored resources in the group, including attribute values) to the agent on the new node. However, HAD also sends the snapshot to the existing nodes. As a result, the agent framework may incorrectly modify certain attributes, and the agent may report an incorrect state of a resource on an existing node. Resolution::Symantec has modified HAD to send a snapshot only to the agent on a newly-added node in the SystemList of a group. * Incident no::2354932 Tracking ID ::2354931 Symptom::'hacli -cmd' Triggers had Coredump Description::When had Is Running In '-onenode'(Started Using 'hastart -onenode'), hacli Tries To Send Ucast Messages To Other Systems(Which Are Not Part Of Cluster As had Is Running In -onenode Mode). The Attempt To Send Ucast Message To Other Systems Causes had Coredump. Resolution::Symantec Has Made Changes So That hacli Does Not Send Ucast Messages When had Is Running In '-onenode' Mode Or Simulator Mode. * Incident no::2372072 Tracking ID ::2365444 Symptom::"hacf" dumps core Description::If 'hacf' could not get its current working directory, it logs an error. At this point the log object is not initialized. Resolution::The source code of hacf is changed to make sure the Log object is initialzed before any logging is done. * Incident no::2382463 Tracking ID ::2209536 Symptom::If Preferred Fencing is enabled then VCS assigns a node weight of 10000 when the value of FencingWeight is set to 10000. Description::When Preferred Fencing is enabled, the least node weight that can be assigned is 1. Hence, VCS adds 1 to the value specified in FencingWeight attribute. If the value of FencingWeight is set to 10000 then VCS tries to set the node weight 10001 which fails since the maximum node weight that can be assigned is 10000. Resolution::The maximum value that can be assigned to attribute FencingWeight is 9999. * Incident no::2382493 Tracking ID ::2204340 Symptom::Parent Service Group Does Not Failover If It Is Dependent On An Online-Local-Firm Parallel Child Service Group. Description::Parallel Child Service Group Has OnOnly Resources. While AutoStarting, TargetCount Of Child Service Group Is Elevated To Number Of Systems In AutoStartList. When An OnOff Resource Is Added In Child Service Group, TargetCount Is Further Incremented Such That TargetCount Is Greater Than Number Of Systems In The SystemList. When Parent Service Group Tries To Failover After Fault Engine Sees TargetCount > CurrentCount For Child Service Group. Engine Concludes That Child Service Group Has Not Completed Failover Hence Aborts Failover Of Parent Service Group. Resolution::Symantec Has Added Validation To Make Sure That TargetCount For Parallel Service Group Is Less Than Or Equal To Number Of Systems In The SystemList. For Failover Service Group, TargetCount Must Be Less Than Or Equal To 1. * Incident no::2382592 Tracking ID ::2207990 Symptom::While Displaying Value Of ResourceInfo Using 'hares -display', 'Status' Key Was Not Displayed. Description::In 'hares -display', There Was An Upper Cap Of 20 Characters. Any Attribute Value Greater Than 20 Characters Was Truncated. Hence 'Status' Keys Was Not Shown As 20 Characters Limit Was Exhausted By Other Keys(Like State, Msg, TS) Of ResourceInfo. Resolution::Symantec Has Removed Upper Cap Of 20 Characters So That Keys Were Not Truncated While Displaying Using 'hares -display' And Full Value Of Attribute Is Shown. * Incident no::2398807 Tracking ID ::2398802 Symptom::In /opt/VRTSvcs/bin/vcsenv, File Descriptors Limit Was Set To 2048. Description::In /opt/VRTSvcs/bin/vcsenv, Soft & Hard Limit Of File Descriptors Was Set To 2048. If Hard Limit Was Set To Higher Value Then Hard Limit Was Overridden With Lower Value(2048). Resolution::Symantec Has Made Changes In /opt/VRTSvcs/bin/vcsenv So That If Hard Limit Of File Descriptors Is Less Than 2048 Then Set Both Limits(Soft & Hard) To 2048, Else Only Set Soft Limit To 2048. Thus Hard Limit Of File Descriptors Is Not Decremented. * Incident no::2399898 Tracking ID ::2198770 Symptom::In a firm or soft group dependency, if more than one parent service group is ONLINE in a cluster, then you cannot switch a child service group from one node to another. Description::When you run the 'hagrp -switch' command, the VCS engine checks the state of a parent service group before switching a child service group. If more than one parent service group is ONLINE, the VCS engine rejects the command irrespective of the rigidity (soft or firm) or location (local, global, or remote) of the group dependency. Resolution::Symantec has introduced the following check for the 'hagrp -switch' command: The VCS engine checks the state of parent service groups and validates the switch operation for a child service group, based on the location and rigidity of dependency. * Incident no::2416842 Tracking ID ::2416758 Symptom::The CPU consumption of "had" process is very high. The had process does not respond to any HA command. HA command is hung in pollsys() system call. Description::The "had" process can accept a limited number of connections from clients. This limit (FD_SETSIZE) is determined by the operating system. However, the accept system call can return a file descriptor greater than the limit. In such a case "had" cannot process this file descriptor using the select system call. As a result "had" goes into a unrecoverable loop. Resolution::Symantec has fixed the code to ensure that "had" will close a file descriptor which is greater than FD_SETSIZE. This prevents "had" process from going into a unrecoverable loop. * Incident no::2426572 Tracking ID ::2433479 Symptom::A persistent (Operations=NONE) type of resource is reported OFFLINE instead of FAULTED Description::VCS assumes the State of a resource is OFFLINE before it is probed. If the first monitor cycle of the agent returns OFFLINE, then the persistent resource is not reported as FAULTED. Resolution::Symantec has fixed the code to ensure that the default State of a persistent resource is ONLINE. If the first monitor cycle returns OFFLINE then VCS will mark the resource as FAULTED. * Incident no::2439695 Tracking ID ::2513764 Symptom::vxfen module is loaded in memory even if the vxfen service is disabled. Description::The VCS HAD process opens the vxfen device file even if the UseFence attribute is not set. This results in the vxfen module getting loaded into memory. Resolution::Symantec has fixed the code to ensure that the HAD process does not open the vxfen device file unless UseFence is set to SCSI3. * Incident no::2439772 Tracking ID ::2439767 Symptom::The remote cluster becomes unresponsive after a network interruption and the cluster state is LOST_CONN. Description::This issue occurs when local wide-area connector(WAC) receives a TCP Connection Close request along with multiple IPM messages from the client. This might happen in the following cases: aC/ When the TCP traffic from one cluster to another cluster is blocked, the remote WAC does not receive any heartbeats. If the remote WAC does not receive heartbeats for a period that is greater than SocketTimeout, it closes the TCP connection with the local WAC. The TCP layer on remote node pushes all the un- ACKed data along with the FIN to the peer(server). aC/ If the local WAC is unresponsive or not scheduled, it is unable to process incoming TCP packets or send heartbeats for a period greater than SocketTimeout. The remote WAC closes the connection. When the local WAC is scheduled again, it receives all the messages along with the TCP FIN msg. When the remote WAC receives multiple messages along with a TCP connection close request, it tries to clear the received messages and clean the connection. However, this corrupts the internal data structures on the local WAC, causing it to send messages to wrong clients. Resolution::Symantec has modified the code to ensure that there is no data corruption. The connection is re-established when the network is up. * Incident no::2477280 Tracking ID ::2477273 Symptom::After VCS detects a concurrency violation it subsequently does not failover the service group on fault. Description::When VCS detects a concurrency violation, the service group is brought offline on the node which violated the concurrency. VCS inadvertently sets the value of TargetCount attribute to 0 which subsequently prevents VCS from failing over the service group. Resolution::Symantec has fixed the code to ensure that the value of TargetCount attribute is not set to 0 if the service group is active in the cluster. * Incident no::2477296 Tracking ID ::2159816 Symptom::Service group does not failover when node panics Description::When a service group is in the process of failover, if a flush operation is performed on the target node when the service group is not active on the target node, then the value of TargetCount attribute is inadvertently set to 0. Hence the service group does not failover when the node panics in future. Resolution::Symantec has fixed the code to ensure that the flush operation does not set the value of TargetCount to 0 when a failover service group is active in the cluster. * Incident no::2483044 Tracking ID ::2438954 Symptom::'had' coredumps with SIGSEGV when asserting against following gp->activecount()->gets32GL(nodename) == 0, in "Resource.C" in check_failover function. Description::If engine receives the second offline message for already offline resources while VCS is still offlining resources in path of faulted resource (i.e. when PathCount of the group is still positive) engine core dumps in above location. Resolution::Solution for this problem is provided in 5.1SP1RP2 and in 6.0. * Incident no::2407755 Tracking ID ::2245069 Symptom::Agent crashes while allocating or de-allocating memory Description::Any kind of memory allocation done between the fork and execve system calls results in memory corruption followed by agent crash. Resolution::Symantec has modified the agent framework library which does not do any memory allocation operations between fork and execve system calls (in child context). This prevents memory corruption and agent crash. The async- signal-safe function from signal handler is also removed to avoid agent hang during signal handling when memory corruption happens. Incidents from old Patches: --------------------------- NONE