* * * READ ME * * * * * * Veritas Volume Manager 5.1 SP1 RP1 * * * * * * P-patch 2 * * * Patch Date: 2012-06-19 This document provides the following information: * PATCH NAME * PACKAGES AFFECTED BY THE PATCH * BASE PRODUCT VERSIONS FOR THE PATCH * OPERATING SYSTEMS SUPPORTED BY THE PATCH * INCIDENTS FIXED BY THE PATCH * INSTALLATION PRE-REQUISITES * INSTALLING THE PATCH * REMOVING THE PATCH PATCH NAME ---------- Veritas Volume Manager 5.1 SP1 RP1 P-patch 2 PACKAGES AFFECTED BY THE PATCH ------------------------------ VRTSvxvm VRTSvxvm BASE PRODUCT VERSIONS FOR THE PATCH ----------------------------------- * Veritas Storage Foundation for Oracle RAC 5.1 SP1 * Veritas Storage Foundation Cluster File System 5.1 SP1 * Veritas Storage Foundation 5.1 SP1 * Veritas Storage Foundation High Availability 5.1 SP1 * Veritas Dynamic Multi-Pathing 5.1 SP1 OPERATING SYSTEMS SUPPORTED BY THE PATCH ---------------------------------------- HP-UX 11i v3 (11.31) INCIDENTS FIXED BY THE PATCH ---------------------------- This patch fixes the following Symantec incidents: Patch ID: PHCO_42992, PHKL_42993 * 2280285 (Tracking ID: 2365486) SYMPTOM: In Two nodes SFRAC configuration, after enabling ports when "vxdisk scandisks" is run, systems panics with following stack: PANIC STACK: .unlock_enable_mem() .unlock_enable_mem() dmp_update_path() dmp_decode_update_dmpnode() dmp_decipher_instructions() dmp_process_instruction_buffer() dmp_reconfigure_db() gendmpioctl() vxdmpioctl() rdevioctl() spec_ioctl() vnop_ioctl() vno_ioctl() common_ioctl() ovlya_addr_sc_flih_main() DESCRIPTION: Improper order of acquire and release of locks during reconfiguration of DMP when I/O activity was running parallelly, lead to above panic. RESOLUTION: Release the locks in the same order as they in which they are acquired. * 2532440 (Tracking ID: 2495186) SYMPTOM: With TCP protocol used for replication, I/O throttling happens due to memory flow control. DESCRIPTION: In some slow network configuration, the I/O throughput is throttled back due to the replication I/O. RESOLUTION: It is better to keep the replication I/O outside the normal I/O code path to improve its I/O throughput performance. * 2563291 (Tracking ID: 2527289) SYMPTOM: In a Campus Cluster setup, storage fault may lead to DETACH of all the configured site. This also results in IOfailure on all the nodes in the Campus Cluster. DESCRIPTION: Site detaches are done on site consistent dgs when any volume in the dg looses all the mirrors of a Site. During the processing of the DETACH of last mirror in a site we identify that it is the last mirror and DETACH the site which in turn detaches all the objects of that site. In Campus Cluster setup we attach a dco volume for any data volume created on a site-consistent dg. The general configuration is to have one DCO mirror on each site. Loss of a single mirror of the dco volume on any node will result in the detach of that site. In a 2 site configuration this particular scenario would result in both the dco mirrors being lost simultaneously. While the site detach for the first mirror is being processed we also signal for DETACH of the second mirror which ends up DETACHING the second site too. This is not hit in other tests as we already have a check to make sure that we do not DETACH the last mirror of a Volume. This check is being subverted in this particular case due to the type of storage failure. RESOLUTION: Before triggering the site detach we need to have an explicit check to see if we are trying to DETACH the last ACTIVE site. * 2621549 (Tracking ID: 2621465) SYMPTOM: When a failed disk belongs to a site has once again become accessible, it cannot be reattached to the disk group. DESCRIPTION: As the disk has a site tag name set, 'vxdg adddisk' command invoked in 'vxreattach' command needs the option '-f' to add the disk back to the disk group. RESOLUTION: Add the option '-f' to 'vxdg adddisk' command when it is invoked in 'vxreattach' command. * 2626900 (Tracking ID: 2608849) SYMPTOM: 1.Under a heavy I/O load on logclient node, write I/Os on VVR Primary logowner takes a very long time to complete. 2. I/Os on "master" and "slave" nodes hang when "master" role is switched multiple times using "vxclustadm setmaster" command. DESCRIPTION: 1. VVR can not allow more than 2048 I/Os outstanding on the SRL volume. Any I/Os beyond this threshold will be throttled. The throttled I/Os are restarted after every SRL header flush operation. During restarting the throttled I/Os, I/Os came from logclient are given higher priority causing logowner I/Os to starve. 2. In CVM reconfiguration code path the RLINK ports are not cleanly deleted on old log-owner. This causes the RLINks not to connect leading to both replication and I/O hang. RESOLUTION: Algorithm which restarts the throttled I/Os is modified to give fair chance to both local and remote I/Os to proceed. Additionally, the code changes are made in CVM reconfiguration code path to delete the RLINK ports cleanly before switching the master role. * 2626915 (Tracking ID: 2417546) SYMPTOM: Raw devices are lost after OS rebooting and also cause permissions issue due to change in dmpnode permissions from 660 to 600. DESCRIPTION: On reboot, while creating raw devices we generate next available device number. There is counting bug due to which VxVM were ending up creating one less device. Also it was creating permissions issue due to change in dmpnode permissions. RESOLUTION: This issue is addressed by source change wherein correct counters are kept and device permissions are changed appropriately. * 2626920 (Tracking ID: 2061082) SYMPTOM: "vxddladm -c assign names" command does not work if dmp_native_support tunable is enabled. DESCRIPTION: If dmp_native_support tunable is set to "on" then VxVM does not allow change in name of dmpnodes. This holds true even for device with native support not enabled like VxVM labeled or Third Party Devices. So there is no way for selectively changing name of devices for which native support is not enabled. RESOLUTION: This enhancement is addressed by code change to selectively change name for devices with native support not enabled. * 2636094 (Tracking ID: 2635476) SYMPTOM: DMP (Dynamic Multi Pathing) driver does not automatically enable the failed paths of Logical Units (LUNs) that are restored. DESCRIPTION: DMP's restore demon probes each failed path at a default interval of 5 minutes (tunable) to detect if that path can be enabled. As part of enabling the path, DMP issues an open() on the path's device number. Owing to a bug in the DMP code, the open() was issued on a wrong device partition which resulted in failure for every probe. Thus, the path remained in failed status at DMP layer though it was enabled at the array side. RESOLUTION: Modified the DMP restore daemon code path to issue the open() on the appropriate device partitions. * 2643651 (Tracking ID: 2643634) SYMPTOM: If standard(non-clone) disks and cloned disks of the same disk group are seen in a host, dg import will fail with the following error message when the standard(non-clone) disks have no enabled configuration copy of the disk group. # vxdg import VxVM vxdg ERROR V-5-1-10978 Disk group : import failed: Disk group has no valid configuration copies DESCRIPTION: When VxVM is importing such a mixed configuration of standard(non-clone) disks and cloned disks, standard(non-clone) disks will be selected as the member of the disk group in 5.0MP3RP5HF1 and 5.1SP1RP2. It will be done while administrators are not aware of the fact that there is a mixed configuration and the standard(non-clone) disks are to be selected for the import. It is hard to figure out from the error message and need time to investigate what is the issue. RESOLUTION: Syslog message enhancements are made in the code that administrators can figure out if such a mixed configuration is seen in a host and also which disks are selected for the import. * 2666175 (Tracking ID: 2666163) SYMPTOM: A small memory leak may be seen in vxconfigd, the VxVM configuration daemon when Serial Split Brain(SSB) error is detected in the import process. DESCRIPTION: The leak may occur when Serial Split Brain(SSB) error is detected in the import process. It is because when the SSB error is returning from a function, a dynamically allocated memory area in the same function would not be freed. The SSB detection is a VxVM feature where VxVM detects if the configuration copy in the disk private region becomes stale unexpectedly. A typical use case of the SSB error is that a disk group is imported to different systems at the same time and configuration copy update in both systems results in an inconsistency in the copies. VxVM cannot identify which configuration copy is most up-to-date in this situation. As a result, VxVM may detect SSB error on the next import and show the details through a CLI message. RESOLUTION: Code changes are made to avoid the memory leak and also a small message fix has been done. * 2695225 (Tracking ID: 2675538) SYMPTOM: Data corruption can be observed on a CDS (Cross-platform Data Sharing) disk, as part of LUN resize operations. The following pattern would be found in the data region of the disk. cyl alt 2 hd sec DESCRIPTION: The CDS disk maintains a SUN VTOC in the zeroth block and a backup label at the end of the disk. The VTOC maintains the disk geometry information like number of cylinders, tracks and sectors per track. The backup label is the duplicate of VTOC and the backup label location is determined from VTOC contents. As part of resize, VTOC is not updated to the new size, which results in the wrong calculation of the backup label location. If the wrongly calculated backup label location falls in the public data region rather than the end of the disk as designed, data corruption occurs. RESOLUTION: Update the VTOC contents appropriately for LUN resize operations to prevent the data corruption. * 2695227 (Tracking ID: 2674465) SYMPTOM: Data corruption is observed when DMP node names are changed by following commands for DMP devices that are controlled by a third party multi-pathing driver (E.g. MPXIO and PowerPath ) # vxddladm [-c] assign names # vxddladm assign names file= # vxddladm set namingscheme= DESCRIPTION: The above said commands when executed would re-assign names to each devices. Accordingly the in-core DMP database should be updated for each device to map the new device name with appropriate device number. Due to a bug in the code, the mapping of names with the device number wasn't done appropriately which resulted in subsequent IOs going to a wrong device thus leading to data corruption. RESOLUTION: DMP routines responsible for mapping the names with right device number is modified to fix this corruption problem. * 2695228 (Tracking ID: 2688747) SYMPTOM: Under a heavy I/O load on logclient node, the writes on VVR Primary logowner takes a very long time to complete. Writes appear to be hung. DESCRIPTION: VVR cannot allow more than specific number of I/Os (4096)outstanding on the SRL volume. Any I/Os beyond this threshold will be throttled. The throttled I/Os are restarted periodically. While restarting, I/Os belonging logclient get high preference compared to logowner I/Os, which can eventually lead to starvation or I/O hang situation on logowner. RESOLUTION: Changes are done in algorithm of I/O scheduling of restarted I/Os, it's made sure that throttled local I/Os will get the chance to proceed under all conditions. * 2701152 (Tracking ID: 2700486) SYMPTOM: If the VVR Primary and Secondary nodes have the same host-name, and there is a loss of heartbeats between them, vradmind daemon can core-dump if an active stats session already exists on the Primary node. Following stack-trace is observed: pthread_kill() _p_raise() raise.raise() abort() __assert_c99 StatsSession::sessionInitReq() StatsSession::processOpReq() StatsSession::processOpMsgs() RDS::processStatsOpMsg() DBMgr::processStatsOpMsg() process_message() main() DESCRIPTION: On loss of heartbeats between the Primary and Secondary nodes, and a subsequent reconnect, RVG information is sent to the Primary by Secondary node. In this case, if a Stats session already exists on the Primary, a STATS_SESSION_INIT request is sent back to the Secondary. However, the code was using "hostname" (as returned by `uname -a`) to identify the secondary node. Since both the nodes had the same hostname, the resulting STATS_SESSION_INIT request was received at the Primary itself, causing vradmind to core dump. RESOLUTION: Code was modified to use 'virtual host-name' information contained in the RLinks, rather than hostname(1m), to identify the secondary node. In a scenario where both Primary and Secondary have the same host-name, virtual host-names are used to configure VVR. * 2702110 (Tracking ID: 2700792) SYMPTOM: vxconfigd, the VxVM volume configuration daemon may dump a core with the following stack during the Cluster Volume Manager(CVM) startup with "hares -online cvm_clus -sys [node]". dg_import_finish() dg_auto_import_all() master_init() role_assume() vold_set_new_role() kernel_get_cvminfo() cluster_check() vold_check_signal() request_loop() main() DESCRIPTION: During CVM startup, vxconfigd accesses the disk group record's pointer of a pending record while the transaction on the disk group is in progress. At times, vxconfigd incorrectly accesses the stale pointer while processing the current transaction, thus resulting in a core dump. RESOLUTION: Code changes are made to access the appropriate pointer of the disk group record which is active in the current transaction. Also, the disk group record is appropriately initialized to NULL value. * 2703370 (Tracking ID: 2700086) SYMPTOM: In the presence of "Not-Ready" EMC devices on the system, multiple dmp (path disabled/enabled) events messages are seen in the syslog DESCRIPTION: The issue is that vxconfigd enables the BCV devices which are in Not-Ready state for IO as the SCSI inquiry succeeds, but soon finds that they cannot be used for I/O and disables those paths. This activity takes place whenever "vxdctl enable" or "vxdisk scandisks" command is executed. RESOLUTION: Avoid changing the state of the BCV device which is in "Not-Ready" to prevent IO and dmp event messages. * 2703373 (Tracking ID: 2698860) SYMPTOM: Mirroring a large size VxVM volume comprising of THIN luns underneath and with VxFS filesystem atop mounted fails with the following error: Command error # vxassist -b -g $disk_group_name mirror $volume_name VxVM vxplex ERROR V-5-1-14671 Volume is configured on THIN luns and not mounted. Use 'force' option, to bypass smartmove. To take advantage of smartmove for supporting thin luns, retry this operation after mounting the volume. VxVM vxplex ERROR V-5-1-407 Attempting to cleanup after failure ... Truss output error: statvfs("", 0xFFBFEB54) Err#79 EOVERFLOW DESCRIPTION: The statvfs system call is invoked internally during mirroring operation to retrieve statistics information of VxFS file system hosted on the volume. However, since the statvfs system call only supports maximum 4294967295 (4GB-1) blocks, so if the total filesystem blocks are greater than that, EOVERFLOW error occurs. This also results in vxplex terminating with the errors. RESOLUTION: Use the 64 bits version of statvfs i.e., statvfs64 system call to resolve the EOVERFLOW and vxplex errors. * 2711758 (Tracking ID: 2710579) SYMPTOM: Data corruption can be observed on a CDS (Cross-platform Data Sharing) disk, as part of operations like LUN resize, Disk FLUSH, Disk ONLINE etc. The following pattern would be found in the data region of the disk. cyl alt 2 hd sec DESCRIPTION: The CDS disk maintains a SUN VTOC in the zeroth block and a backup label at the end of the disk. The VTOC maintains the disk geometry information like number of cylinders, tracks and sectors per track. The backup label is the duplicate of VTOC and the backup label location is determined from VTOC contents. If the content of SUN VTOC located in the zeroth sector are incorrect, this may result in the wrong calculation of the backup label location. If the wrongly calculated backup label location falls in the public data region rather than the end of the disk as designed, data corruption occurs. RESOLUTION: Suppressed writing the backup label to prevent the data corruption. * 2713862 (Tracking ID: 2390998) SYMPTOM: When running'vxdctl'or'vxdisk scandisks'command after the process of migrating SAN ports, system panicked, following is the stack trace: .disable_lock() dmp_close_path() dmp_do_cleanup() dmp_decipher_instructions() dmp_process_instruction_buffer() dmp_reconfigure_db() gendmpioctl() vxdmpioctl() DESCRIPTION: SAN ports migration ends up with two path nodes for the same device number, one node marked as NODE_DEVT_USED which means the same device number has been reused by another node. When open the dmp device, the actual open count on the new node (not marked with NODE_DEVT_USED) is modified. If the caller is referencing the old node (marked with NODE_DEVT_USED), it will then modify the layered open count on the old node. This results in the inconsistent open reference counts of the node and cause panic while checking open counts in close dmp device. RESOLUTION: The code change has been done to make the modification of actual open count and layered open count on the same node while performing dmp device open/close. * 2741105 (Tracking ID: 2722850) SYMPTOM: Disabling/enabling controllers while I/O is in progress results in dmp (Dynamic Multi-Pathing) thread hang with following stack: dmp_handle_delay_open gen_dmpnode_update_cur_pri dmp_start_failover gen_update_cur_pri dmp_update_cur_pri dmp_process_curpri dmp_daemons_loop DESCRIPTION: DMP takes an exclusive lock to quiesce a node to be failed over, and releases the lock to do update operations. These update operations presume that the node will be in quiesced status. A small timing window exists between lock release and update operations, wherein other threads can break-in into this window and unquiesce the node, which will lead to the hang while performing update operations. RESOLUTION: Corrected the quiesce counter of a node to avoid other threads unquiesce it when a thread is performing update operations. * 2744219 (Tracking ID: 2729501) SYMPTOM: In Dynamic Multi pathing environment, excluding a path also excludes other set of paths with matching substrings. DESCRIPTION: excluding a path using vxdmpadm exclude vxvm path=<> is excluding all the paths with matching substring. This is due to strncmp() used for comparison. Also the size of h/w path defined in the structure is more than what is actually fetched. RESOLUTION: Correct the size of h/w path in the structure and use strcmp for comparison inplace of strncmp() * 2750454 (Tracking ID: 2423701) SYMPTOM: Upgrade of VxVM caused change in permissions of /etc/vx/vxesd during live upgrade from drwx------ to d---r-x---. DESCRIPTION: '/etc/vx/vxesd' directory gets shipped in VxVM with "drwx------" permissions. However, while starting the vxesd daemon, if this directory is not present, it gets created with "d---r-x---". RESOLUTION: Changes are made so that while starting vxesd daemon '/etc/vx/vxesd' gets created with 'drwx------' permissions. * 2752178 (Tracking ID: 2741240) SYMPTOM: In a VxVM environment, "vxdg join" when executed during heavy IO load fails with the below message. VxVM vxdg ERROR V-5-1-4597 vxdg join [source_dg] [target_dg] failed join failed : Commit aborted, restart transaction join failed : Commit aborted, restart transaction Half of the disks that were part of source_dg will become part of target_dg whereas other half will have no DG details. DESCRIPTION: In a vxdg join transaction, VxVM has implemented it as a two phase transaction. If the transaction fails after the first phase and during the second phase, half of the disks belonging to source_dg will become part of target_dg and the other half of the disks will be in a complex irrecoverable state. Also, in heavy IO situation, any retry limit (i.e.) a limit to retry transactions can be easily exceeded. RESOLUTION: "vxdg join" is now designed as a one phase atomic transaction and the retry limit is eliminated. * 2774907 (Tracking ID: 2771452) SYMPTOM: In lossy and high latency network, I/O gets hung on VVR primary. Just before the I/O hang, Rlink frequently connects and disconnects. DESCRIPTION: In lossy and high latency network, because of heartbeat time outs, RLINK gets disconnected. As a part of Rlink disconnect, the communication port is deleted. During this process, the RVG is serialized and the I/Os are kept in a special queue - rv_restartq. The I/Os in rv_restartq are supposed to be restarted once the port deletion is successful. The port deletion involves termination of all the communication server processes. Because of a bug in the port deletion logic, the global variable which keeps track of number of communication server processes got decremented twice. This caused port deletion process to be hung leading to I/Os in rv_restartq never being restarted. RESOLUTION: In port deletion logic, it's made sure that the global variable which keeps track of number of communication server processes will get decremented correctly. INSTALLING THE PATCH -------------------- $ swinstall -x autoreboot=true Please do swverify after installing the patches in order to make sure that the patches are installed correctly using: $ swverify REMOVING THE PATCH ------------------ To remove the patch, enter the following command: # swremove -x autoreboot=true SPECIAL INSTRUCTIONS -------------------- NONE OTHERS ------ NONE