* * * READ ME * * * * * * Veritas Volume Manager 6.0 * * * * * * P-patch 1 * * * Patch Date: 2012-01-17 This document provides the following information: * PATCH NAME * PACKAGES AFFECTED BY THE PATCH * BASE PRODUCT VERSIONS FOR THE PATCH * OPERATING SYSTEMS SUPPORTED BY THE PATCH * INCIDENTS FIXED BY THE PATCH * INSTALLATION PRE-REQUISITES * INSTALLING THE PATCH * REMOVING THE PATCH PATCH NAME ---------- Veritas Volume Manager 6.0 P-patch 1 PACKAGES AFFECTED BY THE PATCH ------------------------------ VRTSvxvm BASE PRODUCT VERSIONS FOR THE PATCH ----------------------------------- * Veritas Storage Foundation for Oracle RAC 6.0 * Veritas Storage Foundation Cluster File System 6.0 * Veritas Storage Foundation 6.0 * Veritas Storage Foundation High Availability 6.0 * Veritas Dynamic Multi-Pathing 6.0 * Symantec VirtualStore 6.0 * Veritas Storage Foundation for Sybase ASE CE 6.0 OPERATING SYSTEMS SUPPORTED BY THE PATCH ---------------------------------------- Solaris 10 SPARC INCIDENTS FIXED BY THE PATCH ---------------------------- This patch fixes the following Symantec incidents: Patch ID: 147853-01 * 2614438 (Tracking ID: 2507947) SYMPTOM: I/O verification on some of the volumes failed when array side switch ports were disabled/enabled for several iterations. DESCRIPTION: When I/O processes are going on, disable-enable array side switch ports for a few times. This might result in non responsiveness to I/Os on some or all of the volumes. This might be due to failover not getting initiated for all the required paths. All the paths of certain dmpnodes might remain in blocked state and I/Os get stuck in DMP's defer queue. RESOLUTION: Made changes in DMP to check for path failover in case PR command fails with transport failure. Perform current primary path updation from all places where path's pending IO count is being decremented. Trigger DMP's defer queue processing after current primary path updation task. Made changes in the Engenio APM to reduced SCSI command timeout value to 20 sec. * 2616853 (Tracking ID: 2612190) SYMPTOM: When storage of shared instant DCO (with SF6.0) is disconnected and two nodes join one after another, VxVM utilities or vxconfigd get stuck. Following kernel stack is observed for vxconfigd on Master node vol_kmsg_send_wait+0x1fc cvm_obj_sendmsg_wait+0x9c volmv_fmr_wait_iodrain+0xb8 vol_mv_precommit+0x590 vol_commit_iolock_objects+0x104 vol_ktrans_commit+0x298 volsioctl_real+0x2ac fop_ioctl+0x20 ioctl+0x184 syscall_trap32+0xcc DESCRIPTION: The error case is not handled when a joiner node tries to read the DCO volume which got marked as BADLOG. This leads to a protocol hang in which the joiner stops responding, so master node indefinitely hangs waiting for the response from the joiner. RESOLUTION: The code path where the joiner tries to read a DCO which was marked as BADLOG is fixed. * 2626744 (Tracking ID: 2626199) SYMPTOM: "vxdmpadm list dmpnode" command shows the path-type value as "primary/secondary" for a LUN in an Active-Active array as below when it is suppose to be NULL value. dmpdev = c6t0d3 state = enabled ... array-type = A/A ###path = name state type transport ctlr hwpath aportID aportWWN attr path = c23t0d3 enabled(a) secondary FC c30 2/0/0/2/0/0/0.0x50060e800 5c0bb00 - - - DESCRIPTION: For a LUN under Active-Active array the path-type value is supposed to be NULL. In this specific case other commands like "vxdmpadm getsubpaths dmpnode=<>" were showing correct (NULL) value for path-type. RESOLUTION: The "vxdmpadm list dmpnode" code path failed to initialize the path- type variable and by default set path-type to "primary or secondary" even for Active-Active array LUN's. This is fixed by initializing the path-type variable to NULL. * 2626913 (Tracking ID: 2605444) SYMPTOM: vxdmpadm disable/enable primary path (EFI labelled) in A/PF array results in all paths getting disabled DESCRIPTION: Enabling an EFI labeled primary path is disabling the secondary path. When the primary path is disabled, a failover occurs on to secondary path. The name of the secondary path under goes a change dropping the slice s2 from the name (cxtxdxs2 becomes cxtxdx). The change in the name is not updated in the device property list. This inability in updating the list causes disabling of the secondary path when the primary path is enabled. RESOLUTION: The code path which changes the name of the secondary path is rectified to update the property list. * 2630063 (Tracking ID: 2518625) SYMPTOM: Custer reconfiguration hang is seen when the following sequence of events occur in that order 1. instant DCO(with SF6.0 release) loses complete storage from master node while I/O is going on. 2. a new node joins the cluster immediately. 3. Current master leaves the cluster. When the issue is seen, the kernel thread "volcvm_vxreconfd_thread" on master shows the following stack. delay+0x90() cvm_await_mlocks+0x114() volmvcvm_cluster_reconfig_exit+0x94() volcvm_master+0x970() volcvm_vxreconfd_thread+0x580() thread_start+4() the kernel thread "volcvm_vxreconfd_thread" on slave has the following stack. vol_kmsg_send_wait+0x1fc() cvm_slave_req_rebuild_done+0x78() volmvcvm_cluster_reconfig_exit+0xb0() volcvm_master+0x970() volcvm_vxreconfd_thread+0x580() thread_start+4() The following output of vxdctl is seen on all nodes. # vxdctl -c mode mode: enabled: cluster active - SLAVE master: node1 reconfig: master update DESCRIPTION: After storage failure, the DCO is marked as BADLOG in kernel on all existing nodes and process is initiated to mark the same in volume manager configuration. If a node joins before this process completes, the joining node would not mark the DCO as BADLOG. Subsequently a protocol is initiated that depends on this flag but due to the above mentioned inconsistency, the protocol hangs. RESOLUTION: If DCO is marked BADLOG on other nodes, mark the same on joining node. * 2630088 (Tracking ID: 2583410) SYMPTOM: The diagnostic utility vxfmrmap is not supported with instant DCO of SF6.0 release DESCRIPTION: The diagnostic utility vxfmrmap errors out with not supported when used with instant DCO of SF6.0 release. RESOLUTION: Added support for vxfmrmap to understand and print maps used in instant DCO of SF6.0 release. * 2630106 (Tracking ID: 2612544) SYMPTOM: VxVM commands take more time to run after disk group import operation. DESCRIPTION: If any vxvm command is executed immediately after a disk group import, there will be an increase in time taken for command completion. This happens as VxVM tries to complete recovery operations on the disk group that is imported and vxconfigd daemon will be busy with those operations. This increase in time is noticeable if large number of disks are present in the disk group. Once the recovery related task is completed, vxvm commands will work normally. RESOLUTION: The fix was implemented to reduce the number of vxvm commands issued after disk group import. * 2630109 (Tracking ID: 2618217) SYMPTOM: Possible data corruption due to few I/O's not getting tracked after detaching mirvol added for snapshot (linked volumes)in FMR4. Application such as Oracle complains about data corruption. DESCRIPTION: When mirror volume added for snapshot purpose (linked volumes) detached due to some I/O errors, VxVM start tracking of subsequent I/O's on volume in order to properly re-sync the volumes contents. In FMR4 the tracking of was not getting enabled atomically, during this window few I/O's were not getting tracked. This lead to lead to data corruption after snapshot re-sync operations. RESOLUTION: Atomically enabling the tracking of I/O's so that all the I/O's get tracked properly after detaches. * 2630110 (Tracking ID: 2621521) SYMPTOM: Breakoff mirror snapshot volumes using SF 6.0 based instant DCO could be corrupted when any snapshot operation is done subsequently. DESCRIPTION: With volumes having SF6.0 based instant DCO, the snapshot maps are updated from DRL logs. Volume manager maintains the list of updates that are still to be used for snapshot map updates. Due to a code issue, some of the valid DRL updates were getting missed for snapshot map updates. In such cases, the snapshot maps would be inconsistent and further snapshot operation would cause corruption in the volumes involved. RESOLUTION: The issue has been fixed by making sure that the list is updated atomically such that intermediate state is not visible to other IOs doing DRL. * 2630114 (Tracking ID: 2622147) SYMPTOM: System hangs in a case when load peaks abruptly on a volume with SF 6.0 based instant DCO. DESCRIPTION: When IO load peaks abruptly on a volume with instant DCO in SF 6.0, the DRL logging can get stuck. The IOs are resumed on volume when DRL is done for whole batch of IOs. When load is such that DRL cannot be done for whole batch in a single go it is split into iterations. Due to a code issue, the partitioning of DRL updates caused it to go in loop updating same set of changes over and over into disk. RESOLUTION: DRL updates are not partitioned now. They are either not done if not enough free memory available or done in a single batch. * 2630115 (Tracking ID: 2627126) SYMPTOM: Observed IO hang on system as lots of IO's are stuck in DMP global queue. DESCRIPTION: Lots of IOs and Paths are stuck in dmp_delayq and dmp_path_delayq respectively and DMP daemon could not process them, because of the race condition between "processing the dmp_delayq" and "waking up the DMP daemon". Lock is held while processing the dmp_delayq, and it is released for very short duration. If any path is busy in this duration, it gives IO error, leading to IO hang. RESOLUTION: The global delay queue pointers are copied to local variables and lock is held only for this period, then IOs in the queue are processed using the local queue variable. * 2630132 (Tracking ID: 2613524) SYMPTOM: I/O failure is observed on a NetApp array when configured in Asymmetric Logical Unit Access (ALUA) mode. DESCRIPTION: The I/O failure was identified to be a corner case scenario getting hit during NetAPP pNATE longevity tests. RESOLUTION: Added fixes in Dynamic Multipathing component to avoid the I/O failure. * 2630136 (Tracking ID: 2612470) SYMPTOM: Volume recovery and plex reattach using vxrecover command leaves some plexes in DETACHED state. DESCRIPTION: The vxrecover utility collects all volumes that need recovery or plex reattach and creates tasks for these. In a specific corner case, if the volume size of the first volume picked up is small such that recovery completes within 5 seconds, any plex reattaches required for this volume are skipped, due to a bug in vxrecover. RESOLUTION: The specific corner case is identified in vxrecover utility and for such a volume if any plex needs to be reattached, required task is created before moving to the next volume. WORKAROUND: Re-run vxrecover manually. For shared disk groups, run this on CVM master node or use -c option. * 2637183 (Tracking ID: 2647795) SYMPTOM: With Smartmove feature enabled, while moving subdisk data corruption is seen on file system as subdisk contents are not copied properly. DESCRIPTION: With FS Smartmove feature enabled, subdisk move operation queries VxFS for status of the region before deciding to synchronize the regions. While getting the information about the multiple such regions in one IOCTL to VxFS, if the start- offset is not aligned to region size, one I/O can span across two regions, and VxVM was not properly checking status of such regions and skips the synchronization of that region causing data corruption. RESOLUTION: Code changes are done to properly check the region state even if the region spans 2-bits in the FSMAP. * 2642760 (Tracking ID: 2637828) SYMPTOM: System panic is seen in vol_adminsio_done() function while running operations like vxplex att/vxsd mv/vxsnap addmirr. Kernel stack is given below: vol_adminsio_done+0x44b voliod_iohandle+0xf2 voliod_loop+0x11ba child_rip+0xa DESCRIPTION: Utilities like 'vxtask monitor' are signaled after task updation in kernel. Due to bug, the structures used in signaling are freed, causing NULL pointer dereference which leads to panic. RESOLUTION: Code is re-arranged such that cleanup will not happen before signalling is done to avoid this race condition. The bug is specific to new enhancement done in adminio deprioritization feature and doesn't apply to older releases. * 2643609 (Tracking ID: 2553729) SYMPTOM: Following symptoms can be seen during 'Upgrade' of VxVM (Veritas Volume Manager): i) 'clone_disk' flag is seen on non-clone disks in STATUS field when 'vxdisk -e list' is executed after uprade to 5.1SP1 from lower versions of VxVM. Eg: DEVICE TYPE DISK GROUP STATUS emc0_0054 auto:cdsdisk emc0_0054 50MP3dg online clone_disk emc0_0055 auto:cdsdisk emc0_0055 50MP3dg online clone_disk ii) Clone disk groups (dg) whose versions are less than 140 do not get imported after upgrade to VxVM versions 5.0MP3RP5HF1 or 5.1SP1RP2. Eg: # vxdg -C import VxVM vxdg ERROR V-5-1-10978 Disk group : import failed: Disk group version doesn't support feature; see the vxdg upgrade command DESCRIPTION: While uprading VxVM i) After upgrade to 5.1SP1 or higher versions: If a dg which is created on lower versions is deported and imported back on 5.1SP1 after the upgrade, then "clone_disk" flags gets set on non-cloned disks because of the design change in UDID (unique disk identifier) of the disks. ii) After upgrade to 5.0MP3RP5HF1 or 5.1SP1RP2: Import of clone dg with versions less than 140 fails. RESOLUTION: Code changes are made to ensure that: i) clone_disk flag does not get set for non-clone disks after the upgrade. ii) Clone disk groups with versions less than 140 get imported after the upgrade. * 2643647 (Tracking ID: 2643634) SYMPTOM: If standard(non-clone) disks and cloned disks of the same disk group are seen in a host, dg import will fail with the following error message when the standard(non-clone) disks have no enabled configuration copy of the disk group. # vxdg import VxVM vxdg ERROR V-5-1-10978 Disk group : import failed: Disk group has no valid configuration copies DESCRIPTION: When VxVM is importing such a mixed configuration of standard(non-clone) disks and cloned disks, standard(non-clone) disks will be selected as the member of the disk group in 5.0MP3RP5HF1 and 5.1SP1RP2. It will be done while administrators are not aware of the fact that there is a mixed configuration and the standard(non-clone) disks are to be selected for the import. It is hard to figure out from the error message and need time to investigate what is the issue. RESOLUTION: Syslog message enhancements are made in the code that administrators can figure out if such a mixed configuration is seen in a host and also which disks are selected for the import. * 2644354 (Tracking ID: 2647600) SYMPTOM: System panic with the following stack trace because of the Null pointer dereference. strlen() vol_do_prnt() vol_cmn_err() vol_rp_ack_timeout() vol_rp_search_queues() nmcom_server_proc() nmcom_server_proc_enter() vxvm_start_thread_enter() DESCRIPTION: After going through the panic, this is because of missing entry in volrv_msgs_names[]. There are 12 entries but it is missing the error entry for VOLRV_MSG_PRIM_SEQ. The ack timeout has happened when the VOLRV_MSG_PIM_SEQ is sent. RESOLUTION: Add a entry in volrv_msgs_names[VOLRV_MSG_PIM_SEQ] = { "prim_seq" } INSTALLING THE PATCH -------------------- o Before-the-upgrade :- (a) Stop I/Os to all the VxVM volumes. (b) Umount any filesystems with VxVM volumes. (c) Stop applications using any VxVM volumes. For Solaris 10 release, refer to the man pages for instructions on using 'patchadd' and 'patchrm' scripts provided with Solaris. Any other special or non-generic installation instructions should be described below as special instructions. The following example installs a patch to a st andalone machine: example# patchadd 146884-xx REMOVING THE PATCH ------------------ The following example removes a patch from a standalone system: example# patchrm 146884-xx SPECIAL INSTRUCTIONS -------------------- You need to use the shutdown command to reboot the system after patch installation or de-installation: shutdown -g0 -y -i6 A Solaris 10 issue may prevent this patch from complete installation. Before installing this VM patch, install the Solaris patch 119254-70 (or a later revision). This Solaris patch fixes packaging, installation and patch utilities. [Sun Bug ID 6337009] Download Solaris 10 patch 119254-70 (or later) from Sun at http://sunsolve.sun.com OTHERS ------ NONE