README VERSION : 1.0 README Creation Date : 2011-09-22 Patch-ID : 5.1.132.000 Patch Name : VRTSvxvm 5.1SP1RP2 BASE PACKAGE NAME : VRTSvxvm BASE PACKAGE VERSION : 5.1.100.000 Obsolete Patches : NONE Superseded Patches : 5.1.101.000 Required Patches : NONE Incompatible Patches : NONE Supported PADV : rhel5_x86_64 , rhel6_x86_64 , sles10_x86_64 , sles11_x86_64 (P-Platform , A-Architecture , D-Distribution , V-Version) Patch Category : CORE , CORRUPTION , HANG , MEMORYLEAK , PANIC , PERFORMANCE Reboot Required : YES Patch Installation Instructions: -------------------------------- please refer the guide for installation instructions Patch Uninstallation Instructions: ---------------------------------- please refer the guide for installation instructions Special Install Instructions: ----------------------------- NONE KNOWN ISSUES : FIXED INCIDENTS: ---------------- Patch Id::5.1.132.000 * Incident no::2163809 Tracking ID ::2151894 Symptom::Internal testing utility volassert prints a message: Volume TCv1-548914: recover_offset=0, expected 1024 Description::We changed the behavior of recover_offset in other incident by resetting it back to zero after starting a volume. This works okay for normal cases but not for raid5 volumes. Resolution::Recover offset will be set at the end of a volume after grow/init operations. * Incident no::2169348 Tracking ID ::2094672 Symptom::Master node hang with lot of I/O's and during node reconfig due to node leave. Description::The reconfig is stuck, because the I/O is not drained completely. The master node is responsible to handle the I/O for the both primary and slave. When the slave node is died, and the pending slave I/O on the master node is not cleaned up himself properly. This lead to some I/O's left in the queue un-deleted. Resolution::clean up the I/O during the node failure and reconfig scenario. * Incident no::2198041 Tracking ID ::2196918 Symptom::When creating a space-opimized snapshot by specifying cache-object size either in percentage terms of the volume size or an absolute size, the snapshot creation can fail with an error similar to following: "VxVM vxassist ERROR V-5-1-10127 creating volume snap-dvol2-CV01: Volume or log length violates disk group alignment" Description::VxVM expects all virtual storage objects to have size aligned to a value which is set diskgroup-wide. One can get this value with: # vxdg list testdg|grep alignment alignment: 8192 (bytes) When the cachesize is specified in percentage, the value might not align with dg alignment. If not aligned, the creation of the cache-volume could fail with specified error message Resolution::After computing the cache-size from specified percentage value, it is aligned up to the diskgroup alignment value before trying to create the cache-volume. * Incident no::2204146 Tracking ID ::2200670 Symptom::Some disks are left detached and not recovered by vxattachd. Description::If the shared disk group is not imported or node is not part of the cluster when storage connectivity to failed node is restored, the vxattachd daemon does not getting notified about storage connectivity restore and does not trigger a reattach. Even if the disk group is later imported or the node is joined to CVM cluster, the disks are not automatically reattached. Resolution::i) Missing events for a deported diskgroup: The fix handles this by listening to the import event of the diksgroup and triggers the brute-force recovery for that specific diskgroup. ii) parallel recover of volumes from same disk: vxrecover automatically serializes the recovery of objects that are from the same disk to avoid the back and forth head movements. Also provided an option in vxattchd and vxrecover to control the number of parallel recovery that can happen for objects from the same disk. * Incident no::2205859 Tracking ID ::2196480 Symptom::Initialization of VxVM cdsdisk layout fails on a disk of size less than 1 TB. Description::The disk geometry is derived to fabricate the cdsdisk label during the initialization of VxVM cdsdisk layout on a disk of size less than 1 TB. The disk geometry was violating one of the following requirements: (1) cylinder size is aligned with 8 KB. (2) Number of cylinders is less than 2^16 (3) The last sector in the device is not included in the last cylinder. (4) Number of heads is less than 2 ^16 (5) tracksize is less than 2^16 Resolution::The issue has been resolved by making sure that the disk geometry used in fabricating the cdsdisk label satisfies all the five requirements described above. * Incident no::2211971 Tracking ID ::2190020 Symptom::On heavy I/O system load dmp_deamon requests 1 mega byte continuous memory paging which inturn slows down the system due to continuous page swapping. Description::dmp_deamon keeps calculating statistical information (every 1 second by default). When the I/O load is high the I/O statistics buffer allocation code path calculation dynamically allocates continuous ~1 mega byte per-cpu. Resolution::To avoid repeated memory allocation/free calls in every DMP I/O stats daemon interval, a two buffer strategy was implemented for storing DMP stats records. Two buffers of same size will be allocated at the beginning, one of the buffer will be used for writing active records while the other will be read by IO stats daemon. The two buffers will be swapped every stats daemon interval. * Incident no::2214184 Tracking ID ::2202710 Symptom::Transactions on Rlink are not allowed during SRL to DCM flush. Description::Present implementation doesn't allow rlink transaction to go through if SRL to DCM flush is in progress. As SRL overflows, VVR start reading from SRL and mark the dirty regions in corresponding DCMs of data volumes, it is called SRL to DCM flush. During SRL to DCM flush transactions on rlink is not allowed. Time to complete SRL flush depend on SRL size, it could range from minutes to many hours. If user initiate any transaction on rlink then it will hang until SRL flush completes. Resolution::Changed the code behavior to allow rlink transaction during SRL flush. Fix stops the SRL flush for transaction to go ahead and restart the flush after transaction completion. * Incident no::2220064 Tracking ID ::2228531 Symptom::Vradmind hangs in vol_klog_lock() on VVR (Veritas Volume Replicator) Secondary site. Stack trace might look like: genunix:cv_wait+0x38() vxio:vol_klog_lock+0x5c() vxio:vol_mv_close+0xc0() vxio:vol_close_object+0x30() vxio:vol_object_ioctl+0x198() vxio:voliod_ioctl() vxio:volsioctl_real+0x2d4() specfs:spec_ioctl() genunix:fop_ioctl+0x20() genunix:ioctl+0x184() unix:syscall_trap32+0xcc() Description::In this scenario, a flag value should be set for vradmind to be signalled and woken up. As the flag value is not set here,it causes an enduring sleep. A race condition exists between setting and resetting of the flag values, resulting in the hang. Resolution::Code changes are made to hold a lock to avoid the race condition between setting and resetting of the flag values. * Incident no::2223659 Tracking ID ::2218470 Symptom::When "service -status-all" or "service vxvm-recover status" are executed, the duplicate instance of the Veritas Volume Manager(VxVM) daemons (vxattachd, vxcached, vxrelocd, vxvvrsecdgd and vxconfigbackupd) are invoked. Description::The startup scripts of VxVM for Linux are not Linux Standard Base (LSB) compliant. Hence, the execution of the following command results in starting up new instance. # vxvm-recover status Resolution::The VxVM startup scripts are modified to be LSB compliant. * Incident no::2232057 Tracking ID ::2230377 Symptom::Differences based synchronization (Diff sync) fails for volumes/RVG sizes greater than 1TB. Description::The diff sync fails to calculate the correct RVG/volume size for objects over 1TB causing the diff sync to loop/fail. Resolution::Hotfix binaries of vxrsyncd have been generated with the code fixes. * Incident no::2232829 Tracking ID ::2232789 Symptom::With NetApp metro cluster disk arrays, takeover operations (toggling of LUN ownership within NetApp filer) can lead to IO failures on VxVM volumes. Example of an IO error message at VxVM VxVM vxio V-5-0-2 Subdisk disk_36-03 block 24928: Uncorrectable write error Description::During the takeover operation, the array fails the PGR and IO SCSI commands on secondary paths with the following transient error codes - 0x02/0x04/0x0a (NOT READY/LOGICAL UNIT NOT ACCESSIBLE, ASYMMETRIC ACCESS STATE TRANSITION) or 0x02/0x04/0x01 (NOT READY/LOGICAL UNIT IS IN PROCESS OF BECOMING READY) - that are not handled properly within VxVM. Resolution::Included required code logic within the APM so that the SCSI commands with transient errors are retried for the duration of NetApp filer reconfig time (60 secs) before failing the IO's on VxVM volumes. * Incident no::2234292 Tracking ID ::2152830 Symptom::Sometimes the storage admins create multiple copies/clones of the same device. Diskgroup import fails with a non-descriptive error message when multiple copies(clones) of the same device exists and original device(s) are either offline or not available. # vxdg import mydg VxVM vxdg ERROR V-5-1-10978 Disk group mydg: import failed: No valid disk found containing disk group Description::If the original devices are offline or unavailable, vxdg import picks up cloned disks for import. DG import fails by design unless the clones are tagged and tag is specified during DG import. While the import failure is expected, but the error message is non-descriptive and doesn't provide any corrective action to be taken by user. Resolution::Fix has been added to give correct error meesage when duplicate clones exist during import. Also, details of duplicate clones is reported in the syslog. Example: [At CLI level] # vxdg import testdg VxVM vxdg ERROR V-5-1-10978 Disk group testdg: import failed: DG import duplcate clone detected [In syslog] vxvm:vxconfigd: warning V-5-1-0 Disk Group import failed: Duplicate clone disks are detected, please follow the vxdg (1M) man page to import disk group with duplicate clone disks. Duplicate clone disks are: c2t20210002AC00065Bd0s2 : c2t50060E800563D204d1s2 c2t50060E800563D204d0s2 : c2t50060E800563D204d1s2 * Incident no::2234882 Tracking ID ::2234821 Symptom::The host fails to re-enable the device after the device comes back online. Description::When the device goes offline, the OS device gets deleted and when the same device comes back online with a new device name, DMP fails to re-enable the old dmpnodes. The code enhancements made it work on SLES11, however the same changes failed on RHEL5. This was because the udev environment variable {DEVTYPE} and $name weren't getting set correctly on RHEL5. Resolution::The resolution is to remove ENV{DEVTYPE}=="disk" and replace $name with %k and is certified to work on SLES11 and RHEL5. * Incident no::2241149 Tracking ID ::2240056 Symptom::'vxdg move/split/join' may fail during high I/O load. Description::During heavy I/O load 'dg move' transcation may fail because of open/close assertion and retry will be done. As the retry limit is set to 30 'dg move' fails if retry hits the limit. Resolution::Change the default transaction retry to unlimit, introduce a new option to 'vxdg move/split/join' to set transcation retry limit as follows: vxdg [-f] [-o verify|override] [-o expand] [-o transretry=retrylimit] move src_diskgroup dst_diskgroup objects ... vxdg [-f] [-o verify|override] [-o expand] [-o transretry=retrylimit] split src_diskgroup dst_diskgroup objects ... vxdg [-f] [-o verify|override] [-o transretry=retrylimit] join src_diskgroup dst_diskgroup * Incident no::2248354 Tracking ID ::2245121 Symptom::Rlinks do not connect for NAT (Network Address Translations) configurations. Description::When VVR (Veritas Volume Replicator) is replicating over a Network Address Translation (NAT) based firewall, rlinks fail to connect resulting in replication failure. Rlinks do not connect as there is a failure during exchange of VVR heartbeats. For NAT based firewalls, conversion of mapped IPV6 (Internet Protocol Version 6) address to IPV4 (Internet Protocol Version 4) address is not handled which caused VVR heartbeat exchange with incorrect IP address leading to VVR heartbeat failure. Resolution::Code fixes have been made to appropriately handle the exchange of VVR heartbeats under NAT based firewall. * Incident no::2253269 Tracking ID ::2263317 Symptom::vxdg(1M) man page does not clearly describe diskgroup import and destroy operations for the case in which original diskgroup is destroyed and cloned disks are present. Description::Diskgroup import with dgid is cosidered as a recovery operation. Therefore, while importing with dgid, even though the original diskgroup is destroyed, both the original as well as cloned disks are considered as available disks. Hence, the original diskgroup is imported in such a scenario. The existing vxdg(1M) man page does not clearly describe this scenario. Resolution::Modified the vxdg(1M) man page to clearly describe the scenario. * Incident no::2256728 Tracking ID ::2248730 Symptom::Command hungs if "vxdg import" called from script with STDERR redirected. Description::If script is having "vxdg import" with STDERR redirected then script does not finish till DG import and recovery is finished. Pipe between script and vxrecover is not closed properly which keeps calling script waiting for vxrecover to complete. Resolution::Closed STDERR in vxrecover and redirected the output to /dev/console. * Incident no::2257706 Tracking ID ::2257678 Symptom::When running vxinstall command to install VxVM on Linux/Solaris system with root disk on a LVM volume we get an error as follows: # vxinstall ... The system is encapsulated. Reinstalling the Volume Manager at this stage could leave your system unusable. Please un-encapsulate before continuing with the reinstallation. Cannot continue further. # Description::The vxinstall script checks if the root devices of the System is encapsulated (under VxVM control). The check for this was incorrectly coded. This led to LVM volumes also being detected as VxVM volumes. This error prevented vxinstall to proceed emitting the above error message. The error message was not true and is a false positive. Resolution::The resolution was to modify the code, so that LVM volumes with rootvol in their name are not detected as VxVM encapsulated volumes. * Incident no::2272956 Tracking ID ::2144775 Symptom::failover_policy attribute not persistent across reboot Description::failiver_policy attribute was not implemented to be persistent across reboot. Hence on every reboot failover_policy used to switch back to default. Resolution::Added code changes to make failover_policy attribute settings persistent across reboot. * Incident no::2276958 Tracking ID ::2205108 Symptom::On VxVM 5.1SP1 or later, device discovery operations such as vxdctl enable, vxdisk scandisks and vxconfigd -k failed to claim new disks correctly. For example, if user provisions five new disks, VxVM, instead of creating five different Dynamic Multi-Pathing (DMP) nodes, creates only one and includes the rest as its paths. Also, the following message is displayed at console during this problem. NOTICE: VxVM vxdmp V-5-0-34 added disk array , datype = Please note that the cabinet serial number following "disk array" and the value of "datype" is not printed in the above message. SOL- Description::VxVM's DDL (Device Discovery Layer) is responsible for appropriately claiming the newly provisioned disks. Due to a bug in one of the routines within this layer, though the disks are claimed, their LSN (Lun Serial Number, an unique identifier of disks) is ignored thereby every disk is wrongly categorized within a DMP node. SOL- Resolution::Modified the problematic code within the DDL thereby new disks are claimed appropriately. WORKAROUND: If vxconfigd does not hang or dump a core with this issue, a reboot can be a workaround to recover this situation or to break up once and rebuild the DMP/DDL database on the devices as the following steps; # vxddladm excludearray all # mv /etc/vx/jbod.info /etc/vx/jbod.info.org # vxddladm disablescsi3 # devfsadm -Cv # vxconfigd -k # vxddladm includearray all # mv /etc/vx/jbod.info.org /etc/vx/jbod.info # vxddladm enablescsi3 # rm /etc/vx/disk.info /etc/vx/array.info # vxconfigd -k * Incident no::2299691 Tracking ID ::2299670 Symptom::VxVM disk groups created on EFI (Extensible Firmware Interface) LUNs do not get auto-imported during system boot in VxVM version 5.1SP1 and later. SOL- Description::While determining the disk format of EFI LUNs, stat() system call on the corresponding DMP devices fail with ENOENT ("No such file or directory") error because the DMP device nodes are not created in the root file system during system boot. This leads to failure in auto-import of disk groups created on EFI LUNs. SOL- Resolution::VxVM code is modified to use OS raw device nodes if stat() fails on DMP device nodes. * Incident no::2316309 Tracking ID ::2316297 Symptom::The following error messages are printed on the console every time system boots. VxVM vxdisk ERROR V-5-1-534 Device [DEVICE NAME]: Device is in use Description::During system boot up, while Volume Manager diskgroup imports, vxattachd daemon tries to online the disk. Since the disk may be already online sometimes, an attempt to re-online disk gives the below error message: VxVM vxdisk ERROR V-5-1-534 Device [DEVICE NAME]: Device is in use Resolution::The solution is to check if the disk is already in "online" state. If so, avoid reonline. * Incident no::2323999 Tracking ID ::2323925 Symptom::If the rootdisk is under VxVM control and /etc/vx/reconfig.d/state.d/install-db file exists, the following messages are observed on the console: UX:vxfs fsck: ERROR: V-3-25742: /dev/vx/dsk/rootdg/homevol:sanity check failed: cannot open /dev/vx/dsk/rootdg/homevol: No such device or address UX:vxfs fsck: ERROR: V-3-25742: /dev/vx/dsk/rootdg/optvol:sanity check failed: cannot open /dev/vx/dsk/rootdg/optvol: No such device or address Description::In the vxvm-startup script, there is check for the /etc/vx/reconfig.d/state.d/install-db file. If the install-db file exist on the system, the VxVM assumes that volume manager is not configured and does not start volume configuration daemon "vxconfigd". "install-db" file somehow existed on the system for a VxVM rootable system, this causes the failure. Resolution::If install-db file exists on the system and the system is VxVM rootable, the following warning message is displayed on the console: "This is a VxVM rootable system. Volume configuration daemon could not be started due to the presence of /etc/vx/reconfig.d/state.d/install-db file. Remove the install-db file to proceed" * Incident no::2328219 Tracking ID ::2253552 Symptom::vxconfigd leaks memory while reading the default tunables related to smartmove (a VxVM feature). Description::In Vxconfigd, memory allocated for default tunables related to smartmove feature is not freed causing a memory leak. Resolution::The memory is released after its scope is over. * Incident no::2337091 Tracking ID ::2255182 Symptom::If EMC CLARiiON arrays are configured with different failovermode for each host controllers ( e.g. one HBA has failovermode set as 1 while the other as 2 ), then VxVM's vxconfigd demon dumps core. Description::DDL (VxVM's Device Discovery Layer) determines the array type depending on the failovermode setting. DDL expects the same array type to be returned across all the paths going to that array. This fundamental assumption of DDL will be broken with different failovermode settings thus leading to vxconfigd core dump. Resolution::Validation code is added in DDL to detect such configurations and emit appropriate warning messages to the user to take corrective actions and skips the later set of paths that are reporting different array type. * Incident no::2349653 Tracking ID ::2349352 Symptom::Data corruption is observed on DMP device with single path during Storage reconfiguration (LUN addition/removal). Description::Data corruption can occur in the following configuration, when new LUNs are provisioned or removed under VxVM, while applications are on-line. 1. The DMP device naming scheme is EBN (enclosure based naming) and persistence=no 2. The DMP device is configured with single path or the devices are controlled by Third Party Multipathing Driver (Ex: MPXIO, MPIO etc.,) There is a possibility of change in name of the VxVM devices (DA record), when LUNs are removed or added followed by the following commands, since the persistence naming is turned off. (a) vxdctl enable (b) vxdisk scandisks Execution of above commands discovers all the devices and rebuilds the device attribute list with new DMP device names. The VxVM device records are then updated with this new attributes. Due to a bug in the code, the VxVM device records are mapped to wrong DMP devices. Example: Following are the device before adding new LUNs. sun6130_0_16 auto - - nolabel sun6130_0_17 auto - - nolabel sun6130_0_18 auto:cdsdisk disk_0 prod_SC32 online nohotuse sun6130_0_19 auto:cdsdisk disk_1 prod_SC32 online nohotuse The following are after adding new LUNs sun6130_0_16 auto - - nolabel sun6130_0_17 auto - - nolabel sun6130_0_18 auto - - nolabel sun6130_0_19 auto - - nolabel sun6130_0_20 auto:cdsdisk disk_0 prod_SC32 online nohotuse sun6130_0_21 auto:cdsdisk disk_1 prod_SC32 online nohotuse The name of the VxVM device sun6130_0_18 is changed to sun6130_0_20. Resolution::The code that updates the VxVM device records is rectified. * Incident no::2353325 Tracking ID ::1791397 Symptom::Replication doesn't start if rlink detach and attach is done just after SRL overflow. Description::As SRL overflows, it starts flush writes from SRL to DCM(Data change map). If rlink is detached before complete SRL is flushed to DCM then it leaves the rlink in SRL flushing state. Due to flushing state of rlink, attaching the rlink again doesn't start the replication. Problem here is the way rlink flushing state is interpreted. Resolution::To fix this issue, we changed the logic to correctly interpret rlink flushing state. * Incident no::2353327 Tracking ID ::2179259 Symptom::When using disks of size > 2TB and the disk encounters a media error with offset > 2TB while the disk responds to SCSI inquiry, data corruption can occur incase of a write operation Description::The I/O rety logic in DMP assumes that the I/O offset is within 2TB limit and hence when using disks of size > 2TB and the disk encounters a media error with offset > 2TB while the disk responds to SCSI inquiry, the I/O would be issued on a wrong offset within the 2TB range causing data corruption incase of write I/Os. Resolution::The fix for this issue to change the I/O retry mechanism to work for >2TB offsets as well so that no offset truncation happens that could lead to data corruption * Incident no::2353328 Tracking ID ::2194685 Symptom::vxconfigd dumps core in scenario where array side ports are disabled/enabled in loop for some iterations. gdb) where #0 0x081ca70b in ddl_delete_node () #1 0x081cae67 in ddl_check_migration_of_devices () #2 0x081d0512 in ddl_reconfigure_all () #3 0x0819b6d5 in ddl_find_devices_in_system () #4 0x0813c570 in find_devices_in_system () #5 0x0813c7da in mode_set () #6 0x0807f0ca in setup_mode () #7 0x0807fa5d in startup () #8 0x08080da6 in main () Description::Due to disabling the array side ports, the secondary paths get removed. But the primary paths are reusing the devno of the removed secondary paths which is not correctly handled in current migration code. Due to this, the DMP database gets corrupted and subsequent discoveries lead to configd core dump. Resolution::The issue is due to incorrect setting of a DMP flag. The flag settting has been fixed to prevent the DMP database from corruption in the mentioned scenario. * Incident no::2353403 Tracking ID ::2337694 Symptom::"vxdisk -o thin list" displays size as 0 for thin luns of capacity greater than 2 TB. Description::SCSI READ CAPACITY ioctl is invoked to get the disk capacity. SCSI READ CAPACITY returns data in extended data format if a disk capacity is 2 TB or greater. This extended data was parsed incorectly while calculating the disk capacity. Resolution::This issue has been resolved by properly parsing the extended data returned by SCSI READ CAPACITY ioctl for disks of size greater than 2 TB or greater. * Incident no::2353404 Tracking ID ::2334757 Symptom::Vxconfigd consumes a lot of memory when the DMP tunable dmp_probe_idle_lun is set on. "pmap" command on vxconfigd process shows continuous growing heap. Description::DMP path restoration daemon probes idle LUNs(Idle LUNs are VxVM disks on which no I/O requests are scheduled) and generates notify events to vxconfigd. Vxconfigd in turn send the nofification of these events to its clients. For any reasons, if vxconfigd could not deliver these events (because client is busy processing earlier sent event), it keeps these events to itself. Because of this slowness of events consumption by its clients, memory consumption of vxconfigd grows. Resolution::dmp_probe_idle_lun is set to off by default. * Incident no::2353405 Tracking ID ::2317540 Symptom::System panic due to kernel heap corruption while DMP device driver unload. Panic stack on Solaris (when kmem_flags is set to either 0x100 or 0xf) should be similar to as below: vpanic() kmem_error+0x4b4() dmp_free_stats_table+0x118() dmp_free_modules+0x24() vxdmp`_fini+0x178() moduninstall+0x148() modunrload+0x6c() modctl+0x54() syscall_trap+0xac() Description::During DMP kernel device driver unload, it frees all the allocated kernel heap memory. As part of freeing allocated memory, DMP is trying to free more than the allocated buffer size for one of the allocated buffer, which is leading to system panic when kernel memory audit is enabled. Resolution::Source code is modified to free the kernel buffer, which is aligned to the allocation size. * Incident no::2353406 Tracking ID ::2313021 Symptom::In Sun Cluster environment, nodes fail to join the CVM cluster after their reboot displaying following messages on console : <> vxio: [ID 557667 kern.notice] NOTICE: VxVM vxio V-5-3-1251 joinsio_done: Overlapping reconfiguration, failing the join for node 1. The join will be retried. <> vxio: [ID 976272 kern.notice] NOTICE: VxVM vxio V-5-3-672 abort_joinp: aborting joinp for node 1 with err 11 <> vxvm:vxconfigd: [ID 702911 daemon.notice] V-5-1-12144 CVM_VOLD_JOINOVER command received with error SOL- Description::A reboot of a node within CVM cluster involves a "node leave" followed by a "node join" reconfiguration. During CVM reconfiguration, each node exchanges reconfiguration messages with other nodes using the UDP protocol. At the end of a CVM reconfiguration, the messages exchanged should be deleted from all the nodes in the cluster. However, due to a bug in CVM, the messages weren't deleted as part of the "node leave" reconfiguration processing in some nodes that resulted in failure of subsequent "node join" reconfigurations. SOL- Resolution::After every CVM reconfiguration, the processed reconfiguration messages on all the nodes in the CVM cluster are deleted properly. * Incident no::2353410 Tracking ID ::2286559 Symptom::System panics in DMP (Dynamic Multi Pathing) kernel module due to kernel heap corruption while DMP path failover is in progress. Panic stack may look like: vpanic kmem_error+0x4b4() gen_get_enabled_ctlrs+0xf4() dmp_get_enabled_ctlrs+0xf4() dmp_info_ioctl+0xc8() dmpioctl+0x20() dmp_get_enabled_cntrls+0xac() vx_dmp_config_ioctl+0xe8() quiescesio_start+0x3e0() voliod_iohandle+0x30() voliod_loop+0x24c() thread_start+4() Description::During path failover in DMP, the routine gen_get_enabled_ctlrs() allocates memory proportional to the number of enabled paths. However, while releasing the memory, the routine may end up freeing more memory because of the change in number of enabled paths. Resolution::Code changes have been made in the routines to free allocated memory only. * Incident no::2353421 Tracking ID ::2334534 Symptom::In CVM (Cluster Volume Manager) environment, a node (SLAVE) join to the cluster is getting stuck and leading to unending join hang unless join operation is stopped on joining node (SLAVE) using command '/opt/VRTS/bin/vxclustadm stopnode'. While CVM join is hung in user-land (also called as vxconfigd level join), on CVM MASTER node, vxconfigd (Volume Manager Configuration daemon) doesn't respond to any VxVM command, which communicates to vxconfigd process. When vxconfigd level CVM join is hung in user-land, "vxdctl -c mode" on joining node (SLAVE) displays an output such as: bash-3.00# vxdctl -c mode mode: enabled: cluster active - SLAVE master: mtvat1000-c1d state: joining reconfig: vxconfigd in join Description::As part of a CVM node join to the cluster, every node in the cluster updates the current CVM membership information (membership information which can be viewed by using command '/opt/VRTS/bin/vxclustadm nidmap') in kernel first and then sends a signal to vxconfigd in user land to use that membership in exchanging configuration records among each others. Since each node receives the signal (SIGIO) from kernel independently, the joining node's (SLAVE) vxconfigd is ahead of the MASTER in its execution. Thus any requests coming from the joining node (SLAVE) is denied by MASTER with the error "VE_CLUSTER_NOJOINERS" i.e. join operation is not currently allowed (error number: 234) since MASTER's vxconfigd has not got the updated membership from the kernel yet. While responding to joining node (SLAVE) with error "VE_CLUSTER_NOJOINERS", if there is any change in current membership (change in CVM node ID) as part of node join then MASTER node is wrongly updating the internal data structure of vxconfigd, which is being used to send response to joining (SLAVE) nodes. Due to wrong update of internal data structure, later when the joining node retries its request, the response from master is sent to a wrong node, which doesn't exist in the cluster, and no response is sent to the joining node. Joining node (SLAVE) never gets the response from MASTER for its request and hence CVM node join is not completed and leading to cluster hang. Resolution::vxconfigd code is modified to handle the above mentioned scenario effectively. vxconfid on MASTER node will process connection request coming from joining node (SLAVE) effectively only when MASTER node gets the updated CVM membership information from kernel. * Incident no::2353425 Tracking ID ::2320917 Symptom::vxconfigd, the VxVM configuration daemon dumps core and loses disk group configuration while invoking the following VxVM reconfiguration steps: 1) Volumes which were created on thin reclaimable disks are deleted. 2) Before the space of the deleted volumes is reclaimed, the disks (whose volume is deleted) are removed from the DG with 'vxdg rmdisk' command using '- k' option. 3) The disks are removed using 'vxedit rm' command. 4) New disks are added to the disk group using 'vxdg addisk' command. The stack trace of the core dump is : [ 0006f40c rec_lock3 + 330 0006ea64 rec_lock2 + c 0006ec48 rec_lock2 + 1f0 0006e27c rec_lock + 28c 00068d78 client_trans_start + 6e8 00134d00 req_vol_trans + 1f8 00127018 request_loop + adc 000f4a7c main + fb0 0003fd40 _start + 108 ] Description::When a volume is deleted from a disk group that uses thin reclaim luns, subdisks are not removed immediately, rather it is marked with a special flag. The reclamation happens at a scheduled time every day. "vxdefault" command can be invoked to list and modify the settings. After the disk is removed from disk group using 'vxdg -k rmdisk' and 'vxedit rm' command, the subdisks records are still in core database and they are pointing to disk media record which has been freed. When the next command is run to add another new disk to the disk group, vxconfigd dumps core when locking the disk media record which has already been freed. The subsequent disk group deport and import commands erase all disk group configuration as it detects an invalid association between the subdisks and the removed disk. Resolution::1) The following message will be printed when 'vxdg rmdisk' is used to remove disk that has reclaim pending subdisks: VxVM vxdg ERROR V-5-1-0 Disk is used by one or more subdisks which are pending to be reclaimed. Use "vxdisk reclaim " to reclaim space used by these subdisks, and retry "vxdg rmdisk" command. Note: reclamation is irreversible. 2) Add a check when using 'vxedit rm' to remove disk. If the disk is in removed state and has reclaim pending subdisks, following error message will be printed: VxVM vxedit ERROR V-5-1-10127 deleting : Record is associated * Incident no::2353427 Tracking ID ::2337353 Symptom::The "vxdmpadm include" command is including all the excluded devices along with the device given in the command. Example: # vxdmpadm exclude vxvm dmpnodename=emcpower25s2 # vxdmpadm exclude vxvm dmpnodename=emcpower24s2 # more /etc/vx/vxvm.exclude exclude_all 0 paths emcpower24c /dev/rdsk/emcpower24c emcpower25s2 emcpower10c /dev/rdsk/emcpower10c emcpower24s2 # controllers # product # pathgroups # # vxdmpadm include vxvm dmpnodename=emcpower24s2 # more /etc/vx/vxvm.exclude exclude_all 0 paths # controllers # product # pathgroups # Description::When a dmpnode is excluded, an entry is made in /etc/vx/vxvm.exclude file. This entry has to be removed when the dmpnode is included later. Due to a bug in comparison of dmpnode device names, all the excluded devices are included. Resolution::The bug in the code which compares the dmpnode device names is rectified. * Incident no::2353464 Tracking ID ::2322752 Symptom::Duplicate device names are observed for NR (Not Ready) devices, when vxconfigd is restarted (vxconfigd -k). # vxdisk list emc0_0052 auto - - error emc0_0052 auto:cdsdisk - - error emc0_0053 auto - - error emc0_0053 auto:cdsdisk - - error Description::During vxconfigd restart, disk access records are rebuilt in vxconfigd database. As part of this process IOs are issued on all the devices to read the disk private regions. The failure of these IOs on NR devicess resulted in creating duplicate disk access records. Resolution::vxconfigd code is modified not to create dupicate disk access records. * Incident no::2357579 Tracking ID ::2357507 Symptom::Machine can panic while detecting unstable paths with following stack trace. #0 crash_nmi_callback #1 do_nmi #2 nmi #3 schedule #4 __down #5 __wake_up #6 .text.lock.kernel_lock #7 thread_return #8 printk #9 dmp_notify_event #10 dmp_restore_node Description::After detecting unstable paths restore daemon allocates memory to report the event to userland daemons like vxconfigd. While requesting for memory allocation restore daemon did not drop the spin lock resulting to the machine panic. Resolution::Fixed the code so that spinlocks are not held while requesting for memory allocation in restore daemon. * Incident no::2357820 Tracking ID ::2357798 Symptom::VVR leaking memory due to unfreed vol_ru_update structure. Memory leak is very small but it can accumulate to big value if VVR is running for many days. Description::VVR allocates update structure for each write, if replication is up-to-date then next write coming in will also create multi-update and add it to VVR replication queue. While creating multi-update, VVR wrongly marked the original update with flag, which means that update is in replication queue, but it was never added(not required) to replication queue. When update free routine is called it check if update has flag marked then don't free it, assuming that update is still in replication queue, it will get free while remove it from queue. Since update was not in the queue it will never get free and leak the memory. Memory leak will happen for only first write coming after each time rlink become up-to-date, that is reason it will take many days to leak big memory. Resolution::Marking of flag for some updates was causing this memory leak, flag marking is not required as we are not adding update into replication queue. Fix is to remove marking and checking of flag. * Incident no::2360415 Tracking ID ::2242268 Symptom::The agenode which got already freed got accessed which led to the panic. Panic stack looks like [0674CE30]voldrl_unlog+0001F0 (F100000070D40D08, F10001100A14B000, F1000815B002B8D0, 0000000000000000) [06778490]vol_mv_write_done+000AD0 (F100000070D40D08, F1000815B002B8D0) [065AC364]volkcontext_process+0000E4 (F1000815B002B8D0) [066BD358]voldiskiodone+0009D8 (F10000062026C808) [06594A00]voldmp_iodone+000040 (F10000062026C808) Description::Panic happened because of accessing the memory location which got already freed. Resolution::Skip the data structure for further processing when the memory already got freed off. * Incident no::2360419 Tracking ID ::2237089 Symptom::======= vxrecover failed to recover the data volumes with associated cache volume. Description::=========== vxrecover doesn't wait till the recovery of the cache volumes is complete before triggering the recovery of the data volumes that are created on top of cache volume. Due to this the recovery might fail for the data volumes. Resolution::========== Code changes are done to serialize the recovery for different volume types. * Incident no::2360719 Tracking ID ::2359814 Symptom::1. vxconfigbackup(1M) command fails with the following error: ERROR V-5-2-3720 dgid mismatch 2. "-f" option for the vxconfigbackup(1M) is not documented in the man page. Description::1. In some cases, a *.dginfo file will have two lines starting with "dgid:". It causes vxconfigbackup to fail. The output from the previous awk command returns 2 lines instead of one for the $bkdgid variable and the comparison fails, resulting in "dgid mismatch" error even when the dgids are the same. This happens in the case if the temp dginfo file is not removed during last run of vxconfigbackup, such as the script is interrupted, the temp dginfo file is updated with appending mode, vxconfigbackup.sh: echo "TIMESTAMP" >> $DGINFO_F_TEMP 2>/dev/null Therefore, there may have 2 or more dginfo are added into the dginfo file, it causes the config backup failure with dgid mismatch. 2. "-f" option to force a backup is not documented in the man page of vxconfigbackup(1M). Resolution::1. The solution is to change append mode to destroy mode. 2. Updated the vxconfigbackup(1M) man page with the "-f" option. * Incident no::2364700 Tracking ID ::2364253 Symptom::In case of Space Optimized snapshots at secondary site, VVR leaks kernel memory. Description::In case of Space Optimized snapshots at secondary site, VVR proactively starts the copy-on-write on the snapshot volume. The I/O buffer allocated for this proactive copy-on-write was not freed even after I/Os are completed which lead to the memory leak. Resolution::After the proactive copy-on-write is complete, memory allocated for the I/O buffers is released. * Incident no::2367561 Tracking ID ::2365951 Symptom::Growing RAID5 volumes beyond 5TB fails with "Unexpected kernel error in configuration update" error. Example : # vxassist -g eqpwhkthor1 growby raid5_vol5 19324030976 VxVM vxassist ERROR V-5-1-10128 Unexpected kernel error in configuration update Description::VxVM stores the size required to grow RAID5 volumes in an integer variable which overflowed for large volume sizes. This results in failure to grow the volume. Resolution::VxVM code is modified to handle integer overflow conditions for RAID5 volumes. * Incident no::2367565 Tracking ID ::2367564 Symptom::Linux setup configured with large number of luns, takes long time to boot. Many "vxdmpadm enable path=<>" commands are triggered by vxvm-udev.sh script during boot, which increased the system boot time. Description::On a linux setup configured with large number of devices, when new luns are discovered at early boot time, UDEV events are generated for these new luns. As part of UDEV processing VxVM vxvm-udev.sh script is triggered, which calls "vxdmpadm enable path=<>" command for each new path if fails to post event directly vxesd daemon. As part of "vxdmpadm enable path" command, vxconfigd performs a partially discovery, which slows down the system. And also "udevadm trigger" command was triggered by vxvm-startup RC script at boot time, which is also affecting system boot time. Resolution::Removed the "udevadm trigger" command from vxvm-startup RC script. And removed the code to trigger "vxdmpadm enable path=<>" command from vxvm-udev.sh. Instead throw an error message in syslog to inform user to do "vxdisk scandisks" * Incident no::2377317 Tracking ID ::2408771 Symptom::VXVM does not show all the discovered devices. Number of devices shown by VXVM is lesser than those by the OS. Description::For every lunpath device discovered, VXVM creates a data structure and is stored in a hash table. Hash value is computed based on unique minor of the lunpath. In case minor number exceeds 831231, we encounter integer overflow and store the data structure for this path at wrong location. When we later traverse this hash list, we limit the accesses based on total number of discovered paths and as the devices with minor numbers greater than 831232 are hashed wrongly, we do not create DA records for such devices. Resolution::Integer overflow problem has been resolved by appropriately typecasting the minor number and hence correct hash value is computed. * Incident no::2379034 Tracking ID ::2379029 Symptom::Changing of enclosure name was not working for all devices in enclosure. All these devices were present in /etc/vx/darecs. # cat /etc/vx/darecs ibm_ds8x000_02eb auto online format=cdsdisk,privoffset=256,pubslice=2,privslice=2 ibm_ds8x000_02ec auto online format=cdsdisk,privoffset=256,pubslice=2,privslice=2 # vxdmpadm setattr enclosure ibm_ds8x000 name=new_ibm_ds8x000 # vxdisk -o alldgs list DEVICE TYPE DISK GROUP STATUS ibm_ds8x000_02eb auto:cdsdisk ibm_ds8x000_02eb mydg online ibm_ds8x000_02ec auto:cdsdisk ibm_ds8x000_02ec mydg online new_ibm_ds8x000_02eb auto - - error new_ibm_ds8x000_02ec auto - - error Description::/etc/vx/darecs only stores foreign devices and nopriv or simple devices, the auto device should NOT be written into this file. A DA record is flushed in the /etc/vx/darecs at the end of transaction, if R_NOSTORE flag is NOT set on a DA record. There was a bug in VM where if we initialize a disk that does not exist(e.g. using vxdisk rm) in da_list, the R_NOSTORE flag is NOT set for the new created DA record. Hence duplicate entries for these devices were created and resulted in these DAs going in error state. Resolution::Source has been modified to add R_NOSTORE flag for auto type DA record created by auto_init() or auto_define(). # vxdmpadm setattr enclosure ibm_ds8x000 name=new_ibm_ds8x000 # vxdisk -o alldgs list new_ibm_ds8x000_02eb auto:cdsdisk ibm_ds8x000_02eb mydg online new_ibm_ds8x000_02ec auto:cdsdisk ibm_ds8x000_02ec mydg online * Incident no::2382705 Tracking ID ::1675599 Symptom::Vxconfigd leaks memory while excluding and including a Third party Driver controlled LUN in a loop. As part of this vxconfigd loses its license information and following error is seen in system log: "License has expired or is not available for operation" Description::In vxconfigd code, memory allocated for various data structures related to device discovery layer is not freed which led to the memory leak. Resolution::The memory is released after its scope is over. * Incident no::2382710 Tracking ID ::2139179 Symptom::DG import can fail with SSB (Serial Split Brain) though the SSB does not exist. Description::An association between DM and DA records is done while importing any DG, if the SSB id of the DM and DA records match. On a system with stale cloned disks, the system is attempting to associate the DM with cloned DA, where the SSB id mismatch is observed and resulted in import failure with SSB mismatch. Resolution::The selection of DA to associate with DM is rectified to resolve the issue. * Incident no::2382714 Tracking ID ::2154287 Symptom::In the presence of Not-Ready" devices when the SCSI inquiry on the device succeeds but open or read/write operations fail, one sees that paths to such devices are continuously marked as ENABLED and DISABLED for every DMP restore task cycle. Description::The issue is that the DMP restore task finds these paths connected and hence enables them for I/O but soon finds that they cannot be used for I/O and disables them Resolution::The fix is to not enable the path unless it is found to be connected and available to open and issue I/O. * Incident no::2382717 Tracking ID ::2197254 Symptom::vxassist, the VxVM volume creation utility when creating volume with "logtype=none" doesn't function as expected. Description::While creating volumes on thinrclm disks, Data Change Object(DCO) version 20 log is attached to every volume by default. If the user do not want this default behavior then "logtype=none" option can be specified as a parameter to vxassist command. But with VxVM on HP 11.31 , this option does not work and DCO version 20 log is created by default. The reason for this inconsistency is that when "logtype=none" option is specified, the utility sets the flag to prevent creation of log. However, VxVM wasn't checking whether the flag is set before creating DCO log which led to this issue. Resolution::This is a logical issue which is addressed by code fix. The solution is to check for this corresponding flag of "logtype=none" before creating DCO version 20 by default. * Incident no::2383705 Tracking ID ::2204752 Symptom::The following message is observed after the diskgroup creation: "VxVM ERROR V-5-3-12240: GPT entries checksum mismatch" Description::This message is observed with the disk which was initialized as cds_efi and later on this was initialized as hpdisk. A harmless message "checksum mismatch" is thrown out even when the diskgroup initialization is successful. Resolution::Remove the harmless message "GPT entries checksum mismatch" * Incident no::2384473 Tracking ID ::2064490 Symptom::vxcdsconvert utility fails if disk capacity is greater than or equal to 1 TB Description::VxVM cdsdisk uses GPT layout if the disk capacity is greater than 1 TB and uses VTOC layout if the disk capacity is less 1 TB. Thus, vxcdsconvert utility was not able to convert to the GPT layout if the disk capacity is greater than or equal to 1 TB. Resolution::This issue has been resolved by converting to proper cdsdisk layout depending on the disk capacity * Incident no::2384844 Tracking ID ::2356744 Symptom::When "vxvm-recover" are executed manually, the duplicate instances of the Veritas Volume Manager(VxVM) daemons (vxattachd, vxcached, vxrelocd, vxvvrsecdgd and vxconfigbackupd) are invoked. When user tries to kill any of the daemons manually, the other instances of the daemons are left on this system. Description::The Veritas Volume Manager(VxVM) daemons (vxattachd, vxcached, vxrelocd, vxvvrsecdgd and vxconfigbackupd) do not have : 1. A check for duplicate instance. and 2. Mechanism to clean up the stale processes. Because of this, when user executes the startup script(vxvm-recover), all daemons are invoked again and if user kills any of the daemons manually, the other instances of the daemons are left on this system. Resolution::The VxVM daemons are modified to do the "duplicate instance check" and "stale process cleanup" appropriately. * Incident no::2386763 Tracking ID ::2346470 Symptom::The Dynamic Multi Pathing Administration operations such as "vxdmpadm exclude vxvm dmpnodename=" and "vxdmpadm include vxvm dmpnodename= " triggers memory leaks in the heap segment of VxVM Configuration Daemon (vxconfigd). Description::vxconfigd allocates chunks of memory to store VxVM specific information of the disk being included during "vxdmpadm include vxvm dmpnodename=" operation. The allocated memory is not freed while excluding the same disk from VxVM control. Also when excluding a disk from VxVM control, another chunk of memory is temporarily allocated by vxconfigd to store more details of the device being excluded. However this memory is not freed at the end of exclude operation. Resolution::Memory allocated during include operation of a disk is freed during corresponding exclude operation of the disk. Also temporary memory allocated during exclude operation of a disk is freed at the end of exclude operation. * Incident no::2389095 Tracking ID ::2387993 Symptom::In presence of NR (Not-Ready) devices, vxconfigd (VxVM configuration daemon) goes into disabled mode once restarted. # vxconfigd -k -x syslog # vxdctl mode mode: disabled If vxconfigd is restarted in debug mode at level 9 following message could be seen. # vxconfigd -k -x 9 -x syslog VxVM vxconfigd DEBUG V-5-1-8856 DA_RECOVER() failed, thread 87: Kernel and on-disk configurations don't match Description::When vxconfid is restarted, all the VxVM devices are recovered. As part of recovery the capacity of the device is read, which can fail with EIO. This error is not handled properly. As a result of this the vxconfigd is going to DISABLED state. Resolution::EIO error code from read capacity ioctl is handled specifically. * Incident no::2390804 Tracking ID ::2249113 Symptom::VVR volume recovery hang, at vol_ru_recover_primlog_done() function in a dead loop. Description::During the SRL recovery, the SRL is read to apply the update to the data volume. There are possible hold in the SRL due to some writes are not complete properly. This holes must have to be skipped. and this regions is read as a dummy update and sent it to secondary. If the dummy update size is larger than max_write (>256k), then the code logic goes intoa dead loop, keep reading the same dummy update for ever. Resolution::Handle the large holes which are greater than VVR MAX_WRITE. * Incident no::2390815 Tracking ID ::2383158 Symptom::The panic in vol_rv_mdship_srv_done() due to sio is freed and having the invalid node pointer. Description::The vol_rv_mdship_srv_done() is panicking at referencing wrsio->wrsrv_node as the wrsrv_node is having the invalid pointer.It is also observed that the wrsio is freed or allocated for different SIO. Looking closely, the vol_rv_check_wrswaitq() is called at every done of the SIO, which looks into the waitq and releases all the SIO which has RV_WRSHIP_SRV_SIO_FLAG_LOGEND_DONE flag set on it. In vol_rv_mdship_srv_done(), we set this flag and do more operations on wrsrv. During this time the other SIO which is completed with the DONE, calls the function vol_rv_check_wrswaitq() and deletes the SIO of it own and other SIO which has the RV_WRSHIP_SRV_SIO_FLAG_LOGEND_DONE flag set. This leads to deleting the SIO which is on the fly, and is causing the panic. Resolution::The flag must be set just before calling the function vol_rv_mdship_srv_done(), and at the end of the SIOdone() to avoid other SIO's to race and delete the current running one. * Incident no::2390822 Tracking ID ::2369786 Symptom::On VVR Secondary cluster, if SRL disk goes bad then,then vxconfigd may hang in transaction code path. Description::In case of any error seen in VVR shared disk group environments, error handling is done cluster wide. On VVR Secondary, if SRL disk goes bad due to some temporary or actual disk failure, it starts cluster wide error handling. Error handling requires serialization, in some cases we didn't do serialization which caused error handling to go in dead loop hence the hang. Resolution::Making sure we always serialize the I/O during error handling on VVR Secondary resolved this issue. * Incident no::2391622 Tracking ID ::2081357 Symptom::system panic with ted_assert(f:VOL_RU_UPDATE_VALIDATE_QUEUED:1a) Description::A separate queue for IO is maintained for the I/O comming from slave node. The queue is managed at many places to add and delete the IO. There are some places we dont remove the I/O from the queue, but free the I/O pointers. since there are some freed pointers in the queue, which is leading to queue corruption. Resolution::Delete the queue from all the possible location where the I/O is done processing. * Incident no::2397663 Tracking ID ::2165394 Symptom::If the cloned copy of a diskgroup and a destroyed diskgroup exists on the system, an import operation imports destroyed diskgroup instread of cloned one. For example, consider a system with diskgroup dg containing disk disk1. Disk disk01 is cloned to disk02. When diskgroup dg containing disk01 is destroyed and diskgroup dg is imported, VXVM should import dg with cloned disk i.e disk02. However, it imports the diskgroup dg with disk01. Description::After destroying a diskgroup, if the cloned copy of the same diskgroup exists on the system, the following disk group import operation wrongly identifies the disks to be import and hence destroyed diskgroup gets imported. Resolution::The diskgroup import code is modified to identify the correct diskgroup when a cloned copy of the destroyed diskgroup exists. * Incident no::2405446 Tracking ID ::2253970 Symptom::Enhancement to customize private region I/O size based on maximum transfer size of underlying disk. Description::There are different types of Array Controllers which support data transfer sizes starting from 256K and beyond. VxVM tunable volmax_specialio controls vxconfigd's configuration I/O as well as Atomic Copy I/O size. When volmax_specialio is tuned to a value greater than 1MB to leverage maximum transfer sizes of underlying disks, import operation is failing for disks which cannot accept more than 256K I/O size. If the tunable is set to 256k then it will be the case where large transfer size of disks is not being leveraged. Resolution::All the above scenarios mentioned in Description are handled in this enhancement to leverage large disk transfer sizes as well as support Array controllers with 256K transfer sizes. * Incident no::2408209 Tracking ID ::2291226 Symptom::Data corruption can be observed on a CDS (Cross-platform Data Sharing) disk, whose capacity is more than 1 TB. The following pattern would be found in the data region of the disk. cyl alt 2 hd sec Description::The CDS disk maintains a SUN vtoc in the zeroth block of the disk. This VTOC maintains the disk geometry information like number of cylinders, tracks and sectors per track. These values are limited by a maximum of 65535 by design of SUN's vtoc, which limits the disk capacity to 1TB. As per SUN's requirement, few backup VTOC labels have to be maintained on the last track of the disk. VxVM 5.0 MP3 RP3 allows to setup CDS disk on a disk with capacity more than 1TB. The data region of the CDS disk would span more than 1TB utilizing all the accessible cylinders of the disk. As mentioned above, the VTOC labels would be written at zeroth block and on the last track considering the disk capacity as 1TB. The backup labels would fall in to the data region of the CDS disk causing the data corruption. Resolution::Suppress writing the backup labels to prevent the data corruption. * Incident no::2408864 Tracking ID ::2346376 Symptom::Some DMP IO statistics records were lost from per-cpu IO stats queue. Hence, DMP IO stat reporting CLI was displaying incorrect data. Description::DMP IO statistics daemon has two buffers for maintaining IO statistics records. One of the buffer is active and is updated on every I/O completion, while the other shadow buffer is read by IO statistics daemon. The central IO statistics table is updated, every IO statistics interval, from the records in active buffer. The problem occurs because swapping of two buffers can happen from two contexts, IO throttling and IO statistics collection. IO throttling swaps the buffers but doesn't update central IO statistics table. So, all IO records in active buffer are lost when the two buffers are swapped in throttling context. Resolution::As a resolution to the problem, we avoid swapping the buffers in IO throttling context. * Incident no::2409212 Tracking ID ::2316550 Symptom::While doing cold/ignite install to 11.31 + VxVM 5.1, following warning messages are seen on a setup with ALUA array: "VxVM vxconfigd WARNING V-5-1-0 ddl_add_disk_instr: Turning off NMP Alua mode failed for dmpnode 0xffffffff with ret = 13 " Description::The above warning messages are displayed by vxconfigd started at the early boot if fails to turn off the NMP ALUA mode for a given dmp device. These messages are not harmful, as later vxconfigd started in enabled mode will turn off the NMP ALUA mode for all the dmp devices. Resolution::Changes done in vxconfigd to not to print these warning messages in vxconfigd boot mode. * Incident no::2411052 Tracking ID ::2268408 Symptom::1) On suppressing the underlying path of powerpath controlled device, the disk goes in error state. 2) "vxdmpadm exclude vxvm dmpnodename=" command does not suppress TPD devices. Description::During discovery, H/W path corresponding to the basename is not generated for powerpath controlled devices because basename does not contain the slice portion. Device name with s2 slice is expected while generating H/W name. Resolution::Whole disk name i.e., device name with s2 slice is used to generate H/W path. * Incident no::2411053 Tracking ID ::2410845 Symptom::If a DG(Disk Group) is imported with reservation key, then during DG deport lots of 'reservation conflict' messages will be seen. [DATE TIME] [HOSTNAME] multipathd: VxVM26000: add path (uevent) [DATE TIME] [HOSTNAME] multipathd: VxVM26000: failed to store path info [DATE TIME] [HOSTNAME] multipathd: uevent trigger error [DATE TIME] [HOSTNAME] multipathd: VxVM26001: add path (uevent) [DATE TIME] [HOSTNAME] multipathd: VxVM26001: failed to store path info [DATE TIME] [HOSTNAME] multipathd: uevent trigger error .. [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:2: reservation conflict [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:2: reservation conflict [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:1: reservation conflict [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:2: reservation conflict [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:1: reservation conflict [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:2: reservation conflict Description::When removing a PGR(Persistent Group Reservation) key during DG deport, we need to preempt the key but the preempt operation is failed with reservation conflict error because the passing key for preemption is not correct. Resolution::Code changes are made to set the correct key value for the preemption operation. * Incident no::2413077 Tracking ID ::2385680 Symptom::The vol_rv_async_childdone() panic occurred because of corrupted pripendingq Description::The pripendingq is always corrupted in this panic. The head entry is always freed from the queue but not removed. In mdship_srv_done code, for error condition, we remove the update from pripendingq only if the next or prev pointers of updateq is non-null. This leads to the head pointer not getting removed in the abort scenerio and causing the free to happen without deleting it from the queue. Resolution::The prev and next checks are removed in all the places. Also handled the abort case carefully for the following conditions: 1) abort logendq due to slave node panic (i.e.) this has the update entry but the update is not removed from the pripendingq. 2) vol_kmsg_eagain type of failures, (i.e.) the update is there, but it is removed from the pripendingq. 3) abort very early in the mdship_sio_start() (i.e.) update is allocated but not in pripendingq. * Incident no::2413908 Tracking ID ::2413904 Symptom::Performing Dynamic LUN reconfiguration operations (adding and removing LUNs), can cause corruption in DMP database. This in turn may lead to vxconfigd core dump OR system panic. Description::When a LUN is removed from the VM using 'vxdisk rm' and at the same time some new LUN is added and in case the newly added LUN reuses the devno of the removed LUN then this may corrupt the DMP database as this condition is not handled currently. Resolution::Fixed the DMP code to handle the mentioned issue. * Incident no::2415566 Tracking ID ::2369177 Symptom::When using > 2TB disks and the device respons to SCSI inquiry but fails to service I/O, data corruption can occur as the write I/O would be directed at an incorrect offset Description::Currently when the failed I/O is retried, DMP assumes the offset to be a 32 bit value and hence I/O offsets >2TB can get truncated leading to the rety I/O issued at wrong offset value Resolution::Change the offset value to a 64 bit quantity to avoid truncation during I/O retries from DMP. * Incident no::2415577 Tracking ID ::2193429 Symptom::Enclosure attributes like iopolicy, recoveryoption etc do not persist across reboots in case when before vold startup itself DMP driver is already configured before with different array type (e.g. in case of root support) than stored in array.info. Description::When DMP driver is already configured before vold comes up (as happens in root support), then the enclosure attributes do not take effect if the enclosure name in kernel has changed from previous boot cycle. This is because when vold comes up da_attr_list will be NULL. And then it gets events from DMP kernel for data structures already present in kernel. On receiving this information, it tries to write da_attr_list into the array.info, but since da_attr_list is NULL, array.info gets overwritten with no data. And hence later vold could not correlate the enclosure attributes present in dmppolicy.info with enclosures present in array.info, so the persistent attributes could not get applied. Resolution::Do not overwrite array.info of da_attr_list is NULL * Incident no::2417184 Tracking ID ::2407192 Symptom::Application I/O hangs on RVG volumes when RVG logowner is being set on the node which takes over the master role (either as part of "vxclustadm setmaster" OR as part of original master leave) Description::Whenever a node takes over the master role, RVGs are recovered on the new master. Because of a race between RVG recovery thread (initiated as part of master takeover) and the thread which is changing RVG logowner(which is run as part of "vxrvg set logowner=on", RVG recovery does not get completed which leads to I/O hang. Resolution::The race condition is handled with appropriate locks and conditional variable. * Incident no::2417205 Tracking ID ::2407699 Symptom::The vxassist command dumps core if the file "/etc/default/vxassist" contains the line "wantmirror=" Description::vxassist, the Veritas Volume Manager client utility can accept attributes from the system defaults file (/etc/default/vxassist), the user specified alternate defaults file and the command line. vxassist automatically merges all the attributes by pre-defined priority. However, a null pointer checking is missed while merging "wantmirror" attribute which leads to the core dump. Resolution::Within vxassist, while merging attributes, add a check for NULL pointer. * Incident no::2421100 Tracking ID ::2419348 Symptom::Tags Empty Description::This panic is because of race condition between vxconfigd doing a dmp_reconfigure_db() and another process (vxdclid) executing dmp_passthru_ioctl(). The stack of vxdclid thread:- 000002a107684d51 dmp_get_path_state+0xc(606a5b08140, 301937d9c20, 0, 0, 0, 0) 000002a107684e01 do_passthru_ioctl+0x76c(606a5b08140, 8, 0, 606a506c840, 606a506c848, 0) 000002a107684f61 dmp_passthru_ioctl+0x74(11d000005ca, 40b, 3ad4c0, 100081, 606a3d477b0, 2a107685adc) 000002a107685031 dmpioctl+0x20(11d000005ca, 40b, 3ad4c0, 100081, 606a3d477b0, 2a107685adc) 000002a1076850e1 fop_ioctl+0x20(60582fdfc00, 40b, 3ad4c0, 100081, 606a3d477b0, 1296a58) 000002a107685191 ioctl+0x184(a, 6065a188430, 3ad4c0, ff0bc910, ff1303d8, 40b) 000002a1076852e1 syscall_trap32+0xcc(a, 40b, 3ad4c0, ff0bc910, ff1303d8, ff13a5a0) And the stack of vxconfid which is doing reconfiguarion:- vxdmp:dmp_get_iocount+0x68(0x7) vxdmp:dmp_check_ios_drained+0x40() vxdmp:dmp_check_ios_drained_in_dmpnode+0x40(0x60693cc0f00, 0x20000000) vxdmp:dmp_decode_destroy_dmpnode+0x11c(0x2a10536b698, 0x102003, 0x0, 0x19caa70) vxdmp:dmp_decipher_instructions+0x2e4(0x2a10536b758, 0x10, 0x102003, 0x0, 0x19caa70) vxdmp:dmp_process_instruction_buffer+0x150(0x11d0003ffff, 0x3df634, 0x102003, 0x0, 0x19caa70) vxdmp:dmp_reconfigure_db+0x48() vxdmp:gendmpioctl(0x11d0003ffff, , 0x3df634, 0x102003, 0x604a7017298, 0x2a10536badc) vxdmp:dmpioctl+0x20(, 0x444d5040, 0x3df634, 0x102003, 0x604a7017298) In vxdclid thread we are trying to get the dmpnode from path_t structure. But at the same time path_t has been freed as part of reconfiguration. So hence the panic. Resolution::Get the dmpnode from lvl1tab table instead of path_t structure. Because there is an ioctl is going on this dmpnode, so dmpnode will be available at this time. * Incident no::2421491 Tracking ID ::2396293 Symptom::On VXVM rooted systems, during machine bootup, vxconfigd core dumps with following assert and machine does not bootup. Assertion failed: (0), file auto_sys.c, line 1024 05/30 01:51:25: VxVM vxconfigd ERROR V-5-1-0 IOT trap - core dumped Description::DMP deletes and regenerates device numbers dynamically on every boot. When we start static vxconfigd in boot mode, since ROOT file system is READ only, new DSF's for DMP nodes are not created. But, DMP configures devices in userland and kernel. So, there is mismatch in device numbers of the DSF's and that in DMP kernel, as there are stale DSF's from previous boot present. This leads vxconfigd to actually send I/O's to wrong device numbers resulting in claiming disk with wrong format. Resolution::Issue is fixed by getting the device numbers from vxconfigd and not doing stat on DMP DSF's. * Incident no::2423086 Tracking ID ::2033909 Symptom::Disabling a controller of a A/P-G type array could lead to I/O hang even when there are available paths for I/O. Description::DMP was not clearing a flag, in an internal DMP data structure, to enable I/O to all the LUNs during group failover operation. Resolution::DMP code modified to clear the appropriate flag for all the LUNs of the LUN group so that the failover can occur when a controller is disabled. * Incident no::2424888 Tracking ID ::2388725 Symptom::System panics when trying to load the kdump boot image on RedHat 6.1 after installing Storage Foundation. Description::The panic occurs in 'dmp_get_dmpsymbols' when the system attempts to load the APM kernel modules before starting DMP. Below is an example panic: Modules linked in: dmpjbod(P+)(U) llc uio sunrpc autofs4 vxglm(P)(U) ip_tables xt_CHECKSUM ipt_REJECT nf_conntrack nf_defrag_ipv4 ebtables ip6_tables Pid: 116, comm: insmod Tainted: P ---------------- 2.6.32- 131.0.15.el6.x86_64 #1 Sun Fire X4200 M2 RIP: 0010:[] [] dmp_get_dmpsymbols+0x1c/0x1a0 [dmpjbod] RSP: 0018:ffff88000984bef8 EFLAGS: 00010296 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000010 RDX: 0000000000000000 RSI: ffff880002211300 RDI: 0000000000000282 RBP: ffff88000984bef8 R08: a038000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000100000 R12: ffffffffa00cdca0 When configuring 'kdump' with Storage Foundation installed, the Storage Foundation kernel modules are automatically added to the kdump boot image. When the kdump image is started, the DMP and APM kernel modules are loaded out-of- order, which results in the panic. Resolution::The APM kernel modules have been modified to cleanly handle the case where they are loaded out-of-order. WORKAROUND: Configure 'kdump' before installing Storage Foundation or when Storage Foundation is not started. * Incident no::2428179 Tracking ID ::2425722 Symptom::VxVM's subdisk operation - vxsd mv - fails on subdisk sizes greater than or equal to 2TB. Eg: #vxsd -g nbuapp mv disk_1-03 disk_2-03 VxVM vxsd ERROR V-5-1-740 New subdisks have different size than subdisk disk_1- 03, use -o force Description::VxVM code uses 32-bit unsigned integer variable to store the size of subdisks which can only accommodate values less than 2TB. Thus, for larger subdisk sizes integer overflows resulting in the subdisk move operation failure. Resolution::The code has been modified to accommodate larger subdisk sizes. * Incident no::2435050 Tracking ID ::2421067 Symptom::With VVR configured, 'vxconfigd' hangs on the primary site when trying to recover the SRL log, after a system or storage failure. Description::At the start of each SRL log disk we keep a config header. Part of this header includes a flag which is used by VVR to serialize the flushing of the SRL configuration table, to ensure only a single thread flushes the table at any one time. In this instance, the 'VOLRV_SRLHDR_CONFIG_FLUSHING' flag was set in the config header, and then the config header was written to disk. At this point the storage became inaccessible. During recovery the config header was read from from disk, and when trying to initiate a new flush of the SRL table, the system hung as the flag was already set to indicate that a flush was in progress. Resolution::When loading the SRL header from disk, the flag 'VOLRV_SRLHDR_CONFIG_FLUSHING' is now cleared. * Incident no::2436283 Tracking ID ::2425551 Symptom::The cvm reconfiguration takes 1 minute for each RVG configuration. Description::Every RVG is given 1 minute time to drain the IO, if not drained, then the code wait for 1 minute before aborting the I/O's waiting in the logendq. The logic is such that, for every RVG, it wait 1 minute for the I/O's to drain. Resolution::It should be enough to give oveall 1 minute for all RVGs, and abort all the RVG's after 1 minute time, instead of waiting for each RVG. The alternate solution (long term solution) is, Abort the RVG immediately when the objiocount(rv) == queue_count(logendq). This will reduce the 1 minute dealy further down to the actual requirend time. In this, follwoing things to be take care 1. rusio may be active, which need to be reduced in iocount. 2. every I/O goes into the logendq before getting serviced. So, have to make sure they are not in the process of servicing. * Incident no::2436287 Tracking ID ::2428875 Symptom::On a CVR configuration, issue i/o from both master and slave. reboot the slave lead to reconfiguration hang. Description::The I/O's on both master and slave fills up the SRL and goes to the DCM mode. In DCM mode, the header flush to flush the DCM and the SRL header happens for every 512 updates. Since most of the I/O's are from the SLAVe node, the I/O's throttled due to the hdr_flush is queued in mdship_throttle_q. This queue is flushed at the end of header flush. If the slave node is rebooted and when the SIO are in throttle_q, and when the system is rebooted, the reconfig code path dont flush the mdship_throttleq and wait for them to drain. This lead to the reconfiguration hang due to positive I/O count. Resolution::abort all the SIO's queued in the mdship_throttleq, when the node is aborted. Restart the SIO's for the nodes that did not leave. * Incident no::2436288 Tracking ID ::2411698 Symptom::I/Os hang in CVR (Clustered Volume Replicator) environment. Description::In CVR environment, when CVM (Clustered Volume Manager) Slave node sends a write request to the CVM Master node, following tasks occur. 1) Master grabs the *REGION LOCK* for the write and permits slave to issue the write. 2) When new IOs occur on the same region (till the write that acquired *REGION LOCK* is not complete), they wait in a *REGION LOCK QUEUE*. 3) Once the IO that acquired the *REGION LOCK* is serviced by slave node, it responds to the Master about the same, and Master processes the IOs queued in the *REGION LOCK QUEUE*. The problem occurs when the slave node dies before sending the response to the Master about completion of the IO that held the *REGION LOCK*. Resolution::Code changes have been made to accomodate the condition as mentioned in the section "DESCRIPTION". * Incident no::2440031 Tracking ID ::2426274 Symptom::In a Storage Foundation environment running Veritas File System (VxFS) and Volume Manager (VxVM), a system panic may occur with following the stack trace in case IO hints are being used. One such scenario is in case of using Symantec Oracle Disk Manager (ODM) [] _volsio_mem_free+0x4c/0x270 [vxio] [] vol_subdisksio_done+0x59/0x220 [vxio] [] volkcontext_process+0x346/0x9a0 [vxio] [] voldiskiodone+0x764/0x850 [vxio] [] voldiskiodone_intr+0xfa/0x180 [vxio] [] volsp_iodone_common+0x234/0x3e0 [vxio] [] blk_update_request+0xbb/0x3e0 [] blk_update_bidi_request+0x1f/0x70 [] blk_end_bidi_request+0x27/0x80 [] scsi_end_request+0x3a/0xc0 [scsi_mod] [] scsi_io_completion+0x109/0x4e0 [scsi_mod] [] blk_done_softirq+0x6d/0x80 [] __do_softirq+0xbf/0x170 [] call_softirq+0x1c/0x30 [] do_softirq+0x4d/0x80 [] irq_exit+0x85/0x90 [] do_IRQ+0x6e/0xe0 [] ret_from_intr+0x0/0xa [] default_idle+0x32/0x40 [] cpu_idle+0x5a/0xb0 [] start_kernel+0x2ca/0x395 [] x86_64_start_kernel+0xe1/0xf2 Description::A single Volume Manager I/O (staged I/O) while doing 'done' processing, was trying to access the FS-VM private information data structure which was freed. This free also resulted in an assert which indicated a mismatch in size of the IO that was freed thereby hitting a panic. Resolution::The solution is to preserve the FS-VM private information data structure pertaining to the I/O till its last access. After that, it is freed to release that memory. * Incident no::2440351 Tracking ID ::2440349 Symptom::The grow operation on a DCO volume may grow it into any 'site' not honoring the allocation requirements strictly. Description::When a DCO volume is grown, it may not honor the allocation specification strictly to use only a particular site even though they are specified explicitly. Resolution::The Data Change Object of Volume Manager is modified such that it will honor the alloc specification strictly if provided explicitly * Incident no::2442850 Tracking ID ::2317703 Symptom::When the vxesd daemon is invoked by device attach & removal operations in a loop, it leaves open file descriptors with vxconfigd daemon Description::The issue is caused due to multiple vxesd daemon threads trying to establish contact with vxconfigd daemon at the same time and ending up using losing track of the file descriptor through which the communication channel was established Resolution::The fix for this issue is to maintain a single file descriptor that has a thread safe reference counter thereby not having multiple communication channels established between vxesd and vxconfigd by various threads of vxesd. * Incident no::2477291 Tracking ID ::2428631 Symptom::Shared DG import or Node Join fails with Hitachi Tagmastore storage Description::CVM uses different fence key for every DG. The key format is of type 'NPGRSSSS' where N is the node id (A,B,C..) and 'SSSS' is the sequence number. Some arrays have a restriction on total number of unique keys that can be registered (eg Hitachi Tagmastore) and hence causes issues for configs involving large number of DGs, rather the product of #DGs and #nodes in the cluster. Resolution::Having a unique key for each DG is not essential. Hence a tunable is added to control this behavior. # vxdefault list KEYWORD CURRENT-VALUE DEFAULT-VALUE ... same_key_for_alldgs off off ... Default value of the tunable is 'off' to preserve the current behavior. If a configuration hits the storage array limit on total number of unique keys, the tunable value could be changed to 'on'. # vxdefault set same_key_for_alldgs on # vxdefault list KEYWORD CURRENT-VALUE DEFAULT-VALUE ... same_key_for_alldgs on off ... This would make CVM generate same key for all subsequent DG imports/creates. Already imported DGs need to be deported and re-imported for them to take into consideration the changed value of the tunable. * Incident no::2479746 Tracking ID ::2406292 Symptom::In case of I/Os on volumes having multiple subdisks (example striped volumes), System panicks with following stack. unix:panicsys+0x48() unix:vpanic_common+0x78() unix:panic+0x1c() genunix:kmem_error+0x4b4() vxio:vol_subdisksio_delete() - frame recycled vxio:vol_plexsio_childdone+0x80() vxio:volsiodone() - frame recycled vxio:vol_subdisksio_done+0xe0() vxio:volkcontext_process+0x118() vxio:voldiskiodone+0x360() vxio:voldmp_iodone+0xc() genunix:biodone() - frame recycled vxdmp:gendmpiodone+0x1ec() ssd:ssd_return_command+0x240() ssd:ssdintr+0x294() fcp:ssfcp_cmd_callback() - frame recycled qlc:ql_fast_fcp_post+0x184() qlc:ql_status_entry+0x310() qlc:ql_response_pkt+0x2bc() qlc:ql_isr_aif+0x76c() pcisch:pci_intr_wrapper+0xb8() unix:intr_thread+0x168() unix:ktl0+0x48() Description::On a striped volume, the IO is split in to multiple parts equivalent to the number of sub-disks in the stripe. Each part of the IO is processed parallell by different threads. Thus any such two threads processing the IO completion can enter in to a race condition. Due to such race condition one of the threads happens to access a stale address causing the system panic. Resolution::The critical section of code is modified to hold appropriate locks to avoid race condition. * Incident no::2480006 Tracking ID ::2400654 Symptom::"vxdmpadm listenclosure" command hangs because of duplicate enclosure entries in /etc/vx/array.info file. Example: Enclosure "emc_clariion0" has two entries. #cat /etc/vx/array.info DD4VM1S emc_clariion0 0 EMC_CLARiiON DISKS disk 0 Disk DD3VM2S emc_clariion0 0 EMC_CLARiiON Description::When "vxdmpadm listenclosure" command is run, vxconfigd reads its in-core enclosure list which is populated from the /etc/vx/array.info file. Since the enclosure "emc_clariion0" (as mentioned in the example) is also a last entry within the file, the command expects vxconfigd to return the enclosure information at the last index of the enclosure list. However because of duplicate enclosure entries,vxconfigd returns a different enclosure information thereby leading to the hang. Resolution::The code changes are made in vxconfigd to detect duplicate entries in /etc/vx/array.info file and return the appropriate enclosure information as requested by the vxdmpadm command. * Incident no::2484466 Tracking ID ::2480600 Symptom::I/O of large sizes like 512k and 1024k hang in CVR (Clustered Volume Replicator). Description::When large IOs, say, of sizes like, 1MB, are performed on volumes under RVG (Replicated Volume Group), a limited number of IOs can be accomodated based on RVIOMEM pool limit. So, the pool remains full for majority of the duration.At this time, when CVM (Clustered Volume Manager) slave gets rebooted, or goes down, the pending IOs are aborted and the corresponding memory is freed. In one of the cases, it does not get freed, leading to the hang. Resolution::Code changes have been made to free the memory under all scenarios. * Incident no::2484695 Tracking ID ::2484685 Symptom::In a Storage Foundation environment running Symantec Oracle Disk Manager (ODM), Veritas File System (VxFS) and Volume Manager (VxVM), a system panic may occur with following the stack trace: 000002a10247a7a1 vpanic() 000002a10247a851 kmem_error+0x4b4() 000002a10247a921 vol_subdisksio_done+0xe0() 000002a10247a9d1 volkcontext_process+0x118() 000002a10247aaa1 voldiskiodone+0x360() 000002a10247abb1 voldmp_iodone+0xc() 000002a10247ac61 gendmpiodone+0x1ec() 000002a10247ad11 ssd_return_command+0x240() 000002a10247add1 ssdintr+0x294() 000002a10247ae81 ql_fast_fcp_post+0x184() 000002a10247af31 ql_24xx_status_entry+0x2c8() 000002a10247afe1 ql_response_pkt+0x29c() 000002a10247b091 ql_isr_aif+0x76c() 000002a10247b181 px_msiq_intr+0x200() 000002a10247b291 intr_thread+0x168() 000002a10240b131 cpu_halt+0x174() 000002a10240b1e1 idle+0xd4() 000002a10240b291 thread_start+4() Description::A race condition exists between two IOs (specifically Volume Manager subdisk level staged I/Os) while doing 'done' processing which causes one thread to free FS-VM private information data structure before other thread accesses it. The propensity of the race increases by increasing the number of CPUs. Resolution::Avoid the race condition such that the slower thread doesn't access the freed FS-VM private information data structure. * Incident no::2485278 Tracking ID ::2386120 Symptom::Error messages printed in the syslog in the event of master takeover failure in some situations are not be enough to find out the root cause of the failure. Description::During master takeover if the new master encounters some errors, the master takeover operation fails. We have messages in the code to log the reasons for the failure. These log messages are not available on the customer setups. These are generally enabled in the internal development\testing scenarios. Resolution::Some of the relevant messages have been modified such that they will now be available on the customer setups as well, logging crucial information for root cause analysis of the issue. * Incident no::2485288 Tracking ID ::2431470 Symptom::vxpfto sets PFTO(Powerfail Timeout) value on a wrong VxVM device. HP- Description::vxpfto invokes 'vxdisk set' command to set the PFTO value. vxdisk accepts both DA(Disk Access) and DM(Disk Media) names for device specification. DA and DM names can have conflicts such that even within the same disk group, the same name can refer to different devices - one as a DA name and another as a DM name. vxpfto command uses DM names when invoking the vxdisk command but vxdisk will choose a matching DA name before a DM name. This causes incorrect device to be acted upon. HP- Resolution::Fixed the argument check procedure in 'vxdisk set' based on the common rule of VxVM (i.e.) if a disk group is specified with '-g' option, then only DM name is supported, else it can be a DA name. * Incident no::2488042 Tracking ID ::2431423 Symptom::Panic in vol_mv_commit_check() while accessing Data Change Map(DCM) object. Stack trace of panic vol_mv_commit_check at ffffffffa0bef79e vol_ktrans_commit at ffffffffa0be9b93 volconfig_ioctl at ffffffffa0c4a957 volsioctl_real at ffffffffa0c5395c vols_ioctl at ffffffffa1161122 sys_ioctl at ffffffff801a2a0f compat_sys_ioctl at ffffffff801ba4fb sysenter_do_call at ffffffff80125039 Description::In case of DCM failure, object pointer is set to NULL as part of transaction. If DCM is active then we try to access DCM object in transaction code path without checking it to be NULL. DCM object pointer could be NULL in case of failed DCM. Accessing object pointer without check for NULL caused this panic. Resolution::Fix is to put NULL check for DCM object in transaction code path. * Incident no::2491856 Tracking ID ::2424833 Symptom::VVR primary node crashes while replicating in lossy and high latency network with multiple TCP connections. In debug VxVM build TED assert is hit with following stack : brkpoint+000004 () ted_call_demon+00003C (0000000007D98DB8) ted_assert+0000F0 (0000000007D98DB8, 0000000007D98B28, 0000000000000000) .hkey_legacy_gate+00004C () nmcom_send_msg_tcp+000C20 (F100010A83C4E000, 0000000200000002, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 000000DA000000DA, 0000000100000000) .nmcom_connect_tcp+0007D0 () vol_rp_connect+0012D0 (F100010B0408C000) vol_rp_connect_start+000130 (F1000006503F9308, 0FFFFFFFF420FC50) voliod_iohandle+0000AC (F1000006503F9308, 0000000100000001, 0FFFFFFFF420FC50) voliod_loop+000CFC (0000000000000000) vol_kernel_thread_init+00002C (0FFFFFFFF420FFF0) threadentry+000054 (??, ??, ??, ??) Description::In lossy and high latency network, connection between VVR primary and seconadry can get closed and re-established frequently because of heartbeat timeouts or DATA acknowledgement timeouts. In TCP multi-connection scenario, VVR primary sends its very first message (called NMCOM_HANDSHAKE) to secondary on zeroth socket connection number and then it sends "NMCOM_SESSION" message for each of the next connections. By some reasons, if the sending of the NMCOM_HANDSHAKE message fails, VVR primary tries to send it through the another connection without checking whether it's a valid connection or NOT. Resolution::Code changes are made in VVR to use the other connections only after all the connections are established. * Incident no::2492016 Tracking ID ::2232411 Symptom::Subsequent resize operations of raid5 or layered volumes may fail with "VxVM vxassist ERROR V-5-1-16092 Volume TCv7-13263: There are other recovery activities. Cannot grow volume" Description::If a user tries to grow or shrink a raid5 volume or a layered volume more than once using vxassist command, the command may fail with the above mentioned error message. Resolution::1. Skip setting recover offset for RAID volumes. 2. For layered volumes, topvol: skip setting recover offset. subvols: handles separately later. (code exists). Incidents from old Patches: --------------------------- NONE