README VERSION : 1.1 README CREATION DATE : 2012-09-25 PATCH-ID : 142630-16 PATCH NAME : VRTSvxvm 5.1SP1RP3 BASE PACKAGE NAME : VRTSvxvm BASE PACKAGE VERSION : 5.1 SUPERSEDED PATCHES : 142630-15 REQUIRED PATCHES : NONE INCOMPATIBLE PATCHES : NONE SUPPORTED PADV : sol10_x86 (P-PLATFORM , A-ARCHITECTURE , D-DISTRIBUTION , V-VERSION) PATCH CATEGORY : CORE , CORRUPTION , HANG , MEMORYLEAK , PANIC , PERFORMANCE PATCH CRITICALITY : OPTIONAL HAS KERNEL COMPONENT : YES ID : NONE REBOOT REQUIRED : YES PATCH INSTALLATION INSTRUCTIONS: -------------------------------- NONE PATCH UNINSTALLATION INSTRUCTIONS: ---------------------------------- NONE SPECIAL INSTRUCTIONS: --------------------- NONE SUMMARY OF FIXED ISSUES: ----------------------------------------- 2070079 (1903700) Removing mirror using vxassist does not work. 2205574 (1291519) After multiple VVR migrate operations, vrstat fails to output statistics. 2227908 (2227678) Second rlink goes into DETACHED STALE state in multiple secondaries environment when SRL has overflowed for multiple rlinks. 2366232 (2024617) volpagemod_max_memsz setting in /kernel/drv/vxio.conf is not honoured by system boot 2427560 (2425259) vxdg join operation fails with VE_DDL_PROPERTY: Property not found in the list 2442751 (2104887) vxdg import error message needs improvement for cloned diskgroup import failure. 2442827 (2149922) Record the diskgroup import and deport events in syslog 2492568 (2441937) vxconfigrestore precommit fails with awk errors 2515137 (2513101) User data corrupted with disk label information 2531224 (2526623) Memory leak detected in CVM code. 2560539 (2252680) vxtask abort does not cleanup tasks properly. 2567623 (2567618) VRTSexplorer coredumps in checkhbaapi/print_target_map_entry. 2570988 (2560835) I/Os and vxconfigd hung on master node after slave is rebooted under heavy I/O load. 2576605 (2576602) 'vxdg listtag' should give error message and display correct usage when executed with wrong syntax 2613584 (2606695) Machine panics in CVR (Clustered Volume Replicator) environment while performing I/O Operations. 2613596 (2606709) IO hang is seen when SRL overflows and one of the nodes reboots 2616006 (2575172) I/Os are hung on master node after rebooting the slave node. 2622029 (2620556) IO hung after SRL overflow 2622032 (2620555) IO hang due to SRL overflow & CVM reconfig 2626742 (2626741) Using vxassist -o ordered and mediatype:hdd options together do not work as expected 2626745 (2626199) "vxdmpadm list dmpnode" printing incorrect path-type 2626989 (2533015) Powerpath/SVM controlled disks are not tagged as SVM in vxdisk list output 2627000 (2578336) Failed to online the cdsdisk. 2627004 (2413763) Uninitialized memory read results in a vxconfigd coredump 2627021 (2561012) The offset of private(and/or public) region of disks are shown incorrect in the vxdisk list output which could lead to DG import problem as well as IO errors and system hang reported by VxFS or other applications. 2641932 (2348199) vxconfig dumps core while importing a Disk Group 2646417 (2556781) In cluster environment, import attempt of imported disk group may return wrong error. 2652161 (2647975) Customer ran hastop -local and shared dg had splitbrain 2663673 (2656803) Race between vxnetd start and stop operations causes panic. 2664682 (2653143) System panics while loading vxdmp driver during VxVM 5.1SP1 installation 2685343 (2684558) vxesd dumps core on startup in libc 2690959 (2688308) When re-import of disk group fails during master takeover, other shared disk groups should not be disabled. 2695226 (2648176) Performance difference on Master vs Slave during recovery via DCO. 2695231 (2689845) Data disk can go in error state when data at the end of the first sector of the disk is same as MBR signature 2703035 (925653) Node join fails for higher CVMTimeout value. 2705101 (2216951) vxconfigd dumps core because chosen_rlist_delete() hits NULL pointer in linked list of clone disks 2706024 (2664825) DiskGroup import fails when disk contains no valid UDID tag on config copy and config copy is disabled. 2706027 (2657797) Starting 32TB RAID5 volume fails with V-5-1-10128 Unexpected kernel error in configuration update 2706038 (2516584) startup scripts use 'quit' instead of 'exit', causing empty directories in /tmp 2730149 (2515369) vxconfigd(1M) can hang in the presence of EMC BCV devices 2737373 (2556467) DMP-ASM: disabling all paths and reboot of the host causes losing of /etc/vx/.vxdmprawdev records 2737374 (2735951) Uncorrectable write error is seen on subdisk when SCSI device/bus reset occurs. 2742708 (2742706) Panic due to mutex not being released in vxlo_open 2747340 (2739709) Disk group rebuild fails as the links between volume and vset were missing from 'vxprint -D -' output. 2750455 (2560843) In VVR(Veritas Volume Replicator) setup I/Os can hang in slave nodes after one of the slave node is rebooted. 2756069 (2756059) System may panic when large cross-dg mirrored volume is started at boot. 2759895 (2753954) When a cable is disconnected from one port of a dual-port FC HBA, the paths via another port are marked as SUSPECT PATH. 2763211 (2763206) "vxdisk rm" command dumps core when disk name of very large length is given 2768492 (2277558) vxassist outputs a misleading error message during snapshot related operations. 2800774 (2566174) Null pointer dereference in volcvm_msg_rel_gslock() results in panic. 2804911 (2637217) Document new storage allocation attribute support in vradmin man page for resizevol/resizesrl. 2821027 (2807158) On Solaris platform, sometimes system can hang during VM upgrade or patch installation. 2821137 (2774406) System may panic while accessing data change map volume 2821143 (1431223) "vradmin syncvol" and "vradmin syncrvg" commands do not work if the remote diskgroup and vset names are specified when synchronizing vsets. 2821452 (2495332) vxcdsconvert fails if the private region of the disk to be converted is less than 1 MB. 2821678 (2389554) vxdg listssbinfo output is incorrect. 2821695 (2599526) IO Hang seen when DCM is zero. 2826129 (2826125) VxVM script daemon is terminated abnormally on its invocation. 2826607 (1675482) "vxdg list " command shows configuration copy in new failed state. 2827791 (2760181) Panic hit on secondary slave during logowner operation. 2827794 (2775960) In secondary CVR case, IO hang seen on a DG during SRL disable activity on other DG. 2827939 (2088426) Re-onlining of disks in DG during DG deport/destroy. 2836529 (2836528) Unable to grow LUN dynamically on Solaris x86 using "vxdisk resize" command. 2836910 (2818840) Enhance the vxdmpasm utility to support various permissions and "root:non- system" ownership can be set persistently. 2845984 (2739601) vradmin repstatus output occasionally reports abnormal timestamp. 2852270 (2715129) Vxconfigd hangs during Master takeover in a CVM (Clustered Volume Manager) environment. 2858859 (2858853) After master switch, vxconfigd dumps core on old master. 2859390 (2000585) vxrecover doesn't start remaining volumes if one of the volumes is removed during vxrecover command run. 2860281 (2838059) VVR Secondary panic in vol_rv_update_expected_pos. 2860445 (2627126) IO hang seen due to IOs stuck at DMP level. 2860449 (2836798) In VxVM, resizing simple EFI disk fails and causes system panic/hang. 2860451 (2815517) vxdg adddisk allows mixing of clone & non-clone disks in a DiskGroup. 2860812 (2801962) Growing a volume takes significantly large time when the volume has version 20 DCO attached to it. 2862024 (2680343) Manual disable/enable of paths to an enclosure leads to system panic 2867001 (2866997) VxVM Disk initialization fails as an un-initialized variable gets an unexpected value after OS patch installation. 2876116 (2729911) IO errors seen during controller reboot or array port disable/enable. 2882488 (2754819) Diskgroup rebuild through 'vxmake -d' loops infinitely if the diskgroup configuration has multiple objects on a single cache object. 2886083 (2257850) vxdiskadm leaks memory while performing operations related to enclosures. 2911010 (2627056) 'vxmake -g DGNAME -d desc-file' fails with very large configuration due to memory leaks 2927307 (2930396) The vxdmpasm command (in 5.1SP1 release) and the vxdmpraw command (in 6.0 release) do not work on Solaris platform. SUMMARY OF KNOWN ISSUES: ----------------------------------------- 2223250(2165829) Node is not able to join the cluster when recovery is in progress. KNOWN ISSUES : -------------- * INCIDENT NO::2223250 TRACKING ID ::2165829 SYMPTOM:: Node join fails if the recovery for the leaving node is not completed. WORKAROUND:: Retry node join after the recovery is completed. FIXED INCIDENTS: ---------------- PATCH ID:142630-16 * INCIDENT NO:2070079 TRACKING ID:1903700 SYMPTOM: vxassist remove mirror does not work if nmirror and alloc is specified, giving an error "Cannot remove enough mirrors" DESCRIPTION: During remove mirror operation, VxVM does not perform correct analysis of plexes. Hence the issue. RESOLUTION: Necessary code changes have been done so that vxassist works properly. * INCIDENT NO:2205574 TRACKING ID:1291519 SYMPTOM: After two VVR migrate operations, vrstat command does not output any statistics. DESCRIPTION: Migrate operation results in RDS (Replicated Data Set) information getting updated on both primary and secondary side vradmind. After multiple migrate operations, stale handle to older RDS is used by vrstat to retrieve statistics resulting in the failure. RESOLUTION: Necessary code changes have being made to ensure correct and updated RDS handle is used by vrstat to retrieve statistics. * INCIDENT NO:2227908 TRACKING ID:2227678 SYMPTOM: In case of multiple secondaries, if one secondary has overflowed and is in resync mode, then if another secondary overflows, then the rlink corresponding to the later secondary gets DETACHED and is not able to connect again. Even a complete resynchronization is not working for the detached rlink. DESCRIPTION: When the later rlink overflows, we detach the Rlink. At the time of detach, the rlink is going into an incorrect and unrecoverable state resulting it to never connect again. RESOLUTION: Changes have been made to ensure that when a resync is ongoing for one of the rlinks and another rlink overflows, it gets detached and a valid state is maintained for that rlink. Hence, full synchronization at a later time can recover the rlink completely. * INCIDENT NO:2366232 TRACKING ID:2024617 SYMPTOM: volpagemod_max_memsz setting in /kernel/drv/vxio.conf is not honoured by system reboot. /kernel/drv/vxio.conf =================== myldom1# cat /kernel/drv/vxio.conf name="vxio" parent="pseudo" ; volpagemod_max_memsz=131072 ; After reboot: myldom1# echo 'volpagemod_max_memsz/E' | mdb -k volpagemod_max_memsz: volpagemod_max_memsz: 65536 DESCRIPTION: According to the vxtune(1M) man page, persistent change to volpagemod_max_memsz should be set in the file /kernel/drv/vxio.conf. Note: Using vxtune to adjust the value of volpagemod_max_memsz does not persist across sys- tem reboots unless you also adjust the value that is configured in the /kernel/drv/vxio.conf file. But the parameter is not persistent across reboot, it gets reset to the default value after reboot. RESOLUTION: Code has been modified to use ddi_prop_get_int to allow persistent setting of volpagemod_max_memsz across reboot. * INCIDENT NO:2427560 TRACKING ID:2425259 SYMPTOM: vxdg join operation fails throwing error "join failed : Invalid attribute specification". DESCRIPTION: For the disk name containing "/" character e. g. cciss/c0d1, join operation fails to parse the name of disks and hence returns with error. RESOLUTION: Code changes are made to handle special characters in disk name. * INCIDENT NO:2442751 TRACKING ID:2104887 SYMPTOM: vxdg import fails with following ERROR message for cloned device import, when original diskgroup is already imported with same DGID. # vxdg -Cfn clonedg -o useclonedev=on -o tag=tag1 import testdg VxVM vxdg ERROR V-5-1-10978 Disk group testdg: import failed: Disk group exists and is imported DESCRIPTION: In case of clone device import, vxdg import without "-o updateid" option fails, if original DG is already imported. The error message returned may be interpreted as diskgroup with same name is imported, while actually the dgid is duplicated and not the dgname. RESOLUTION: vxdg utility is modified to return better error message for cloned DG import. It directs you to get details from system log. Details of conflicting dgid and suggestion to use "-o updateid" are added in the system log. * INCIDENT NO:2442827 TRACKING ID:2149922 SYMPTOM: Record the diskgroup import and deport events in the /var/adm/messages file. Following type of message can be logged in syslog: vxvm:vxconfigd: V-5-1-16254 Disk group import of succeeded. DESCRIPTION: With the diskgroup import or deport, appropriate success message or failure message with the cause for failure should be logged. RESOLUTION: Code changes are made to log diskgroup import and deport events in syslog. * INCIDENT NO:2492568 TRACKING ID:2441937 SYMPTOM: vxconfigrestore(1M) command fails with the following error... "The source line number is 1. awk: Input line 22 | cannot be longer than 3,000 bytes." DESCRIPTION: In the function where we read the disk attributes from backup, we are getting the disk attributes in the variable "$disk_attr". The value of this variable "$disk_attr" comes out to be a line longer than 3000 bytes. Later this variable "$disk_attr" is used by awk(1) command to parse it and hits the awk(1) limitation of 3000 bytes. RESOLUTION: The code is modified to replace awk(1) command with cut(1) command which does not have this limitation. * INCIDENT NO:2515137 TRACKING ID:2513101 SYMPTOM: When VxVM is upgraded from 4.1MP4RP2 to 5.1SP1RP1, the data on CDS disk gets corrupted. DESCRIPTION: When CDS disks are initialized with VxVM version 4.1MP4RP2, the number of cylinders are calculated based on the disk raw geometry. If the calculated number of cylinders exceed Solaris VTOC limit (65535), because of unsigned integer overflow, truncated value of number of cylinders gets written in CDS label. After the VxVM is upgraded to 5.1SP1RP1, CDS label gets wrongly written in the public region leading to the data corruption. RESOLUTION: The code changes are made to suitably adjust the number of tracks & heads so that the calculated number of cylinders be within Solaris VTOC limit. * INCIDENT NO:2531224 TRACKING ID:2526623 SYMPTOM: Memory leak detected in CVM DMP messaging phase. Below is message: NOTICE: VxVM vxio V-5-3-3938 vol_unload(): not all memory has been freed (volkmem=424) DESCRIPTION: During CVM-DMP messaging memory was not getting freed for a specific scenario. RESOLUTION: Necessary code changes done to take care of memory deallocation. * INCIDENT NO:2560539 TRACKING ID:2252680 SYMPTOM: When a paused VxVM (Veritas Volume Manager task) is aborted using 'vxtask abort' command, it does not get aborted appropriately. It continues to show up in the output of 'vxtask list' command and the corresponding process does not get killed. DESCRIPTION: As appropriate signal is not sent to the paused VxVM task which is being aborted, it fails to abort and continues to show up in the output of 'vxtask list' command. Also, its corresponding process does not get killed. RESOLUTION: Code changes are done to send an appropriate signal to paused tasks to abort them. * INCIDENT NO:2567623 TRACKING ID:2567618 SYMPTOM: VRTSexplorer coredumps in checkhbaapi/print_target_map_entry which looks like: print_target_map_entry() check_hbaapi() main() _start() DESCRIPTION: checkhbaapi utility uses HBA_GetFcpTargetMapping() API which returns the current set of mappings between operating system and fibre channel protocol (FCP) devices for a given HBA port. The maximum limit for mappings was set to 512 and only that much memory was allocated. When the number of mappings returned was greater than 512, the function that prints this information used to try to access the entries beyond that limit, which resulted in core dumps. RESOLUTION: The code has been changed to allocate enough memory for all the mappings returned by HBA_GetFcpTargetMapping(). * INCIDENT NO:2570988 TRACKING ID:2560835 SYMPTOM: On master I/Os and vxconfigd get hung when slave is rebooted under heavy I/O load. DESCRIPTION: When slave leaves cluster without sending the DATA ack message to master, slave's I/Os get stuck on master because their logend processing can not be completed. At the same time cluster reconfiguration takes place as the slave left the cluster. In CVM (Cluster Volume Manager) reconfiguration code path these I/Os are aborted in order to proceed with the reconfiguration and recovery. But if the local I/Os on master goes to the logend queue after the logendq is aborted, these local I/Os get stuck forever in the logend queue leading to the permanent I/O hang. RESOLUTION: During CVM reconfiguration and RVG (Replicated Volume group) recovery later, no I/Os will be put into the logendq. * INCIDENT NO:2576605 TRACKING ID:2576602 SYMPTOM: listtag option for vxdg command gives results even when executed with wrong syntax. DESCRIPTION: The correct syntax, as per vxdg help, is "vxdg listtag [diskgroup ...]". However, when executed with the wrong syntax, "vxdg [-g diskgroup] listtag", it still give results. RESOLUTION: Please use the correct syntax as per help for vxdg command. The command has being modified from 6.0 release onwards to display error and usage message when wrong syntax is used. * INCIDENT NO:2613584 TRACKING ID:2606695 SYMPTOM: Panic in CVR (Clustered Volume Replicator) environment while performing I/O Operations. Panic stack traces might look like: 1) vol_rv_add_wrswaitq vol_get_timespec_latest vol_kmsg_obj_request vol_kmsg_request_receive vol_kmsg_receiver kernel_thread 2) vol_rv_mdship_callback vol_kmsg_receiver kernel_thread DESCRIPTION: In CVR, logclient requests METADATA information from logowner node to perform write operations. Logowner node looks for any duplicate messages before adding the requests to the queue for processing. When a duplicate request arrives, logowner tries to copy the data from original I/O request and responds to the logclient with the METADATA information. During this process, panic can occur i) While copying the data as the code handling "copy" is not properly locked. ii) if logclient receives inappropriate METADATA information because of improper copy. RESOLUTION: Code changes are performed with appropriate conditions and locks while copying the data from original I/O requests for the duplicates. * INCIDENT NO:2613596 TRACKING ID:2606709 SYMPTOM: SRL overflow and CVR reconfiguration lead to the reconfiguration hang. DESCRIPTION: There are 6 RVG each has 16 datavolumes in the above reported problem. This problem could happen with more than 1 RVG configured. Here, both master and slave nodes are performing I/O. Slave node is rebooted, and which trigger a reconfiguration. All 6 RVG's doing I/O which fully utilized the RVIOMEM pool (memory pool used for RVG I/O's). Due to node leave, the I/O's on all the RVG will come to halt waiting for recovery flag set by the reconfiguration code path. Some pending I/O's in all the RVG's are still kept in the queue, due to holes in the SRL beacuse of node leave. The RVIOMEM pool is completely used by 3 of the RVG (600 + I/O) which are still doing the I/O. In the reconfiguration code, the rvg1 is picked to abort all the pending IO's in the queue, and wait for the active I/O's to complete. There are still some I/O still waiting for the RVIOMEM pool and is waiting for the memory. But the other active RVG's are not releasing any memory, this is just queued or waiting for the memory. With out all the pending I/O's are serviced, the code will not move forward to abort the I/O's, and the reconfiguration will never complete. RESOLUTION: Instead of going 1 by 1 RVG to abort and start the recovery, changing the logic to abort the I/O's in all the RVG's first. Later send the recovery message for all the RVG's after the iocount drains to 0. This way, we will avoid a hang situation due to some RVG's holding the memory. * INCIDENT NO:2616006 TRACKING ID:2575172 SYMPTOM: The reconfigd thread is hung waiting for the IO to drain. DESCRIPTION: While doing CVR(Cluser Volume Replicator) reconfiguration, RVG (Replicator Volume Group) recovery is started. The recovery can get stuck in DCM(Data Change Map) read while flushing the SRL(Serial Replicator Log). Flush operation creates lrage number of (1000+) threads. When the system memory is very low. In some cases, when the memory allocation is fails, failing to reduce the count leads to hang. RESOLUTION: Reset the number_of_children to 0, when ever the I/O creation fails due to memory allocation failure. * INCIDENT NO:2622029 TRACKING ID:2620556 SYMPTOM: The I/O hung on the primary after SRL overflow and during SRL flush and rlink connect/disconnect. DESCRIPTION: As part of rlink connect or disconnect, the RVG is serialized to complete the connection or disconnection. The I/O throttle is normal during the SRL flush due to memory pool pressure or reaching the max throttle limit. During the serialization, the i/o is throttled to complete the DCM flush. The remote I/O's are kept in throttleq during the throttling is triggered. Due to I/O serialization, the throttled I/O is never get flushed and because of that I/O never complete. RESOLUTION: If the serialization is successful, flush the throttleq immediately. This will make sure, the remote I/O's will get retried again in the serialization code path * INCIDENT NO:2622032 TRACKING ID:2620555 SYMPTOM: During CVM reconfig, the RVG wait for the iocount to go to '0', to start the RVG recovery and complete the reconfiguration. DESCRIPTION: In CVR, the node leave will trigger the reconfiguration. The reconfiguration code path initiate the RVG recovery of all the shared diskgroup. The recovery is needed to flush the SRL (shared by all the nodes) to the data volume to avoid any missing writes to the data volume by the leaving node. This recovery involves, reading the data fromt he SRL and copy copy it to the data volume. The flush may take its own time depend on the disk response time and the size of SRL region required to flush. During the recovery a flag is set on the RVG to avoid any new I/O. In this particular hand, the recovery is taking 30 minutes. During this time, there is another node leave happened, which triggered the second reconfiguration. The second reconfiguration before it trigger another recovery it wait for the IO count to go to zero by setting the RECOVER flag to RVG. The first RVG recovery clears the RECOVER flag after 30 minutes once completed the SRL flush. Since this is the same flag set by the second reconfiguration, the second reconfiguration waiting indefinitely for the I/O count to go to zero. Since the RECOVER flag is unset, the I/O resumed. So second reconfiguration stuck forever. RESOLUTION: If the RECOVER flag is set, dont keep waiting for the iocount to become zero in the reconfigration code path. There is no need for another recovery, if the second reconfiguration is started before the first recovery completes. * INCIDENT NO:2626742 TRACKING ID:2626741 SYMPTOM: Vxassist when used with "-o ordered " and "mediatype:hdd" during striped volume make operation does not maintain disk order. DESCRIPTION: Vxassist when invoked with "-o ordered" and "mediatype:hdd" options while creating a striped volume, does not maintain the disk order provided by the user. First stripe of the volume should correspond to the first disk provided by the user. RESOLUTION: Rectified code to use the disks as per the user specified disk order. * INCIDENT NO:2626745 TRACKING ID:2626199 SYMPTOM: "vxdmpadm list dmpnode" command shows the path-type value as "primary/secondary" for a LUN in an Active-Active array as below when it is suppose to be NULL value. dmpdev = c6t0d3 state = enabled ... array-type = A/A ###path = name state type transport ctlr hwpath aportID aportWWN attr path = c23t0d3 enabled(a) secondary FC c30 2/0/0/2/0/0/0.0x50060e800 5c0bb00 - - - DESCRIPTION: For a LUN under Active-Active array the path-type value is supposed to be NULL. In this specific case other commands like "vxdmpadm getsubpaths dmpnode=<>" were showing correct (NULL) value for path-type. RESOLUTION: The "vxdmpadm list dmpnode" code path failed to initialize the path-type variable and by default set path-type to "primary or secondary" even for Active-Active array LUN's. This is fixed by initializing the path-type variable to NULL. * INCIDENT NO:2626989 TRACKING ID:2533015 SYMPTOM: EMC disks which are part of SVM (Solaris Volume Manager) and controlled by EMC-powerpath (Third Party multipathing Driver) are not shown as type 'auto:SVM' disk in 'vxdisk list' output. It can also cause other severe issues if "vxdisksetup" command wrongly picks up the EMC disks under SVM to configure for use with VxVM. Example: -------- bash-3.00# vxdisk list DEVICE TYPE DISK GROUP STATUS disk_0 auto:cdsdisk - - online disk_1 auto:SVM - - SVM emcpower0s2 auto:none - - online invalid bash-3.00# cat /etc/lvm/md.cf # metadevice configuration file # do not hand edit d10 1 1 c2t5006016830603AE5d0s2 bash-3.00# vxdmpadm getsubpaths tpdnodename=emcpower0c NAME TPDNODENAME PATH-TYPE[-] DMPNODENAME ENCLR-TYPE ENCLR-NAME =============================================================================== c2t5006016830603AE5d0s2 emcpower0c - emcpower0s2 PP_EMC_CLARiiON pp_emc_clariion0 c2t5006016030603AE5d0s2 emcpower0c - emcpower0s2 PP_EMC_CLARiiON pp_emc_clariion0 DESCRIPTION: Following are the three cases which should be taken care when VxVM determines the SVM disk format while device discovery. 1. DMP controls the disks and /etc/lvm/md.cf file contains the CTD device names for SVM controlled devices. 2. TPD (Third Party multipathing Driver) controls the disks and /etc/lvm/md.cf file contains the TPD metanode name for SVM controlled devices. 3. TPD controls the disks and /etc/lvm/md.cf file contains the CTD device name for SVM controlled devices. VxVM was not taking care of third case hence couldn't determine the SVM disk format correctly. RESOLUTION: Fix is included to handle the third case so that VxVM can determine the SVM disk format correctly. Example: -------- bash-3.00# vxdisk list DEVICE TYPE DISK GROUP STATUS disk_0 auto:cdsdisk - - online disk_1 auto:SVM - - SVM emcpower0s2 auto:SVM - - SVM <<<<<< * INCIDENT NO:2627000 TRACKING ID:2578336 SYMPTOM: I/O error is encountered while accessing the cdsdisk. DESCRIPTION: This issue is seen only on defective cdsdisks, where s2 partition size in sector 0 label is less than sum of public region offset and public region length. RESOLUTION: A solution has been implemented to rectify defective cdsdisk at the time the cdsdisk is onlined. * INCIDENT NO:2627004 TRACKING ID:2413763 SYMPTOM: vxconfigd, the VxVM daemon dumps core with the following stack: ddl_fill_dmp_info ddl_init_dmp_tree ddl_fetch_dmp_tree ddl_find_devices_in_system find_devices_in_system mode_set setup_mode startup main __libc_start_main _start DESCRIPTION: Dynamic Multi Pathing node buffer declared in the Device Discovery Layer was not initialized. Since the node buffer is local to the function, an explicit initialization is required before copying another buffer into it. RESOLUTION: The node buffer is appropriately initialized using memset() to address the coredump. * INCIDENT NO:2627021 TRACKING ID:2561012 SYMPTOM: In a cluster, if a disk is re-initialized and added to a DG on one node and DG is imported on another node the disk region offsets (and/or disk format type) is incorrect. This could result in disk listing errors or errors while doing IO over the disks or overlaying volumes. There could also be FS level hang or errors. DESCRIPTION: VxVM caches the disk related information for enhancing performance. In case of cluster if the disk layout is updated on disk for a disk which is not part of a shared DG, it is not reflected on all the nodes of cluster. When the DG is imported on another node, stale layout and format information about the disk is used which leads to incorrect offset of data on the disk. This could lead to incorrect display of disk attributes and type as well as IO error and subsequent issues in FS and other applications using the disk. RESOLUTION: Necessary code changes have been done to ensure that when importing a non-shared DG, the on-disk information about the disk is read rather than using the in memory information for the same. * INCIDENT NO:2641932 TRACKING ID:2348199 SYMPTOM: vxconfigd dumps core during Disk Group import with the following function call stack strcmp+0x60 () da_find_diskid+0x300 () dm_get_da+0x250 () ssb_check_disks+0x8c0 () dg_import_start+0x4e50 () dg_reimport+0x6c0 () dg_recover_all+0x710 () mode_set+0x1770 () setup_mode+0x50 () startup+0xca0 () main+0x3ca0 () DESCRIPTION: During Disk Group import, vxconfigd performs certain validations on the disks. During one such validation, it iterates through the list of available disk access records to find a match with a given disk media record. It does a string comparison of the disk IDs in the two records to find a match. Under certain conditions, the disk ID for a disk access record may have a NULL value. vxconfigd dumps core when it passes this to strcmp() function. RESOLUTION: Code was modified to check for disk access records with NULL value and skip them from disk ID comparison. * INCIDENT NO:2646417 TRACKING ID:2556781 SYMPTOM: In cluster environment, importing a disk group which is imported on another node will result in wrong error messages like given below: VxVM vxdg ERROR V-5-1-10978 Disk group : import failed: Disk is in use by another host DESCRIPTION: When VxVM is translating a given disk group name to disk group id during the disk group import process, an error return indicating that the disk group is in use by another host may be overwritten by a wrong error. RESOLUTION: The source code has been changed to handle the return value in a correct way. * INCIDENT NO:2652161 TRACKING ID:2647975 SYMPTOM: Serial Split Brain (SSB) condition caused Cluster Volume Manager(CVM) Master Takeover to fail. The below vxconfigd debug output was noticed when the issue was noticed: VxVM vxconfigd NOTICE V-5-1-7899 CVM_VOLD_CHANGE command received V-5-1-0 Preempting CM NID 1 VxVM vxconfigd NOTICE V-5-1-9576 Split Brain. da id is 0.5, while dm id is 0.4 for dm cvmdgA-01 VxVM vxconfigd WARNING V-5-1-8060 master: could not delete shared disk groups VxVM vxconfigd ERROR V-5-1-7934 Disk group cvmdgA: Disabled by errors VxVM vxconfigd ERROR V-5-1-7934 Disk group cvmdgB: Disabled by errors ... VxVM vxconfigd ERROR V-5-1-11467 kernel_fail_join() : Reconfiguration interrupted: Reason is transition to role failed (12, 1) VxVM vxconfigd NOTICE V-5-1-7901 CVM_VOLD_STOP command received DESCRIPTION: When Serial Split Brain (SSB) condition is detected by the new CVM master, on Veritas Volume Manager (VxVM) versions 5.0 and 5.1, the default CVM behaviour will cause the new CVM master to leave the cluster and causes cluster-wide downtime. RESOLUTION: Necessary code changes have been done to ensure that when SSB is detected in a diskgroup, CVM will only disable that particular diskgroup and keep the other diskgroups imported during the CVM Master Takeover, the new CVM master will not leave the cluster with the fix applied. * INCIDENT NO:2663673 TRACKING ID:2656803 SYMPTOM: VVR (Veritas Volume Replicator) panics when vxnetd start/stop operations are invoked in parallel. Panic stack trace might look like: panicsys vpanic_common panic mutex_enter() vol_nm_heartbeat_free() vol_sr_shutdown_netd() volnet_ioctl() volsioctl_real() spec_ioctl() DESCRIPTION: vxnetd start and stop operations are not serialized. Hence we hit race condition and panic if they are run in parallel, when they access the shared resources without locks. The panic stack varies depending on where the resource contention is seen. RESOLUTION: Incorporated synchronization primitive to allow only either the vxnetd start or stop process to run at a time. * INCIDENT NO:2664682 TRACKING ID:2653143 SYMPTOM: When VxVM 5.1SP1 is installed on Solaris 10 the system panics while loading vxdmp driver during the installation. Estimated time remaining: 2:15 4 of 35 Performing SFHA preinstall tasks .................................. Done Installing VRTSvlic package ....................................... Done Installing VRTSperl package ....................................... Done Installing VRTSspt package ........................................ Done Installing VRTSvxvm package . . . panic[cpu5]/thread=2a100d87ca0: BAD TRAP: type=31 rp=2a100d86f20 addr=280 mmu_fsr=0 occurred in module "unix" due to a NULL pointer dereference sched: trap type = 0x31 addr=0x280 pid=0, pc=0x104bde4, sp=0x2a100d867c1, tstate=0x4480001606, context=0x0 g1-g7: 119b8c8, 0, 1, 30006ef8de8, 0, 0, 2a100d87ca0 DESCRIPTION: The _init() routine doesn't initialize necessary data structures before mod_install(). This leads to a system panic. RESOLUTION: Necessary code changes have been done to ensure that all the data structures get initialized at the time of driver installation. * INCIDENT NO:2685343 TRACKING ID:2684558 SYMPTOM: vxesd dumps core with following stack: =>[1] libc.so.1:atoi() [2] vxesd:esd_fc_getctlr_lname() [3] vxesd:esd_write_fc() [4] vxesd:esd_create_configfile() [5] vxesd:es_update_ddlconfig() [6] vxesd:main() DESCRIPTION: While creating the DDL configuration file (ddlconfig.info) ESD fails with unhandled NULL pointers and dump the core file. RESOLUTION: Handle the NULL pointer properly to avoid exception in the ESD code path. * INCIDENT NO:2690959 TRACKING ID:2688308 SYMPTOM: When re-import of a disk group fails during master takeover, it makes all the shared disk groups to be disabled. It also results in the corresponding node (new master) leaving the cluster. DESCRIPTION: In cluster volume manager when master goes down, the upcoming master tries to re-import the disk group. If some error occurs while re-importing the disk group then it disables all the shared disk groups and the new master leaves the cluster. This may result in cluster outage. RESOLUTION: Code changes are made to disable the disk group on which error occurred while re-importing and continue import of the other shared disk groups. * INCIDENT NO:2695226 TRACKING ID:2648176 SYMPTOM: In a clustered volume manager environment, additional data synchronization is noticed during reattach of a detached plex on a mirrored volume even when there was no I/O on the volume after the mirror was detached. This behavior is seen only on mirrored volumes with version 20 DCO attached and is part of a shared diskgroup. DESCRIPTION: In a clustered volume manager environment, write I/Os issued on a mirrored volume from the CVM master node are tracked in a bitmap unnecessarily. The tracked bitmap is then used during detach to create the tracking map for detached plex. This results in additional delta between active plex and the detached plex. So, even when there are no I/Os after detach, the reattach will do additional synchronization between mirrors. RESOLUTION: The unnecessary bitmap tracking of write I/Os issued on a mirrored volume from the CVM master node is prevented. So, the tracking map that gets created during detach will always starts clean. * INCIDENT NO:2695231 TRACKING ID:2689845 SYMPTOM: Disks are seen in error state. hitachi_usp-vm0_11 auto - - error DESCRIPTION: When data at the end of the first sector of the disk is same as MBR signature, Volume Manager misinterprets the data disk as MBR disk. Accordingly, partitions are determined but format determination fails for these fake partitions and disk goes into error state. RESOLUTION: Code changes are made to check the status field of the disk along with the MBR signature. Valid status fields for MBR disk are 0x00 and 0x80. * INCIDENT NO:2703035 TRACKING ID:925653 SYMPTOM: Node join fails when CVMTimeout is set to value higher than 35 mins (approximately). DESCRIPTION: Node join fails due to integer overflow for higher CVMTimeout value. RESOLUTION: Code changes done to handle higher CVMTimeout value. * INCIDENT NO:2705101 TRACKING ID:2216951 SYMPTOM: The vxconfigd daemon core dumps in the chosen_rlist_delete() function and the following stack trace is displayed: chosen_rlist_delete() req_dg_import_disk_names() request_loop() main() DESCRIPTION: The vxconfigd daemon core dumps when it accesses a NULL pointer in the chosen_rlist_delete() function. RESOLUTION: The code is modified to handle the NULL pointer in the chosen_rlist_delete() function * INCIDENT NO:2706024 TRACKING ID:2664825 SYMPTOM: The following two issues are seen when a cloned disk group having a mixture of disks which are clones of disks initialized under VxVM version 4.x and 5.x is imported. (i) The following error will be seen without "-o useclonedev=on -o updateid" options on 5.x environment with the import failure. # vxdg -Cf import VxVM vxdg ERROR Disk group : import failed: Disk group has no valid configuration copies (ii) The following warning will be seen with "-o useclonedev=on -o updateid" options on 5.x environment with the import success. # vxdg -Cf -o useclonedev=on -o updateid import VxVM vxdg WARNING Disk : Not found, last known location: ... DESCRIPTION: The vxconfigd, a VxVM daemon, imports a disk group having disks where all the disks should be either cloned or standard(non-clone). If the disk group has a mixture of cloned and standard devices, and user attempts to import the disk group - (i) without "-o useclonedev=on" options, only standard disks are considered for import. The import would fail if none of the standard disks have a valid configuration copy. (ii) with "-o useclonedev=on" option, the import would succeed, but the standard disks go missing because only clone disks are considered for import. A disk which is initialized in the VxVM version earlier to 5.x has no concept of Unique Disk Identifier(UDID) which helps to identify the cloned disk. It could not be flagged as cloned disk even if it is indeed a cloned disk. This results in the issues (i) and (ii). RESOLUTION: The source code is modified to set the appropriate flags so that the disks initialized in VxVM 4.X will be recognized as "cloned", and both of the issues (i) and (ii) will be avoided. * INCIDENT NO:2706027 TRACKING ID:2657797 SYMPTOM: Starting a RAID5 volume fails, when one of the sub-disks in the RAID5 column starts at an offset greater than 1TB. Example: # vxvol -f -g dg1 -o delayrecover start vol1 VxVM vxvol ERROR V-5-1-10128 Unexpected kernel error in configuration update DESCRIPTION: VxVM uses an integer variable to store the starting block offset of a sub-disk in a RAID5 column. This overflows when a sub-disk is located at an offset greater than 2147483647 blocks (1TB) and results in failure to start the volume. Refer to "sdaj" in the following example. E.g. v RaidVol - DETACHED NEEDSYNC 64459747584 RAID - raid5 pl RaidVol-01 RaidVol ENABLED ACTIVE 64459747584 RAID 4/128 RW [..] SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE sd DiskGroup101-01 RaidVol-01 DiskGroup101 0 1953325744 0/0 sdaa ENA sd DiskGroup106-01 RaidVol-01 DiskGroup106 0 1953325744 0/1953325744 sdaf ENA sd DiskGroup110-01 RaidVol-01 DiskGroup110 0 1953325744 0/3906651488 sdaj ENA RESOLUTION: VxVM code is modified to handle integer overflow conditions for RAID5 volumes. * INCIDENT NO:2706038 TRACKING ID:2516584 SYMPTOM: There are many random directories not cleaned up in /tmp/, like vx.$RANDOM.$RANDOM.$RANDOM.$$ on system startup. DESCRIPTION: In general the startup scripts should call quit(), in which it call do the cleanup when errors detected. The scripts were calling exit() directly instead of quit() leaving some random-created directories uncleaned. RESOLUTION: These script should be restored to call quit() instead of exit() directly. * INCIDENT NO:2730149 TRACKING ID:2515369 SYMPTOM: vxconfigd(1M) can hang in the presence of EMC BCV devices in established (bcv-nr) state with a call stack similar to the following is observed: inline biowait_rp biowait dmp_indirect_io gendmpioctl dmpioctl spec_ioctl vno_ioctl ioctl syscall Also, a message similar to the following can be seen in the syslog: NOTICE: VxVM vxdmp V-5-3-0 gendmpstrategy: strategy call failed on bp
, path devno 255/ DESCRIPTION: The issue can happen during device discovery. While reading the device information, the device is expected to be opened in block mode, but the device was incorrectly being opened in character mode causing the hang. RESOLUTION: The code was changed to open the block device from DMP indirect IO code path. * INCIDENT NO:2737373 TRACKING ID:2556467 SYMPTOM: When dmp_native_support is enabled, ASM (Automatic Storage Management) disks are disconnected from host and host is rebooted, user defined user-group ownership of respective DMP (Dynamic Multipathing) devices is lost and ownership is set to default values. DESCRIPTION: The user-group ownership records of DMP devices in /etc/vx/.vxdmprawdev file are refreshed at the time of boot and only the records of currently available devices are retained. As part of refresh, records of all the disconnected ASM disks are removed from /etc/vx/.vxdmpraw and hence set to default value. RESOLUTION: Made code changes so that the file /etc/vx/.vxdmprawdev will not be refreshed at boot time. * INCIDENT NO:2737374 TRACKING ID:2735951 SYMPTOM: Following messages can be seen in syslog: SCSI error: return code = 0x00070000 I/O error, dev , sector VxVM vxdmp V-5-0-0 i/o error occurred (errno=0x0) on dmpnode / DESCRIPTION: When the SCSI resets happen, the I/O fails with PATH_OK or PATH_RETRY error. As time bound recovery is default recovery option, VxVM retries the I/O till timeout. Because of miscalculation of time taken by each I/O retry, total timeout value is reduced drastically. All retries fail with the same error in this small timeout value and uncorrectable error occurs. RESOLUTION: Code changes are made to calculate the timeout value properly. * INCIDENT NO:2742708 TRACKING ID:2742706 SYMPTOM: The system panic can happen with following stack, when the Oracle 10G Grid Agent Software invokes the command :- # nmhs get_solaris_disks unix:lock_try+0x0() genunix:turnstile_interlock+0x1c() genunix:turnstile_block+0x1b8() unix:mutex_vector_enter+0x428() unix:mutex_enter() - frame recycled vxlo:vxlo_open+0x2c() genunix:dev_open() - frame recycled specfs:spec_open+0x4f4() genunix:fop_open+0x78() genunix:vn_openat+0x500() genunix:copen+0x260() unix:syscall_trap32+0xcc() DESCRIPTION: The open system call code path of the vxlo (Veritas Loopback Driver) is not releasing the acquired global lock after the work is completed. The panic may occur when the next open system call tries to acquire the lock. RESOLUTION: Code changes have been made to release the global lock appropriately. * INCIDENT NO:2747340 TRACKING ID:2739709 SYMPTOM: While rebuilding disk group,maker file generated from "vxprint - dmvpshrcCx -D -" command does not have the links between volumes and vset. Hence,rebuild of disk group fails. DESCRIPTION: File generated by the "vxprint -dmvpshrcCx -D -" command does not have the link between volumes and vset(volume set) due to which diskgroup rebuilding fails. RESOLUTION: Code changes are done to maintain the link between volumes and vsets. * INCIDENT NO:2750455 TRACKING ID:2560843 SYMPTOM: In 3 or more node cluster, when one of the slaves is rebooted under heavy I/O load, the I/Os hang on the other slave. Example : Node A (master and logowner) Node B (slave 1) Node C (slave 2) If Node C is doing a heavy I/Os and Node B is rebooted, the I/Os on Node C gets hung. DESCRIPTION: When Node B leaves the cluster, its throttled I/Os are aborted and all the resources taken by these I/Os are freed. Along with these I/Os, throttled I/Os of Node C are also responded that resources are not available to let Node C resend those I/Os. But during this process, region locks hold by these I/Os on master are not freed. RESOLUTION: All the resources taken by the remote I/Os on master are freed properly. * INCIDENT NO:2756069 TRACKING ID:2756059 SYMPTOM: During boot process when vxvm starts large cross-dg mirrored volume (>1.5TB), system may panic with following stack: vxio:voldco_or_drl_to_pvm vxio:voldco_write_pervol_maps_20 vxio:volfmr_write_pervol_maps vxio:volfmr_copymaps_instant vxio:volfmr_copymaps vxio:vol_mv_precommit vxio:vol_commit_iolock_objects vxio:vol_ktrans_commit vxio:volconfig_ioctl vxio:volsioctl_real DESCRIPTION: During resync of the cross-dg mirrored volume DRL(dirtly region logging) log is changed to track map on the volume. While changing map pointer calculation is not done properly. Due to wrong moving forward step of the pointer, array out of bounds issue occurs for very large volume leading to panic. RESOLUTION: The code changes are done to fix the wrong pointer increment. * INCIDENT NO:2759895 TRACKING ID:2753954 SYMPTOM: When cable is disconnected from one port of a dual-port FC HBA, only paths going through the port should be marked as SUSPECT. But paths going through other port are also getting marked as SUSPECT. DESCRIPTION: Disconnection of a cable from a HBA port generates a FC event. When the event is generated, paths of all ports of the corresponding HBA are marked as SUSPECT. RESOLUTION: The code changes are done to mark the paths only going through the port on which FC event is generated. * INCIDENT NO:2763211 TRACKING ID:2763206 SYMPTOM: vxdisk rm dumps core with following stack trace vfprintf volumivpfmt volpfmt do_rm DESCRIPTION: While copying the disk name of very large length array bound checking is not done which causes buffer overflow. Segmentation fault occurs while accessing corrupted memory, terminating "vxdisk rm" process. RESOLUTION: Code changes are done to do array bound checking to avoid such buffer overflow issues. * INCIDENT NO:2768492 TRACKING ID:2277558 SYMPTOM: vxassist outputs an error message while doing snapshot related operations. The message looks like : "VxVM VVR vxassist ERROR V-5-1-10127 getting associations of rvg : Property not found in the list" DESCRIPTION: The error message is being displayed incorrectly. There is an error condition which is getting masked by a previously occurred error which vxassist chose to ignore, and went ahead with the operation. RESOLUTION: Fix has been added to reset previously occurred error which has been ignored, so that the real error is displayed by vxassist. * INCIDENT NO:2800774 TRACKING ID:2566174 SYMPTOM: In a Clustered Volume Manager environment, the node which is taking over as MASTER dumped core because of NULL pointer dereference while releasing the ilocks. The stack is given below: vxio:volcvm_msg_rel_gslock vxio:volkmsg_obj_sio_start vxio:voliod_iohandle vxio:voliod_loop DESCRIPTION: The issue is seen due to offloading glock messages to the io daemon threads. When VxVM io daemon threads are processing the glock release messages, the interlock release and free happens after invoking the kernel message complete routine. This has a side effect that the reference count on the control block becomes zero and if garbage collection is running at this stage, it will end up freeing the message from the garbage queue. So, if there is a resend of the same message, there will be two contexts processing the same interlock free request. The receiver thread for which interlock is NULL and freed from other context, panic occurs. RESOLUTION: Code changes are done to offload glock messages to VxVM io daemon threads after processing the control block. Also the kernel message response routine is invoked after checking whether interlock release is required and releasing it. * INCIDENT NO:2804911 TRACKING ID:2637217 SYMPTOM: The storage allocation attributes 'pridiskname' and 'secdiskname' are not documented in the vradmin man page for resizevol/resizesrl. DESCRIPTION: The 'pridiskname' and 'secdiskname' are optional arguments to the vradmin resizevol and vradmin resizesrl commands, which enable users to specify a comma- separated list of disk names for the resize operation on a VVR data volume and SRL, respectively. These arguments were introduced in 5.1SP1, but were not documented in the vradmin man page. RESOLUTION: The vradmin man page has been updated to document the storage allocation attributes 'pridiskname' and 'secdiskname' for the vradmin resizevol and vradmin resizesrl commands. * INCIDENT NO:2821027 TRACKING ID:2807158 SYMPTOM: During VM upgrade or patch installation on Solaris platform, sometimes the system can hang due to deadlock with following stack: genunix:cv_wait genunix:ndi_devi_enter genunix:devi_config_one genunix:ndi_devi_config_one genunix:resolve_pathname genunix:e_ddi_hold_devi_by_path vxspec:_init genunix:modinstall genunix:mod_hold_installed_mod genunix:modrload genunix:modload genunix:mod_hold_dev_by_major genunix:ndi_hold_driver genunix:probe_node genunix:i_ndi_config_node genunix:i_ddi_attachchild DESCRIPTION: During the upgrade or patch installation, the vxspec module is unloaded and reloaded. In the vxspec module initialization, it tries to lock root node during the pathname go-through while already holding the subnode, i.e, /pseudo. Meanwhile, if there is another process holding the lock of root node is acquiring the lock of the subnode /pseudo, the deadlock occurs since each process tries to get the lock already hold by peer. RESOLUTION: APIs which are introducing deadlock are replaced. * INCIDENT NO:2821137 TRACKING ID:2774406 SYMPTOM: The system may panic while referencing the DCM(Data Channge Map) object attached to the volume, with following stack: vol_flush_srl_to_dv_start voliod_iohandle voliod_loop DESCRIPTION: When volume tries to flush the DCM to track the I/O map, if disk attached to the DCM is not available, DCM state is set to aborting before marking inactive. Since the current state of volume is till ACTIVE, trying to access the DCM object causes panic. RESOLUTION: Code changes are done to check if DCM is not in aborting state before proceeding with the DCM flush. * INCIDENT NO:2821143 TRACKING ID:1431223 SYMPTOM: "vradmin syncvol" and "vradmin syncrvg" commands do not work if the remote diskgroup and vset names are specified when synchronizing vsets. DESCRIPTION: When command "vradmin syncvol" or "vradmin syncrvg" for vset is executed, vset is expanded to its component volumes and path is generated for each component volume. But when remote vset names are specified on command line, it fails to expand remote component volumes correctly. Synchronization will fail because of incorrect path for volumes. RESOLUTION: Code changes have been made to ensure remote vset is expanded correctly when specified on command line. * INCIDENT NO:2821452 TRACKING ID:2495332 SYMPTOM: vxcdsconvert(1M) fails if private region length is less than 1MB and it is a single sub-disk spanning the entire disk with following error: # vxcdsconvert -g alldisks evac_subdisks_ok=yes VxVM vxprint ERROR V-5-1-924 Record not found VxVM vxprint ERROR V-5-1-924 Record not found VxVM vxcdsconvert ERROR V-5-2-3174 Internal Error VxVM vxcdsconvert ERROR V-5-2-3120 Conversion process aborted DESCRIPTION: If non-cds disk private region length is less than 1MB, vxcdsconvert internally tries to relocate subdisks at the start to create room for the private region of 1MB. To make room for back-up labels, vxcdsconvert tries to relocate subdisks at the end of the disk. Two entries, one for relocation at the start and one at the end are created during the analysis phase. Once the first sub-disk is relocated, next vxprint operation fails as the sub-disk has already been evacuated to another DM(disk media record). RESOLUTION: This problem is fixed by allowing generation of multiple relocation entries for same subdisk. Later if the sub-disk is found to be already evacuated to other DM, relocation is skipped for the subdisk with same name. * INCIDENT NO:2821678 TRACKING ID:2389554 SYMPTOM: The vxdg command of VxVM (Veritas Volume Manager) located at /usr/sbin directory shows incorrect message for ssb (Serial Split Brain) information of a disk group. The ssb information uses "DISK PRIVATE PATH" as an item, but the content is public path of some disk. The ssb information also prints unknown characters to represent the config copy id of a disk if the disk's config copies are all disabled. Moreover, there is some redundant information in the output messages. "DISK PRIVATE PATH" error is like this: $ vxdisk list ... pubpaths: block=/dev/vx/dmp/s4 char=/dev/vx/rdmp/s4 privpaths: block=/dev/vx/dmp/s3 char=/dev/vx/rdmp/s3 ... $ vxsplitlines -v -g mydg ... Pool 0 DEVICE DISK DISK ID DISK PRIVATE PATH /dev/vx/rdmp/s4 Unknown character error message is like this: $ vxdg import ... To import the diskgroup with config copy from the second pool issue the command /usr/sbin/vxdg [-s] -o selectcp= import ... DESCRIPTION: VxVM uses a SSB data structure to maintain ssb information displayed to the user. The SSB data structure contains some members, such as Pool ID, config copy id, etc. After memory allocation for a SSB data structure, this new allocated memory area is not initialized. If all the config copies of some disk are disabled, then the config copy id member has unknwon data. vxdg command will try to print such data, then unknown characters are displayed to the stdout. The SSB data structure has "disk public path" member, but no "disk private path" member. So the output message can only display public path of a disk. RESOLUTION: The ssb structure has been changed to use "disk private path" instead of "disk public path". Moreover, after memory allocation for a ssb structure, newly allocated memory is properly initialized. * INCIDENT NO:2821695 TRACKING ID:2599526 SYMPTOM: SRL to DCM flush does not happen resulting in I/O hang. DESCRIPTION: After SRL overflow, before the RU state machine phase could be changed to VOLRP_PHASE_SRL_FLUSH; Rlink connection thread sneak in and changed the phase to VOLRP_PHASE_START_UPDATE. Once the phase is changed to VOLRP_PHASE_START_UPDATE; the state machine missed to flush the SRL into DCM and goes into VOLRP_PHASE_DCM_WAIT and stucks there. RESOLUTION: RU state machine phases are handled correctly after SRL overflows. * INCIDENT NO:2826129 TRACKING ID:2826125 SYMPTOM: VxVM script daemons are not up after they are invoked with the vxvm-recover script. DESCRIPTION: When the VxVM script daemon is starting, it will terminate any stale instance if it does exist. When the script daemon is invoking with exactly the same process id of the previous invocation, the daemon itself is abnormally terminated by killing one own self through a false-positive detection. RESOLUTION: Code changes was made to handle the same process id situation correctly. * INCIDENT NO:2826607 TRACKING ID:1675482 SYMPTOM: vxdg list command shows configuration copy in new failed state. # vxdg list dgname config disk 3PARDATA0_75 copy 1 len=48144 state=new failed config-tid=0.1550 pending-tid=0.1551 Error: error=Volume error 0 DESCRIPTION: When a configuration copy is initialized on a new disk in a diskgroup, an IO error on the disk can prevent on-disk update and make configuration copy in- consistent. RESOLUTION: In case of a failed initialization, configuration copy is disabled. If required in future, this disabled copy will be reused for setting up a new configuration copy. If current state of configuration copy is "new-failed", next import of the diskgroup will disable it. * INCIDENT NO:2827791 TRACKING ID:2760181 SYMPTOM: The secondary slave node hit a panic in vol_rv_change_sio_start() for already active logowner operation. DESCRIPTION: The slave node panic during the logowner change. The logowner change and the reconfiguration recovery process happens at the same time, leading to a race in setting the ACTIVE flag. The reconfiguration recovery unset the flag which is set by the logowner change operation. In the middle of logowner change operation the ACTIVE flag is missing and leads to the system panic. RESOLUTION: The appropriate lock is taken in the logowner change code and also added more debug log entries for better tracking the logowner issues. * INCIDENT NO:2827794 TRACKING ID:2775960 SYMPTOM: On secondary CVR, after disabling the SRL on one DG, triggered an IO hang on another DG. DESCRIPTION: The failure of SRL lun's is causing the failure in both DG's. The I/O failure in the messages confirmed, the LUN failure on the DG4 also. Every 1024 I/O's to the SRL, the header of the SRL is flushed. In the SRL flush code, during the error scenario, the flush I/O is queued but not getting started. If the flush I/O is not getting completed, the application I/O will hang for ever. RESOLUTION: The fix is to start the flush I/O which is queued in the error scenario. * INCIDENT NO:2827939 TRACKING ID:2088426 SYMPTOM: Re-onlining of all disks irrespective of whether they belong to the shared dg being destroyed/deported is being done on master and slaves. DESCRIPTION: When a shared dg is destroyed/deported, all disks on the nodes in the cluster are re-onlined asynchronously when another command is fired. This results in wasteful usage of resources and delaying the next command substantially based on how many disks/luns are there on each node. RESOLUTION: Restrict re-onlining of luns/disks to those that belong to the dg that is being destroyed/deported. By doing this resource and delay in the next command is restricted to the number of luns/disks that belong to the dg in question. * INCIDENT NO:2836529 TRACKING ID:2836528 SYMPTOM: vxdisk resize fails with an error " New geometry makes partition unaligned " bash# vxdisk -g testdg resize disk01 length=8g VxVM vxdisk ERROR V-5-1-8643 Device disk01: resize failed: New geometry makes partition unaligned DESCRIPTION: On Solaris X86 system, the partition 8 is not necessary to align with cylinder size. However VxVM requires this partition to be cylinder aligned. Hence the issue. RESOLUTION: Issue is fixed by doing the necessary changes to skip alignment check for partition 8 on Solaris X86 platform. * INCIDENT NO:2836910 TRACKING ID:2818840 SYMPTOM: 1. The file permissions set to ASM devices is not persistent across reboot. 2. User is not able to set desired permissions to ASM devices. 3. The files created with user id "root" and group other than "system" are not persistent. DESCRIPTION: The vxdmpasm sets the permissions of the devices to "660", which is not persistent across reboot as these devices are not kept in /etc/vx/.vxdmpasmdev file. Currently there is no option which enables the User to set the desired permissions. The files created with user id root and group other than "system" are changed back to "root:system" upon reboot. RESOLUTION: Code is modified to keep the device entries in the file /etc/vx/.vxdmpasmdev to make the permissions persistent across reboot. Code is enhanced to provide an option to set the desired permissions and set the desired user id/group. * INCIDENT NO:2845984 TRACKING ID:2739601 SYMPTOM: vradmin repstatus output occasionally reports abnormal timestamp information. DESCRIPTION: Sometimes vradmin repstatus will show the timestamp which is abnormal. This timestamp is reported in the "Timestamp Information" section of vradmin repstatus output. In this case the timestamp reported is a very high value in time, something like 100 hours. This condition occurs when no data has been replicated across to the secondary for a long time. This does not necessarily mean that the Rlinks are disconnected for a long time. Even if the Rlinks are connected it could be possible that no new data was written to the primary during that period and thus no data got replicated across to the secondary. Now, if at this point the Rlink is paused and some writes are done, then vradmin repstatus will show abnormal timestamp. RESOLUTION: To solve this issue whenever new data is written to the data volume, if the Rlink is up-to-date then we mark this timestamp. This will make sure that abnormal timestamp is not reported. * INCIDENT NO:2852270 TRACKING ID:2715129 SYMPTOM: Vxconfigd hangs during Master takeover in a CVM (Clustered Volume Manager) environment. This results in vx command hang. DESCRIPTION: During Master takeover, VxVM (Veritas Volume Manager) kernel signals Vxconfigd with the information of new Master. Vxconfigd then proceeds with a vxconfigd- level handshake with the nodes across the cluster. Before kernel could signal to vxconfigd, vxconfigd handshake mechanism got started, resulting in the hang. RESOLUTION: Code changes are done to ensure that vxconfigd handshake gets started only upon receipt of signal from the kernel. * INCIDENT NO:2858859 TRACKING ID:2858853 SYMPTOM: In CVM(Cluster Volume Manager) environment, after master switch, vxconfigd dumps core on the slave node (old master) when a disk is removed from the disk group. dbf_fmt_tbl() voldbf_fmt_tbl() voldbsup_format_record() voldb_format_record() format_write() ddb_update() dg_set_copy_state() dg_offline_copy() dasup_dg_unjoin() dapriv_apply() auto_apply() da_client_commit() client_apply() commit() dg_trans_commit() slave_trans_commit() slave_response() fillnextreq() vold_getrequest() request_loop() main() DESCRIPTION: During master switch, disk group configuration copy related flags are not cleared on the old master, hence when a disk is removed from a disk group, vxconfigd dumps core. RESOLUTION: Necessary code changes have been made to clear configuration copy related flags during master switch. * INCIDENT NO:2859390 TRACKING ID:2000585 SYMPTOM: If 'vxrecover -sn' is run and at the same time one volume is removed, vxrecover exits with the error 'Cannot refetch volume', the exit status code is zero but no volumes are started. DESCRIPTION: vxrecover assumes that volume is missing because the diskgroup must have been deported while vxrecover was in progress. Hence, it exits without starting remaining volumes. vxrecover should be able to start other volumes, if the DG is not deported. RESOLUTION: Modified the source to skip missing volume and proceed with remaining volumes. * INCIDENT NO:2860281 TRACKING ID:2838059 SYMPTOM: The VVR secondary machine crashes with following panic stack: crash_kexec __die do_page_fault error_exit [exception RIP: vol_rv_update_expected_pos+337] vol_rv_service_update vol_rv_service_message_start voliod_iohandle voliod_loop kernel_thread at ffffffff8005dfb1 DESCRIPTION: If VVR primary machine crashes without completing a few of the write I/Os to the data volumes, it does fill the incomplete write I/Os with the "DUMMY" I/Os. It has to do so to maintain the write order fidelity at the secondary. While processing such dummy updates on secondary, because of a logical error, the secondary VVR code tries to deference the NULL pointer leading to the panic. RESOLUTION: The code changes are made in VVR secondary in "DUMMY" update processing code path to correct the logic. * INCIDENT NO:2860445 TRACKING ID:2627126 SYMPTOM: Observed IO hang on system as lots of IO's are stuck in DMP global queue. DESCRIPTION: Lots of IOs and Paths are stuck in dmp_delayq and dmp_path_delayq respectively and DMP daemon could not process them, because of the race condition between "processing the dmp_delayq" and "waking up the DMP daemon". Lock is held while processing the dmp_delayq, and it is released for very short duration. If any path is busy in this duration, it gives IO error, leading to IO hang. RESOLUTION: The global delay queue pointers are copied to local variables and lock is held only for this period, then IOs in the queue are processed using the local queue variable. * INCIDENT NO:2860449 TRACKING ID:2836798 SYMPTOM: 'vxdisk resize' fails with the following error on the simple format EFI (Extensible Firmware Interface) disk expanded from array side and system may panic/hang after a few minutes. # vxdisk resize disk_10 VxVM vxdisk ERROR V-5-1-8643 Device disk_10: resize failed: Configuration daemon error -1 DESCRIPTION: As VxVM doesn't support Dynamic Lun Expansion on simple/sliced EFI disk, last usable LBA (Logical Block Address) in EFI header is not updated while expanding LUN. Since the header is not updated, the partition end entry was regarded as illegal and cleared as part of partition range check. This inconsistent partition information between the kernel and disk causes system panic/hang. RESOLUTION: Added checks in VxVM code to prevent DLE on simple/sliced EFI disk. * INCIDENT NO:2860451 TRACKING ID:2815517 SYMPTOM: vxdg adddisk succeeds to add a clone disk to non-clone and non-clone disk to clone diskgroup, resulting in mixed diskgroup. DESCRIPTION: vxdg import fails for diskgroup which has mix of clone and non-clone disks. So vxdg adddisk should not allow creation of mixed diskgroup. RESOLUTION: vxdisk adddisk code is modified to return an error for an attempt to add clone disk to non-clone or non-clone disks to clone diskgroup, Thus it prevents addition of disk in diskgroup which leads to mixed diskgroup. * INCIDENT NO:2860812 TRACKING ID:2801962 SYMPTOM: Operations that lead to growing of volume, including 'vxresize', 'vxassist growby/growto' take significantly larger time if the volume has version 20 DCO(Data Change Object) attached to it in comparison to volume which doesn't have DCO attached. DESCRIPTION: When a volume with a DCO is grown, it needs to copy the existing map in DCO and update the map to track the grown regions. The algorithm was such that for each region in the map it would search for the page that contains that region so as to update the map. Number of regions and number of pages containing them are proportional to volume size. So, the search complexity is amplified and observed primarily when the volume size is of the order of terabytes. In the reported instance, it took more than 12 minutes to grow a 2.7TB volume by 50G. RESOLUTION: Code has been enhanced to find the regions that are contained within a page and then avoid looking-up the page for all those regions. * INCIDENT NO:2862024 TRACKING ID:2680343 SYMPTOM: While manually disabling and enabling paths to an enclosure machine may panic with the following stack: apauto_get_failover_path+0000CC() gen_dmpnode_update_cur_pri+000828() dmp_start_failover+000124() gen_update_cur_pri+00012C() dmp_update_cur_pri+000030() dmp_reconfig_update_cur_pri+000010() dmp_decipher_instructions+0006E8() dmp_process_instruction_buffer+000308() dmp_reconfigure_db+0000C4() gendmpioctl+000ECC() dmpioctl+00012C() DESCRIPTION: The Dynamic Multi-Pathing(DMP) driver keeps track of the number of active paths and failed paths internally. The computation may go wrong while exercising manual disable/enable of paths which can lead to machine panic. RESOLUTION: Code changes have been made to properly update the active path and failed path count. * INCIDENT NO:2867001 TRACKING ID:2866997 SYMPTOM: After applying Solaris patch 147440-20, disk initialization using vxdisksetup command fails with following error, VxVM vxdisksetup ERROR V-5-2-43 : Invalid disk device for vxdisksetup DESCRIPTION: A un-initialized variable gets a different value after OS patch installation, thereby making vxparms command outputs give an incorrect result. RESOLUTION: Initialize the variable with correct value. * INCIDENT NO:2876116 TRACKING ID:2729911 SYMPTOM: During a controller or port failure, UDEV removes the associated path information from DMP. When the paths are being removed the IO occurring to this disk could still get re-directed to this path, after it has been deleted, leading to an IO failure. DESCRIPTION: When a path is being deleted from a DMP node the appropriate data structures for this path needs to be updated to not have it available for IO after deletion which is not happening currently. RESOLUTION: The DMP code is modified to not select the deleted path for future IOs. * INCIDENT NO:2882488 TRACKING ID:2754819 SYMPTOM: Diskgroup rebuild through 'vxmake -d' gets stuck with following stack trace: buildlocks() MakeEverybody() main() DESCRIPTION: During diskgroup rebuild for configurations having multiple objects on a single cache object, an internal list of cache objects gets incorrectly modified to a circular list, which causes infinite looping during its access. RESOLUTION: Code changes are done to correctly populate the cache object list. * INCIDENT NO:2886083 TRACKING ID:2257850 SYMPTOM: Memory leak is observed when information about enclosure is accessed by vxdiskadm. DESCRIPTION: The memory allocated locally for a data structure keeping information about the array specific attributes is not freed. RESOLUTION: Code changes are made to avoid such memory leaks. * INCIDENT NO:2911010 TRACKING ID:2627056 SYMPTOM: vxmake(1M) command when run with a very large DESCRIPTION: Due to a memory leak in vxmake(1M) command, data section limit for the process was reached. As a result further memory allocations failed and vxmake command failed with the above error RESOLUTION: Fixed the memory leak by freeing the memory after it has been used. * INCIDENT NO:2927307 TRACKING ID:2930396 SYMPTOM: The vxdmpasm/vxdmpraw command does not work on Solaris. For example: #vxdmpasm enable user1 group1 600 emc0_02c8 expr: syntax error /etc/vx/bin/vxdmpasm: test: argument expected #vxdmpraw enable user1 group1 600 emc0_02c8 expr: syntax error /etc/vx/bin/vxdmpraw: test: argument expected DESCRIPTION: The "length" function of expr command does not work on Solaris. This function was used in the script and used to give error. RESOLUTION: The expr command has been replaced by awk command. PATCH ID:142630-15 * INCIDENT NO:2280285 TRACKING ID:2365486 SYMPTOM: In Two nodes SFRAC configuration, after enabling ports when "vxdisk scandisks" is run, systems panics with following stack: PANIC STACK: .unlock_enable_mem() .unlock_enable_mem() dmp_update_path() dmp_decode_update_dmpnode() dmp_decipher_instructions() dmp_process_instruction_buffer() dmp_reconfigure_db() gendmpioctl() vxdmpioctl() rdevioctl() spec_ioctl() vnop_ioctl() vno_ioctl() common_ioctl() ovlya_addr_sc_flih_main() DESCRIPTION: Improper order of acquire and release of locks during reconfiguration of DMP when I/O activity was running parallelly, lead to above panic. RESOLUTION:Release the locks in the same order as they in which they are acquired. * INCIDENT NO:2405446 TRACKING ID:2253970 SYMPTOM: Enhancement to customize private region I/O size based on maximum transfer size of underlying disk. DESCRIPTION: There are different types of Array Controllers which support data transfer sizes starting from 256K and beyond. VxVM tunable volmax_specialio controls vxconfigd's configuration I/O as well as Atomic Copy I/O size. When volmax_specialio is tuned to a value greater than 1MB to leverage maximum transfer sizes of underlying disks, import operation is failing for disks which cannot accept more than 256K I/O size. If the tunable is set to 256k then it will be the case where large transfer size of disks is not being leveraged. RESOLUTION:This enhancement leverages large disk transfer sizes as well as supports Array controllers with 256K transfer sizes. * INCIDENT NO:2532440 TRACKING ID:2495186 SYMPTOM: With TCP protocol used for replication, I/O throttling happens due to memory flow control. DESCRIPTION: In some slow network configuration, the I/O throughput is throttled back due to the replication I/O. RESOLUTION:It is better to keep the replication I/O outside the normal I/O code path to improve its I/O throughput performance. * INCIDENT NO:2563291 TRACKING ID:2527289 SYMPTOM: In a Campus Cluster setup, storage fault may lead to DETACH of all the configured site. This also results in IOfailure on all the nodes in the Campus Cluster. DESCRIPTION: Site detaches are done on site consistent dgs when any volume in the dg looses all the mirrors of a Site. During the processing of the DETACH of last mirror in a site we identify that it is the last mirror and DETACH the site which in turn detaches all the objects of that site. In Campus Cluster setup we attach a dco volume for any data volume created on a site-consistent dg. The general configuration is to have one DCO mirror on each site. Loss of a single mirror of the dco volume on any node will result in the detach of that site. In a 2 site configuration this particular scenario would result in both the dco mirrors being lost simultaneously. While the site detach for the first mirror is being processed we also signal for DETACH of the second mirror which ends up DETACHING the second site too. This is not hit in other tests as we already have a check to make sure that we do not DETACH the last mirror of a Volume. This check is being subverted in this particular case due to the type of storage failure. RESOLUTION:Before triggering the site detach we need to have an explicit check to see if we are trying to DETACH the last ACTIVE site. * INCIDENT NO:2589679 TRACKING ID:2589569 SYMPTOM: The vxdisksetup takes longer time (approximately 2-4 mins) to initialize sliced disk on A/P array. DESCRIPTION: In VxVM(Veritas Volume Manager), the DKIOCGVTOC/DKIOCGGEOM IOCTL(s) are used to detect the EFI disk. If the said IOCTL(s) return an error ENOTSUP, then the disk is said to have EFI label. Upon returning ENOTSUP error from primary path, the DMP driver attempts to retry the IOCTL(s) on secondary path, which is consuming more time. RESOLUTION:The IOCTL service routine is modified to restrict DMP driver from retrying the IOCTL(s) on secondary path. * INCIDENT NO:2603605 TRACKING ID:2419948 SYMPTOM: Race between the SRL flush due to SRL overflow and the kernel logging code, leads to a panic. DESCRIPTION: Rlink is disconencted, the RLINK state is moved to HALT. Primary RVG SRL is overflowed since there is no replication and which initiated DCM logging. This change the STATE of rlink to DCM. (since rlink is already disconencted, this will keep the finale state as HALT. During the SRL overflow, if the rlink connection resoted, then it has many state changes before completing the connection. If the SRL overflow and klogging code, finishes inbetween the above state transistion, and if it not finding it in VOLRP_PHASE_HALT, then the system is initiating the panic. RESOLUTION:Consider the above state change as valid, and make sure the SRL overflow code dont always expect the HALT state. Take action for the other state or wait for the full state transistion to complete for the rlink connection. * INCIDENT NO:2612969 TRACKING ID:2612960 SYMPTOM: Onlining a disk with GPT (GUID Partition Table) and VxVM aixdisk layout may result in vxconfigd dumping core and printing the following message: Assertion failed: (0), file , line . DESCRIPTION: This problem occurs only on disks with VxVM aixdisk layout which previously had GPT layout prior to being initialized with VxVM aixdisk layout. The existence of the GPT label on a disk with VxVM aixdisk layout resulted in VxVM unable to discover the disk layout properly. RESOLUTION:While discovering the layout of the disk, VxVM first checks if the disk has VxVM aixdisk layout. VxVM clears out GPT label on disks which have VxVM aixdisk layout. * INCIDENT NO:2621549 TRACKING ID:2621465 SYMPTOM: When a failed disk belongs to a site has once again become accessible, it cannot be reattached to the disk group. DESCRIPTION: As the disk has a site tag name set, 'vxdg adddisk' command invoked in 'vxreattach' command needs the option '-f' to add the disk back to the disk group. RESOLUTION:Add the option '-f' to 'vxdg adddisk' command when it is invoked in 'vxreattach' command. * INCIDENT NO:2626900 TRACKING ID:2608849 SYMPTOM: 1.Under a heavy I/O load on logclient node, write I/Os on VVR Primary logowner takes a very long time to complete. 2. I/Os on "master" and "slave" nodes hang when "master" role is switched multiple times using "vxclustadm setmaster" command. DESCRIPTION: 1. VVR can not allow more than 2048 I/Os outstanding on the SRL volume. Any I/Os beyond this threshold will be throttled. The throttled I/Os are restarted after every SRL header flush operation. During restarting the throttled I/Os, I/Os came from logclient are given higher priority causing logowner I/Os to starve. 2. In CVM reconfiguration code path the RLINK ports are not cleanly deleted on old log-owner. This causes the RLINks not to connect leading to both replication and I/O hang. RESOLUTION:Algorithm which restarts the throttled I/Os is modified to give fair chance to both local and remote I/Os to proceed. Additionally, the code changes are made in CVM reconfiguration code path to delete the RLINK ports cleanly before switching the master role. * INCIDENT NO:2626911 TRACKING ID:2605444 SYMPTOM: vxdmpadm disable/enable primary path (EFI labelled) in A/PF array results in all paths getting disabled DESCRIPTION: Enabling an EFI labeled primary path is disabling the secondary path. When the primary path is disabled, a failover occurs on to secondary path. The name of the secondary path under goes a change dropping the slice s2 from the name (cxtxdxs2 becomes cxtxdx). The change in the name is not updated in the device property list. This inability in updating the list causes disabling of the secondary path when the primary path is enabled. RESOLUTION:The code path which changes the name of the secondary path is rectified to update the property list. * INCIDENT NO:2626920 TRACKING ID:2061082 SYMPTOM: "vxddladm -c assign names" command does not work if dmp_native_support tunable is enabled. DESCRIPTION: If dmp_native_support tunable is set to "on" then VxVM does not allow change in name of dmpnodes. This holds true even for device with native support not enabled like VxVM labeled or Third Party Devices. So there is no way for selectively changing name of devices for which native support is not enabled. RESOLUTION:This enhancement is addressed by code change to selectively change name for devices with native support not enabled. * INCIDENT NO:2633041 TRACKING ID:2509291 SYMPTOM: "vxconfigd" daemon hangs if disable/enable of host side fc switch ports is exercised for several iterations and consequently, VxVM related commands don't return. schedule dmp_biowait dmp_indirect_io gendmpioctl dmpioctl dmp_ioctl dmp_compat_ioctl compat_blkdev_ioctl compat_sys_ioctl sysenter_do_call DESCRIPTION: When the fail-over thread corresponding to the lun is scheduled, it goes ahead and frees memory allocated for the fail-over request and returns from Array Policy Module fail-over function call. When the thread is scheduled again, it still points to the same fail-over request that got freed above. When it tries to get the next value, NULL value is returned. The fail-over thread waiting for other luns never gets invoked and results in vxconfigd daemon hang RESOLUTION:Code changes have been made to the Array Policy Module to save the fail-over request pointer after marking the request state field as fail-over has completed successfully. * INCIDENT NO:2636094 TRACKING ID:2635476 SYMPTOM: DMP (Dynamic Multi Pathing) driver does not automatically enable the failed paths of Logical Units (LUNs) that are restored. DESCRIPTION: DMP's restore demon probes each failed path at a default interval of 5 minutes (tunable) to detect if that path can be enabled. As part of enabling the path, DMP issues an open() on the path's device number. Owing to a bug in the DMP code, the open() was issued on a wrong device partition which resulted in failure for every probe. Thus, the path remained in failed status at DMP layer though it was enabled at the array side. RESOLUTION:Modified the DMP restore daemon code path to issue the open() on the appropriate device partitions. * INCIDENT NO:2643651 TRACKING ID:2643634 SYMPTOM: If standard(non-clone) disks and cloned disks of the same disk group are seen in a host, dg import will fail with the following error message when the standard(non-clone) disks have no enabled configuration copy of the disk group. # vxdg import VxVM vxdg ERROR V-5-1-10978 Disk group : import failed: Disk group has no valid configuration copies DESCRIPTION: When VxVM is importing such a mixed configuration of standard(non-clone) disks and cloned disks, standard(non-clone) disks will be selected as the member of the disk group in 5.0MP3RP5HF1 and 5.1SP1RP2. It will be done while administrators are not aware of the fact that there is a mixed configuration and the standard(non-clone) disks are to be selected for the import. It is hard to figure out from the error message and need time to investigate what is the issue. RESOLUTION:Syslog message enhancements are made in the code that administrators can figure out if such a mixed configuration is seen in a host and also which disks are selected for the import. * INCIDENT NO:2651421 TRACKING ID:2649846 SYMPTOM: In a Sun Cluster with VxVM (Veritas Volume Manager) and EMC Power Path (Third Party Multi-Pathing Driver) environment, "cldg create -t vxvm ..." which is a Sun Cluster command core dumps while creating a VxVM type disk group. The Sun Cluster command is: cldg create -t vxvm -n -v And the error messages is: umem allocator: redzone violation: write past end of buffer DESCRIPTION: "cldg create -t vxvm -v ..." the Sun Cluster command needs VxVM's assistance to get the sub-path of devices included in the disk group. However, VxVM doesn't allocate enough memory to hold the devices' name. Owing to this reason Sun Cluster reports the error messages. RESOLUTION:VxVM library has been modified to allocate adequate memory space to hold the device name. * INCIDENT NO:2666175 TRACKING ID:2666163 SYMPTOM: A small memory leak may be seen in vxconfigd, the VxVM configuration daemon when Serial Split Brain(SSB) error is detected in the import process. DESCRIPTION: The leak may occur when Serial Split Brain(SSB) error is detected in the import process. It is because when the SSB error is returning from a function, a dynamically allocated memory area in the same function would not be freed. The SSB detection is a VxVM feature where VxVM detects if the configuration copy in the disk private region becomes stale unexpectedly. A typical use case of the SSB error is that a disk group is imported to different systems at the same time and configuration copy update in both systems results in an inconsistency in the copies. VxVM cannot identify which configuration copy is most up-to-date in this situation. As a result, VxVM may detect SSB error on the next import and show the details through a CLI message. RESOLUTION:Code changes are made to avoid the memory leak and also a small message fix has been done. * INCIDENT NO:2676703 TRACKING ID:2553729 SYMPTOM: The following is observed during 'Upgrade' of VxVM (Veritas Volume Manager): i) 'clone_disk' flag is seen on non-clone disks in STATUS field when 'vxdisk -e list' is executed after uprade to 5.1SP1 from lower versions of VxVM. Eg: DEVICE TYPE DISK GROUP STATUS emc0_0054 auto:cdsdisk emc0_0054 50MP3dg online clone_disk emc0_0055 auto:cdsdisk emc0_0055 50MP3dg online clone_disk ii) Disk groups (dg) whose versions are less than 140 do not get imported after upgrade to VxVM versions 5.0MP3RP5HF1 or 5.1SP1RP2. Eg: # vxdg -C import VxVM vxdg ERROR V-5-1-10978 Disk group : import failed: Disk group version doesn't support feature; see the vxdg upgrade command DESCRIPTION: While uprading VxVM i) After upgrade to 5.1SP1 or higher versions: If a dg which is created on lower versions is deported and imported back on 5.1SP1 after the upgrade, then "clone_disk" flags gets set on non-cloned disks because of the design change in UDID (unique disk identifier) of the disks. ii) After upgrade to 5.0MP3RP5HF1 or 5.1SP1RP2: Import of dg with versions less than 140 fails. RESOLUTION:Code changes are made to ensure that: i) clone_disk flag does not get set for non-clone disks after the upgrade. ii) Disk groups with versions less than 140 get imported after the upgrade. * INCIDENT NO:2695225 TRACKING ID:2675538 SYMPTOM: Data corruption can be observed on a CDS (Cross-platform Data Sharing) disk, as part of LUN resize operations. The following pattern would be found in the data region of the disk. cyl alt 2 hd sec DESCRIPTION: The CDS disk maintains a SUN VTOC in the zeroth block and a backup label at the end of the disk. The VTOC maintains the disk geometry information like number of cylinders, tracks and sectors per track. The backup label is the duplicate of VTOC and the backup label location is determined from VTOC contents. As part of resize, VTOC is not updated to the new size, which results in the wrong calculation of the backup label location. If the wrongly calculated backup label location falls in the public data region rather than the end of the disk as designed, data corruption occurs. RESOLUTION:Update the VTOC contents appropriately for LUN resize operations to prevent the data corruption. * INCIDENT NO:2695227 TRACKING ID:2674465 SYMPTOM: Data corruption is observed when DMP node names are changed by following commands for DMP devices that are controlled by a third party multi-pathing driver (E.g. MPXIO and PowerPath ) # vxddladm [-c] assign names # vxddladm assign names file= # vxddladm set namingscheme= DESCRIPTION: The above said commands when executed would re-assign names to each devices. Accordingly the in-core DMP database should be updated for each device to map the new device name with appropriate device number. Due to a bug in the code, the mapping of names with the device number wasn't done appropriately which resulted in subsequent IOs going to a wrong device thus leading to data corruption. RESOLUTION:DMP routines responsible for mapping the names with right device number is modified to fix this corruption problem. * INCIDENT NO:2695228 TRACKING ID:2688747 SYMPTOM: Under a heavy I/O load on logclient node, the writes on VVR Primary logowner takes a very long time to complete. Writes appear to be hung. DESCRIPTION: VVR cannot allow more than specific number of I/Os (4096)outstanding on the SRL volume. Any I/Os beyond this threshold will be throttled. The throttled I/Os are restarted periodically. While restarting, I/Os belonging logclient get high preference compared to logowner I/Os, which can eventually lead to starvation or I/O hang situation on logowner. RESOLUTION:Changes are done in algorithm of I/O scheduling of restarted I/Os, it's made sure that throttled local I/Os will get the chance to proceed under all conditions. * INCIDENT NO:2701152 TRACKING ID:2700486 SYMPTOM: If the VVR Primary and Secondary nodes have the same host-name, and there is a loss of heartbeats between them, vradmind daemon can core-dump if an active stats session already exists on the Primary node. Following stack-trace is observed: pthread_kill() _p_raise() raise.raise() abort() __assert_c99 StatsSession::sessionInitReq() StatsSession::processOpReq() StatsSession::processOpMsgs() RDS::processStatsOpMsg() DBMgr::processStatsOpMsg() process_message() main() DESCRIPTION: On loss of heartbeats between the Primary and Secondary nodes, and a subsequent reconnect, RVG information is sent to the Primary by Secondary node. In this case, if a Stats session already exists on the Primary, a STATS_SESSION_INIT request is sent back to the Secondary. However, the code was using "hostname" (as returned by `uname -a`) to identify the secondary node. Since both the nodes had the same hostname, the resulting STATS_SESSION_INIT request was received at the Primary itself, causing vradmind to core dump. RESOLUTION:Code was modified to use 'virtual host-name' information contained in the RLinks, rather than hostname(1m), to identify the secondary node. In a scenario where both Primary and Secondary have the same host-name, virtual host-names are used to configure VVR. * INCIDENT NO:2702110 TRACKING ID:2700792 SYMPTOM: vxconfigd, the VxVM volume configuration daemon may dump a core with the following stack during the Cluster Volume Manager(CVM) startup with "hares -online cvm_clus -sys [node]". dg_import_finish() dg_auto_import_all() master_init() role_assume() vold_set_new_role() kernel_get_cvminfo() cluster_check() vold_check_signal() request_loop() main() DESCRIPTION: During CVM startup, vxconfigd accesses the disk group record's pointer of a pending record while the transaction on the disk group is in progress. At times, vxconfigd incorrectly accesses the stale pointer while processing the current transaction, thus resulting in a core dump. RESOLUTION:Code changes are made to access the appropriate pointer of the disk group record which is active in the current transaction. Also, the disk group record is appropriately initialized to NULL value. * INCIDENT NO:2703370 TRACKING ID:2700086 SYMPTOM: In the presence of "Not-Ready" EMC devices on the system, multiple dmp (path disabled/enabled) events messages are seen in the syslog DESCRIPTION: The issue is that vxconfigd enables the BCV devices which are in Not-Ready state for IO as the SCSI inquiry succeeds, but soon finds that they cannot be used for I/O and disables those paths. This activity takes place whenever "vxdctl enable" or "vxdisk scandisks" command is executed. RESOLUTION:Avoid changing the state of the BCV device which is in "Not-Ready" to prevent IO and dmp event messages. * INCIDENT NO:2703373 TRACKING ID:2698860 SYMPTOM: Mirroring a large size VxVM volume comprising of THIN luns underneath and with VxFS filesystem atop mounted fails with the following error: Command error # vxassist -b -g $disk_group_name mirror $volume_name VxVM vxplex ERROR V-5-1-14671 Volume is configured on THIN luns and not mounted. Use 'force' option, to bypass smartmove. To take advantage of smartmove for supporting thin luns, retry this operation after mounting the volume. VxVM vxplex ERROR V-5-1-407 Attempting to cleanup after failure ... Truss output error: statvfs("", 0xFFBFEB54) Err#79 EOVERFLOW DESCRIPTION: The statvfs system call is invoked internally during mirroring operation to retrieve statistics information of VxFS file system hosted on the volume. However, since the statvfs system call only supports maximum 4294967295 (4GB-1) blocks, so if the total filesystem blocks are greater than that, EOVERFLOW error occurs. This also results in vxplex terminating with the errors. RESOLUTION:Use the 64 bits version of statvfs i.e., statvfs64 system call to resolve the EOVERFLOW and vxplex errors. * INCIDENT NO:2706036 TRACKING ID:2617336 SYMPTOM: System panics when a root disk with a swap partition is encapsulated on a Solaris 10 system with kernel patch 147440-04 installed. DESCRIPTION: Systems upgraded to Solaris 10 kernel patch 147440-04 and have swap device encapsulated will recursively panic due to a NULL pointer passed to vxioioctl from a new kernel routine 'swapify()' RESOLUTION:The vxio driver will not access the disk IOCTL return value pointer when it is set to NULL. * INCIDENT NO:2711758 TRACKING ID:2710579 SYMPTOM: Data corruption can be observed on a CDS (Cross-platform Data Sharing) disk, as part of operations like LUN resize, Disk FLUSH, Disk ONLINE etc. The following pattern would be found in the data region of the disk. cyl alt 2 hd sec DESCRIPTION: The CDS disk maintains a SUN VTOC in the zeroth block and a backup label at the end of the disk. The VTOC maintains the disk geometry information like number of cylinders, tracks and sectors per track. The backup label is the duplicate of VTOC and the backup label location is determined from VTOC contents. If the content of SUN VTOC located in the zeroth sector are incorrect, this may result in the wrong calculation of the backup label location. If the wrongly calculated backup label location falls in the public data region rather than the end of the disk as designed, data corruption occurs. RESOLUTION:Suppressed writing the backup label to prevent the data corruption. * INCIDENT NO:2713862 TRACKING ID:2390998 SYMPTOM: When running'vxdctl'or'vxdisk scandisks'command after the process of migrating SAN ports, system panicked, following is the stack trace: .disable_lock() dmp_close_path() dmp_do_cleanup() dmp_decipher_instructions() dmp_process_instruction_buffer() dmp_reconfigure_db() gendmpioctl() vxdmpioctl() DESCRIPTION: SAN ports migration ends up with two path nodes for the same device number, one node marked as NODE_DEVT_USED which means the same device number has been reused by another node. When open the dmp device, the actual open count on the new node (not marked with NODE_DEVT_USED) is modified. If the caller is referencing the old node (marked with NODE_DEVT_USED), it will then modify the layered open count on the old node. This results in the inconsistent open reference counts of the node and cause panic while checking open counts in close dmp device. RESOLUTION:The code change has been done to make the modification of actual open count and layered open count on the same node while performing dmp device open/close. * INCIDENT NO:2741105 TRACKING ID:2722850 SYMPTOM: Disabling/enabling controllers while I/O is in progress results in dmp (Dynamic Multi-Pathing) thread hang with following stack: dmp_handle_delay_open gen_dmpnode_update_cur_pri dmp_start_failover gen_update_cur_pri dmp_update_cur_pri dmp_process_curpri dmp_daemons_loop DESCRIPTION: DMP takes an exclusive lock to quiesce a node to be failed over, and releases the lock to do update operations. These update operations presume that the node will be in quiesced status. A small timing window exists between lock release and update operations, wherein other threads can break-in into this window and unquiesce the node, which will lead to the hang while performing update operations. RESOLUTION:Corrected the quiesce counter of a node to avoid other threads unquiesce it when a thread is performing update operations. * INCIDENT NO:2744219 TRACKING ID:2729501 SYMPTOM: In Dynamic Multi pathing environment, excluding a path also excludes other set of paths with matching substrings. DESCRIPTION: excluding a path using vxdmpadm exclude vxvm path=<> is excluding all the paths with matching substring. This is due to strncmp() used for comparison. Also the size of h/w path defined in the structure is more than what is actually fetched. RESOLUTION:Correct the size of h/w path in the structure and use strcmp for comparison inplace of strncmp() * INCIDENT NO:2750453 TRACKING ID:2439481 SYMPTOM: After doing live upgrade on an encapsulated disk with mirror, mirror disk entry does not get removed from the rootdg. DESCRIPTION: When the alternate disk(-d option) is specifed as c#t#d# to the vxlustart script, the mirrored disk entries are not removed from the rootdg. The vxlustart script does not handle the case where the alternate disk is specified as c#t#d# format while the DA/DM names of the disks do not resemble c#t#d# format. RESOLUTION:Changes are done to enable specifying alt_disk in c#t#d# format with vxlustart when DA/DM name do not resemble c#t#d# format. * INCIDENT NO:2750454 TRACKING ID:2423701 SYMPTOM: Upgrade of VxVM caused change in permissions of /etc/vx/vxesd during live upgrade from drwx------ to d---r-x---. DESCRIPTION: '/etc/vx/vxesd' directory gets shipped in VxVM with "drwx------" permissions. However, while starting the vxesd daemon, if this directory is not present, it gets created with "d---r-x---". RESOLUTION:Changes are made so that while starting vxesd daemon '/etc/vx/vxesd' gets created with 'drwx------' permissions. * INCIDENT NO:2750458 TRACKING ID:2370250 SYMPTOM: When vxlufinish script runs "fuser -k" on the list of filesystem obtained using "lufslist" command, it fails with the following error: $ ./vxlufinish -V -u 5.10 VERITAS Volume Manager is finishing Live-Upgrade of OS release 5.10 /dev/dsk/c1t0d0s0: /dev/vx/dsk/datadg/datavol: 29079o /dev/vx/dsk/ocrvotedg/ocrvotevol: 26857o 24239o # luumount -n dest.9041 # luactivate dest.9041 # lumount -n dest.9041 /altroot.5.10 vxlufinish check is successful. Still you can expect errors in encapsulation or in luactivate because of incorrect installation. Now try running it with no -V option $ Write failed: Broken pipe DESCRIPTION: The filesystems which are excluded during "lucreate" get mounted as loopback file system (lofs) under Alternate Boot Environment (ABE). These lofs filesystems basically point to actual special device files under Primary Boot Environment (PBE). The subroutine "unmount_all" runs "fuser -k" on lofs mounts. Hence the issue. RESOLUTION:The solution involves following steps- 1) generate a list(L1) of mount points from "/etc/mnttab" of PBE. 2) generate a list(L2) of lofs mount points using "/etc/mnttab" of ABE. 3) do not run "fuser -k" on lofs mounts in L2. * INCIDENT NO:2750462 TRACKING ID:2553942 SYMPTOM: "vxlustart -k" fails for the option of auto registration. DESCRIPTION: The .volume.inf file for sol10u10Build17 consists the string VI"SOL_10_811_SPARC" whereas for other updates of Solaris 10 the string is VI"SOL_10_910_SPARC" . The subroutine "chech_auto_registration" parses year and month using this string. On the basis of year and month "auto_reg_required" gets set. RESOLUTION:Changed the logic so that "auto_reg_required" gets set correctly. * INCIDENT NO:2752178 TRACKING ID:2741240 SYMPTOM: In a VxVM environment, "vxdg join" when executed during heavy IO load fails with the below message. VxVM vxdg ERROR V-5-1-4597 vxdg join [source_dg] [target_dg] failed join failed : Commit aborted, restart transaction join failed : Commit aborted, restart transaction Half of the disks that were part of source_dg will become part of target_dg whereas other half will have no DG details. DESCRIPTION: In a vxdg join transaction, VxVM has implemented it as a two phase transaction. If the transaction fails after the first phase and during the second phase, half of the disks belonging to source_dg will become part of target_dg and the other half of the disks will be in a complex irrecoverable state. Also, in heavy IO situation, any retry limit (i.e.) a limit to retry transactions can be easily exceeded. RESOLUTION:"vxdg join" is now designed as a one phase atomic transaction and the retry limit is eliminated. * INCIDENT NO:2774907 TRACKING ID:2771452 SYMPTOM: In lossy and high latency network, I/O gets hung on VVR primary. Just before the I/O hang, Rlink frequently connects and disconnects. DESCRIPTION: In lossy and high latency network, because of heartbeat time outs, RLINK gets disconnected. As a part of Rlink disconnect, the communication port is deleted. During this process, the RVG is serialized and the I/Os are kept in a special queue - rv_restartq. The I/Os in rv_restartq are supposed to be restarted once the port deletion is successful. The port deletion involves termination of all the communication server processes. Because of a bug in the port deletion logic, the global variable which keeps track of number of communication server processes got decremented twice. This caused port deletion process to be hung leading to I/Os in rv_restartq never being restarted. RESOLUTION:In port deletion logic, it's made sure that the global variable which keeps track of number of communication server processes will get decremented correctly. PATCH ID:142630-14 * INCIDENT NO:2583307 TRACKING ID:2185069 SYMPTOM:In a CVR setup, while the application IOs are going on all nodes of primary, bringing down a slave node results in panic on master node with following stack trace: #0 [ffff8800282a3680] machine_kexec at ffffffff8103695b #1 [ffff8800282a36e0] crash_kexec at ffffffff810b8f08 #2 [ffff8800282a37b0] oops_end at ffffffff814cbbd0 #3 [ffff8800282a37e0] no_context at ffffffff8104651b #4 [ffff8800282a3830] __bad_area_nosemaphore at ffffffff810467a5 #5 [ffff8800282a3880] bad_area_nosemaphore at ffffffff81046873 #6 [ffff8800282a3890] do_page_fault at ffffffff814cd658 #7 [ffff8800282a38e0] page_fault at ffffffff814caf45 [exception RIP: vol_rv_async_childdone+876] RIP: ffffffffa080b7ac RSP: ffff8800282a3990 RFLAGS: 00010006 RAX: ffff8801ee8a5200 RBX: ffff8801f6e17200 RCX: ffff8802324290c0 RDX: ffff8801f7c8fac8 RSI: 0000000000000009 RDI: ffff8801f7c8fac8 RBP: ffff8800282a3a00 R8: ffff8801f38d8000 R9: 0000000000000001 R10: 000000000000003f R11: 000000000000000c R12: ffff8801f2580000 R13: ffff88021bdfa7c0 R14: ffff8801f7c8fa00 R15: ffff8801ed46a200 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff8800282a3a08] volsiodone at ffffffffa0672c3e #9 [ffff8800282a3a88] vol_subdisksio_done at ffffffffa06764a7 #10 [ffff8800282a3ac8] volkcontext_process at ffffffffa0642a59 #11 [ffff8800282a3b18] voldiskiodone at ffffffffa062f1c1 #12 [ffff8800282a3bc8] voldiskiodone_intr at ffffffffa062f3a2 #13 [ffff8800282a3bf8] voldmp_iodone at ffffffffa05f7806 #14 [ffff8800282a3c08] bio_endio at ffffffff811a0d3d #15 [ffff8800282a3c18] gendmpiodone at ffffffffa059594a #16 [ffff8800282a3c68] dmpiodone at ffffffffa0596cf2 #17 [ffff8800282a3cb8] bio_endio at ffffffff811a0d3d #18 [ffff8800282a3cc8] req_bio_endio at ffffffff8123f7fb #19 [ffff8800282a3cf8] blk_update_request at ffffffff8124083f #20 [ffff8800282a3d58] blk_update_bidi_request at ffffffff81240ba7 #21 [ffff8800282a3d88] blk_end_bidi_request at ffffffff81241c7f #22 [ffff8800282a3db8] blk_end_request at ffffffff81241d20 #23 [ffff8800282a3dc8] scsi_io_completion at ffffffff8134a42f #24 [ffff8800282a3e48] scsi_finish_command at ffffffff81341812 #25 [ffff8800282a3e88] scsi_softirq_done at ffffffff8134aa3d #26 [ffff8800282a3eb8] blk_done_softirq at ffffffff81247275 #27 [ffff8800282a3ee8] __do_softirq at ffffffff81073bd7 #28 [ffff8800282a3f58] call_softirq at ffffffff810142cc #29 [ffff8800282a3f70] do_softirq at ffffffff81015f35 #30 [ffff8800282a3f90] irq_exit at ffffffff810739d5 #31 [ffff8800282a3fa0] smp_call_function_single_interrupt at ffffffff8102eab5 #32 [ffff8800282a3fb0] call_function_single_interrupt at ffffffff81013e33 --- --- #33 [ffff8801f3ca9af8] call_function_single_interrupt at ffffffff81013e33 [exception RIP: page_waitqueue+125] RIP: ffffffff8110b16d RSP: ffff8801f3ca9ba8 RFLAGS: 00000213 RAX: 0000000000000b9d RBX: ffff8801f3ca9ba8 RCX: 0000000000000034 RDX: ffff880000027d80 RSI: 0000000000000000 RDI: 00000000000003df RBP: ffffffff81013e2e R8: ea00000000000000 R9: 5000000000000000 R10: 0000000000000000 R11: ffff8801ecd0f268 R12: ffffea0006c13d40 R13: 0000000000001000 R14: ffffffff8119d881 R15: ffff8801f3ca9b18 ORIG_RAX: ffffffffffffff04 CS: 0010 SS: 0018 #34 [ffff8801f3ca9bb0] unlock_page at ffffffff8110c16a #35 [ffff8801f3ca9bd0] blkdev_write_end at ffffffff811a3cd0 #36 [ffff8801f3ca9c00] generic_file_buffered_write at ffffffff8110c944 #37 [ffff8801f3ca9cd0] __generic_file_aio_write at ffffffff8110e230 #38 [ffff8801f3ca9d90] blkdev_aio_write at ffffffff811a339c #39 [ffff8801f3ca9dc0] do_sync_write at ffffffff8116c51a #40 [ffff8801f3ca9ef0] vfs_write at ffffffff8116c818 #41 [ffff8801f3ca9f30] sys_write at ffffffff8116d251 #42 [ffff8801f3ca9f80] sysenter_dispatch at ffffffff8104ca7f DESCRIPTION:The reason for panic is that an internal data structure access is not properly serialized resulting in corruption of that data structure. RESOLUTION:Resolution is to properly serialize access to the internal data structure so that its contents are not corrupted under any scenario, PATCH ID:142630-13 * INCIDENT NO:2440015 TRACKING ID:2428170 SYMPTOM:I/O hangs when reading or writing to a volume after a total storage failure in CVM environments with Active-Passive arrays. DESCRIPTION:In the event of a storage failure, in active-passive environments, the CVM-DMP fail over protocol is initiated. This protocol is responsible for coordinating the fail-over of primary paths to secondary paths on all nodes in the cluster. In the event of a total storage failure, where both the primary paths and secondary paths fail, in some situations the protocol fails to cleanup some internal structures, leaving the devices quiesced. RESOLUTION:After a total storage failure all devices should be un-quiesced, allowing the I/Os to fail. The CVM-DMP protocol has been changed to cleanup devices, even if all paths to a device have been removed. * INCIDENT NO:2477272 TRACKING ID:2169726 SYMPTOM:If a combination of cloned and non-cloned disks for a diskgroup is available at the time of import, then the diskgroup imported through vxdg import operation contains both cloned and non-cloned disks. DESCRIPTION:For a particular diskgroup, if some of the disks are not available during the diskgroup import operation and the corresponding cloned disks are present, then the diskgroup imported through vxdg import operation contains combination of cloned and non-cloned disks. Example - Diskgroup named dg1 with the disks disk1 and disk2 exists on some machine. Clones of disks named disk1_clone disk2_clone are also available. If disk2 goes offline and the import for dg1 is performed, then the resulting diskgroup will contain disks disk1 and disk2_clone. RESOLUTION:The diskgroup import operation will consider cloned disks only if no non-cloned disk is available. * INCIDENT NO:2497637 TRACKING ID:2489350 SYMPTOM:In a Storage Foundation environment running Symantec Oracle Disk Manager (ODM), Veritas File System (VxFS), Cluster volume Manager (CVM) and Veritas Volume Replicator (VVR), kernel memory is leaked under certain conditions. DESCRIPTION:In CVR (CVM + VVR), under certain conditions (for example when I/O throttling gets enabled or kernel messaging subsystem is overloaded), the I/O resources allocated before are freed and the I/Os are being restarted afresh. While freeing the I/O resources, VVR primary node doesn't free the kernel memory allocated for FS-VM private information data structure and causing the kernel memory leak of 32 bytes for each restarted I/O. RESOLUTION:Code changes are made in VVR to free the kernel memory allocated for FS-VM private information data structure before the I/O is restarted afresh. * INCIDENT NO:2497796 TRACKING ID:2235382 SYMPTOM:IOs can hang in DMP driver when IOs are in progress while carrying out path failover. DESCRIPTION:While restoring any failed path to a non-A/A LUN, DMP driver is checking that whether any pending IOs are there on the same dmpnode. If any are present then DMP is marking the corresponding LUN with special flag so that path failover/failback can be triggered by the pending IOs. There is a window here and by chance if all the pending IOs return before marking the dmpnode, then any future IOs on the dmpnode get stuck in wait queues. RESOLUTION:Make sure that whenever the LUN is having pending IOs then only to set the flag on it so that failover can be triggered by pending IOs. * INCIDENT NO:2507120 TRACKING ID:2438426 SYMPTOM:The following messages are displayed after vxconfigd is started. pp_claim_device: Could not get device number for /dev/rdsk/emcpower0 pp_claim_device: Could not get device number for /dev/rdsk/emcpower1 DESCRIPTION:Device Discovery Layer(DDL) has incorrectly marked a path under dmp device with EFI flag even though there is no corresponding Extensible Firmware Interface (EFI) device in /dev/[r]dsk/. As a result, Array Support Library (ASL) issues a stat command on non-existent EFI device and displays the above messages. RESOLUTION:Avoided marking EFI flag on Dynamic MultiPathing (DMP) paths which correspond to non-efi devices. * INCIDENT NO:2507124 TRACKING ID:2484334 SYMPTOM:The system panic occurs with the following stack while collecting the DMP stats. dmp_stats_is_matching_group+0x314() dmp_group_stats+0x3cc() dmp_get_stats+0x194() gendmpioctl() dmpioctl+0x20() DESCRIPTION:Whenever new devices are added to the system, the stats table is adjusted to accomodate the new devices in the DMP. There exists a race between the stats collection thread and the thread which adjusts the stats table to accomodate the new devices. The race can result the stats collection thread to access the memory beyond the known size of the table causing the system panic. RESOLUTION:The stats collection code in the DMP is rectified to restrict the access to the known size of the stats table. * INCIDENT NO:2508294 TRACKING ID:2419486 SYMPTOM:Data corruption is observed with single path when naming scheme is changed from enclodure based (EBN) to OS Native (OSN). DESCRIPTION:The Data corruption can occur in the following configuration, when the naming scheme is changed while applications are on-line. 1. The DMP device is configured with single path or the devices are controlled by Third party Multipathing Driver (Ex: MPXIO, MPIO etc.,) 2. The DMP device naming scheme is EBN (enclosure based naming) and persistence=yes 3. The naming scheme is changed to OSN using the following command # vxddladm set namingscheme=osn There is possibility of change in name of the VxVM device (DA record) while the naming scheme is changing. As a result of this the device attribute list is updated with new DMP device names. Due to a bug in the code which updates the attribute list, the VxVM device records are mapped to wrong DMP devices. Example: Following are the device names with EBN naming scheme. MAS-usp0_0 auto:cdsdisk hitachi_usp0_0 prod_SC32 online MAS-usp0_1 auto:cdsdisk hitachi_usp0_4 prod_SC32 online MAS-usp0_2 auto:cdsdisk hitachi_usp0_5 prod_SC32 online MAS-usp0_3 auto:cdsdisk hitachi_usp0_6 prod_SC32 online MAS-usp0_4 auto:cdsdisk hitachi_usp0_7 prod_SC32 online MAS-usp0_5 auto:none - - online invalid MAS-usp0_6 auto:cdsdisk hitachi_usp0_1 prod_SC32 online MAS-usp0_7 auto:cdsdisk hitachi_usp0_2 prod_SC32 online MAS-usp0_8 auto:cdsdisk hitachi_usp0_3 prod_SC32 online MAS-usp0_9 auto:none - - online invalid disk_0 auto:cdsdisk - - online disk_1 auto:none - - online invalid bash-3.00# vxddladm set namingscheme=osn The follwoing is after executing the above command. The MAS-usp0_9 is changed as MAS-usp0_6 and the following devices are changed accordingly. bash-3.00# vxdisk list DEVICE TYPE DISK GROUP STATUS MAS-usp0_0 auto:cdsdisk hitachi_usp0_0 prod_SC32 online MAS-usp0_1 auto:cdsdisk hitachi_usp0_4 prod_SC32 online MAS-usp0_2 auto:cdsdisk hitachi_usp0_5 prod_SC32 online MAS-usp0_3 auto:cdsdisk hitachi_usp0_6 prod_SC32 online MAS-usp0_4 auto:cdsdisk hitachi_usp0_7 prod_SC32 online MAS-usp0_5 auto:none - - online invalid MAS-usp0_6 auto:none - - online invalid MAS-usp0_7 auto:cdsdisk hitachi_usp0_1 prod_SC32 online MAS-usp0_8 auto:cdsdisk hitachi_usp0_2 prod_SC32 online MAS-usp0_9 auto:cdsdisk hitachi_usp0_3 prod_SC32 online c4t20000014C3D27C09d0s2 auto:none - - online invalid c4t20000014C3D26475d0s2 auto:cdsdisk - - online RESOLUTION:Code changes are made to update device attribute list correctly even if name of the VxVM device is changed while the naming scheme is changing. * INCIDENT NO:2508418 TRACKING ID:2390431 SYMPTOM:In a Disaster Recovery environment, when DCM (Data Change Map) is active and during SRL(Storage Replicator Log)/DCM flush, the system panics due to missing parent on one of the DCM in an RVG (Replicated Volume Group). DESCRIPTION:The DCM flush happens during every log update and its frequency depends on the IO load. If the I/O load is high, the DCM flush happens very often and if there are more volumes in the RVG, the frequency is very high. Every DCM flush triggers the DCM flush on all the volumes in the RVG. If there are 50 volumes, in an RVG, then each DCM flush creates 50 children and is controlled by one parent SIO. Once all the 50 children are done, then the parent SIO releases itself for the next flush. Once the DCM flush of each child completes, it detaches itself from the parent by setting the parent field to NULL. It so happens that, if the 49th child is done and before it is detaching it from the parent, the 50th child completes and releases the parent_SIO for the next DCM flush. Before the 49th child detaches, the new DCM flush is started on the same 50th child. After the next flush is started, the 49th child of the previous flush detaches itself from the parent and since it is a static SIO, it indirectly resets the new flush parent field. Also, the lock is not obtained before modifing the sio state field in a few scenarios. RESOLUTION:Before reducing the children count, detach the parent first. This will make sure the new flush will not race with the previous flush. Protect the field with the required lock in all the scenarios. * INCIDENT NO:2511928 TRACKING ID:2420386 SYMPTOM:Corrupted data is seen near the end of a sub-disk, on thin-reclaimable disks with either CDS EFI or sliced disk formats. DESCRIPTION:In environments with thin-reclaim disks running with either CDS-EFI disks or sliced disks, misaligned reclaims can be initiated. In some situations, when reclaiming a sub-disk, the reclaim does not take into account the correct public region start offset, which in rare instances can potentially result in reclaiming data before the sub-disk which is being reclaimed. RESOLUTION:The public offset is taken into account when initiating all reclaim operations. * INCIDENT NO:2515137 TRACKING ID:2513101 SYMPTOM:When VxVM is upgraded from 4.1MP4RP2 to 5.1SP1RP1, the data on CDS disk gets corrupted. DESCRIPTION:When CDS disks are initialized with VxVM version 4.1MP4RP2, the no of cylinders are calculated based on the disk raw geometry. If the calculated no. of cylinders exceed Solaris VTOC limit (65535), because of unsigned integer overflow, truncated value of no of cylinders gets written in CDS label. After the VxVM is upgraded to 5.1SP1RP1, CDS label gets wrongly written in the public region leading to the data corruption. RESOLUTION:The code changes are made to suitably adjust the no. of tracks & heads so that the calculated no. of cylinders be within Solaris VTOC limit. * INCIDENT NO:2525333 TRACKING ID:2148851 SYMPTOM:"vxdisk resize" operation fails on a disk with VxVM cdsdisk/simple/sliced layout on Solaris/Linux platform with the following message: VxVM vxdisk ERROR V-5-1-8643 Device emc_clariion0_30: resize failed: New geometry makes partition unaligned DESCRIPTION:The new cylinder size selected during "vxdisk resize" operation is unaligned with the partitions that existed prior to the "vxdisk resize" operation. RESOLUTION:The algorithm to select the new geometry has been redesigned such that the new cylinder size is always aligned with the existing as well as new partitions. * INCIDENT NO:2531983 TRACKING ID:2483053 SYMPTOM:VVR Primary system consumes very high kernel heap memory and appear to be hung. DESCRIPTION:There is a race between REGION LOCK deletion thread which runs as part of SLAVE leave reconfiguration and the thread which process the DATA_DONE message coming from log client to logowner. Because of this race, the flags which stores the status information about the I/Os was not correctly updated. This used to cause a lot of SIOs being stuck in a queue consuming a large kernel heap. RESOLUTION:The code changes are made to take the proper locks while updating the SIOs' fields. * INCIDENT NO:2531987 TRACKING ID:2510523 SYMPTOM:In CVM-VVR configuration, I/Os on "master" and "slave" nodes hang when "master" role is switched to the other node using "vxclustadm setmaster" command. DESCRIPTION:Under heavy I/O load, the I/Os are sometimes throttled in VVR, if number of outstanding I/Os on SRL reaches a certain limit (2048 I/Os). When "master" role is switched to the other node by using "vxclustadm setmaster" command, the throttled I/Os on original master are never restarted. This causes the I/O hang. RESOLUTION:Code changes are made in VVR to make sure the throttled I/Os are restarted before "master" switching is started. * INCIDENT NO:2531993 TRACKING ID:2524936 SYMPTOM:Disk group is disabled after rescanning disks with "vxdctl enable" command with the console output below, pp_claim_device: 0 Could not get metanode from ODM database pp_claim_device: 0 Could not get metanode from ODM database The error messages below are also seen in vxconfigd debug log output, VxVM vxconfigd ERROR V-5-1-12223 Error in claiming /dev/: The process file table is full. VxVM vxconfigd ERROR V-5-1-12223 Error in claiming /dev/: The process file table is full. ... VxVM vxconfigd ERROR V-5-1-12223 Error in claiming /dev/: The process file table is full. AIX- DESCRIPTION:When the total physical memory in AIX machine is greater than or equal to 40GB & multiple of 40GB (like 80GB, 120GB), a limitation/bug in setulimit function causes an overflowed value set as the new limit/size of the data area, which results in memory allocation failures in vxconfigd. Creation of the shared memory segment also fails during this course. Error handling of this case is missing in vxconfigd code, hence resulting in error in claiming disks and offlining configuration copies which in-turn results in disabling disk group. AIX- RESOLUTION:Code changes are made to handle the failure case on shared memory segment creation. * INCIDENT NO:2552402 TRACKING ID:2432006 SYMPTOM:System intermittently hangs during boot if disk is encapsulated. When this problem occurs, OS boot process stops after outputing this: "VxVM sysboot INFO V-5-2-3409 starting in boot mode..." DESCRIPTION:The boot process hung due to a dead lock between two threads, one VxVM transaction thread and another thread attempting a read on root volume issued by dhcpagent. Read I/O is deferred till transaction is finished but read count incremented earlier is not properly adjusted. RESOLUTION:Proper care is taken to decrement pending read count if read I/O is deferred. * INCIDENT NO:2553391 TRACKING ID:2536667 SYMPTOM:[04DAD004]voldiodone+000C78 (F10000041116FA08) [04D9AC88]volsp_iodone_common+000208 (F10000041116FA08, 0000000000000000, 0000000000000000) [04B7A194]volsp_iodone+00001C (F10000041116FA08) [000F3FDC]internal_iodone_offl+0000B0 (??, ??) [000F3F04]iodone_offl+000068 () [000F20CC]i_softmod+0001F0 () [0017C570].finish_interrupt+000024 () DESCRIPTION:Panic happened due to accessing a stale DG pointer as DG got deleted before the I/O returned. It may happen on cluster configuration where commands generating private region i/os and "vxdg deport/delete" commands are executing simultaneously on two nodes of the cluster. RESOLUTION:Code changes are made to drain private region I/Os before deleting the DG. * INCIDENT NO:2562911 TRACKING ID:2375011 SYMPTOM:User is not able to change the "dmp_native_support" tunable to "on" or "off" in the presence of the root ZFS pool. SOL_ DESCRIPTION:DMP does not allow the dmp_native_support tunable to be changed if any of the ZFS pools is in use. Therefore in the presence of root ZFS pool, DMP reports the following error when the user tried to change the "dmp_native_support" tunable to "on" or "off" # vxdmpadm settune dmp_native_support=off VxVM vxdmpadm ERROR V-5-1-15690 Operation failed for one or more zpools VxVM vxdmpadm ERROR V-5-1-15686 The following zpool(s) could not be migrated as they are in use - rpool SOL_ RESOLUTION:DMP code has been changed to skip the root ZFS pool in its internal checks for active ZFS pools prior to changing the value of dmp_native_support tunable. * INCIDENT NO:2563291 TRACKING ID:2527289 SYMPTOM:In a Campus Cluster setup, storage fault may lead to DETACH of all the configured site. This also results in IOfailure on all the nodes in the Campus Cluster. DESCRIPTION:Site detaches are done on site consistent dgs when any volume in the dg looses all the mirrors of a Site. During the processing of the DETACH of last mirror in a site we identify that it is the last mirror and DETACH the site which in turn detaches all the objects of that site. In Campus Cluster setup we attach a dco volume for any data volume created on a site-consistent dg. The general configuration is to have one DCO mirror on each site. Loss of a single mirror of the dco volume on any node will result in the detach of that site. In a 2 site configuration this particular scenario would result in both the dco mirrors being lost simultaneously. While the site detach for the first mirror is being processed we also signal for DETACH of the second mirror which ends up DETACHING the second site too. This is not hit in other tests as we already have a check to make sure that we do not DETACH the last mirror of a Volume. This check is being subverted in this particular case due to the type of storage failure. RESOLUTION:Before triggering the site detach we need to have an explicit check to see if we are trying to DETACH the last ACTIVE site. * INCIDENT NO:2574840 TRACKING ID:2344186 SYMPTOM:In a master-slave configuration with FMR3/DCO volumes, reboot of a cluster node fails to join back the cluster again with following error messages in the console [..] Jul XX 18:44:09 vienna vxvm:vxconfigd: [ID 702911 daemon.error] V-5-1-11092 cleanup_client: (Volume recovery in progress) 230 Jul XX 18:44:09 vienna vxvm:vxconfigd: [ID 702911 daemon.error] V-5-1-11467 kernel_fail_join() : Reconfiguration interrupted: Reason is retry to add a node failed (13, 0) [..] DESCRIPTION:VxVM volumes with FMR3/DCO have inbuilt DRL mechanism to track the disk block of in-flight IOs in order to recover the data much quicker in case of a node crash. Thus, a joining node awaits the variable, responsible for recovery, to get unset to join the cluster. However, due to a bug in FMR3/DCO code, this variable was set forever, thus leading to node join failure. RESOLUTION:Modified the FMR3/DCO code to appropriately set and unset this recovery variable. PATCH ID:142630-12 * INCIDENT NO:2169348 TRACKING ID:2094672 SYMPTOM:Master node hang with lot of I/O's and during node reconfig due to node leave. DESCRIPTION:The reconfig is stuck, because the I/O is not drained completely. The master node is responsible to handle the I/O for the both primary and slave. When the slave node is died, and the pending slave I/O on the master node is not cleaned up himself properly. This lead to some I/O's left in the queue un-deleted. RESOLUTION:clean up the I/O during the node failure and reconfig scenario. * INCIDENT NO:2169372 TRACKING ID:2108152 SYMPTOM:vxconfigd, the VxVM volume configuration daemon startup fails to get into enabled mode and "vxdctl enable" command displays the error "VxVM vxdctl ERROR V-5-1-1589 enable failed: Error in disk group configuration copies ". DESCRIPTION:vxconfigd issues input/output control system call (ioctl) to read the disk capacity from disks. However, if it fails, the error number is not propagated back to vxconfigd. The subsequent disk operations to these failed devices were causing vxconfigd to get into disabled mode. RESOLUTION:The fix is made to propagate the actual "error number" returned by the ioctl failure back to vxconfigd. * INCIDENT NO:2198041 TRACKING ID:2196918 SYMPTOM:When creating a space-opimized snapshot by specifying cache-object size either in percentage terms of the volume size or an absolute size, the snapshot creation can fail with an error similar to following: "VxVM vxassist ERROR V-5-1-10127 creating volume snap-dvol2-CV01: Volume or log length violates disk group alignment" DESCRIPTION:VxVM expects all virtual storage objects to have size aligned to a value which is set diskgroup-wide. One can get this value with: # vxdg list testdg|grep alignment alignment: 8192 (bytes) When the cachesize is specified in percentage, the value might not align with dg alignment. If not aligned, the creation of the cache-volume could fail with specified error message RESOLUTION:After computing the cache-size from specified percentage value, it is aligned up to the diskgroup alignment value before trying to create the cache-volume. * INCIDENT NO:2204146 TRACKING ID:2200670 SYMPTOM:Some disks are left detached and not recovered by vxattachd. DESCRIPTION:If the shared disk group is not imported or node is not part of the cluster when storage connectivity to failed node is restored, the vxattachd daemon does not getting notified about storage connectivity restore and does not trigger a reattach. Even if the disk group is later imported or the node is joined to CVM cluster, the disks are not automatically reattached. RESOLUTION:i) Missing events for a deported diskgroup: The fix handles this by listening to the import event of the diksgroup and triggers the brute-force recovery for that specific diskgroup. ii) parallel recover of volumes from same disk: vxrecover automatically serializes the recovery of objects that are from the same disk to avoid the back and forth head movements. Also provided an option in vxattchd and vxrecover to control the number of parallel recovery that can happen for objects from the same disk. * INCIDENT NO:2205859 TRACKING ID:2196480 SYMPTOM:Initialization of VxVM cdsdisk layout fails on a disk of size less than 1 TB. DESCRIPTION:The disk geometry is derived to fabricate the cdsdisk label during the initialization of VxVM cdsdisk layout on a disk of size less than 1 TB. The disk geometry was violating one of the following requirements: (1) cylinder size is aligned with 8 KB. (2) Number of cylinders is less than 2^16 (3) The last sector in the device is not included in the last cylinder. (4) Number of heads is less than 2 ^16 (5) tracksize is less than 2^16 RESOLUTION:The issue has been resolved by making sure that the disk geometry used in fabricating the cdsdisk label satisfies all the five requirements described above. * INCIDENT NO:2211971 TRACKING ID:2190020 SYMPTOM:On heavy I/O system load dmp_deamon requests 1 mega byte continuous memory paging which inturn slows down the system due to continuous page swapping. LINUX- DESCRIPTION:dmp_deamon keeps calculating statistical information (every 1 second by default). When the I/O load is high the I/O statistics buffer allocation code path calculation dynamically allocates continuous ~1 mega byte per-cpu. LINUX- RESOLUTION:To avoid repeated memory allocation/free calls in every DMP I/O stats daemon interval, a two buffer strategy was implemented for storing DMP stats records. Two buffers of same size will be allocated at the beginning, one of the buffer will be used for writing active records while the other will be read by IO stats daemon. The two buffers will be swapped every stats daemon interval. * INCIDENT NO:2215263 TRACKING ID:2215262 SYMPTOM:Netapp iSCSI LUN goes into error state while initializing via VEA GUI. DESCRIPTION:VEA(vmprovider) calls fstyp command to check the file system type configured on the device before doing the initialization. The fstyp sends some unsupported pass through ioctl to dmp device which makes APM specific function is called to check path state of the device. The path state checking function sends the SCSI inquiry command to get the device state, but the unexpected error returned from the inquiry because the memory allocated in path state checking function is not aligned, therefore the path gets into disabled state. RESOLUTION:Fix the memory allocation method in netapp APM path state checking function to make start address of the memory is aligned. In addition, error analyzing function in netapp APM use the same memory allocation method, we need fix it as well. * INCIDENT NO:2215376 TRACKING ID:2215256 SYMPTOM:Volume Manager is unable to recognize the devices connected through F5100 HBA DESCRIPTION:During device discovery volume manager does not scan the luns that are connected through SAS HBA (F5100 is a new SAS HBA). So the commands like 'vxdisk list' does not even show the luns that are connected through F5100 HBA RESOLUTION:Modified the device discovery code in volume manager to include the paths/luns that are connected through SAS HBA. * INCIDENT NO:2220064 TRACKING ID:2228531 SYMPTOM:Vradmind hangs in vol_klog_lock() on VVR (Veritas Volume Replicator) Secondary site. Stack trace might look like: genunix:cv_wait+0x38() vxio:vol_klog_lock+0x5c() vxio:vol_mv_close+0xc0() vxio:vol_close_object+0x30() vxio:vol_object_ioctl+0x198() vxio:voliod_ioctl() vxio:volsioctl_real+0x2d4() specfs:spec_ioctl() genunix:fop_ioctl+0x20() genunix:ioctl+0x184() unix:syscall_trap32+0xcc() DESCRIPTION:In this scenario, a flag value should be set for vradmind to be signalled and woken up. As the flag value is not set here,it causes an enduring sleep. A race condition exists between setting and resetting of the flag values, resulting in the hang. RESOLUTION:Code changes are made to hold a lock to avoid the race condition between setting and resetting of the flag values. * INCIDENT NO:2227945 TRACKING ID:2226304 SYMPTOM:In Solaris 9 platform, newfs(1M)/mkfs_ufs(1M) cannot create ufs file system on >1 Tera byte(TB) VxVM volume and it displays the following error: # newfs /dev/vx/rdsk// newfs: construct a new file system /dev/vx/rdsk//: (y/n)? y Can not determine partition size: Inappropriate ioctl for device # prtvtoc /dev/vx/rdsk// prtvtoc: /dev/vx/rdsk//: Unknown problem reading VTOC DESCRIPTION:newfs(1M)/mkfs_ufs(1M) invokes DKIOCGETEFI ioctl. During the enhancement of EFI support on Solaris 10 on 5.0MP3RP3 or later, DKIOCGETEFI ioctl functionality was not implemented on Solaris 9 because of the following limitations: 1. EFI feature has not been introduced from Solaris 9 FCS and has been introduced from Solaris 9 U3(4/03) which includes 114127-03(libefi) and 114129- 02(libuuid and efi/uuid headers). 2. During the enhancement of EFI support on Solaris 10, for solaris 9, DKIOCGVTOC ioctl was only supported on a volume <= 1TB since the VTOC specification was defined for only <= 1 TB LUN/volume. If the size of the volume is > 1 TB DKIOCGVTOC ioctl would return an inaccurate vtoc structure due to value overflow. RESOLUTION:The resolution is to enhance the VxVM code to handle DKIOCGETEFI ioctl correctly on VxVM volume on Solaris 9 platform. When newfs(1M)/mkfs_ufs(1M) invokes DKIOCGETEFI ioctl on a VxVM volume device, VxVM shall return the relevant EFI label information so that the UFS utilities can determine the volume size correctly. * INCIDENT NO:2232052 TRACKING ID:2230716 SYMPTOM:While trying to convert from SVM to VxVM, user runs doconvert. The conversion process does not throw any error however after rebooting the host conversion is not completed and no diskgroup is created. DESCRIPTION:After executing /opt/VRTSvxvm/vmcvt/bin/doconvert and rebooting the conversion does not complete. Found this is due to /etc/lvm/md.cf file is not cleared of all meta devices and VxVM upon reboot tries to initialize the disk and create the diskgroup fails with error: "VxVM vxdisk ERROR V-5-1-15395 Device disk_1 is already in use by SVM. If you want to initialize this device for VxVM use, please clear SVM metadata by running 'metastat' and 'metaclear' commands." Above error is seen in the svc log file. RESOLUTION:We added a fix to clear the SVM metadevices using metaclear while we do the conversion process. * INCIDENT NO:2232829 TRACKING ID:2232789 SYMPTOM:With NetApp metro cluster disk arrays, takeover operations (toggling of LUN ownership within NetApp filer) can lead to IO failures on VxVM volumes. Example of an IO error message at VxVM VxVM vxio V-5-0-2 Subdisk disk_36-03 block 24928: Uncorrectable write error DESCRIPTION:During the takeover operation, the array fails the PGR and IO SCSI commands on secondary paths with the following transient error codes - 0x02/0x04/0x0a (NOT READY/LOGICAL UNIT NOT ACCESSIBLE, ASYMMETRIC ACCESS STATE TRANSITION) or 0x02/0x04/0x01 (NOT READY/LOGICAL UNIT IS IN PROCESS OF BECOMING READY) - that are not handled properly within VxVM. RESOLUTION:Included required code logic within the APM so that the SCSI commands with transient errors are retried for the duration of NetApp filer reconfig time (60 secs) before failing the IO's on VxVM volumes. * INCIDENT NO:2234292 TRACKING ID:2152830 SYMPTOM:Sometimes the storage admins create multiple copies/clones of the same device. Diskgroup import fails with a non-descriptive error message when multiple copies(clones) of the same device exists and original device(s) are either offline or not available. # vxdg import mydg VxVM vxdg ERROR V-5-1-10978 Disk group mydg: import failed: No valid disk found containing disk group DESCRIPTION:If the original devices are offline or unavailable, vxdg import picks up cloned disks for import. DG import fails by design unless the clones are tagged and tag is specified during DG import. While the import failure is expected, but the error message is non-descriptive and doesn't provide any corrective action to be taken by user. RESOLUTION:Fix has been added to give correct error meesage when duplicate clones exist during import. Also, details of duplicate clones is reported in the syslog. Example: [At CLI level] # vxdg import testdg VxVM vxdg ERROR V-5-1-10978 Disk group testdg: import failed: DG import duplcate clone detected [In syslog] vxvm:vxconfigd: warning V-5-1-0 Disk Group import failed: Duplicate clone disks are detected, please follow the vxdg (1M) man page to import disk group with duplicate clone disks. Duplicate clone disks are: c2t20210002AC00065Bd0s2 : c2t50060E800563D204d1s2 c2t50060E800563D204d0s2 : c2t50060E800563D204d1s2 * INCIDENT NO:2241149 TRACKING ID:2240056 SYMPTOM:'vxdg move/split/join' may fail during high I/O load. DESCRIPTION:During heavy I/O load 'dg move' transcation may fail because of open/close assertion and retry will be done. As the retry limit is set to 30 'dg move' fails if retry hits the limit. RESOLUTION:Change the default transaction retry to unlimit, introduce a new option to 'vxdg move/split/join' to set transcation retry limit as follows: vxdg [-f] [-o verify|override] [-o expand] [-o transretry=retrylimit] move src_diskgroup dst_diskgroup objects ... vxdg [-f] [-o verify|override] [-o expand] [-o transretry=retrylimit] split src_diskgroup dst_diskgroup objects ... vxdg [-f] [-o verify|override] [-o transretry=retrylimit] join src_diskgroup dst_diskgroup * INCIDENT NO:2253269 TRACKING ID:2263317 SYMPTOM:vxdg(1M) man page does not clearly describe diskgroup import and destroy operations for the case in which original diskgroup is destroyed and cloned disks are present. DESCRIPTION:Diskgroup import with dgid is cosidered as a recovery operation. Therefore, while importing with dgid, even though the original diskgroup is destroyed, both the original as well as cloned disks are considered as available disks. Hence, the original diskgroup is imported in such a scenario. The existing vxdg(1M) man page does not clearly describe this scenario. RESOLUTION:Modified the vxdg(1M) man page to clearly describe the scenario. * INCIDENT NO:2256728 TRACKING ID:2248730 SYMPTOM:Command hungs if "vxdg import" called from script with STDERR redirected. DESCRIPTION:If script is having "vxdg import" with STDERR redirected then script does not finish till DG import and recovery is finished. Pipe between script and vxrecover is not closed properly which keeps calling script waiting for vxrecover to complete. RESOLUTION:Closed STDERR in vxrecover and redirected the output to /dev/console. * INCIDENT NO:2257706 TRACKING ID:2257678 SYMPTOM:When running vxinstall command to install VxVM on Linux/Solaris system with root disk on a LVM volume we get an error as follows: # vxinstall ... The system is encapsulated. Reinstalling the Volume Manager at this stage could leave your system unusable. Please un-encapsulate before continuing with the reinstallation. Cannot continue further. # DESCRIPTION:The vxinstall script checks if the root devices of the System is encapsulated (under VxVM control). The check for this was incorrectly coded. This led to LVM volumes also being detected as VxVM volumes. This error prevented vxinstall to proceed emitting the above error message. The error message was not true and is a false positive. RESOLUTION:The resolution was to modify the code, so that LVM volumes with rootvol in their name are not detected as VxVM encapsulated volumes. * INCIDENT NO:2272956 TRACKING ID:2144775 SYMPTOM:failover_policy attribute not persistent across reboot DESCRIPTION:failiver_policy attribute was not implemented to be persistent across reboot. Hence on every reboot failover_policy used to switch back to default. RESOLUTION:Added code changes to make failover_policy attribute settings persistent across reboot. * INCIDENT NO:2273573 TRACKING ID:2270880 SYMPTOM:On Solaris 10 (SPARC only), if the size of EFI(Extensible Firmware Interface) labeled disk is greater than 2TB, the disk capacity will be truncated to 2TB when it is initialized with CDS(Cross-platform Data Sharing) under VxVM(Veritas Volume Manager). For example, the sizes shown as the sector count by prtvtoc(1M) and public region size by vxdisk(1M) will be truncated to the sizes approximate 2TB. # prtvtoc /dev/rdsk/c0t500601604BA07D17d13 * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 2 15 00 48 4294967215 4294967262 # vxdisk list c0t500601604BA07D17d13 | grep public public: slice=2 offset=65744 len=4294901456 disk_offset=48 DESCRIPTION:From VxVM 5.1 SP1 and onwards, the CDS format is enhanced to support for disks of greater than 1TB. VxVM will use EFI layout to support CDS functionality for disks of greater than 1TB, however on Solaris 10 (SPARC only), a problem is seen that the disk capacity will be truncated to 2TB if the size of EFI labeled disk is greater than 2TB. This is because the library /usr/lib/libvxscsi.so in Solaris 10 (SPARC only) package does not contain the required enhancement on Solaris 10 to support CDS format for disks greater than 2TB. RESOLUTION:The VxVM package for Solaris has been changed to contain all the libvxscsi.so binaries which is built for Solaris platforms(versions) respectively, for example libvxscsi.so.SunOS_5.9 and libvxscsi.so.SunOS_5.10. From this fix and onwards, the appropriate platform's built of the binary will be installed as /usr/lib/libvxscsi.so during the installation of the VxVM package. * INCIDENT NO:2276958 TRACKING ID:2205108 SYMPTOM:On VxVM 5.1SP1 or later, device discovery operations such as vxdctl enable, vxdisk scandisks and vxconfigd -k failed to claim new disks correctly. For example, if user provisions five new disks, VxVM, instead of creating five different Dynamic Multi-Pathing (DMP) nodes, creates only one and includes the rest as its paths. Also, the following message is displayed at console during this problem. NOTICE: VxVM vxdmp V-5-0-34 added disk array , datype = Please note that the cabinet serial number following "disk array" and the value of "datype" is not printed in the above message. DESCRIPTION:VxVM's DDL (Device Discovery Layer) is responsible for appropriately claiming the newly provisioned disks. Due to a bug in one of the routines within this layer, though the disks are claimed, their LSN (Lun Serial Number, an unique identifier of disks) is ignored thereby every disk is wrongly categorized within a DMP node. RESOLUTION:Modified the problematic code within the DDL thereby new disks are claimed appropriately. WORKAROUND: If vxconfigd does not hang or dump a core with this issue, a reboot can be a workaround to recover this situation or to break up once and rebuild the DMP/DDL database on the devices as the following steps; # vxddladm excludearray all # mv /etc/vx/jbod.info /etc/vx/jbod.info.org # vxddladm disablescsi3 # devfsadm -Cv # vxconfigd -k # vxddladm includearray all # mv /etc/vx/jbod.info.org /etc/vx/jbod.info # vxddladm enablescsi3 # rm /etc/vx/disk.info /etc/vx/array.info # vxconfigd -k * INCIDENT NO:2291184 TRACKING ID:2291176 SYMPTOM:vxrootadm does not set dump device correctly with LANG=ja ( i.e. Japanese). DESCRIPTION:The vxrootadm script tries to get dump device from dumpadm command output, but as language is set to Japanese, it is not able to grep English words from the output. As a result it fails to set dump device properly. RESOLUTION:Set environment Language variable to C (English) before parsing the dumpadm command output. This fix has been made to vxrootadm and vxunroot scripts where they try to parse the dumpadm command output. * INCIDENT NO:2299691 TRACKING ID:2299670 SYMPTOM:VxVM disk groups created on EFI (Extensible Firmware Interface) LUNs do not get auto-imported during system boot in VxVM version 5.1SP1 and later. DESCRIPTION:While determining the disk format of EFI LUNs, stat() system call on the corresponding DMP devices fail with ENOENT ("No such file or directory") error because the DMP device nodes are not created in the root file system during system boot. This leads to failure in auto-import of disk groups created on EFI LUNs. RESOLUTION:VxVM code is modified to use OS raw device nodes if stat() fails on DMP device nodes. * INCIDENT NO:2316309 TRACKING ID:2316297 SYMPTOM:The following error messages are printed on the console every time system boots. VxVM vxdisk ERROR V-5-1-534 Device [DEVICE NAME]: Device is in use DESCRIPTION:During system boot up, while Volume Manager diskgroup imports, vxattachd daemon tries to online the disk. Since the disk may be already online sometimes, an attempt to re-online disk gives the below error message: VxVM vxdisk ERROR V-5-1-534 Device [DEVICE NAME]: Device is in use RESOLUTION:The solution is to check if the disk is already in "online" state. If so, avoid reonline. * INCIDENT NO:2323999 TRACKING ID:2323925 SYMPTOM:If the rootdisk is under VxVM control and /etc/vx/reconfig.d/state.d/install-db file exists, the following messages are observed on the console: UX:vxfs fsck: ERROR: V-3-25742: /dev/vx/dsk/rootdg/homevol:sanity check failed: cannot open /dev/vx/dsk/rootdg/homevol: No such device or address UX:vxfs fsck: ERROR: V-3-25742: /dev/vx/dsk/rootdg/optvol:sanity check failed: cannot open /dev/vx/dsk/rootdg/optvol: No such device or address DESCRIPTION:In the vxvm-startup script, there is check for the /etc/vx/reconfig.d/state.d/install-db file. If the install-db file exist on the system, the VxVM assumes that volume manager is not configured and does not start volume configuration daemon "vxconfigd". "install-db" file somehow existed on the system for a VxVM rootable system, this causes the failure. RESOLUTION:If install-db file exists on the system and the system is VxVM rootable, the following warning message is displayed on the console: "This is a VxVM rootable system. Volume configuration daemon could not be started due to the presence of /etc/vx/reconfig.d/state.d/install-db file. Remove the install-db file to proceed" * INCIDENT NO:2328219 TRACKING ID:2253552 SYMPTOM:vxconfigd leaks memory while reading the default tunables related to smartmove (a VxVM feature). DESCRIPTION:In Vxconfigd, memory allocated for default tunables related to smartmove feature is not freed causing a memory leak. RESOLUTION:The memory is released after its scope is over. * INCIDENT NO:2337091 TRACKING ID:2255182 SYMPTOM:If EMC CLARiiON arrays are configured with different failovermode for each host controllers ( e.g. one HBA has failovermode set as 1 while the other as 2 ), then VxVM's vxconfigd demon dumps core. DESCRIPTION:DDL (VxVM's Device Discovery Layer) determines the array type depending on the failovermode setting. DDL expects the same array type to be returned across all the paths going to that array. This fundamental assumption of DDL will be broken with different failovermode settings thus leading to vxconfigd core dump. RESOLUTION:Validation code is added in DDL to detect such configurations and emit appropriate warning messages to the user to take corrective actions and skips the later set of paths that are reporting different array type. * INCIDENT NO:2349653 TRACKING ID:2349352 SYMPTOM:Data corruption is observed on DMP device with single path during Storage reconfiguration (LUN addition/removal). DESCRIPTION:Data corruption can occur in the following configuration, when new LUNs are provisioned or removed under VxVM, while applications are on-line. 1. The DMP device naming scheme is EBN (enclosure based naming) and persistence=no 2. The DMP device is configured with single path or the devices are controlled by Third Party Multipathing Driver (Ex: MPXIO, MPIO etc.,) There is a possibility of change in name of the VxVM devices (DA record), when LUNs are removed or added followed by the following commands, since the persistence naming is turned off. (a) vxdctl enable (b) vxdisk scandisks Execution of above commands discovers all the devices and rebuilds the device attribute list with new DMP device names. The VxVM device records are then updated with this new attributes. Due to a bug in the code, the VxVM device records are mapped to wrong DMP devices. Example: Following are the device before adding new LUNs. sun6130_0_16 auto - - nolabel sun6130_0_17 auto - - nolabel sun6130_0_18 auto:cdsdisk disk_0 prod_SC32 online nohotuse sun6130_0_19 auto:cdsdisk disk_1 prod_SC32 online nohotuse The following are after adding new LUNs sun6130_0_16 auto - - nolabel sun6130_0_17 auto - - nolabel sun6130_0_18 auto - - nolabel sun6130_0_19 auto - - nolabel sun6130_0_20 auto:cdsdisk disk_0 prod_SC32 online nohotuse sun6130_0_21 auto:cdsdisk disk_1 prod_SC32 online nohotuse The name of the VxVM device sun6130_0_18 is changed to sun6130_0_20. RESOLUTION:The code that updates the VxVM device records is rectified. * INCIDENT NO:2353325 TRACKING ID:1791397 SYMPTOM:Replication doesn't start if rlink detach and attach is done just after SRL overflow. DESCRIPTION:As SRL overflows, it starts flush writes from SRL to DCM(Data change map). If rlink is detached before complete SRL is flushed to DCM then it leaves the rlink in SRL flushing state. Due to flushing state of rlink, attaching the rlink again doesn't start the replication. Problem here is the way rlink flushing state is interpreted. RESOLUTION:To fix this issue, we changed the logic to correctly interpret rlink flushing state. * INCIDENT NO:2353327 TRACKING ID:2179259 SYMPTOM:When using disks of size > 2TB and the disk encounters a media error with offset > 2TB while the disk responds to SCSI inquiry, data corruption can occur incase of a write operation DESCRIPTION:The I/O rety logic in DMP assumes that the I/O offset is within 2TB limit and hence when using disks of size > 2TB and the disk encounters a media error with offset > 2TB while the disk responds to SCSI inquiry, the I/O would be issued on a wrong offset within the 2TB range causing data corruption incase of write I/Os. RESOLUTION:The fix for this issue to change the I/O retry mechanism to work for >2TB offsets as well so that no offset truncation happens that could lead to data corruption * INCIDENT NO:2353328 TRACKING ID:2194685 SYMPTOM:vxconfigd dumps core in scenario where array side ports are disabled/enabled in loop for some iterations. gdb) where #0 0x081ca70b in ddl_delete_node () #1 0x081cae67 in ddl_check_migration_of_devices () #2 0x081d0512 in ddl_reconfigure_all () #3 0x0819b6d5 in ddl_find_devices_in_system () #4 0x0813c570 in find_devices_in_system () #5 0x0813c7da in mode_set () #6 0x0807f0ca in setup_mode () #7 0x0807fa5d in startup () #8 0x08080da6 in main () DESCRIPTION:Due to disabling the array side ports, the secondary paths get removed. But the primary paths are reusing the devno of the removed secondary paths which is not correctly handled in current migration code. Due to this, the DMP database gets corrupted and subsequent discoveries lead to configd core dump. RESOLUTION:The issue is due to incorrect setting of a DMP flag. The flag settting has been fixed to prevent the DMP database from corruption in the mentioned scenario. * INCIDENT NO:2353403 TRACKING ID:2337694 SYMPTOM:"vxdisk -o thin list" displays size as 0 for thin luns of capacity greater than 2 TB. DESCRIPTION:SCSI READ CAPACITY ioctl is invoked to get the disk capacity. SCSI READ CAPACITY returns data in extended data format if a disk capacity is 2 TB or greater. This extended data was parsed incorectly while calculating the disk capacity. RESOLUTION:This issue has been resolved by properly parsing the extended data returned by SCSI READ CAPACITY ioctl for disks of size greater than 2 TB or greater. * INCIDENT NO:2353410 TRACKING ID:2286559 SYMPTOM:System panics in DMP (Dynamic Multi Pathing) kernel module due to kernel heap corruption while DMP path failover is in progress. Panic stack may look like: vpanic kmem_error+0x4b4() gen_get_enabled_ctlrs+0xf4() dmp_get_enabled_ctlrs+0xf4() dmp_info_ioctl+0xc8() dmpioctl+0x20() dmp_get_enabled_cntrls+0xac() vx_dmp_config_ioctl+0xe8() quiescesio_start+0x3e0() voliod_iohandle+0x30() voliod_loop+0x24c() thread_start+4() DESCRIPTION:During path failover in DMP, the routine gen_get_enabled_ctlrs() allocates memory proportional to the number of enabled paths. However, while releasing the memory, the routine may end up freeing more memory because of the change in number of enabled paths. RESOLUTION:Code changes have been made in the routines to free allocated memory only. * INCIDENT NO:2353421 TRACKING ID:2334534 SYMPTOM:In CVM (Cluster Volume Manager) environment, a node (SLAVE) join to the cluster is getting stuck and leading to unending join hang unless join operation is stopped on joining node (SLAVE) using command '/opt/VRTS/bin/vxclustadm stopnode'. While CVM join is hung in user-land (also called as vxconfigd level join), on CVM MASTER node, vxconfigd (Volume Manager Configuration daemon) doesn't respond to any VxVM command, which communicates to vxconfigd process. When vxconfigd level CVM join is hung in user-land, "vxdctl -c mode" on joining node (SLAVE) displays an output such as: bash-3.00# vxdctl -c mode mode: enabled: cluster active - SLAVE master: mtvat1000-c1d state: joining reconfig: vxconfigd in join DESCRIPTION:As part of a CVM node join to the cluster, every node in the cluster updates the current CVM membership information (membership information which can be viewed by using command '/opt/VRTS/bin/vxclustadm nidmap') in kernel first and then sends a signal to vxconfigd in user land to use that membership in exchanging configuration records among each others. Since each node receives the signal (SIGIO) from kernel independently, the joining node's (SLAVE) vxconfigd is ahead of the MASTER in its execution. Thus any requests coming from the joining node (SLAVE) is denied by MASTER with the error "VE_CLUSTER_NOJOINERS" i.e. join operation is not currently allowed (error number: 234) since MASTER's vxconfigd has not got the updated membership from the kernel yet. While responding to joining node (SLAVE) with error "VE_CLUSTER_NOJOINERS", if there is any change in current membership (change in CVM node ID) as part of node join then MASTER node is wrongly updating the internal data structure of vxconfigd, which is being used to send response to joining (SLAVE) nodes. Due to wrong update of internal data structure, later when the joining node retries its request, the response from master is sent to a wrong node, which doesn't exist in the cluster, and no response is sent to the joining node. Joining node (SLAVE) never gets the response from MASTER for its request and hence CVM node join is not completed and leading to cluster hang. RESOLUTION:vxconfigd code is modified to handle the above mentioned scenario effectively. vxconfid on MASTER node will process connection request coming from joining node (SLAVE) effectively only when MASTER node gets the updated CVM membership information from kernel. * INCIDENT NO:2353425 TRACKING ID:2320917 SYMPTOM:vxconfigd, the VxVM configuration daemon dumps core and loses disk group configuration while invoking the following VxVM reconfiguration steps: 1) Volumes which were created on thin reclaimable disks are deleted. 2) Before the space of the deleted volumes is reclaimed, the disks (whose volume is deleted) are removed from the DG with 'vxdg rmdisk' command using '- k' option. 3) The disks are removed using 'vxedit rm' command. 4) New disks are added to the disk group using 'vxdg addisk' command. The stack trace of the core dump is : [ 0006f40c rec_lock3 + 330 0006ea64 rec_lock2 + c 0006ec48 rec_lock2 + 1f0 0006e27c rec_lock + 28c 00068d78 client_trans_start + 6e8 00134d00 req_vol_trans + 1f8 00127018 request_loop + adc 000f4a7c main + fb0 0003fd40 _start + 108 ] DESCRIPTION:When a volume is deleted from a disk group that uses thin reclaim luns, subdisks are not removed immediately, rather it is marked with a special flag. The reclamation happens at a scheduled time every day. "vxdefault" command can be invoked to list and modify the settings. After the disk is removed from disk group using 'vxdg -k rmdisk' and 'vxedit rm' command, the subdisks records are still in core database and they are pointing to disk media record which has been freed. When the next command is run to add another new disk to the disk group, vxconfigd dumps core when locking the disk media record which has already been freed. The subsequent disk group deport and import commands erase all disk group configuration as it detects an invalid association between the subdisks and the removed disk. RESOLUTION:1) The following message will be printed when 'vxdg rmdisk' is used to remove disk that has reclaim pending subdisks: VxVM vxdg ERROR V-5-1-0 Disk is used by one or more subdisks which are pending to be reclaimed. Use "vxdisk reclaim " to reclaim space used by these subdisks, and retry "vxdg rmdisk" command. Note: reclamation is irreversible. 2) Add a check when using 'vxedit rm' to remove disk. If the disk is in removed state and has reclaim pending subdisks, following error message will be printed: VxVM vxedit ERROR V-5-1-10127 deleting : Record is associated * INCIDENT NO:2353427 TRACKING ID:2337353 SYMPTOM:The "vxdmpadm include" command is including all the excluded devices along with the device given in the command. Example: # vxdmpadm exclude vxvm dmpnodename=emcpower25s2 # vxdmpadm exclude vxvm dmpnodename=emcpower24s2 # more /etc/vx/vxvm.exclude exclude_all 0 paths emcpower24c /dev/rdsk/emcpower24c emcpower25s2 emcpower10c /dev/rdsk/emcpower10c emcpower24s2 # controllers # product # pathgroups # # vxdmpadm include vxvm dmpnodename=emcpower24s2 # more /etc/vx/vxvm.exclude exclude_all 0 paths # controllers # product # pathgroups # DESCRIPTION:When a dmpnode is excluded, an entry is made in /etc/vx/vxvm.exclude file. This entry has to be removed when the dmpnode is included later. Due to a bug in comparison of dmpnode device names, all the excluded devices are included. RESOLUTION:The bug in the code which compares the dmpnode device names is rectified. * INCIDENT NO:2353428 TRACKING ID:2339251 SYMPTOM:In Solaris 10 version, newfs/mkfs_ufs(1M) fails to create UFS file system on "VxVM volume > 2 Tera Bytes" with the following error: # newfs /dev/vx/rdsk/[disk group]/[volume] newfs: construct a new file system /dev/vx/rdsk/[disk group]/[volume]: (y/n)? y Can not determine partition size: Inappropriate ioctl for device The truss output of the newfs/mkfs_ufs(1M) shows that the ioctl() system calls, to identify the size of the disk or volume device, fails with ENOTTY error. ioctl(3, 0x042A, ...) Err#25 ENOTTY ... ioctl(3, 0x0412, ...) Err#25 ENOTTY DESCRIPTION:In Solaris 10 version, newfs/mkfs_ufs(1M) uses ioctl() system calls, to identify the size of the disk or volume device, when creating UFS file system on disk or volume devices "> 2TB". If the Operating System (OS) version is less than Solaris 10 Update 8, the above ioctl system calls are invoked on "volumes > 1TB" as well. VxVM, Veritas Volume Manager exports the ioctl interfaces for VxVM volumes. VxVM 5.1 SP1 RP1 P1 and VxVM 5.0 MP3 RP3 introduced the support for Extensible Firmware Interface (EFI) for VxVM volumes in Solaris 9 and Solaris 10 respectively. However the corresponding EFI specific build time definition in Veritas Kernel IO driver (VXIO) was not updated in Solaris 10 in VxVM 5.1 SP1 RP1 P1 and onwards. RESOLUTION:The code changes to add the build time definition for EFI in VXIO entails in newfs/mkfs_ufs(1M) successfully creating UFS file system on VxVM volume devices "> 2TB" ("> 1TB" if OS version is less than Solaris 10 Update 8). * INCIDENT NO:2353464 TRACKING ID:2322752 SYMPTOM:Duplicate device names are observed for NR (Not Ready) devices, when vxconfigd is restarted (vxconfigd -k). # vxdisk list emc0_0052 auto - - error emc0_0052 auto:cdsdisk - - error emc0_0053 auto - - error emc0_0053 auto:cdsdisk - - error DESCRIPTION:During vxconfigd restart, disk access records are rebuilt in vxconfigd database. As part of this process IOs are issued on all the devices to read the disk private regions. The failure of these IOs on NR devicess resulted in creating duplicate disk access records. RESOLUTION:vxconfigd code is modified not to create dupicate disk access records. * INCIDENT NO:2357579 TRACKING ID:2357507 SYMPTOM:Machine can panic while detecting unstable paths with following stack trace. #0 crash_nmi_callback #1 do_nmi #2 nmi #3 schedule #4 __down #5 __wake_up #6 .text.lock.kernel_lock #7 thread_return #8 printk #9 dmp_notify_event #10 dmp_restore_node DESCRIPTION:After detecting unstable paths restore daemon allocates memory to report the event to userland daemons like vxconfigd. While requesting for memory allocation restore daemon did not drop the spin lock resulting to the machine panic. RESOLUTION:Fixed the code so that spinlocks are not held while requesting for memory allocation in restore daemon. * INCIDENT NO:2357820 TRACKING ID:2357798 SYMPTOM:VVR leaking memory due to unfreed vol_ru_update structure. Memory leak is very small but it can accumulate to big value if VVR is running for many days. DESCRIPTION:VVR allocates update structure for each write, if replication is up-to-date then next write coming in will also create multi-update and add it to VVR replication queue. While creating multi-update, VVR wrongly marked the original update with flag, which means that update is in replication queue, but it was never added(not required) to replication queue. When update free routine is called it check if update has flag marked then don't free it, assuming that update is still in replication queue, it will get free while remove it from queue. Since update was not in the queue it will never get free and leak the memory. Memory leak will happen for only first write coming after each time rlink become up-to-date, that is reason it will take many days to leak big memory. RESOLUTION:Marking of flag for some updates was causing this memory leak, flag marking is not required as we are not adding update into replication queue. Fix is to remove marking and checking of flag. * INCIDENT NO:2360404 TRACKING ID:2146833 SYMPTOM:The vxrootadm/vxmirror command may fail with error: VxVM mirror INFO V-5-2-22 Mirror voume swapvol... VxVM ERROR V-5-2-673 Mirroring of disk rootdisk failed: Error: VxVM vxdisk ERRROR V-5-1-0 Device has UFS FS on it. DESCRIPTION:With VxVM 5.1SP1 we have restricted use of -f option with 'vxdisk init' to initialize a disk having UFS FS on it. We have introduced a new option '-r' to be used if user wants to forcefully initalize the disk. The root disk on solaris do not have a foreign format but have UFS FS. While trying to encapsulate the root disk we try to init the disk which fails if -r option is not specified. RESOLUTION:The fix is to add -r option to 'vxdisk init'/'vxdisksetup' at all the places within our encap scripts to ensure that we successfully initialize a root disk. We have also made the error message more informative as: For ex: bash-3.00# vxdisk -f init c0d40s2 VxVM vxdisk ERROR V-5-1-16114 The device is in use. This device may be a boot disk. Device has a UFS FS on it. If you still want to initialize this device for VxVM use, ensure that there is no root FS on it. Then remove the FS signature from each of the slice(s) as follows: dd if=/dev/zero of=/dev/vx/rdmp/c0d40s[n] oseek=18 bs=512 count=1 [n] is the slice number. Or alternatively you can rerun the same command with -r option. * INCIDENT NO:2360415 TRACKING ID:2242268 SYMPTOM:The agenode which got already freed got accessed which led to the panic. Panic stack looks like [0674CE30]voldrl_unlog+0001F0 (F100000070D40D08, F10001100A14B000, F1000815B002B8D0, 0000000000000000) [06778490]vol_mv_write_done+000AD0 (F100000070D40D08, F1000815B002B8D0) [065AC364]volkcontext_process+0000E4 (F1000815B002B8D0) [066BD358]voldiskiodone+0009D8 (F10000062026C808) [06594A00]voldmp_iodone+000040 (F10000062026C808) DESCRIPTION:Panic happened because of accessing the memory location which got already freed. RESOLUTION:Skip the data structure for further processing when the memory already got freed off. * INCIDENT NO:2360419 TRACKING ID:2237089 SYMPTOM:======= vxrecover failed to recover the data volumes with associated cache volume. DESCRIPTION:=========== vxrecover doesn't wait till the recovery of the cache volumes is complete before triggering the recovery of the data volumes that are created on top of cache volume. Due to this the recovery might fail for the data volumes. RESOLUTION:========== Code changes are done to serialize the recovery for different volume types. * INCIDENT NO:2360719 TRACKING ID:2359814 SYMPTOM:1. vxconfigbackup(1M) command fails with the following error: ERROR V-5-2-3720 dgid mismatch 2. "-f" option for the vxconfigbackup(1M) is not documented in the man page. DESCRIPTION:1. In some cases, a *.dginfo file will have two lines starting with "dgid:". It causes vxconfigbackup to fail. The output from the previous awk command returns 2 lines instead of one for the $bkdgid variable and the comparison fails, resulting in "dgid mismatch" error even when the dgids are the same. This happens in the case if the temp dginfo file is not removed during last run of vxconfigbackup, such as the script is interrupted, the temp dginfo file is updated with appending mode, vxconfigbackup.sh: echo "TIMESTAMP" >> $DGINFO_F_TEMP 2>/dev/null Therefore, there may have 2 or more dginfo are added into the dginfo file, it causes the config backup failure with dgid mismatch. 2. "-f" option to force a backup is not documented in the man page of vxconfigbackup(1M). RESOLUTION:1. The solution is to change append mode to destroy mode. 2. Updated the vxconfigbackup(1M) man page with the "-f" option. * INCIDENT NO:2364700 TRACKING ID:2364253 SYMPTOM:In case of Space Optimized snapshots at secondary site, VVR leaks kernel memory. DESCRIPTION:In case of Space Optimized snapshots at secondary site, VVR proactively starts the copy-on-write on the snapshot volume. The I/O buffer allocated for this proactive copy-on-write was not freed even after I/Os are completed which lead to the memory leak. RESOLUTION:After the proactive copy-on-write is complete, memory allocated for the I/O buffers is released. * INCIDENT NO:2367561 TRACKING ID:2365951 SYMPTOM:Growing RAID5 volumes beyond 5TB fails with "Unexpected kernel error in configuration update" error. Example : # vxassist -g eqpwhkthor1 growby raid5_vol5 19324030976 VxVM vxassist ERROR V-5-1-10128 Unexpected kernel error in configuration update DESCRIPTION:VxVM stores the size required to grow RAID5 volumes in an integer variable which overflowed for large volume sizes. This results in failure to grow the volume. RESOLUTION:VxVM code is modified to handle integer overflow conditions for RAID5 volumes. * INCIDENT NO:2377317 TRACKING ID:2408771 SYMPTOM:VXVM does not show all the discovered devices. Number of devices shown by VXVM is lesser than those by the OS. DESCRIPTION:For every lunpath device discovered, VXVM creates a data structure and is stored in a hash table. Hash value is computed based on unique minor of the lunpath. In case minor number exceeds 831231, we encounter integer overflow and store the data structure for this path at wrong location. When we later traverse this hash list, we limit the accesses based on total number of discovered paths and as the devices with minor numbers greater than 831232 are hashed wrongly, we do not create DA records for such devices. RESOLUTION:Integer overflow problem has been resolved by appropriately typecasting the minor number and hence correct hash value is computed. * INCIDENT NO:2379034 TRACKING ID:2379029 SYMPTOM:Changing of enclosure name was not working for all devices in enclosure. All these devices were present in /etc/vx/darecs. # cat /etc/vx/darecs ibm_ds8x000_02eb auto online format=cdsdisk,privoffset=256,pubslice=2,privslice=2 ibm_ds8x000_02ec auto online format=cdsdisk,privoffset=256,pubslice=2,privslice=2 # vxdmpadm setattr enclosure ibm_ds8x000 name=new_ibm_ds8x000 # vxdisk -o alldgs list DEVICE TYPE DISK GROUP STATUS ibm_ds8x000_02eb auto:cdsdisk ibm_ds8x000_02eb mydg online ibm_ds8x000_02ec auto:cdsdisk ibm_ds8x000_02ec mydg online new_ibm_ds8x000_02eb auto - - error new_ibm_ds8x000_02ec auto - - error DESCRIPTION:/etc/vx/darecs only stores foreign devices and nopriv or simple devices, the auto device should NOT be written into this file. A DA record is flushed in the /etc/vx/darecs at the end of transaction, if R_NOSTORE flag is NOT set on a DA record. There was a bug in VM where if we initialize a disk that does not exist(e.g. using vxdisk rm) in da_list, the R_NOSTORE flag is NOT set for the new created DA record. Hence duplicate entries for these devices were created and resulted in these DAs going in error state. RESOLUTION:Source has been modified to add R_NOSTORE flag for auto type DA record created by auto_init() or auto_define(). # vxdmpadm setattr enclosure ibm_ds8x000 name=new_ibm_ds8x000 # vxdisk -o alldgs list new_ibm_ds8x000_02eb auto:cdsdisk ibm_ds8x000_02eb mydg online new_ibm_ds8x000_02ec auto:cdsdisk ibm_ds8x000_02ec mydg online * INCIDENT NO:2382705 TRACKING ID:1675599 SYMPTOM:Vxconfigd leaks memory while excluding and including a Third party Driver controlled LUN in a loop. As part of this vxconfigd loses its license information and following error is seen in system log: "License has expired or is not available for operation" DESCRIPTION:In vxconfigd code, memory allocated for various data structures related to device discovery layer is not freed which led to the memory leak. RESOLUTION:The memory is released after its scope is over. * INCIDENT NO:2382710 TRACKING ID:2139179 SYMPTOM:DG import can fail with SSB (Serial Split Brain) though the SSB does not exist. DESCRIPTION:An association between DM and DA records is done while importing any DG, if the SSB id of the DM and DA records match. On a system with stale cloned disks, the system is attempting to associate the DM with cloned DA, where the SSB id mismatch is observed and resulted in import failure with SSB mismatch. RESOLUTION:The selection of DA to associate with DM is rectified to resolve the issue. * INCIDENT NO:2382714 TRACKING ID:2154287 SYMPTOM:In the presence of Not-Ready" devices when the SCSI inquiry on the device succeeds but open or read/write operations fail, one sees that paths to such devices are continuously marked as ENABLED and DISABLED for every DMP restore task cycle. DESCRIPTION:The issue is that the DMP restore task finds these paths connected and hence enables them for I/O but soon finds that they cannot be used for I/O and disables them RESOLUTION:The fix is to not enable the path unless it is found to be connected and available to open and issue I/O. * INCIDENT NO:2382717 TRACKING ID:2197254 SYMPTOM:vxassist, the VxVM volume creation utility when creating volume with "logtype=none" doesn't function as expected. DESCRIPTION:While creating volumes on thinrclm disks, Data Change Object(DCO) version 20 log is attached to every volume by default. If the user do not want this default behavior then "logtype=none" option can be specified as a parameter to vxassist command. But with VxVM on HP 11.31 , this option does not work and DCO version 20 log is created by default. The reason for this inconsistency is that when "logtype=none" option is specified, the utility sets the flag to prevent creation of log. However, VxVM wasn't checking whether the flag is set before creating DCO log which led to this issue. RESOLUTION:This is a logical issue which is addressed by code fix. The solution is to check for this corresponding flag of "logtype=none" before creating DCO version 20 by default. * INCIDENT NO:2382720 TRACKING ID:2216515 SYMPTOM:System could not boot-up after vxunreloc. If original offsets are used while un-relocation, it will corrupt the boot disk. DESCRIPTION:When root disk is not having any free space and it is encapsulated then encap process will steal some space from swap. And also it will create a public slice starting from "0" sector and it will create a -B0 subdisk to protect the cylinder "0" information. Hence public region length is bigger than it should have been when non-full disk is initialized. When disk is yanked out the rootvol, swapvol is relocated to a disk which is already re-initialized, so there is no need to reserve space for -B0 subdisk. Hence it will allocate rootvol, swapvol space from the public region and relocates the data.But when it comes to un-reloc, unreloc will try to create a subdisk on the new target disk and will try to keep the same offsets as the original failed disk. Hence it exceeds the public region slice and will overlap other slice, causing data corruption. RESOLUTION:Source has been modified to not use the original offsets for case of unrelocation of encapsulated root disk. It will display an info message during unrelocation to indicate this. # /etc/vx/bin/vxunreloc -g rootdg rootdg01 VxVM INFO V-5-2-0 Forcefully unrelocating the root disk without preserving original offsets * INCIDENT NO:2383705 TRACKING ID:2204752 SYMPTOM:The following message is observed after the diskgroup creation: "VxVM ERROR V-5-3-12240: GPT entries checksum mismatch" DESCRIPTION:This message is observed with the disk which was initialized as cds_efi and later on this was initialized as hpdisk. A harmless message "checksum mismatch" is thrown out even when the diskgroup initialization is successful. RESOLUTION:Remove the harmless message "GPT entries checksum mismatch" * INCIDENT NO:2384473 TRACKING ID:2064490 SYMPTOM:vxcdsconvert utility fails if disk capacity is greater than or equal to 1 TB DESCRIPTION:VxVM cdsdisk uses GPT layout if the disk capacity is greater than 1 TB and uses VTOC layout if the disk capacity is less 1 TB. Thus, vxcdsconvert utility was not able to convert to the GPT layout if the disk capacity is greater than or equal to 1 TB. RESOLUTION:This issue has been resolved by converting to proper cdsdisk layout depending on the disk capacity * INCIDENT NO:2384844 TRACKING ID:2356744 SYMPTOM:When "vxvm-recover" are executed manually, the duplicate instances of the Veritas Volume Manager(VxVM) daemons (vxattachd, vxcached, vxrelocd, vxvvrsecdgd and vxconfigbackupd) are invoked. When user tries to kill any of the daemons manually, the other instances of the daemons are left on this system. DESCRIPTION:The Veritas Volume Manager(VxVM) daemons (vxattachd, vxcached, vxrelocd, vxvvrsecdgd and vxconfigbackupd) do not have : 1. A check for duplicate instance. and 2. Mechanism to clean up the stale processes. Because of this, when user executes the startup script(vxvm-recover), all daemons are invoked again and if user kills any of the daemons manually, the other instances of the daemons are left on this system. RESOLUTION:The VxVM daemons are modified to do the "duplicate instance check" and "stale process cleanup" appropriately. * INCIDENT NO:2386763 TRACKING ID:2346470 SYMPTOM:The Dynamic Multi Pathing Administration operations such as "vxdmpadm exclude vxvm dmpnodename=" and "vxdmpadm include vxvm dmpnodename= " triggers memory leaks in the heap segment of VxVM Configuration Daemon (vxconfigd). DESCRIPTION:vxconfigd allocates chunks of memory to store VxVM specific information of the disk being included during "vxdmpadm include vxvm dmpnodename=" operation. The allocated memory is not freed while excluding the same disk from VxVM control. Also when excluding a disk from VxVM control, another chunk of memory is temporarily allocated by vxconfigd to store more details of the device being excluded. However this memory is not freed at the end of exclude operation. RESOLUTION:Memory allocated during include operation of a disk is freed during corresponding exclude operation of the disk. Also temporary memory allocated during exclude operation of a disk is freed at the end of exclude operation. * INCIDENT NO:2389095 TRACKING ID:2387993 SYMPTOM:In presence of NR (Not-Ready) devices, vxconfigd (VxVM configuration daemon) goes into disabled mode once restarted. # vxconfigd -k -x syslog # vxdctl mode mode: disabled If vxconfigd is restarted in debug mode at level 9 following message could be seen. # vxconfigd -k -x 9 -x syslog VxVM vxconfigd DEBUG V-5-1-8856 DA_RECOVER() failed, thread 87: Kernel and on-disk configurations don't match DESCRIPTION:When vxconfid is restarted, all the VxVM devices are recovered. As part of recovery the capacity of the device is read, which can fail with EIO. This error is not handled properly. As a result of this the vxconfigd is going to DISABLED state. RESOLUTION:EIO error code from read capacity ioctl is handled specifically. * INCIDENT NO:2390804 TRACKING ID:2249113 SYMPTOM:VVR volume recovery hang, at vol_ru_recover_primlog_done() function in a dead loop. DESCRIPTION:During the SRL recovery, the SRL is read to apply the update to the data volume. There are possible hold in the SRL due to some writes are not complete properly. This holes must have to be skipped. and this regions is read as a dummy update and sent it to secondary. If the dummy update size is larger than max_write (>256k), then the code logic goes intoa dead loop, keep reading the same dummy update for ever. RESOLUTION:Handle the large holes which are greater than VVR MAX_WRITE. * INCIDENT NO:2390815 TRACKING ID:2383158 SYMPTOM:The panic in vol_rv_mdship_srv_done() due to sio is freed and having the invalid node pointer. DESCRIPTION:The vol_rv_mdship_srv_done() is panicking at referencing wrsio->wrsrv_node as the wrsrv_node is having the invalid pointer.It is also observed that the wrsio is freed or allocated for different SIO. Looking closely, the vol_rv_check_wrswaitq() is called at every done of the SIO, which looks into the waitq and releases all the SIO which has RV_WRSHIP_SRV_SIO_FLAG_LOGEND_DONE flag set on it. In vol_rv_mdship_srv_done(), we set this flag and do more operations on wrsrv. During this time the other SIO which is completed with the DONE, calls the function vol_rv_check_wrswaitq() and deletes the SIO of it own and other SIO which has the RV_WRSHIP_SRV_SIO_FLAG_LOGEND_DONE flag set. This leads to deleting the SIO which is on the fly, and is causing the panic. RESOLUTION:The flag must be set just before calling the function vol_rv_mdship_srv_done(), and at the end of the SIOdone() to avoid other SIO's to race and delete the current running one. * INCIDENT NO:2390822 TRACKING ID:2369786 SYMPTOM:On VVR Secondary cluster, if SRL disk goes bad then,then vxconfigd may hang in transaction code path. DESCRIPTION:In case of any error seen in VVR shared disk group environments, error handling is done cluster wide. On VVR Secondary, if SRL disk goes bad due to some temporary or actual disk failure, it starts cluster wide error handling. Error handling requires serialization, in some cases we didn't do serialization which caused error handling to go in dead loop hence the hang. RESOLUTION:Making sure we always serialize the I/O during error handling on VVR Secondary resolved this issue. * INCIDENT NO:2397663 TRACKING ID:2165394 SYMPTOM:If the cloned copy of a diskgroup and a destroyed diskgroup exists on the system, an import operation imports destroyed diskgroup instread of cloned one. For example, consider a system with diskgroup dg containing disk disk1. Disk disk01 is cloned to disk02. When diskgroup dg containing disk01 is destroyed and diskgroup dg is imported, VXVM should import dg with cloned disk i.e disk02. However, it imports the diskgroup dg with disk01. DESCRIPTION:After destroying a diskgroup, if the cloned copy of the same diskgroup exists on the system, the following disk group import operation wrongly identifies the disks to be import and hence destroyed diskgroup gets imported. RESOLUTION:The diskgroup import code is modified to identify the correct diskgroup when a cloned copy of the destroyed diskgroup exists. * INCIDENT NO:2405446 TRACKING ID:2253970 SYMPTOM:Enhancement to customize private region I/O size based on maximum transfer size of underlying disk. DESCRIPTION:There are different types of Array Controllers which support data transfer sizes starting from 256K and beyond. VxVM tunable volmax_specialio controls vxconfigd's configuration I/O as well as Atomic Copy I/O size. When volmax_specialio is tuned to a value greater than 1MB to leverage maximum transfer sizes of underlying disks, import operation is failing for disks which cannot accept more than 256K I/O size. If the tunable is set to 256k then it will be the case where large transfer size of disks is not being leveraged. RESOLUTION:All the above scenarios mentioned in Description are handled in this enhancement to leverage large disk transfer sizes as well as support Array controllers with 256K transfer sizes. * INCIDENT NO:2411052 TRACKING ID:2268408 SYMPTOM:1) On suppressing the underlying path of powerpath controlled device, the disk goes in error state. 2) "vxdmpadm exclude vxvm dmpnodename=" command does not suppress TPD devices. DESCRIPTION:During discovery, H/W path corresponding to the basename is not generated for powerpath controlled devices because basename does not contain the slice portion. Device name with s2 slice is expected while generating H/W name. RESOLUTION:Whole disk name i.e., device name with s2 slice is used to generate H/W path. * INCIDENT NO:2411053 TRACKING ID:2410845 SYMPTOM:If a DG(Disk Group) is imported with reservation key, then during DG deport lots of 'reservation conflict' messages will be seen. [DATE TIME] [HOSTNAME] multipathd: VxVM26000: add path (uevent) [DATE TIME] [HOSTNAME] multipathd: VxVM26000: failed to store path info [DATE TIME] [HOSTNAME] multipathd: uevent trigger error [DATE TIME] [HOSTNAME] multipathd: VxVM26001: add path (uevent) [DATE TIME] [HOSTNAME] multipathd: VxVM26001: failed to store path info [DATE TIME] [HOSTNAME] multipathd: uevent trigger error .. [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:2: reservation conflict [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:2: reservation conflict [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:1: reservation conflict [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:2: reservation conflict [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:1: reservation conflict [DATE TIME] [HOSTNAME] kernel: sd 2:0:0:2: reservation conflict DESCRIPTION:When removing a PGR(Persistent Group Reservation) key during DG deport, we need to preempt the key but the preempt operation is failed with reservation conflict error because the passing key for preemption is not correct. RESOLUTION:Code changes are made to set the correct key value for the preemption operation. * INCIDENT NO:2413077 TRACKING ID:2385680 SYMPTOM:The vol_rv_async_childdone() panic occurred because of corrupted pripendingq DESCRIPTION:The pripendingq is always corrupted in this panic. The head entry is always freed from the queue but not removed. In mdship_srv_done code, for error condition, we remove the update from pripendingq only if the next or prev pointers of updateq is non-null. This leads to the head pointer not getting removed in the abort scenerio and causing the free to happen without deleting it from the queue. RESOLUTION:The prev and next checks are removed in all the places. Also handled the abort case carefully for the following conditions: 1) abort logendq due to slave node panic (i.e.) this has the update entry but the update is not removed from the pripendingq. 2) vol_kmsg_eagain type of failures, (i.e.) the update is there, but it is removed from the pripendingq. 3) abort very early in the mdship_sio_start() (i.e.) update is allocated but not in pripendingq. * INCIDENT NO:2413908 TRACKING ID:2413904 SYMPTOM:Performing Dynamic LUN reconfiguration operations (adding and removing LUNs), can cause corruption in DMP database. This in turn may lead to vxconfigd core dump OR system panic. DESCRIPTION:When a LUN is removed from the VM using 'vxdisk rm' and at the same time some new LUN is added and in case the newly added LUN reuses the devno of the removed LUN then this may corrupt the DMP database as this condition is not handled currently. RESOLUTION:Fixed the DMP code to handle the mentioned issue. * INCIDENT NO:2415566 TRACKING ID:2369177 SYMPTOM:When using > 2TB disks and the device respons to SCSI inquiry but fails to service I/O, data corruption can occur as the write I/O would be directed at an incorrect offset DESCRIPTION:Currently when the failed I/O is retried, DMP assumes the offset to be a 32 bit value and hence I/O offsets >2TB can get truncated leading to the rety I/O issued at wrong offset value RESOLUTION:Change the offset value to a 64 bit quantity to avoid truncation during I/O retries from DMP. * INCIDENT NO:2415577 TRACKING ID:2193429 SYMPTOM:Enclosure attributes like iopolicy, recoveryoption etc do not persist across reboots in case when before vold startup itself DMP driver is already configured before with different array type (e.g. in case of root support) than stored in array.info. DESCRIPTION:When DMP driver is already configured before vold comes up (as happens in root support), then the enclosure attributes do not take effect if the enclosure name in kernel has changed from previous boot cycle. This is because when vold comes up da_attr_list will be NULL. And then it gets events from DMP kernel for data structures already present in kernel. On receiving this information, it tries to write da_attr_list into the array.info, but since da_attr_list is NULL, array.info gets overwritten with no data. And hence later vold could not correlate the enclosure attributes present in dmppolicy.info with enclosures present in array.info, so the persistent attributes could not get applied. RESOLUTION:Do not overwrite array.info of da_attr_list is NULL * INCIDENT NO:2417184 TRACKING ID:2407192 SYMPTOM:Application I/O hangs on RVG volumes when RVG logowner is being set on the node which takes over the master role (either as part of "vxclustadm setmaster" OR as part of original master leave) DESCRIPTION:Whenever a node takes over the master role, RVGs are recovered on the new master. Because of a race between RVG recovery thread (initiated as part of master takeover) and the thread which is changing RVG logowner(which is run as part of "vxrvg set logowner=on", RVG recovery does not get completed which leads to I/O hang. RESOLUTION:The race condition is handled with appropriate locks and conditional variable. * INCIDENT NO:2421100 TRACKING ID:2419348 SYMPTOM:Tags Empty DESCRIPTION:This panic is because of race condition between vxconfigd doing a dmp_reconfigure_db() and another process (vxdclid) executing dmp_passthru_ioctl(). The stack of vxdclid thread:- 000002a107684d51 dmp_get_path_state+0xc(606a5b08140, 301937d9c20, 0, 0, 0, 0) 000002a107684e01 do_passthru_ioctl+0x76c(606a5b08140, 8, 0, 606a506c840, 606a506c848, 0) 000002a107684f61 dmp_passthru_ioctl+0x74(11d000005ca, 40b, 3ad4c0, 100081, 606a3d477b0, 2a107685adc) 000002a107685031 dmpioctl+0x20(11d000005ca, 40b, 3ad4c0, 100081, 606a3d477b0, 2a107685adc) 000002a1076850e1 fop_ioctl+0x20(60582fdfc00, 40b, 3ad4c0, 100081, 606a3d477b0, 1296a58) 000002a107685191 ioctl+0x184(a, 6065a188430, 3ad4c0, ff0bc910, ff1303d8, 40b) 000002a1076852e1 syscall_trap32+0xcc(a, 40b, 3ad4c0, ff0bc910, ff1303d8, ff13a5a0) And the stack of vxconfid which is doing reconfiguarion:- vxdmp:dmp_get_iocount+0x68(0x7) vxdmp:dmp_check_ios_drained+0x40() vxdmp:dmp_check_ios_drained_in_dmpnode+0x40(0x60693cc0f00, 0x20000000) vxdmp:dmp_decode_destroy_dmpnode+0x11c(0x2a10536b698, 0x102003, 0x0, 0x19caa70) vxdmp:dmp_decipher_instructions+0x2e4(0x2a10536b758, 0x10, 0x102003, 0x0, 0x19caa70) vxdmp:dmp_process_instruction_buffer+0x150(0x11d0003ffff, 0x3df634, 0x102003, 0x0, 0x19caa70) vxdmp:dmp_reconfigure_db+0x48() vxdmp:gendmpioctl(0x11d0003ffff, , 0x3df634, 0x102003, 0x604a7017298, 0x2a10536badc) vxdmp:dmpioctl+0x20(, 0x444d5040, 0x3df634, 0x102003, 0x604a7017298) In vxdclid thread we are trying to get the dmpnode from path_t structure. But at the same time path_t has been freed as part of reconfiguration. So hence the panic. RESOLUTION:Get the dmpnode from lvl1tab table instead of path_t structure. Because there is an ioctl is going on this dmpnode, so dmpnode will be available at this time. * INCIDENT NO:2421491 TRACKING ID:2396293 SYMPTOM:On VXVM rooted systems, during machine bootup, vxconfigd core dumps with following assert and machine does not bootup. Assertion failed: (0), file auto_sys.c, line 1024 05/30 01:51:25: VxVM vxconfigd ERROR V-5-1-0 IOT trap - core dumped DESCRIPTION:DMP deletes and regenerates device numbers dynamically on every boot. When we start static vxconfigd in boot mode, since ROOT file system is READ only, new DSF's for DMP nodes are not created. But, DMP configures devices in userland and kernel. So, there is mismatch in device numbers of the DSF's and that in DMP kernel, as there are stale DSF's from previous boot present. This leads vxconfigd to actually send I/O's to wrong device numbers resulting in claiming disk with wrong format. RESOLUTION:Issue is fixed by getting the device numbers from vxconfigd and not doing stat on DMP DSF's. * INCIDENT NO:2423086 TRACKING ID:2033909 SYMPTOM:Disabling a controller of a A/P-G type array could lead to I/O hang even when there are available paths for I/O. DESCRIPTION:DMP was not clearing a flag, in an internal DMP data structure, to enable I/O to all the LUNs during group failover operation. RESOLUTION:DMP code modified to clear the appropriate flag for all the LUNs of the LUN group so that the failover can occur when a controller is disabled. * INCIDENT NO:2428179 TRACKING ID:2425722 SYMPTOM:VxVM's subdisk operation - vxsd mv - fails on subdisk sizes greater than or equal to 2TB. Eg: #vxsd -g nbuapp mv disk_1-03 disk_2-03 VxVM vxsd ERROR V-5-1-740 New subdisks have different size than subdisk disk_1- 03, use -o force DESCRIPTION:VxVM code uses 32-bit unsigned integer variable to store the size of subdisks which can only accommodate values less than 2TB. Thus, for larger subdisk sizes integer overflows resulting in the subdisk move operation failure. RESOLUTION:The code has been modified to accommodate larger subdisk sizes. * INCIDENT NO:2435050 TRACKING ID:2421067 SYMPTOM:With VVR configured, 'vxconfigd' hangs on the primary site when trying to recover the SRL log, after a system or storage failure. DESCRIPTION:At the start of each SRL log disk we keep a config header. Part of this header includes a flag which is used by VVR to serialize the flushing of the SRL configuration table, to ensure only a single thread flushes the table at any one time. In this instance, the 'VOLRV_SRLHDR_CONFIG_FLUSHING' flag was set in the config header, and then the config header was written to disk. At this point the storage became inaccessible. During recovery the config header was read from from disk, and when trying to initiate a new flush of the SRL table, the system hung as the flag was already set to indicate that a flush was in progress. RESOLUTION:When loading the SRL header from disk, the flag 'VOLRV_SRLHDR_CONFIG_FLUSHING' is now cleared. * INCIDENT NO:2436283 TRACKING ID:2425551 SYMPTOM:The cvm reconfiguration takes 1 minute for each RVG configuration. DESCRIPTION:Every RVG is given 1 minute time to drain the IO, if not drained, then the code wait for 1 minute before aborting the I/O's waiting in the logendq. The logic is such that, for every RVG, it wait 1 minute for the I/O's to drain. RESOLUTION:It should be enough to give oveall 1 minute for all RVGs, and abort all the RVG's after 1 minute time, instead of waiting for each RVG. The alternate solution (long term solution) is, Abort the RVG immediately when the objiocount(rv) == queue_count(logendq). This will reduce the 1 minute dealy further down to the actual requirend time. In this, follwoing things to be take care 1. rusio may be active, which need to be reduced in iocount. 2. every I/O goes into the logendq before getting serviced. So, have to make sure they are not in the process of servicing. * INCIDENT NO:2436287 TRACKING ID:2428875 SYMPTOM:On a CVR configuration, issue i/o from both master and slave. reboot the slave lead to reconfiguration hang. DESCRIPTION:The I/O's on both master and slave fills up the SRL and goes to the DCM mode. In DCM mode, the header flush to flush the DCM and the SRL header happens for every 512 updates. Since most of the I/O's are from the SLAVe node, the I/O's throttled due to the hdr_flush is queued in mdship_throttle_q. This queue is flushed at the end of header flush. If the slave node is rebooted and when the SIO are in throttle_q, and when the system is rebooted, the reconfig code path dont flush the mdship_throttleq and wait for them to drain. This lead to the reconfiguration hang due to positive I/O count. RESOLUTION:abort all the SIO's queued in the mdship_throttleq, when the node is aborted. Restart the SIO's for the nodes that did not leave. * INCIDENT NO:2436288 TRACKING ID:2411698 SYMPTOM:I/Os hang in CVR (Clustered Volume Replicator) environment. DESCRIPTION:In CVR environment, when CVM (Clustered Volume Manager) Slave node sends a write request to the CVM Master node, following tasks occur. 1) Master grabs the *REGION LOCK* for the write and permits slave to issue the write. 2) When new IOs occur on the same region (till the write that acquired *REGION LOCK* is not complete), they wait in a *REGION LOCK QUEUE*. 3) Once the IO that acquired the *REGION LOCK* is serviced by slave node, it responds to the Master about the same, and Master processes the IOs queued in the *REGION LOCK QUEUE*. The problem occurs when the slave node dies before sending the response to the Master about completion of the IO that held the *REGION LOCK*. RESOLUTION:Code changes have been made to accomodate the condition as mentioned in the section "DESCRIPTION". * INCIDENT NO:2440351 TRACKING ID:2440349 SYMPTOM:The grow operation on a DCO volume may grow it into any 'site' not honoring the allocation requirements strictly. DESCRIPTION:When a DCO volume is grown, it may not honor the allocation specification strictly to use only a particular site even though they are specified explicitly. RESOLUTION:The Data Change Object of Volume Manager is modified such that it will honor the alloc specification strictly if provided explicitly * INCIDENT NO:2442850 TRACKING ID:2317703 SYMPTOM:When the vxesd daemon is invoked by device attach & removal operations in a loop, it leaves open file descriptors with vxconfigd daemon DESCRIPTION:The issue is caused due to multiple vxesd daemon threads trying to establish contact with vxconfigd daemon at the same time and ending up using losing track of the file descriptor through which the communication channel was established RESOLUTION:The fix for this issue is to maintain a single file descriptor that has a thread safe reference counter thereby not having multiple communication channels established between vxesd and vxconfigd by various threads of vxesd. * INCIDENT NO:2477291 TRACKING ID:2428631 SYMPTOM:Shared DG import or Node Join fails with Hitachi Tagmastore storage DESCRIPTION:CVM uses different fence key for every DG. The key format is of type 'NPGRSSSS' where N is the node id (A,B,C..) and 'SSSS' is the sequence number. Some arrays have a restriction on total number of unique keys that can be registered (eg Hitachi Tagmastore) and hence causes issues for configs involving large number of DGs, rather the product of #DGs and #nodes in the cluster. RESOLUTION:Having a unique key for each DG is not essential. Hence a tunable is added to control this behavior. # vxdefault list KEYWORD CURRENT-VALUE DEFAULT-VALUE ... same_key_for_alldgs off off ... Default value of the tunable is 'off' to preserve the current behavior. If a configuration hits the storage array limit on total number of unique keys, the tunable value could be changed to 'on'. # vxdefault set same_key_for_alldgs on # vxdefault list KEYWORD CURRENT-VALUE DEFAULT-VALUE ... same_key_for_alldgs on off ... This would make CVM generate same key for all subsequent DG imports/creates. Already imported DGs need to be deported and re-imported for them to take into consideration the changed value of the tunable. * INCIDENT NO:2479746 TRACKING ID:2406292 SYMPTOM:In case of I/Os on volumes having multiple subdisks (example striped volumes), System panicks with following stack. unix:panicsys+0x48() unix:vpanic_common+0x78() unix:panic+0x1c() genunix:kmem_error+0x4b4() vxio:vol_subdisksio_delete() - frame recycled vxio:vol_plexsio_childdone+0x80() vxio:volsiodone() - frame recycled vxio:vol_subdisksio_done+0xe0() vxio:volkcontext_process+0x118() vxio:voldiskiodone+0x360() vxio:voldmp_iodone+0xc() genunix:biodone() - frame recycled vxdmp:gendmpiodone+0x1ec() ssd:ssd_return_command+0x240() ssd:ssdintr+0x294() fcp:ssfcp_cmd_callback() - frame recycled qlc:ql_fast_fcp_post+0x184() qlc:ql_status_entry+0x310() qlc:ql_response_pkt+0x2bc() qlc:ql_isr_aif+0x76c() pcisch:pci_intr_wrapper+0xb8() unix:intr_thread+0x168() unix:ktl0+0x48() DESCRIPTION:On a striped volume, the IO is split in to multiple parts equivalent to the number of sub-disks in the stripe. Each part of the IO is processed parallell by different threads. Thus any such two threads processing the IO completion can enter in to a race condition. Due to such race condition one of the threads happens to access a stale address causing the system panic. RESOLUTION:The critical section of code is modified to hold appropriate locks to avoid race condition. * INCIDENT NO:2480006 TRACKING ID:2400654 SYMPTOM:"vxdmpadm listenclosure" command hangs because of duplicate enclosure entries in /etc/vx/array.info file. Example: Enclosure "emc_clariion0" has two entries. #cat /etc/vx/array.info DD4VM1S emc_clariion0 0 EMC_CLARiiON DISKS disk 0 Disk DD3VM2S emc_clariion0 0 EMC_CLARiiON DESCRIPTION:When "vxdmpadm listenclosure" command is run, vxconfigd reads its in-core enclosure list which is populated from the /etc/vx/array.info file. Since the enclosure "emc_clariion0" (as mentioned in the example) is also a last entry within the file, the command expects vxconfigd to return the enclosure information at the last index of the enclosure list. However because of duplicate enclosure entries,vxconfigd returns a different enclosure information thereby leading to the hang. RESOLUTION:The code changes are made in vxconfigd to detect duplicate entries in /etc/vx/array.info file and return the appropriate enclosure information as requested by the vxdmpadm command. * INCIDENT NO:2484466 TRACKING ID:2480600 SYMPTOM:I/O of large sizes like 512k and 1024k hang in CVR (Clustered Volume Replicator). DESCRIPTION:When large IOs, say, of sizes like, 1MB, are performed on volumes under RVG (Replicated Volume Group), a limited number of IOs can be accomodated based on RVIOMEM pool limit. So, the pool remains full for majority of the duration.At this time, when CVM (Clustered Volume Manager) slave gets rebooted, or goes down, the pending IOs are aborted and the corresponding memory is freed. In one of the cases, it does not get freed, leading to the hang. RESOLUTION:Code changes have been made to free the memory under all scenarios. * INCIDENT NO:2484695 TRACKING ID:2484685 SYMPTOM:In a Storage Foundation environment running Symantec Oracle Disk Manager (ODM), Veritas File System (VxFS) and Volume Manager (VxVM), a system panic may occur with following the stack trace: 000002a10247a7a1 vpanic() 000002a10247a851 kmem_error+0x4b4() 000002a10247a921 vol_subdisksio_done+0xe0() 000002a10247a9d1 volkcontext_process+0x118() 000002a10247aaa1 voldiskiodone+0x360() 000002a10247abb1 voldmp_iodone+0xc() 000002a10247ac61 gendmpiodone+0x1ec() 000002a10247ad11 ssd_return_command+0x240() 000002a10247add1 ssdintr+0x294() 000002a10247ae81 ql_fast_fcp_post+0x184() 000002a10247af31 ql_24xx_status_entry+0x2c8() 000002a10247afe1 ql_response_pkt+0x29c() 000002a10247b091 ql_isr_aif+0x76c() 000002a10247b181 px_msiq_intr+0x200() 000002a10247b291 intr_thread+0x168() 000002a10240b131 cpu_halt+0x174() 000002a10240b1e1 idle+0xd4() 000002a10240b291 thread_start+4() DESCRIPTION:A race condition exists between two IOs (specifically Volume Manager subdisk level staged I/Os) while doing 'done' processing which causes one thread to free FS-VM private information data structure before other thread accesses it. The propensity of the race increases by increasing the number of CPUs. RESOLUTION:Avoid the race condition such that the slower thread doesn't access the freed FS-VM private information data structure. * INCIDENT NO:2485278 TRACKING ID:2386120 SYMPTOM:Error messages printed in the syslog in the event of master takeover failure in some situations are not be enough to find out the root cause of the failure. DESCRIPTION:During master takeover if the new master encounters some errors, the master takeover operation fails. We have messages in the code to log the reasons for the failure. These log messages are not available on the customer setups. These are generally enabled in the internal development\testing scenarios. RESOLUTION:Some of the relevant messages have been modified such that they will now be available on the customer setups as well, logging crucial information for root cause analysis of the issue. * INCIDENT NO:2485288 TRACKING ID:2431470 SYMPTOM:vxpfto sets PFTO(Powerfail Timeout) value on a wrong VxVM device. HP- DESCRIPTION:vxpfto invokes 'vxdisk set' command to set the PFTO value. vxdisk accepts both DA(Disk Access) and DM(Disk Media) names for device specification. DA and DM names can have conflicts such that even within the same disk group, the same name can refer to different devices - one as a DA name and another as a DM name. vxpfto command uses DM names when invoking the vxdisk command but vxdisk will choose a matching DA name before a DM name. This causes incorrect device to be acted upon. HP- RESOLUTION:Fixed the argument check procedure in 'vxdisk set' based on the common rule of VxVM (i.e.) if a disk group is specified with '-g' option, then only DM name is supported, else it can be a DA name. * INCIDENT NO:2488042 TRACKING ID:2431423 SYMPTOM:Panic in vol_mv_commit_check() while accessing Data Change Map(DCM) object. Stack trace of panic vol_mv_commit_check at ffffffffa0bef79e vol_ktrans_commit at ffffffffa0be9b93 volconfig_ioctl at ffffffffa0c4a957 volsioctl_real at ffffffffa0c5395c vols_ioctl at ffffffffa1161122 sys_ioctl at ffffffff801a2a0f compat_sys_ioctl at ffffffff801ba4fb sysenter_do_call at ffffffff80125039 DESCRIPTION:In case of DCM failure, object pointer is set to NULL as part of transaction. If DCM is active then we try to access DCM object in transaction code path without checking it to be NULL. DCM object pointer could be NULL in case of failed DCM. Accessing object pointer without check for NULL caused this panic. RESOLUTION:Fix is to put NULL check for DCM object in transaction code path. * INCIDENT NO:2491856 TRACKING ID:2424833 SYMPTOM:VVR primary node crashes while replicating in lossy and high latency network with multiple TCP connections. In debug VxVM build TED assert is hit with following stack : brkpoint+000004 () ted_call_demon+00003C (0000000007D98DB8) ted_assert+0000F0 (0000000007D98DB8, 0000000007D98B28, 0000000000000000) .hkey_legacy_gate+00004C () nmcom_send_msg_tcp+000C20 (F100010A83C4E000, 0000000200000002, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 000000DA000000DA, 0000000100000000) .nmcom_connect_tcp+0007D0 () vol_rp_connect+0012D0 (F100010B0408C000) vol_rp_connect_start+000130 (F1000006503F9308, 0FFFFFFFF420FC50) voliod_iohandle+0000AC (F1000006503F9308, 0000000100000001, 0FFFFFFFF420FC50) voliod_loop+000CFC (0000000000000000) vol_kernel_thread_init+00002C (0FFFFFFFF420FFF0) threadentry+000054 (??, ??, ??, ??) DESCRIPTION:In lossy and high latency network, connection between VVR primary and seconadry can get closed and re-established frequently because of heartbeat timeouts or DATA acknowledgement timeouts. In TCP multi-connection scenario, VVR primary sends its very first message (called NMCOM_HANDSHAKE) to secondary on zeroth socket connection number and then it sends "NMCOM_SESSION" message for each of the next connections. By some reasons, if the sending of the NMCOM_HANDSHAKE message fails, VVR primary tries to send it through the another connection without checking whether it's a valid connection or NOT. RESOLUTION:Code changes are made in VVR to use the other connections only after all the connections are established. PATCH ID:142630-11 * INCIDENT NO:2280640 TRACKING ID:2205108 SYMPTOM:On VxVM 5.1SP1 or later, device discovery operations such as vxdctl enable, vxdisk scandisks and vxconfigd -k failed to claim new disks correctly. For example, if user provisions five new disks, VxVM, instead of creating five different Dynamic Multi-Pathing (DMP) nodes, creates only one and includes the rest as its paths. Also, the following message is displayed at console during this problem. NOTICE: VxVM vxdmp V-5-0-34 added disk array , datype = Please note that the cabinet serial number following "disk array" and the value of "datype" is not printed in the above message. DESCRIPTION:VxVM's DDL (Device Discovery Layer) is responsible for appropriately claiming the newly provisioned disks. Due to a bug in one of the routines within this layer, though the disks are claimed, their LSN (Lun Serial Number, an unique identifier of disks) is ignored thereby every disk is wrongly categorized within a DMP node. RESOLUTION:Modified the problematic code within the DDL thereby new disks are claimed appropriately. WORKAROUND: If vxconfigd does not hang or dump a core with this issue, a reboot can be a workaround to recover this situation or to break up once and rebuild the DMP/DDL database on the devices as the following steps; # vxddladm excludearray all # mv /etc/vx/jbod.info /etc/vx/jbod.info.org # vxddladm disablescsi3 # devfsadm -Cv # vxconfigd -k # vxddladm includearray all # mv /etc/vx/jbod.info.org /etc/vx/jbod.info # vxddladm enablescsi3 # rm /etc/vx/disk.info /etc/vx/array.info # vxconfigd -k * INCIDENT NO:2291967 TRACKING ID:2286559 SYMPTOM:System panics in DMP (Dynamic Multi Pathing) kernel module due to kernel heap corruption while DMP path failover is in progress. Panic stack may look like: vpanic kmem_error+0x4b4() gen_get_enabled_ctlrs+0xf4() dmp_get_enabled_ctlrs+0xf4() dmp_info_ioctl+0xc8() dmpioctl+0x20() dmp_get_enabled_cntrls+0xac() vx_dmp_config_ioctl+0xe8() quiescesio_start+0x3e0() voliod_iohandle+0x30() voliod_loop+0x24c() thread_start+4() DESCRIPTION:During path failover in DMP, the routine gen_get_enabled_ctlrs() allocates memory proportional to the number of enabled paths. However, while releasing the memory, the routine may end up freeing more memory because of the change in number of enabled paths. RESOLUTION:Code changes have been made in the routines to free allocated memory only. * INCIDENT NO:2299977 TRACKING ID:2299670 SYMPTOM:VxVM disk groups created on EFI (Extensible Firmware Interface) LUNs do not get auto-imported during system boot in VxVM version 5.1SP1 and later. DESCRIPTION:While determining the disk format of EFI LUNs, stat() system call on the corresponding DMP devices fail with ENOENT ("No such file or directory") error because the DMP device nodes are not created in the root file system during system boot. This leads to failure in auto-import of disk groups created on EFI LUNs. RESOLUTION:VxVM code is modified to use OS raw device nodes if stat() fails on DMP device nodes. * INCIDENT NO:2318820 TRACKING ID:2317540 SYMPTOM:System panic due to kernel heap corruption while DMP device driver unload. Panic stack on Solaris (when kmem_flags is set to either 0x100 or 0xf) should be similar to as below: vpanic() kmem_error+0x4b4() dmp_free_stats_table+0x118() dmp_free_modules+0x24() vxdmp`_fini+0x178() moduninstall+0x148() modunrload+0x6c() modctl+0x54() syscall_trap+0xac() DESCRIPTION:During DMP kernel device driver unload, it frees all the allocated kernel heap memory. As part of freeing allocated memory, DMP is trying to free more than the allocated buffer size for one of the allocated buffer, which is leading to system panic when kernel memory audit is enabled. RESOLUTION:Source code is modified to free the kernel buffer, which is aligned to the allocation size. * INCIDENT NO:2320613 TRACKING ID:2313021 SYMPTOM:In Sun Cluster environment, nodes fail to join the CVM cluster after their reboot displaying following messages on console : <> vxio: [ID 557667 kern.notice] NOTICE: VxVM vxio V-5-3-1251 joinsio_done: Overlapping reconfiguration, failing the join for node 1. The join will be retried. <> vxio: [ID 976272 kern.notice] NOTICE: VxVM vxio V-5-3-672 abort_joinp: aborting joinp for node 1 with err 11 <> vxvm:vxconfigd: [ID 702911 daemon.notice] V-5-1-12144 CVM_VOLD_JOINOVER command received with error DESCRIPTION:A reboot of a node within CVM cluster involves a "node leave" followed by a "node join" reconfiguration. During CVM reconfiguration, each node exchanges reconfiguration messages with other nodes using the UDP protocol. At the end of a CVM reconfiguration, the messages exchanged should be deleted from all the nodes in the cluster. However, due to a bug in CVM, the messages weren't deleted as part of the "node leave" reconfiguration processing in some nodes that resulted in failure of subsequent "node join" reconfigurations. RESOLUTION:After every CVM reconfiguration, the processed reconfiguration messages on all the nodes in the CVM cluster are deleted properly. * INCIDENT NO:2322742 TRACKING ID:2108152 SYMPTOM:vxconfigd, the VxVM volume configuration daemon startup fails to get into enabled mode and "vxdctl enable" command displays the error "VxVM vxdctl ERROR V-5-1-1589 enable failed: Error in disk group configuration copies ". DESCRIPTION:vxconfigd issues input/output control system call (ioctl) to read the disk capacity from disks. However, if it fails, the error number is not propagated back to vxconfigd. The subsequent disk operations to these failed devices were causing vxconfigd to get into disabled mode. RESOLUTION:The fix is made to propagate the actual "error number" returned by the ioctl failure back to vxconfigd. * INCIDENT NO:2322757 TRACKING ID:2322752 SYMPTOM:Duplicate device names are observed for NR (Not Ready) devices, when vxconfigd is restarted (vxconfigd -k). # vxdisk list emc0_0052 auto - - error emc0_0052 auto:cdsdisk - - error emc0_0053 auto - - error emc0_0053 auto:cdsdisk - - error DESCRIPTION:During vxconfigd restart, disk access records are rebuilt in vxconfigd database. As part of this process IOs are issued on all the devices to read the disk private regions. The failure of these IOs on NR devicess resulted in creating duplicate disk access records. RESOLUTION:vxconfigd code is modified not to create dupicate disk access records. * INCIDENT NO:2333255 TRACKING ID:2253552 SYMPTOM:vxconfigd leaks memory while reading the default tunables related to smartmove (a VxVM feature). DESCRIPTION:In Vxconfigd, memory allocated for default tunables related to smartmove feature is not freed causing a memory leak. RESOLUTION:The memory is released after its scope is over. * INCIDENT NO:2333257 TRACKING ID:1675599 SYMPTOM:Vxconfigd leaks memory while excluding and including a Third party Driver controlled LUN in a loop. As part of this vxconfigd loses its license information and following error is seen in system log: "License has expired or is not available for operation" DESCRIPTION:In vxconfigd code, memory allocated for various data structures related to device discovery layer is not freed which led to the memory leak. RESOLUTION:The memory is released after its scope is over. * INCIDENT NO:2337237 TRACKING ID:2337233 SYMPTOM:Excluding a TPD device with "vxdmpadm exclude" command does not work. The excluded device is still shown in "vxdisk list" outout. Example: # vxdmpadm exclude vxvm dmpnodename=emcpower22s2 # cat /etc/vx/vxvm.exclude exclude_all 0 paths emcpower22c /pseudo/emcp@22 emcpower22s2 # controllers # product # pathgroups # # vxdisk scandisks # vxdisk list | grep emcpower22s2 emcpower22s2 auto:sliced - - online DESCRIPTION:Because of a bug in the logic of path name comparison, DMP ends up including the disks in device discovery which are part of the exclude list. RESOLUTION:The code in DMP is corrected to handle path name comparison appropriately. * INCIDENT NO:2337354 TRACKING ID:2337353 SYMPTOM:The "vxdmpadm include" command is including all the excluded devices along with the device given in the command. Example: # vxdmpadm exclude vxvm dmpnodename=emcpower25s2 # vxdmpadm exclude vxvm dmpnodename=emcpower24s2 # more /etc/vx/vxvm.exclude exclude_all 0 paths emcpower24c /dev/rdsk/emcpower24c emcpower25s2 emcpower10c /dev/rdsk/emcpower10c emcpower24s2 # controllers # product # pathgroups # # vxdmpadm include vxvm dmpnodename=emcpower24s2 # more /etc/vx/vxvm.exclude exclude_all 0 paths # controllers # product # pathgroups # DESCRIPTION:When a dmpnode is excluded, an entry is made in /etc/vx/vxvm.exclude file. This entry has to be removed when the dmpnode is included later. Due to a bug in comparison of dmpnode device names, all the excluded devices are included. RESOLUTION:The bug in the code which compares the dmpnode device names is rectified. * INCIDENT NO:2339254 TRACKING ID:2339251 SYMPTOM:In Solaris 10 version, newfs/mkfs_ufs(1M) fails to create UFS file system on "VxVM volume > 2 Tera Bytes" with the following error: # newfs /dev/vx/rdsk/[disk group]/[volume] newfs: construct a new file system /dev/vx/rdsk/[disk group]/[volume]: (y/n)? y Can not determine partition size: Inappropriate ioctl for device The truss output of the newfs/mkfs_ufs(1M) shows that the ioctl() system calls, to identify the size of the disk or volume device, fails with ENOTTY error. ioctl(3, 0x042A, ...) Err#25 ENOTTY ... ioctl(3, 0x0412, ...) Err#25 ENOTTY DESCRIPTION:In Solaris 10 version, newfs/mkfs_ufs(1M) uses ioctl() system calls, to identify the size of the disk or volume device, when creating UFS file system on disk or volume devices "> 2TB". If the Operating System (OS) version is less than Solaris 10 Update 8, the above ioctl system calls are invoked on "volumes > 1TB" as well. VxVM, Veritas Volume Manager exports the ioctl interfaces for VxVM volumes. VxVM 5.1 SP1 RP1 P1 and VxVM 5.0 MP3 RP3 introduced the support for Extensible Firmware Interface (EFI) for VxVM volumes in Solaris 9 and Solaris 10 respectively. However the corresponding EFI specific build time definition in Veritas Kernel IO driver (VXIO) was not updated in Solaris 10 in VxVM 5.1 SP1 RP1 P1 and onwards. RESOLUTION:The code changes to add the build time definition for EFI in VXIO entails in newfs/mkfs_ufs(1M) successfully creating UFS file system on VxVM volume devices "> 2TB" ("> 1TB" if OS version is less than Solaris 10 Update 8). * INCIDENT NO:2346469 TRACKING ID:2346470 SYMPTOM:The Dynamic Multi Pathing Administration operations such as "vxdmpadm exclude vxvm dmpnodename=" and "vxdmpadm include vxvm dmpnodename= " triggers memory leaks in the heap segment of VxVM Configuration Daemon (vxconfigd). DESCRIPTION:vxconfigd allocates chunks of memory to store VxVM specific information of the disk being included during "vxdmpadm include vxvm dmpnodename=" operation. The allocated memory is not freed while excluding the same disk from VxVM control. Also when excluding a disk from VxVM control, another chunk of memory is temporarily allocated by vxconfigd to store more details of the device being excluded. However this memory is not freed at the end of exclude operation. RESOLUTION:Memory allocated during include operation of a disk is freed during corresponding exclude operation of the disk. Also temporary memory allocated during exclude operation of a disk is freed at the end of exclude operation. * INCIDENT NO:2349497 TRACKING ID:2320917 SYMPTOM:vxconfigd, the VxVM configuration daemon dumps core and loses disk group configuration while invoking the following VxVM reconfiguration steps: 1) Volumes which were created on thin reclaimable disks are deleted. 2) Before the space of the deleted volumes is reclaimed, the disks (whose volume is deleted) are removed from the DG with 'vxdg rmdisk' command using '- k' option. 3) The disks are removed using 'vxedit rm' command. 4) New disks are added to the disk group using 'vxdg addisk' command. The stack trace of the core dump is : [ 0006f40c rec_lock3 + 330 0006ea64 rec_lock2 + c 0006ec48 rec_lock2 + 1f0 0006e27c rec_lock + 28c 00068d78 client_trans_start + 6e8 00134d00 req_vol_trans + 1f8 00127018 request_loop + adc 000f4a7c main + fb0 0003fd40 _start + 108 ] DESCRIPTION:When a volume is deleted from a disk group that uses thin reclaim luns, subdisks are not removed immediately, rather it is marked with a special flag. The reclamation happens at a scheduled time every day. "vxdefault" command can be invoked to list and modify the settings. After the disk is removed from disk group using 'vxdg -k rmdisk' and 'vxedit rm' command, the subdisks records are still in core database and they are pointing to disk media record which has been freed. When the next command is run to add another new disk to the disk group, vxconfigd dumps core when locking the disk media record which has already been freed. The subsequent disk group deport and import commands erase all disk group configuration as it detects an invalid association between the subdisks and the removed disk. RESOLUTION:1) The following message will be printed when 'vxdg rmdisk' is used to remove disk that has reclaim pending subdisks: VxVM vxdg ERROR V-5-1-0 Disk is used by one or more subdisks which are pending to be reclaimed. Use "vxdisk reclaim " to reclaim space used by these subdisks, and retry "vxdg rmdisk" command. Note: reclamation is irreversible. 2) Add a check when using 'vxedit rm' to remove disk. If the disk is in removed state and has reclaim pending subdisks, following error message will be printed: VxVM vxedit ERROR V-5-1-10127 deleting : Record is associated * INCIDENT NO:2349553 TRACKING ID:2353493 SYMPTOM:On Solaris 10, "pkgchk" command on VxVM package fails with the following error: #pkgchk -a VRTSvxvm ERROR: /usr/lib/libvxscsi.so.SunOS_5.10 pathname does not exist DESCRIPTION:During installation of the VxVM package, the VxVM's libraries libvxscsi.so did not get installed in the path /usr/lib/libvxscsi.so.SunOS_5.10, which is a pre-requisite for successful execution of 'pkgchk'command. RESOLUTION:VxVM's installation scripts are modified to include the library at the correct location. installation. * INCIDENT NO:2353429 TRACKING ID:2334757 SYMPTOM:Vxconfigd consumes a lot of memory when the DMP tunable dmp_probe_idle_lun is set on. "pmap" command on vxconfigd process shows continuous growing heap. DESCRIPTION:DMP path restoration daemon probes idle LUNs(Idle LUNs are VxVM disks on which no I/O requests are scheduled) and generates notify events to vxconfigd. Vxconfigd in turn send the nofification of these events to its clients. For any reasons, if vxconfigd could not deliver these events (because client is busy processing earlier sent event), it keeps these events to itself. Because of this slowness of events consumption by its clients, memory consumption of vxconfigd grows. RESOLUTION:dmp_probe_idle_lun is set to off by default. * INCIDENT NO:2357935 TRACKING ID:2349352 SYMPTOM:Data corruption is observed on DMP device with single path during Storage reconfiguration (LUN addition/removal). DESCRIPTION:Data corruption can occur in the following configuration, when new LUNs are provisioned or removed under VxVM, while applications are on-line. 1. The DMP device naming scheme is EBN (enclosure based naming) and persistence=no 2. The DMP device is configured with single path or the devices are controlled by Third Party Multipathing Driver (Ex: MPXIO, MPIO etc.,) There is a possibility of change in name of the VxVM devices (DA record), when LUNs are removed or added followed by the following commands, since the persistence naming is turned off. (a) vxdctl enable (b) vxdisk scandisks Execution of above commands discovers all the devices and rebuilds the device attribute list with new DMP device names. The VxVM device records are then updated with this new attributes. Due to a bug in the code, the VxVM device records are mapped to wrong DMP devices. Example: Following are the device before adding new LUNs. sun6130_0_16 auto - - nolabel sun6130_0_17 auto - - nolabel sun6130_0_18 auto:cdsdisk disk_0 prod_SC32 online nohotuse sun6130_0_19 auto:cdsdisk disk_1 prod_SC32 online nohotuse The following are after adding new LUNs sun6130_0_16 auto - - nolabel sun6130_0_17 auto - - nolabel sun6130_0_18 auto - - nolabel sun6130_0_19 auto - - nolabel sun6130_0_20 auto:cdsdisk disk_0 prod_SC32 online nohotuse sun6130_0_21 auto:cdsdisk disk_1 prod_SC32 online nohotuse The name of the VxVM device sun6130_0_18 is changed to sun6130_0_20. RESOLUTION:The code that updates the VxVM device records is rectified. * INCIDENT NO:2364294 TRACKING ID:2364253 SYMPTOM:In case of Space Optimized snapshots at secondary site, VVR leaks kernel memory. DESCRIPTION:In case of Space Optimized snapshots at secondary site, VVR proactively starts the copy-on-write on the snapshot volume. The I/O buffer allocated for this proactive copy-on-write was not freed even after I/Os are completed which lead to the memory leak. RESOLUTION:After the proactive copy-on-write is complete, memory allocated for the I/O buffers is released. * INCIDENT NO:2366071 TRACKING ID:2366066 SYMPTOM:The VxVM (Veritas Volume Manager) vxstat command displays absurd statistics for READ & WRITE operations on VxVM objects. The absurd statistics is near to the max value of a 32-bit unsigned integer. For example : # vxstat -g -i OPERATIONS BLOCKS AVG TIME(ms) TYP NAME READ WRITE READ WRITE READ WRITE vol 10 303 112 2045 6.15 14.43 + 60 seconds vol 2 67 32 476 6.00 14.28 + 60*2 seconds vol 4294967288 4294966980 4294967199 4294965129 0.00 0.00 DESCRIPTION:vxio, a VxVM driver, uses 32-bit unsigned integer variable to keep track of the number of READ & WRITE blocks on VxVM objects. Whenever the 32-bit unsigned integer overflows, vxstat displays the absurd statistics as shown in SYMPTOM section above. RESOLUTION:Both vxio driver and vxstat command have been modified to accommodate larger number of READ & WRITE blocks on VxVM objects. PATCH ID:142630-10 * INCIDENT NO:2256685 TRACKING ID:2080730 SYMPTOM:On Linux, exclusion of devices using the "vxdmpadm exclude" CLI is not persistent across reboots. DESCRIPTION:On Linux, names of OS devices (/dev/sd*) are not persistent. The "vxdmpadm exclude" CLI uses the OS device names to keep track of devices to be excluded by VxVM/DMP. As a result, on reboot, if the OS device names change, then the devices which are intended to be excluded will be included again. RESOLUTION:The resolution is to use persistent physical path names to keep track of the devices that have been excluded. * INCIDENT NO:2256686 TRACKING ID:2152830 SYMPTOM:Sometimes the storage admins create multiple copies/clones of the same device. Diskgroup import fails with a non-descriptive error message when multiple copies(clones) of the same device exists and original device(s) are either offline or not available. # vxdg import mydg VxVM vxdg ERROR V-5-1-10978 Disk group mydg: import failed: No valid disk found containing disk group DESCRIPTION:If the original devices are offline or unavailable, vxdg import picks up cloned disks for import. DG import fails by design unless the clones are tagged and tag is specified during DG import. While the import failure is expected, but the error message is non-descriptive and doesn't provide any corrective action to be taken by user. RESOLUTION:Fix has been added to give correct error meesage when duplicate clones exist during import. Also, details of duplicate clones is reported in the syslog. Example: [At CLI level] # vxdg import testdg VxVM vxdg ERROR V-5-1-10978 Disk group testdg: import failed: DG import duplcate clone detected [In syslog] vxvm:vxconfigd: warning V-5-1-0 Disk Group import failed: Duplicate clone disks are detected, please follow the vxdg (1M) man page to import disk group with duplicate clone disks. Duplicate clone disks are: c2t20210002AC00065Bd0s2 : c2t50060E800563D204d1s2 c2t50060E800563D204d0s2 : c2t50060E800563D204d1s2 * INCIDENT NO:2256688 TRACKING ID:2202710 SYMPTOM:Transactions on Rlink are not allowed during SRL to DCM flush. DESCRIPTION:Present implementation doesnat allow rlink transaction to go through if SRL to DCM flush is in progress. As SRL overflows, VVR start reading from SRL and mark the dirty regions in corresponding DCMs of data volumes, it is called SRL to DCM flush. During SRL to DCM flush transactions on rlink is not allowed. Time to complete SRL flush depend on SRL size, it could range from minutes to many hours. If user initiate any transaction on rlink then it will hang until SRL flush completes. RESOLUTION:Changed the code behavior to allow rlink transaction during SRL flush. Fix stops the SRL flush for transaction to go ahead and restart the flush after transaction completion. * INCIDENT NO:2256689 TRACKING ID:2233889 SYMPTOM:The volume recovery happens in a serial fashion when any of the volumes has a log volume attached to it. DESCRIPTION:When recovery is initiated on a disk group, vxrecover creates lists of each type of volumes such as cache volume, data volume, log volume etc. The log volumes are recovered in a serial fashion by design. Due to a bug the data volumes are added to the log volume list if there exists a log volume. Hence even the data volumes were recovered in a serial fashion if any of the volumes has a log volume attached. RESOLUTION:The code was fixed such that the data volume list, cache volume list and the log volume list are maintained separately and the data volumes are not added to the log volumes list. The recovery for the volumes in each list is done in parallel. -------------------------------------------------------------------------------- * INCIDENT NO:2256690 TRACKING ID:2226304 SYMPTOM:In Solaris 9 platform, newfs(1M)/mkfs_ufs(1M) cannot create ufs file system on >1 Tera byte(TB) VxVM volume and it displays the following error: # newfs /dev/vx/rdsk// newfs: construct a new file system /dev/vx/rdsk//: (y/n)? y Can not determine partition size: Inappropriate ioctl for device # prtvtoc /dev/vx/rdsk// prtvtoc: /dev/vx/rdsk//: Unknown problem reading VTOC SOL- DESCRIPTION:newfs(1M)/mkfs_ufs(1M) invokes DKIOCGETEFI ioctl. During the enhancement of EFI support on Solaris 10 on 5.0MP3RP3 or later, DKIOCGETEFI ioctl functionality was not implemented on Solaris 9 because of the following limitations: 1. EFI feature has not been introduced from Solaris 9 FCS and has been introduced from Solaris 9 U3(4/03) which includes 114127-03(libefi) and 114129- 02(libuuid and efi/uuid headers). 2. During the enhancement of EFI support on Solaris 10, for solaris 9, DKIOCGVTOC ioctl was only supported on a volume <= 1TB since the VTOC specification was defined for only <= 1 TB LUN/volume. If the size of the volume is > 1 TB DKIOCGVTOC ioctl would return an inaccurate vtoc structure due to value overflow. SOL- RESOLUTION:The resolution is to enhance the VxVM code to handle DKIOCGETEFI ioctl correctly on VxVM volume on Solaris 9 platform. When newfs(1M)/mkfs_ufs(1M) invokes DKIOCGETEFI ioctl on a VxVM volume device, VxVM shall return the relevant EFI label information so that the UFS utilities can determine the volume size correctly. * INCIDENT NO:2256691 TRACKING ID:2197254 SYMPTOM:vxassist, the VxVM volume creation utility when creating volume with alogtype=nonea doesnat function as expected. DESCRIPTION:While creating volumes on thinrclm disks, Data Change Object(DCO) version 20 log is attached to every volume by default. If the user do not want this default behavior then alogtype=nonea option can be specified as a parameter to vxassist command. But with VxVM on HP 11.31 , this option does not work and DCO version 20 log is created by default. The reason for this inconsistency is that when alogtype=nonea option is specified, the utility sets the flag to prevent creation of log. However, VxVM wasnat checking whether the flag is set before creating DCO log which led to this issue. RESOLUTION:This is a logical issue which is addressed by code fix. The solution is to check for this corresponding flag of alogtype=nonea before creating DCO version 20 by default. * INCIDENT NO:2256692 TRACKING ID:2240056 SYMPTOM:'vxdg move/split/join' may fail during high I/O load. DESCRIPTION:During heavy I/O load 'dg move' transcation may fail because of open/close assertion and retry will be done. As the retry limit is set to 30 'dg move' fails if retry hits the limit. RESOLUTION:Change the default transaction retry to unlimit, introduce a new option to 'vxdg move/split/join' to set transcation retry limit as follows: vxdg [-f] [-o verify|override] [-o expand] [-o transretry=retrylimit] move src_diskgroup dst_diskgroup objects ... vxdg [-f] [-o verify|override] [-o expand] [-o transretry=retrylimit] split src_diskgroup dst_diskgroup objects ... vxdg [-f] [-o verify|override] [-o transretry=retrylimit] join src_diskgroup dst_diskgroup * INCIDENT NO:2256722 TRACKING ID:2215256 SYMPTOM:Volume Manager is unable to recognize the devices connected through F5100 HBA SOL- DESCRIPTION:During device discovery volume manager does not scan the luns that are connected through SAS HBA (F5100 is a new SAS HBA). So the commands like 'vxdisk list' does not even show the luns that are connected through F5100 HBA SOL- RESOLUTION:Modified the device discovery code in volume manager to include the paths/luns that are connected through SAS HBA. * INCIDENT NO:2257684 TRACKING ID:2245121 SYMPTOM:Rlinks do not connect for NAT (Network Address Translations) configurations. DESCRIPTION:When VVR (Veritas Volume Replicator) is replicating over a Network Address Translation (NAT) based firewall, rlinks fail to connect resulting in replication failure. Rlinks do not connect as there is a failure during exchange of VVR heartbeats. For NAT based firewalls, conversion of mapped IPV6 (Internet Protocol Version 6) address to IPV4 (Internet Protocol Version 4) address is not handled which caused VVR heartbeat exchange with incorrect IP address leading to VVR heartbeat failure. RESOLUTION:Code fixes have been made to appropriately handle the exchange of VVR heartbeats under NAT based firewall. * INCIDENT NO:2268733 TRACKING ID:2248730 SYMPTOM:Command hungs if "vxdg import" called from script with STDERR redirected. DESCRIPTION:If script is having "vxdg import" with STDERR redirected then script does not finish till DG import and recovery is finished. Pipe between script and vxrecover is not closed properly which keeps calling script waiting for vxrecover to complete. RESOLUTION:Closed STDERR in vxrecover and redirected the output to /dev/console. * INCIDENT NO:2276324 TRACKING ID:2270880 SYMPTOM:On Solaris 10 (SPARC only), if the size of EFI(Extensible Firmware Interface) labeled disk is greater than 2TB, the disk capacity will be truncated to 2TB when it is initialized with CDS(Cross-platform Data Sharing) under VxVM(Veritas Volume Manager). For example, the sizes shown as the sector count by prtvtoc(1M) and public region size by vxdisk(1M) will be truncated to the sizes approximate 2TB. # prtvtoc /dev/rdsk/c0t500601604BA07D17d13 * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 2 15 00 48 4294967215 4294967262 # vxdisk list c0t500601604BA07D17d13 | grep public public: slice=2 offset=65744 len=4294901456 disk_offset=48 SOL- DESCRIPTION:From VxVM 5.1 SP1 and onwards, the CDS format is enhanced to support for disks of greater than 1TB. VxVM will use EFI layout to support CDS functionality for disks of greater than 1TB, however on Solaris 10 (SPARC only), a problem is seen that the disk capacity will be truncated to 2TB if the size of EFI labeled disk is greater than 2TB. This is because the library /usr/lib/libvxscsi.so in Solaris 10 (SPARC only) package does not contain the required enhancement on Solaris 10 to support CDS format for disks greater than 2TB. SOL- RESOLUTION:The VxVM package for Solaris has been changed to contain all the libvxscsi.so binaries which is built for Solaris platforms(versions) respectively, for example libvxscsi.so.SunOS_5.9 and libvxscsi.so.SunOS_5.10. From this fix and onwards, the appropriate platform's built of the binary will be installed as /usr/lib/libvxscsi.so during the installation of the VxVM package. INCIDENTS FROM OLD PATCHES: --------------------------- NONE