README VERSION : 1.1 README CREATION DATE : 2012-03-06 PATCH-ID : PVKL_03938 PATCH NAME : VRTSvxvm 6.0RP1 BASE PACKAGE NAME : VRTSvxvm BASE PACKAGE VERSION : 6.0.0.0 OBSOLETE PATCHES : NONE SUPERSEDED PATCHES : NONE REQUIRED PATCHES : PVCO_03937 INCOMPATIBLE PATCHES : NONE SUPPORTED PADV : hpux1131 (P-PLATFORM , A-ARCHITECTURE , D-DISTRIBUTION , V-VERSION) PATCH CATEGORY : CORRUPTION , HANG , MEMORYLEAK , PANIC REBOOT REQUIRED : YES PATCH INSTALLATION INSTRUCTIONS: -------------------------------- Please refer the release notes for installation instructions. PATCH UNINSTALLATION INSTRUCTIONS: ---------------------------------- Please refer the release notes for un-installation instructions. SPECIAL INSTRUCTIONS: --------------------- NONE SUMMARY OF FIXED ISSUES: ----------------------------------------- 2598525 CVR: memory leaks reported 2605706 write fails on volume on slave node after join which earlier had disks in "lfailed" state 2615288 Site consistency: Both sites become detached after data/dco plex failue at each site, leading to I/O cluster wide outage 2624574 VVR Logowner: local I/O starved with heavy I/O load from Logclient 2625762 secondary master panics at volkiofree 2630074 Longevity:sfrac: after 'vxdg destroy' hung (for shared DiskGroup), all vxcommands hang on master 2637183 Intermittent data corruption after a vxassist move 2643134 Failure during validating mirror name interface for linked mirror volume 2643138 IO hang due to SRL overflow & CVM reconfig 2643139 IO hung after SRL overflow 2643155 VVR: Primary master panic'ed in rv_ibc_freeze_timeout 2643156 CVM: diskgroup activation can hang due to a bug in vxvm kernel code 2643159 UMI Documentation Request: V-5-0-1279 2682534 Starting 32TB RAID5 volume fails with V-5-1-10128 Unexpected kernel error in configuration update KNOWN ISSUES : -------------- Please refer the release notes. FIXED INCIDENTS: ---------------- PATCH ID:PVKL_03938 * INCIDENT NO:2598525 TRACKING ID:2526498 SYMPTOM: Memory leak after running the automated VVR test case. DESCRIPTION: The IBC after servicing the request , it queued in free queue. This update is queued only when the reference count is not zero. In some cases IBC receive and IBC send race each other. during this time the ref_count may not be euqal to 0, and the update is queued in free queue. This free queue, is freed by the garbage collector, or get cleaned up when the RVG is removed. But in some code path, free queue is set to NULL with out freeing up the update. RESOLUTION:So, the proposed fix is to keep the free queue not RESET, and so either garbage collector or RVG delete will free the update. * INCIDENT NO:2605706 TRACKING ID:2590183 SYMPTOM: IOs on newly enabled paths can fail with reservation conflict error DESCRIPTION: While enabling the path PGR registration is not done, so IOs can fail with reservation conflict. RESOLUTION:Do the PGR registration on newly enabled paths. * INCIDENT NO:2615288 TRACKING ID:2527289 SYMPTOM: In a Campus Cluster setup, storage fault may lead to DETACH of all the configured site. This also results in IOfailure on all the nodes in the Campus Cluster. DESCRIPTION: Site detaches are done on site consistent dgs when any volume in the dg looses all the mirrors of a Site. During the processing of the DETACH of last mirror in a site we identify that it is the last mirror and DETACH the site which in turn detaches all the objects of that site. In Campus Cluster setup we attach a dco volume for any data volume created on a site-consistent dg. The general configuration is to have one DCO mirror on each site. Loss of a single mirror of the dco volume on any node will result in the detach of that site. In a 2 site configuration this particular scenario would result in both the dco mirrors being lost simultaneously. While the site detach for the first mirror is being processed we also signal for DETACH of the second mirror which ends up DETACHING the second site too. This is not hit in other tests as we already have a check to make sure that we do not DETACH the last mirror of a Volume. This check is being subverted in this particular case due to the type of storage failure. RESOLUTION:Before triggering the site detach we need to have an explicit check to see if we are trying to DETACH the last ACTIVE site. * INCIDENT NO:2624574 TRACKING ID:2608849 SYMPTOM: 1.Under a heavy I/O load on logclient node, write I/Os on VVR Primary logowner takes a very long time to complete. 2. I/Os on "master" and "slave" nodes hang when "master" role is switched multiple times using "vxclustadm setmaster" command. DESCRIPTION: 1. VVR can not allow more than 2048 I/Os outstanding on the SRL volume. Any I/Os beyond this threshold will be throttled. The throttled I/Os are restarted after every SRL header flush operation. During restarting the throttled I/Os, I/Os came from logclient are given higher priority causing logowner I/Os to starve. 2. In CVM reconfiguration code path the RLINK ports are not cleanly deleted on old log-owner. This causes the RLINks not to connect leading to both replication and I/O hang. RESOLUTION:Algorithm which restarts the throttled I/Os is modified to give fair chance to both local and remote I/Os to proceed. Additionally, the code changes are made in CVM reconfiguration code path to delete the RLINK ports cleanly before switching the master role. * INCIDENT NO:2625762 TRACKING ID:2607519 SYMPTOM: During initial sync from VVR primary site to VVR secondary site, if there is a cluster reconfiguration, the CVM Master on the the VVR secondary site may panic with following stack trace: volkiofree+000018 vol_dcm_read_cleanup+000150 vol_rvdcm_read_done+000834 volkcontext_process+0000E4 voldiskiodone+000B50 volsp_iodone_common+0001B8 volsp_iodone+00002C internal_iodone_offl+000170 iodone_offl+000078 i_softmod+00027C flih_util+000250 DESCRIPTION: The reason for the panic is an condition with respect to the cluster reconfiguration on the secondary site which is not correctly handled. De-referencing a NULL pointer variable caused the panic. RESOLUTION:The fix improves handling the reconfiguration conditions during the initial sync. * INCIDENT NO:2630074 TRACKING ID:2530698 SYMPTOM: vxdg destroy command can hang. vxconfigd hung with following stack trace: Stack trace for process "vxconfigd" at 0xe000000161919300 (pid 392) Thread at 0xe000000161960000 (tid 799) #0 slpq_swtch_core+0x530 () #1 real_sleep+0x360 () #2 sleep_one+0x90 () #3 vol_kmsg_send_wait+0x4c0 () #4 volcvmdg_delete_group+0x270 () #5 vol_delete_group+0x220 () #6 volconfig_ioctl+0x200 () #7 volsioctl_real+0x7d0 () #8 volsioctl+0x60 () #9 vols_ioctl+0x80 () #10 spec_ioctl+0xf0 () #11 vno_ioctl+0x350 () #12 ioctl+0x410 () #13 syscall+0x590 () kmsg receiver thread hung with following stack trace on master node: (Note, on HP the process name is "vxiod" even for kmsg receiver thread) Stack trace for process "vxiod" at 0xe0000001510a9300 (pid 42) Thread at 0xe0000001af147380 (tid 19002) #0 slpq_swtch_core+0x530 () #1 inline real_sleep+0x360 () #2 sleep+0x90 () #3 vxvm_delay+0xe0 () #4 voldg_delete_finish+0x160 () #5 volcvmdg_delete_msg_receive+0xcf0 () #6 vol_kmsg_dg_request+0x2f0 () #7 vol_kmsg_request_receive+0x8c0 () #8 vol_kmsg_ring_broadcast_receive+0xcf0 () #9 vol_kmsg_receiver+0x1130 () DESCRIPTION: When vxdg destroy command is issued while internal IOs are in progress (plex attach or adding a mirror etc) that can lead to hang on the master node. There is a bug in cvm code, where master node keeps waiting for glock to be granted from a slave node that has already destroyed the dg. In this case, slaves respond with an error saying the dg does not exist any more. This will also result in vxconfigd to hang in the kernel. Once this issue is hit, most of vxvm commands hang on the master node, the only way to recover from this is to reboot the system. RESOLUTION:Code changes are made in cvm to handle error responses from slaves while requesting glocks during internal IOs. When master receives this error from a slave, the new code changes treat this as if glock has been granted. The dg destroy processing is also moved from kmsg receiver thread to vxiod threads to avoid any potential dead locks between destroy operation and any glock grant operation on master node. * INCIDENT NO:2637183 TRACKING ID:2647795 SYMPTOM: With Smartmove feature enabled, while moving subdisk data corruption is seen on file system as subdisk contents are not copied properly. DESCRIPTION: With FS Smartmove feature enabled, subdisk move operation queries VxFS for status of the region before deciding to synchronize the regions. While getting the information about the multiple such regions in one IOCTL to VxFS, if the start- offset is not aligned to region size, one I/O can span across two regions, and VxVM was not properly checking status of such regions and skips the synchronization of that region causing data corruption. RESOLUTION:Code changes are done to properly check the region state even if the region spans 2-bits in the FSMAP. * INCIDENT NO:2643134 TRACKING ID:2348180 SYMPTOM: Mirror name is getting truncated while getting the name of mirror for a given volume and mirror number. DESCRIPTION: VxVM supports volume name up to 32 characters. But while getting the name of mirror for a given volume and mirror number because of miscalculation mirror name is getting truncated. RESOLUTION:Proper and complete mirror name is returned. * INCIDENT NO:2643138 TRACKING ID:2620555 SYMPTOM: During CVM reconfig, the RVG wait for the iocount to go to '0', to start the RVG recovery and complete the reconfiguration. DESCRIPTION: In CVR, the node leave will trigger the reconfiguration. The reconfiguration code path initiate the RVG recovery of all the shared diskgroup. The recovery is needed to flush the SRL (shared by all the nodes) to the data volume to avoid any missing writes to the data volume by the leaving node. This recovery involves, reading the data fromt he SRL and copy copy it to the data volume. The flush may take its own time depend on the disk response time and the size of SRL region required to flush. During the recovery a flag is set on the RVG to avoid any new I/O. In this particular hand, the recovery is taking 30 minutes. During this time, there is another node leave happened, which triggered the second reconfiguration. The second reconfiguration before it trigger another recovery it wait for the IO count to go to zero by setting the RECOVER flag to RVG. The first RVG recovery clears the RECOVER flag after 30 minutes once completed the SRL flush. Since this is the same flag set by the second reconfiguration, the second reconfiguration waiting indefinitely for the I/O count to go to zero. Since the RECOVER flag is unset, the I/O resumed. So second reconfiguration stuck forever. RESOLUTION:If the RECOVER flag is set, dont keep waiting for the iocount to become zero in the reconfigration code path. There is no need for another recovery, if the second reconfiguration is started before the first recovery completes. * INCIDENT NO:2643139 TRACKING ID:2620556 SYMPTOM: The I/O hung on the primary after SRL overflow and during SRL flush and rlink connect/disconnect. DESCRIPTION: As part of rlink connect or disconnect, the RVG is serialized to complete the connection or disconnection. The I/O throttle is normal during the SRL flush due to memory pool pressure or reaching the max throttle limit. During the serialization, the i/o is throttled to complete the DCM flush. The remote I/O's are kept in throttleq during the throttling is triggered. Due to I/O serilalization, the throttled I/O is never get flushed and because of that I/O never complete. RESOLUTION:If the serialization is successful, flush the throttleq immediately. This will make sure, the remote I/O's will get retried again in the serialization code path * INCIDENT NO:2643155 TRACKING ID:2607293 SYMPTOM: VVR Primary panics while deleting RVG. Here is stack trace panic_save_regs_switchstack+0x110 panic bad_news bubbleup+0x880 rv_ibc_freeze_timeout invoke_callouts_for_self soft_intr_handler external_interrupt bubbleup+0x880 DESCRIPTION: VVR Primary is frozen to send IBC for given timeout value. If RVG is deleted before unfreeze is done or timeout expire then it can cause panic. During RVG deletion freeze timer is not cleared due to bug in code. As freeze timer expires callback routine is called which access the RVG information, if RVG is deleted then accessing it causes panic. RESOLUTION:To fix this issue, check for IBC freeze timer while deleting RVG and unset it. * INCIDENT NO:2643156 TRACKING ID:2610877 SYMPTOM: vxdg -g set activation= might hang due to a bug in activation code path, when memory allocation fails in the kernel. DESCRIPTION: vxdg activation cmd is used to set read-write permission at dg level on each node. While running this command if there is a memory allocation failure in the vxvm kernel path, due to a bug in this code path command can hang. If this command hangs, then it will also end up blocking most of vxvm commands. RESOLUTION:Code changes are made in the vxvm kernel code path to handle memory allocation failure correctly and keep retrying memory allocation until it succeeds. * INCIDENT NO:2643159 TRACKING ID:2633936 SYMPTOM: In mirrored volume when read error occurs following message gets displayed on console: V-5-0-1279 rderr3_childdone: read error on object of mirror in volume (start length ) corrected DESCRIPTION: In mirrored volume when read error occurs on one plex, data from another plex is written to the failed plex. In this read-writeback operation some unwanted messages are displayed on console. RESOLUTION:Code changes are done to not display unwanted messages on user console. * INCIDENT NO:2682534 TRACKING ID:2657797 SYMPTOM: Starting a RAID5 volume fails, when one of the sub-disks in the RAID5 column starts at an offset greater than 1TB. Example: # vxvol -f -g dg1 -o delayrecover start vol1 VxVM vxvol ERROR V-5-1-10128 Unexpected kernel error in configuration update DESCRIPTION: VxVM uses an integer variable to store the starting block offset of a sub-disk in a RAID5 column. This overflows when a sub-disk is located at an offset greater than 2147483647 blocks (1TB) and results in failure to start the volume. Refer to "sdaj" in the following example. E.g. v RaidVol - DETACHED NEEDSYNC 64459747584 RAID - raid5 pl RaidVol-01 RaidVol ENABLED ACTIVE 64459747584 RAID 4/128 RW [..] SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE sd DiskGroup101-01 RaidVol-01 DiskGroup101 0 1953325744 0/0 sdaa ENA sd DiskGroup106-01 RaidVol-01 DiskGroup106 0 1953325744 0/1953325744 sdaf ENA sd DiskGroup110-01 RaidVol-01 DiskGroup110 0 1953325744 0/3906651488 sdaj ENA RESOLUTION:VxVM code is modified to handle integer overflow conditions for RAID5 volumes. INCIDENTS FROM OLD PATCHES: --------------------------- NONE