README VERSION		: 1.1
README CREATION DATE	: 2012-03-06
PATCH-ID 		: PVKL_03938 
PATCH NAME		: VRTSvxvm 6.0RP1
BASE PACKAGE NAME	: VRTSvxvm
BASE PACKAGE VERSION	: 6.0.0.0
OBSOLETE PATCHES	: NONE
SUPERSEDED PATCHES	: NONE
REQUIRED PATCHES	: PVCO_03937
INCOMPATIBLE PATCHES	: NONE
SUPPORTED PADV		: hpux1131
(P-PLATFORM , A-ARCHITECTURE , D-DISTRIBUTION , V-VERSION)
PATCH CATEGORY		:  CORRUPTION ,  HANG ,  MEMORYLEAK ,  PANIC
REBOOT REQUIRED		: YES

PATCH INSTALLATION INSTRUCTIONS:
--------------------------------


Please refer the release notes for installation instructions.

PATCH UNINSTALLATION INSTRUCTIONS:
----------------------------------
Please refer the release notes for un-installation instructions.

SPECIAL INSTRUCTIONS:
---------------------
NONE

SUMMARY OF FIXED ISSUES:
-----------------------------------------
2598525  CVR: memory leaks reported 
2605706  write fails on volume on slave node after join which earlier had disks in "lfailed" state 
2615288  Site consistency: Both sites become detached after data/dco plex failue at each site, leading to I/O cluster wide outage 
2624574  VVR Logowner: local I/O starved with heavy I/O load from Logclient 
2625762  secondary master panics at volkiofree 
2630074  Longevity:sfrac: after 'vxdg destroy' hung (for shared DiskGroup), all vxcommands hang on master 
2637183  Intermittent data corruption after a vxassist move 
2643134  Failure during validating mirror name interface for linked mirror volume 
2643138  IO hang due to SRL overflow & CVM reconfig 
2643139  IO hung after SRL overflow 
2643155  VVR: Primary master panic'ed in rv_ibc_freeze_timeout 
2643156  CVM: diskgroup activation can hang due to a bug in vxvm kernel code 
2643159  UMI Documentation Request: V-5-0-1279 
2682534  Starting 32TB RAID5 volume fails with V-5-1-10128 Unexpected kernel error in configuration update 


KNOWN ISSUES : 
--------------

Please refer the release notes.

FIXED INCIDENTS: 
----------------


 PATCH ID:PVKL_03938

 * INCIDENT NO:2598525	 TRACKING ID:2526498

SYMPTOM: Memory leak after running the automated VVR test case. 

DESCRIPTION: The IBC after servicing the request , it queued in free queue.
This update is queued only when the reference count is not zero. In some cases 
IBC receive and IBC send race each other. during this time the ref_count may not 
be euqal to 0, and the update is queued in free queue.

This free queue, is freed by the garbage collector, or get
cleaned up when the RVG is removed.

But in some code path, free queue is set to NULL with out freeing up the update. 

RESOLUTION:So, the proposed fix is to keep the free queue not RESET, and so either
garbage collector or RVG delete will free the update. 

 * INCIDENT NO:2605706	 TRACKING ID:2590183

SYMPTOM: IOs on newly enabled paths can fail with reservation conflict error 

DESCRIPTION: While enabling the path PGR registration is not done, so IOs can fail with
reservation conflict. 

RESOLUTION:Do the PGR registration on newly enabled paths. 

 * INCIDENT NO:2615288	 TRACKING ID:2527289

SYMPTOM: In a Campus Cluster setup, storage fault may lead to DETACH of all the
configured site. This also results in IOfailure on all the nodes in the Campus
Cluster. 

DESCRIPTION: Site detaches are done on site consistent dgs when any volume in the dg looses
all the mirrors of a Site. During the processing of the DETACH of last mirror in
a site we identify that it is the last mirror and DETACH the site which in turn
detaches all the objects of that site.

In Campus Cluster setup we attach a dco volume for any data volume created on a
site-consistent dg. The general configuration is to have one DCO mirror on each
site. Loss of a single mirror of the dco volume on any node will result in the
detach of that site. 

In a 2 site configuration this particular scenario would result in both the dco
mirrors being lost simultaneously. While the site detach for the first mirror is
being processed we also signal for DETACH of the second mirror which ends up
DETACHING the second site too. 

This is not hit in other tests as we already have a check to make sure that we
do not DETACH the last mirror of a Volume. This check is being subverted in this
particular case due to the type of storage failure. 

RESOLUTION:Before triggering the site detach we need to have an explicit check to see if we
are trying to DETACH the last ACTIVE site. 

 * INCIDENT NO:2624574	 TRACKING ID:2608849

SYMPTOM: 1.Under a heavy I/O load on logclient node, write I/Os on VVR Primary logowner
takes a very long time to complete.

2. I/Os on "master" and "slave" nodes hang when "master" role is switched
multiple times using "vxclustadm setmaster" command. 

DESCRIPTION: 1.
VVR can not allow more than 2048 I/Os outstanding on the SRL volume. Any I/Os
beyond this threshold will be throttled. The throttled I/Os are restarted after
every SRL header flush operation. During restarting the throttled I/Os, I/Os
came from logclient are given higher priority causing logowner I/Os to starve.

2.
In CVM reconfiguration code path the RLINK ports are not cleanly deleted on old
log-owner. This causes the RLINks not to connect leading to both replication and
I/O hang. 

RESOLUTION:Algorithm which restarts the throttled I/Os is modified to give fair chance to
both local and remote I/Os to proceed.
Additionally, the code changes are made in CVM reconfiguration code path to
delete the RLINK ports cleanly before switching the master role. 

 * INCIDENT NO:2625762	 TRACKING ID:2607519

SYMPTOM: During initial sync from VVR primary site to VVR secondary site, if
there is a cluster reconfiguration, the CVM Master on the the VVR secondary site
may panic with following stack trace:

volkiofree+000018 
vol_dcm_read_cleanup+000150 
vol_rvdcm_read_done+000834 
volkcontext_process+0000E4 
voldiskiodone+000B50 
volsp_iodone_common+0001B8 
volsp_iodone+00002C 
internal_iodone_offl+000170 
iodone_offl+000078 
i_softmod+00027C 
flih_util+000250 

DESCRIPTION: The reason for the panic is an condition with respect to the cluster
reconfiguration on the secondary site which is not correctly handled.
De-referencing a NULL pointer variable caused the panic. 

RESOLUTION:The fix improves handling the reconfiguration conditions during the initial 
sync. 

 * INCIDENT NO:2630074	 TRACKING ID:2530698

SYMPTOM: vxdg destroy <shared-dg> command can hang.
vxconfigd hung with following stack trace:

Stack trace for process "vxconfigd" at 0xe000000161919300 (pid 392)
Thread at 0xe000000161960000 (tid 799)
#0  slpq_swtch_core+0x530 ()
#1  real_sleep+0x360 ()
#2  sleep_one+0x90 ()
#3  vol_kmsg_send_wait+0x4c0 ()
#4  volcvmdg_delete_group+0x270 ()
#5  vol_delete_group+0x220 ()
#6  volconfig_ioctl+0x200 ()
#7  volsioctl_real+0x7d0 ()
#8  volsioctl+0x60 ()
#9  vols_ioctl+0x80 ()
#10 spec_ioctl+0xf0 ()
#11 vno_ioctl+0x350 ()
#12 ioctl+0x410 ()
#13 syscall+0x590 ()

kmsg receiver thread hung with following stack trace on master node:
(Note, on HP the process name is "vxiod" even for kmsg receiver thread)

Stack trace for process "vxiod" at 0xe0000001510a9300 (pid 42)
Thread at 0xe0000001af147380 (tid 19002)
#0  slpq_swtch_core+0x530 ()
#1  inline real_sleep+0x360 ()
#2  sleep+0x90 ()
#3  vxvm_delay+0xe0 ()
#4  voldg_delete_finish+0x160 ()
#5  volcvmdg_delete_msg_receive+0xcf0 ()
#6  vol_kmsg_dg_request+0x2f0 ()
#7  vol_kmsg_request_receive+0x8c0 ()
#8  vol_kmsg_ring_broadcast_receive+0xcf0 ()
#9  vol_kmsg_receiver+0x1130 () 

DESCRIPTION: When vxdg destroy command is issued while internal IOs are in progress (plex 
attach or adding a mirror etc) that can lead to hang on the master node. There 
is 
a bug in cvm code, where master node keeps waiting for glock to be granted from 
a 
slave node that has already destroyed the dg. In this case, slaves respond with 
an 
error saying the dg does not exist any more. This will also result in vxconfigd 
to 
hang in the kernel. Once this issue is hit, most of vxvm commands hang on the 
master node, the only way to recover from this is to reboot the system. 

RESOLUTION:Code changes are made in cvm to handle error responses from slaves while 
requesting glocks during internal IOs. When master receives this error from a 
slave, the new code changes treat this as if glock has been granted. The dg 
destroy processing is also moved from kmsg receiver thread to vxiod threads to 
avoid any potential dead locks between destroy operation and any glock grant 
operation on master node. 

 * INCIDENT NO:2637183	 TRACKING ID:2647795

SYMPTOM: With Smartmove feature enabled, while moving subdisk data corruption is seen on 
file system as subdisk contents are not copied properly. 

DESCRIPTION: With FS Smartmove feature enabled, subdisk move operation queries VxFS for 
status 
of the region before deciding to synchronize the regions. While getting the 
information about the multiple such regions in one IOCTL to VxFS, if the start-
offset is not aligned to region size, one I/O can span across two regions, and 
VxVM was not properly checking status of such regions and skips the 
synchronization of that region causing data corruption. 

RESOLUTION:Code changes are done to properly check the region state even if the region 
spans 2-bits 
in the FSMAP. 

 * INCIDENT NO:2643134	 TRACKING ID:2348180

SYMPTOM: Mirror name is getting truncated while getting the name of mirror for a given 
volume and mirror number. 

DESCRIPTION: VxVM supports volume name up to 32 characters. But while getting the name of 
mirror for a given volume and mirror number because of miscalculation mirror 
name 
is getting truncated. 

RESOLUTION:Proper and complete mirror name is returned. 

 * INCIDENT NO:2643138	 TRACKING ID:2620555

SYMPTOM: During CVM reconfig, the RVG wait for the iocount to go to '0', to start the 
RVG recovery and complete the reconfiguration. 

DESCRIPTION: In CVR, the node leave will trigger the reconfiguration. The reconfiguration 
code path initiate the RVG recovery of all the shared diskgroup. The recovery 
is needed to flush the SRL (shared by all the nodes) to the data volume to 
avoid any missing writes to the data volume by the leaving node. This recovery 
involves, reading the data fromt he SRL and copy copy it to the data volume. 
The flush may take its own time depend on the disk response time and the size 
of SRL region required to flush. During the recovery a flag is set on the RVG 
to avoid any new I/O.

In this particular hand, the recovery is taking 30 minutes. During this time, 
there is another node leave happened, which triggered the second 
reconfiguration. The second reconfiguration before it trigger another recovery 
it wait for the IO count to go to zero by setting the RECOVER flag to RVG.
 
The first RVG recovery clears the RECOVER flag after 30 minutes once completed 
the SRL flush. Since this is the same flag set by the second reconfiguration, 
the second reconfiguration waiting indefinitely for the I/O count to go to 
zero. Since the RECOVER flag is unset, the I/O resumed. So second 
reconfiguration stuck forever. 

RESOLUTION:If the RECOVER flag is set, dont keep waiting for the iocount to become zero in 
the reconfigration code path. There is no need for another recovery, if the 
second reconfiguration is started before the first recovery completes. 

 * INCIDENT NO:2643139	 TRACKING ID:2620556

SYMPTOM: The I/O hung on the primary after SRL overflow and during SRL flush and rlink 
connect/disconnect. 

DESCRIPTION: As part of rlink connect or disconnect, the RVG is serialized to complete the 
connection or disconnection. The I/O throttle is normal during the SRL flush 
due to memory pool pressure or reaching the max throttle limit. During the 
serialization, the i/o is throttled to complete the DCM flush. The remote I/O's 
are kept in throttleq during the throttling is triggered.

Due to I/O serilalization, the throttled I/O is never get flushed and because 
of that I/O never complete. 

RESOLUTION:If the serialization is successful, flush the throttleq immediately. This will 
make sure, the remote I/O's will get retried again in the serialization code 
path 

 * INCIDENT NO:2643155	 TRACKING ID:2607293

SYMPTOM: VVR Primary panics while deleting RVG. Here is stack trace

panic_save_regs_switchstack+0x110
panic
bad_news
bubbleup+0x880
rv_ibc_freeze_timeout
invoke_callouts_for_self
soft_intr_handler
external_interrupt
bubbleup+0x880 

DESCRIPTION: VVR Primary is frozen to send IBC for given timeout value. If RVG is deleted
before unfreeze is done or timeout expire then it can cause panic. During RVG
deletion freeze timer is not cleared due to bug in code. As freeze timer expires
callback routine is called which access the RVG information, if RVG is deleted
then accessing it causes panic. 

RESOLUTION:To fix this issue, check for IBC freeze timer while deleting RVG and unset it. 

 * INCIDENT NO:2643156	 TRACKING ID:2610877

SYMPTOM: vxdg -g <dg-name> set activation=<options> 
might hang due to a bug in activation code path, when memory allocation fails in 
the kernel. 

DESCRIPTION: vxdg activation cmd is used to set read-write permission at dg level on each 
node. While running this command if there is a memory allocation failure in the 
vxvm kernel path, due to a bug in this code path command can hang.
If this command hangs, then it will also end up blocking most of vxvm commands. 

RESOLUTION:Code changes are made in the vxvm kernel code path to handle memory allocation 
failure correctly and keep retrying memory allocation until it succeeds. 

 * INCIDENT NO:2643159	 TRACKING ID:2633936

SYMPTOM: In mirrored volume when read error occurs following message gets displayed on 
console:
V-5-0-1279 rderr3_childdone: read error on object <object> of mirror <mirror> 
in volume <volume> (start <start_offset> length <length>) corrected 

DESCRIPTION: In mirrored volume when read error occurs on one plex, data from another plex
is written to the failed plex. In this read-writeback operation some unwanted 
messages are displayed on console. 

RESOLUTION:Code changes are done to not display unwanted messages on user console. 

 * INCIDENT NO:2682534	 TRACKING ID:2657797

SYMPTOM: Starting a RAID5 volume fails, when one of the sub-disks in the RAID5 column
starts at an offset greater than 1TB.

Example:
# vxvol -f -g dg1 -o delayrecover start vol1
VxVM vxvol ERROR V-5-1-10128  Unexpected kernel error in configuration update 

DESCRIPTION: VxVM uses an integer variable to store the starting block offset of a sub-disk
in a RAID5 column. This overflows when a sub-disk is located at an offset
greater than 2147483647 blocks (1TB) and results in failure to start the volume.

Refer to "sdaj" in the following example.

E.g.
v  RaidVol    -            DETACHED NEEDSYNC 64459747584 RAID   -        raid5
pl RaidVol-01 RaidVol    ENABLED  ACTIVE   64459747584 RAID   4/128    RW
[..]
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
sd DiskGroup101-01 RaidVol-01 DiskGroup101 0 1953325744 0/0 sdaa     ENA
sd DiskGroup106-01 RaidVol-01 DiskGroup106 0 1953325744 0/1953325744 sdaf
ENA             
sd DiskGroup110-01 RaidVol-01 DiskGroup110 0 1953325744 0/3906651488 sdaj
ENA 

RESOLUTION:VxVM code is modified to handle integer overflow conditions for RAID5 volumes. 

INCIDENTS FROM OLD PATCHES:
---------------------------
NONE