README VERSION		: 1.1
README CREATION DATE	: 2012-09-27
PATCH-ID 		: PHKL_43064 
PATCH NAME		: VRTSvxvm 5.1SP1RP2
BASE PACKAGE NAME	: VRTSvxvm
BASE PACKAGE VERSION	: 5.1.100.000/ 5.1.100.001
SUPERSEDED PATCHES	: PHKL_42993
REQUIRED PATCHES	: PHCO_43065
INCOMPATIBLE PATCHES	: NONE
SUPPORTED PADV		: hpux1131
(P-PLATFORM , A-ARCHITECTURE , D-DISTRIBUTION , V-VERSION)
PATCH CATEGORY		:  HANG ,  MEMORYLEAK ,  PANIC ,  PERFORMANCE
PATCH CRITICALITY       : CRITICAL
HAS KERNEL COMPONENT    : YES
ID                      : NONE
REBOOT REQUIRED		: YES

PATCH INSTALLATION INSTRUCTIONS:
--------------------------------
Please refer to the install/upgrade/rollback section in the Release Notes

PATCH UNINSTALLATION INSTRUCTIONS:
----------------------------------
Please refer to the install/upgrade/rollback section in the Release Notes

SPECIAL INSTRUCTIONS:
---------------------
NONE

SUMMARY OF FIXED ISSUES:
-----------------------------------------
2227908 (2227678)  Second rlink goes into DETACHED STALE state in multiple secondaries
environment when SRL has overflowed for multiple rlinks. 
2531224 (2526623)  Memory leak detected in CVM code. 
2570988 (2560835)  I/Os and vxconfigd hung on master node after slave is rebooted under heavy I/O load. 
2613584 (2606695)  Machine panics in CVR (Clustered Volume Replicator) environment while performing I/O Operations. 
2613596 (2606709)  IO hang is seen when SRL overflows and one of the nodes reboots 
2616006 (2575172)  I/Os are hung on master node after rebooting the slave node. 
2622029 (2620556)  IO hung after SRL overflow 
2622032 (2620555)  IO hang due to SRL overflow &amp; CVM reconfig 
2663673 (2656803)  Race between vxnetd start and stop operations causes panic. 
2695226 (2648176)  Performance difference on Master vs Slave during recovery via DCO. 
2703035 (925653)  Node join fails for higher CVMTimeout value. 
2706027 (2657797)  Starting 32TB RAID5 volume fails with V-5-1-10128 Unexpected kernel error in configuration update 
2730149 (2515369)  vxconfigd(1M) can hang in the presence of EMC BCV devices 
2737374 (2735951)  Uncorrectable write error is seen on subdisk when SCSI device/bus reset occurs. 
2750455 (2560843)  In VVR(Veritas Volume Replicator) setup I/Os can hang in slave nodes after one of the slave node is rebooted. 
2756069 (2756059)  System may panic when large cross-dg mirrored volume is 
started at boot. 
2800774 (2566174)  Null pointer dereference in volcvm_msg_rel_gslock() results 
in panic. 
2817926 (2423608)  panic in vol_dev_strategy()following FC problems 
2821137 (2774406)  System may panic while accessing data change map volume 
2821695 (2599526)  IO Hang seen when DCM is zero. 
2827791 (2760181)  Panic hit on secondary slave during logowner operation. 
2827794 (2775960)  In secondary CVR case, IO hang seen on a DG during SRL disable activity on 
other DG. 
2845984 (2739601)  vradmin repstatus output occasionally reports abnormal 
timestamp. 
2860281 (2838059)  VVR Secondary panic in vol_rv_update_expected_pos. 
2860445 (2627126)  IO hang seen due to IOs stuck at DMP level. 
2860812 (2801962)  Growing a volume takes significantly large time when the volume has version 20 
DCO attached to it. 
2862024 (2680343)  Manual disable/enable of paths to an enclosure leads to system panic 
2876116 (2729911)  IO errors seen during controller reboot or array port disable/enable. 


SUMMARY OF KNOWN ISSUES:
-----------------------------------------
2223250(2165829)  Node is not able to join the cluster when recovery is in progress. 
2900968(2920800)  Creation of VxVM BOOT fails with legacy device tree removed. 
2922551(2826905)  vxautoconvert(1M) fails to covert an LVM VG to a VxVM DG with new namingscheme. 
2937442(2942692)  On HPIVM 6.1, VxVM can not identify "thin disks" 


KNOWN ISSUES : 
--------------


 * INCIDENT NO::2223250	 TRACKING ID ::2165829

SYMPTOM:: Node join fails if the recovery for the leaving node is not completed. 

WORKAROUND:: Retry node join after the recovery is completed. 

 * INCIDENT NO::2900968	 TRACKING ID ::2920800

SYMPTOM:: If OS legacy device special files are not available, following VXVM operation
may fail:
Creating VXVM BOOT disk by running vxdisksetup -B <device name> 

WORKAROUND:: There is no known 

 * INCIDENT NO::2922551	 TRACKING ID ::2826905

SYMPTOM:: vxautoconvert(1M) fails to covert an LVM VG to a VxVM DG with new namingscheme. 
Convert fails and not able to roll back LVM VG's original state. 

WORKAROUND:: Change  to different  naming  scheme before convert procedure using 
vxautoconvert(1M) and once the conversion is finished, return back to the 
original(new namingscheme). 

 * INCIDENT NO::2937442	 TRACKING ID ::2942692

SYMPTOM:: On HPIVM 6.1, VxVM can not identify "thin disks".
When "vxdisk -o thin list" command is run, the follwing error message is displayed:

"VxVM vxdisk INFO V-5-1-14413  No Thin Provisioned disk are attached to the system." 

WORKAROUND:: None 

FIXED INCIDENTS: 
----------------


 PATCH ID:PHKL_43064

 * INCIDENT NO:2227908	 TRACKING ID:2227678

SYMPTOM: In case of multiple secondaries, if one secondary has overflowed and is in
resync mode, then if another secondary overflows, then the rlink corresponding
to the later secondary gets DETACHED and is not able to connect again. Even a
complete resynchronization is not working for the detached rlink. 

DESCRIPTION: When the later rlink overflows, we detach the Rlink. At the time of detach, the
rlink is going into an incorrect and unrecoverable state resulting it to never
connect again. 

RESOLUTION: Changes have been made to ensure that when a resync is ongoing for one of the
rlinks and another rlink overflows, it gets detached and a valid state is
maintained for that rlink. Hence, full synchronization at a later time can
recover the rlink completely. 

 * INCIDENT NO:2531224	 TRACKING ID:2526623

SYMPTOM: Memory leak detected in CVM DMP messaging phase. Below is message:
NOTICE: VxVM vxio V-5-3-3938 vol_unload(): not all memory has been freed 
(volkmem=424) 

DESCRIPTION: During CVM-DMP messaging memory was not getting freed for a specific scenario. 

RESOLUTION: Necessary code changes done to take care of memory deallocation. 

 * INCIDENT NO:2570988	 TRACKING ID:2560835

SYMPTOM: On master I/Os and vxconfigd get hung when slave is rebooted under 
heavy I/O load. 

DESCRIPTION: When slave leaves cluster without sending the DATA ack message to 
master, slave's I/Os get stuck on master because their logend processing can 
not be completed. At the same time cluster reconfiguration takes place as the 
slave left the cluster. In CVM (Cluster Volume Manager) reconfiguration code 
path these I/Os are aborted in order to proceed with the reconfiguration and 
recovery. But if the local I/Os on master goes to the logend queue after the 
logendq is aborted, these local I/Os get stuck forever in the logend queue 
leading to the permanent I/O hang. 

RESOLUTION: During CVM reconfiguration and RVG (Replicated Volume group) 
recovery later, no I/Os will be put into the logendq. 

 * INCIDENT NO:2613584	 TRACKING ID:2606695

SYMPTOM: Panic in CVR (Clustered Volume Replicator) environment while performing I/O 
Operations.

Panic stack traces might look like:
1)
vol_rv_add_wrswaitq 
vol_get_timespec_latest 
vol_kmsg_obj_request 
vol_kmsg_request_receive 
vol_kmsg_receiver 
kernel_thread

2)
vol_rv_mdship_callback
vol_kmsg_receiver
kernel_thread 

DESCRIPTION: In CVR, logclient requests METADATA information from logowner node to perform 
write operations.  Logowner node looks for any duplicate messages before adding 
the requests to the queue for processing. When a duplicate request arrives, 
logowner tries to copy the data from original I/O request and responds to the 
logclient with the METADATA information. During this process, panic can occur 

i) While copying the data as the code handling "copy" is not properly locked.
ii) if logclient receives inappropriate METADATA information because of 
improper copy. 

RESOLUTION: Code changes are performed with appropriate conditions and locks while copying 
the data from original I/O requests for the duplicates. 

 * INCIDENT NO:2613596	 TRACKING ID:2606709

SYMPTOM: SRL overflow and CVR reconfiguration lead to the reconfiguration hang. 

DESCRIPTION: There are 6 RVG each has 16 datavolumes in the above reported 
problem. This problem could happen with more than 1 RVG configured. Here, both 
master and slave nodes are performing I/O. Slave node is rebooted, and which 
trigger a reconfiguration. All 6 RVG's doing I/O which fully utilized the 
RVIOMEM pool (memory pool used for RVG I/O's). Due to node leave, the I/O's on 
all the RVG will come to halt  waiting for recovery flag set by the 
reconfiguration code path. Some pending I/O's in all the RVG's are still kept 
in the queue, due to holes in the SRL  beacuse of node leave. The RVIOMEM pool 
is completely used by 3 of the RVG (600 + I/O) which are still doing the I/O.
In the reconfiguration code, the rvg1 is picked to abort all the pending IO's 
in the queue, and wait for the active I/O's to complete. There are still some 
I/O still waiting for the RVIOMEM pool and is waiting for the memory. But the 
other active RVG's are not releasing any memory, this is just queued or 
waiting for the memory. With out all the pending I/O's are serviced, the code 
will not move forward to abort the I/O's,  and the reconfiguration will never 
complete. 

RESOLUTION: Instead of going 1 by 1 RVG to abort and start the recovery, 
changing the logic to abort the I/O's in all the RVG's first. Later send the 
recovery message for all the RVG's after the iocount drains to 0. This way, we 
will avoid a hang situation due to some RVG's holding the memory. 

 * INCIDENT NO:2616006	 TRACKING ID:2575172

SYMPTOM: The reconfigd thread is hung waiting for the IO to drain. 

DESCRIPTION: While doing CVR(Cluser Volume Replicator) reconfiguration, RVG
(Replicator Volume Group) recovery is started. The recovery can get stuck in 
DCM(Data Change Map) read while flushing the SRL(Serial Replicator Log). Flush 
operation creates lrage number of (1000+) threads. When the system memory is 
very low. In some cases, when the memory allocation is fails, failing to reduce 
the count leads to hang. 

RESOLUTION: Reset the number_of_children to 0, when ever the I/O creation fails 
due to memory allocation failure. 

 * INCIDENT NO:2622029	 TRACKING ID:2620556

SYMPTOM: The I/O hung on the primary after SRL overflow and during SRL flush and rlink 
connect/disconnect. 

DESCRIPTION: As part of rlink connect or disconnect, the RVG is serialized to complete the 
connection or disconnection. The I/O throttle is normal during the SRL flush 
due to memory pool pressure or reaching the max throttle limit. During the 
serialization, the i/o is throttled to complete the DCM flush. The remote I/O's 
are kept in throttleq during the throttling is triggered.

Due to I/O serialization, the throttled I/O is never get flushed and because 
of that I/O never complete. 

RESOLUTION: If the serialization is successful, flush the throttleq immediately. This will 
make sure, the remote I/O's will get retried again in the serialization code 
path 

 * INCIDENT NO:2622032	 TRACKING ID:2620555

SYMPTOM: During CVM reconfig, the RVG wait for the iocount to go to '0', to start the 
RVG recovery and complete the reconfiguration. 

DESCRIPTION: In CVR, the node leave will trigger the reconfiguration. The reconfiguration 
code path initiate the RVG recovery of all the shared diskgroup. The recovery 
is needed to flush the SRL (shared by all the nodes) to the data volume to 
avoid any missing writes to the data volume by the leaving node. This recovery 
involves, reading the data fromt he SRL and copy copy it to the data volume. 
The flush may take its own time depend on the disk response time and the size 
of SRL region required to flush. During the recovery a flag is set on the RVG 
to avoid any new I/O.

In this particular hand, the recovery is taking 30 minutes. During this time, 
there is another node leave happened, which triggered the second 
reconfiguration. The second reconfiguration before it trigger another recovery 
it wait for the IO count to go to zero by setting the RECOVER flag to RVG.
 
The first RVG recovery clears the RECOVER flag after 30 minutes once completed 
the SRL flush. Since this is the same flag set by the second reconfiguration, 
the second reconfiguration waiting indefinitely for the I/O count to go to 
zero. Since the RECOVER flag is unset, the I/O resumed. So second 
reconfiguration stuck forever. 

RESOLUTION: If the RECOVER flag is set, dont keep waiting for the iocount to become zero in 
the reconfigration code path. There is no need for another recovery, if the 
second reconfiguration is started before the first recovery completes. 

 * INCIDENT NO:2663673	 TRACKING ID:2656803

SYMPTOM: VVR (Veritas Volume Replicator) panics when vxnetd start/stop operations are 
invoked in parallel.
Panic stack trace might look like:

panicsys
vpanic_common
panic
mutex_enter()
vol_nm_heartbeat_free()
vol_sr_shutdown_netd()
volnet_ioctl()
volsioctl_real()
spec_ioctl() 

DESCRIPTION: vxnetd start and stop operations are not serialized. Hence we hit race condition 
and panic if they are run in parallel, when they access the shared resources 
without locks. The panic stack varies depending on where the resource contention 
is seen. 

RESOLUTION: Incorporated synchronization primitive to allow only either the vxnetd start or 
stop process to run at a time. 

 * INCIDENT NO:2695226	 TRACKING ID:2648176

SYMPTOM: In a clustered volume manager environment, additional data synchronization is 
noticed during reattach of a detached plex on a mirrored volume even when there 
was no I/O on the volume after the mirror was detached. This behavior is seen 
only on mirrored volumes with version 20 DCO attached and is part of a shared 
diskgroup. 

DESCRIPTION: In a clustered volume manager environment, write I/Os issued on a mirrored 
volume from the CVM master node are tracked in a bitmap unnecessarily. The 
tracked bitmap is then used during detach to create the tracking map for 
detached plex. This results in additional delta between active plex and the 
detached plex. So, even when there are no I/Os after detach, the reattach will 
do additional synchronization between mirrors. 

RESOLUTION: The unnecessary bitmap tracking of write I/Os issued on a mirrored volume from 
the CVM master node is prevented. So, the tracking map that gets created during 
detach will always starts clean. 

 * INCIDENT NO:2703035	 TRACKING ID:925653

SYMPTOM: Node join fails when CVMTimeout is set to value higher than 35 mins 
(approximately). 

DESCRIPTION: Node join fails due to integer overflow for higher CVMTimeout value. 

RESOLUTION: Code changes done to handle higher CVMTimeout value. 

 * INCIDENT NO:2706027	 TRACKING ID:2657797

SYMPTOM: Starting a RAID5 volume fails, when one of the sub-disks in the RAID5 column
starts at an offset greater than 1TB.

Example:
# vxvol -f -g dg1 -o delayrecover start vol1
VxVM vxvol ERROR V-5-1-10128  Unexpected kernel error in configuration update 

DESCRIPTION: VxVM uses an integer variable to store the starting block offset of a sub-disk
in a RAID5 column. This overflows when a sub-disk is located at an offset
greater than 2147483647 blocks (1TB) and results in failure to start the volume.

Refer to "sdaj" in the following example.

E.g.
v  RaidVol    -            DETACHED NEEDSYNC 64459747584 RAID   -        raid5
pl RaidVol-01 RaidVol    ENABLED  ACTIVE   64459747584 RAID   4/128    RW
[..]
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
sd DiskGroup101-01 RaidVol-01 DiskGroup101 0 1953325744 0/0 sdaa     ENA
sd DiskGroup106-01 RaidVol-01 DiskGroup106 0 1953325744 0/1953325744 sdaf
ENA             
sd DiskGroup110-01 RaidVol-01 DiskGroup110 0 1953325744 0/3906651488 sdaj
ENA 

RESOLUTION: VxVM code is modified to handle integer overflow conditions for RAID5 volumes. 

 * INCIDENT NO:2730149	 TRACKING ID:2515369

SYMPTOM: vxconfigd(1M) can hang in the presence of EMC BCV devices in established
(bcv-nr) state with a call stack similar to the following is observed:

  inline biowait_rp
  biowait
  dmp_indirect_io
  gendmpioctl
  dmpioctl
  spec_ioctl
  vno_ioctl
  ioctl
  syscall

Also, a message similar to the following can be seen in the syslog:
NOTICE: VxVM vxdmp V-5-3-0 gendmpstrategy: strategy call failed on bp
<address>, path devno 255/<device_no> 

DESCRIPTION: The issue can happen during device discovery. While reading the device
information, the device is expected to be opened in block mode, but the device
was incorrectly being opened in character mode causing the hang. 

RESOLUTION: The code was changed to open the block device from DMP indirect IO code path. 

 * INCIDENT NO:2737374	 TRACKING ID:2735951

SYMPTOM: Following messages can be seen in syslog:

SCSI error: return code = 0x00070000  
I/O error, dev <devname>, sector <no_of_sector>
VxVM vxdmp V-5-0-0 i/o error occurred (errno=0x0) on dmpnode <major>/<minor> 

DESCRIPTION: When the SCSI resets happen, the I/O fails with PATH_OK or PATH_RETRY error. As 
time bound recovery is default recovery option, VxVM retries the I/O till 
timeout. Because of miscalculation of time taken by each I/O retry, total 
timeout value is reduced drastically. All retries fail with the same error in 
this small timeout value and uncorrectable error occurs. 

RESOLUTION: Code changes are made to calculate the timeout value properly. 

 * INCIDENT NO:2750455	 TRACKING ID:2560843

SYMPTOM: In 3 or more node cluster, when one of the slaves is rebooted under 
heavy I/O load, the I/Os hang on the other slave.

Example :
Node A (master and logowner)
Node B (slave 1)
Node C (slave 2)

If Node C is doing a heavy I/Os and Node B is rebooted, the I/Os on Node C gets
hung. 

DESCRIPTION: When Node B leaves the cluster, its throttled I/Os are aborted and 
all the resources taken by these I/Os are freed. Along with these I/Os, 
throttled I/Os of Node C are also responded that resources are not available to 
let Node C resend those I/Os. But during this process, region locks hold by 
these I/Os on master are not freed. 

RESOLUTION: All the resources taken by the remote I/Os on master are freed 
properly. 

 * INCIDENT NO:2756069	 TRACKING ID:2756059

SYMPTOM: During boot process when vxvm starts large cross-dg mirrored volume 
(>1.5TB), system may panic with following stack:

vxio:voldco_or_drl_to_pvm
vxio:voldco_write_pervol_maps_20
vxio:volfmr_write_pervol_maps
vxio:volfmr_copymaps_instant
vxio:volfmr_copymaps
vxio:vol_mv_precommit
vxio:vol_commit_iolock_objects
vxio:vol_ktrans_commit
vxio:volconfig_ioctl
vxio:volsioctl_real 

DESCRIPTION: During resync of the cross-dg mirrored volume DRL(dirtly region 
logging) log is changed to track map on the volume. While changing map pointer 
calculation is not done properly. Due to wrong moving forward step of the 
pointer, array out of bounds issue occurs for very large volume leading to 
panic. 

RESOLUTION: The code changes are done to fix the wrong pointer increment. 

 * INCIDENT NO:2800774	 TRACKING ID:2566174

SYMPTOM: In a Clustered Volume Manager environment, the node which is taking 
over as MASTER dumped core because of NULL pointer dereference while releasing 
the ilocks. The stack is given below:

vxio:volcvm_msg_rel_gslock
vxio:volkmsg_obj_sio_start
vxio:voliod_iohandle
vxio:voliod_loop 

DESCRIPTION: The issue is seen due to offloading glock messages to the io 
daemon threads. When VxVM io daemon threads are processing the glock release 
messages, the interlock release and free happens after invoking the kernel 
message complete routine. This has a side effect that the reference count on 
the control block becomes zero and if garbage collection is running at this 
stage, it will end up freeing the message from the garbage queue. So, if there 
is a resend of the same message, there will be two contexts processing the same 
interlock free request. The receiver thread for which interlock is NULL and 
freed from other context, panic occurs. 

RESOLUTION: Code changes are done to offload glock messages to VxVM io 
daemon threads after processing the control block. Also the kernel message 
response routine is invoked after checking whether interlock release is 
required and releasing it. 

 * INCIDENT NO:2817926	 TRACKING ID:2423608

SYMPTOM: System panic'd after losing some paths to a disk following some FC issues with 
the following stack :

vol_dev_strategy+0x81 
voldiosio_start+0xe70 
volkcontext_process+0x8c0 
volsiowait+0x130 
voldio+0xad0 
vol_voldio_read+0x30 
volconfig_ioctl+0x940 
volsioctl_real+0x7d0 
volsioctl+0x60 
vols_ioctl+0x80 
spec_ioctl+0x100 
vno_ioctl+0x390 
ioctl+0x3e0 
syscall+0x5a0 

DESCRIPTION: Panic occurs in voldiosio_start()->vol_dev_strategy(), 
the related procedure in voldiosio_start() is 

1) find disk by invoking function find_disk_from_dio()
2) Set the disk policy 
3) Call vol_dev_stratety() to fire IO

There is a race condition in traversing the dglist and when there are FC issues,
the wrong disk is returned which was already freed. As the disk has been freed,
the disk's iopolicy points to a freed memory and caused panic when accessing it. 

RESOLUTION: There is no need to traverse the dglist to get the disk information as the same
disk is set in the IO structure. The IO is issued to the correct disk and no panic
occurs. 

 * INCIDENT NO:2821137	 TRACKING ID:2774406

SYMPTOM: The system may panic while referencing the DCM(Data Channge Map) 
object attached to the volume, with following stack:
vol_flush_srl_to_dv_start
voliod_iohandle
voliod_loop 

DESCRIPTION: When volume tries to flush the DCM to track the I/O map, if disk 
attached to the DCM is not available, DCM state is set to aborting before 
marking inactive. Since the current state of volume is till ACTIVE, trying to 
access the DCM object causes panic. 

RESOLUTION: Code changes are done to check if DCM is not in aborting state 
before proceeding with the DCM flush. 

 * INCIDENT NO:2821695	 TRACKING ID:2599526

SYMPTOM: SRL to DCM flush does not happen resulting in I/O hang. 

DESCRIPTION: After SRL overflow, before the RU state machine phase could be changed to
VOLRP_PHASE_SRL_FLUSH; Rlink connection thread sneak in and changed the phase to
VOLRP_PHASE_START_UPDATE. Once the phase is changed to VOLRP_PHASE_START_UPDATE;
the state machine missed to flush the SRL into DCM and goes into
VOLRP_PHASE_DCM_WAIT and stucks there. 

RESOLUTION: RU state machine phases are handled correctly after SRL overflows. 

 * INCIDENT NO:2827791	 TRACKING ID:2760181

SYMPTOM: The secondary slave node hit a panic in vol_rv_change_sio_start() for already 
active logowner operation. 

DESCRIPTION: The slave node panic during the logowner change. The logowner change and the 
reconfiguration recovery process happens at the same time, leading to a race in 
setting the ACTIVE flag. The reconfiguration recovery unset the flag which is 
set by the logowner change operation. In the middle of logowner change 
operation the ACTIVE flag is missing and leads to the system panic. 

RESOLUTION: The appropriate lock is taken in the logowner change code and also added more 
debug log entries for better tracking the logowner issues. 

 * INCIDENT NO:2827794	 TRACKING ID:2775960

SYMPTOM: On secondary CVR, after disabling the SRL on one DG, triggered an IO hang on 
another DG. 

DESCRIPTION: The failure of SRL lun's is causing the failure in both DG's. The I/O failure 
in the messages confirmed, the LUN failure on the DG4 also. Every 1024 I/O's to 
the SRL, the header of the SRL is flushed. In the SRL flush code, during the 
error scenario, the flush I/O is queued but not getting started. If the flush 
I/O is not getting completed, the application I/O will hang for ever. 

RESOLUTION: The fix is to start the flush I/O which is queued in the error scenario. 

 * INCIDENT NO:2845984	 TRACKING ID:2739601

SYMPTOM: vradmin repstatus output occasionally reports abnormal timestamp 
information. 

DESCRIPTION: Sometimes vradmin repstatus will show the timestamp which is abnormal.
This timestamp is reported in the "Timestamp Information" section of vradmin
repstatus output. In this case the timestamp reported is a very high value in 
time, something like 100 hours. This condition occurs when no data has been 
replicated across to the secondary for a long time. This does not necessarily 
mean that the Rlinks are disconnected for a long time. Even if the Rlinks are 
connected it could be possible that no new data was written to the primary 
during that period and thus no data got replicated across to the secondary.
Now, if at this point the Rlink is paused and some writes are done, then vradmin
repstatus will show abnormal timestamp. 

RESOLUTION: To solve this issue whenever new data is written to the data volume, if the
Rlink is up-to-date then we mark this timestamp. This will make sure that
abnormal timestamp is not reported. 

 * INCIDENT NO:2860281	 TRACKING ID:2838059

SYMPTOM: The VVR secondary machine crashes with following panic stack:

crash_kexec 
__die 
do_page_fault
error_exit
[exception RIP: vol_rv_update_expected_pos+337]
vol_rv_service_update
vol_rv_service_message_start 
voliod_iohandle
voliod_loop
kernel_thread at ffffffff8005dfb1 

DESCRIPTION: If VVR primary machine crashes without completing a few of the write I/Os to the
data volumes, it does fill the incomplete write I/Os with the "DUMMY" I/Os. 
It has to do so to maintain the write order fidelity at the secondary. While 
processing such dummy updates on secondary, because of a logical error, the 
secondary VVR code tries to deference the NULL pointer leading to the panic. 

RESOLUTION: The code changes are made in VVR secondary in "DUMMY" update processing code 
path to correct the logic. 

 * INCIDENT NO:2860445	 TRACKING ID:2627126

SYMPTOM: Observed IO hang on system as lots of IO's are stuck in DMP global queue. 

DESCRIPTION: Lots of IOs and Paths are stuck in dmp_delayq and dmp_path_delayq respectively
and DMP daemon could not process them, because of the race condition between 
"processing the dmp_delayq" and "waking up the DMP daemon". Lock is held while 
processing the dmp_delayq, and it is released for very short duration. If any 
path is busy in this duration, it gives IO error, leading to IO hang. 

RESOLUTION: The global delay queue pointers are copied to local variables and lock is held
only for this period, then IOs in the queue are processed using the local queue
variable. 

 * INCIDENT NO:2860812	 TRACKING ID:2801962

SYMPTOM: Operations that lead to growing of volume, including 'vxresize', 'vxassist 
growby/growto' take significantly larger time if the volume has version 20 
DCO(Data Change Object) attached to it in comparison to volume which doesn't 
have DCO attached. 

DESCRIPTION: When a volume with a DCO is grown, it needs to copy the existing map in DCO and 
update the map to track the grown regions.  The algorithm was such that for 
each region in the map it would search for the page that contains that region 
so as to update the map. Number of regions and number of pages containing them 
are proportional to volume size. So, the search complexity is amplified and 
observed primarily when the volume size is of the order of terabytes. In the 
reported instance, it took more than 12 minutes to grow a 2.7TB volume by 50G. 

RESOLUTION: Code has been enhanced to find the regions that are contained within a page and 
then avoid looking-up the page for all those regions. 

 * INCIDENT NO:2862024	 TRACKING ID:2680343

SYMPTOM: While manually disabling and enabling paths to an enclosure machine may panic
with the following stack:

    apauto_get_failover_path+0000CC()
    gen_dmpnode_update_cur_pri+000828()
    dmp_start_failover+000124()
    gen_update_cur_pri+00012C()
    dmp_update_cur_pri+000030()
    dmp_reconfig_update_cur_pri+000010()
    dmp_decipher_instructions+0006E8()
    dmp_process_instruction_buffer+000308()
    dmp_reconfigure_db+0000C4()
    gendmpioctl+000ECC()
    dmpioctl+00012C() 

DESCRIPTION: The Dynamic Multi-Pathing(DMP) driver keeps track of the number of active paths
and failed paths internally. The computation may go wrong while exercising
manual disable/enable of paths which can lead to machine panic. 

RESOLUTION: Code changes have been made to properly update the active path and failed path
count. 

 * INCIDENT NO:2876116	 TRACKING ID:2729911

SYMPTOM: During a controller or port failure, UDEV removes the associated path
information from DMP. When the paths are being removed the IO occurring to this 
disk could still get re-directed to this path, after it has been deleted,
leading to an IO failure. 

DESCRIPTION: When a path is being deleted from a DMP node the appropriate data structures 
for this path needs to be updated to not have it available for IO after 
deletion which is not happening currently. 

RESOLUTION: The DMP code is modified to not select the deleted path for future IOs. 

 PATCH ID:PHKL_42993

 * INCIDENT NO:2280285	 TRACKING ID:2365486

SYMPTOM: In Two nodes SFRAC configuration, after enabling ports when "vxdisk
scandisks" is run, systems panics with following stack: 

PANIC STACK:

.unlock_enable_mem()
.unlock_enable_mem()
dmp_update_path()
dmp_decode_update_dmpnode()
dmp_decipher_instructions()
dmp_process_instruction_buffer()
dmp_reconfigure_db()
gendmpioctl()
vxdmpioctl()
rdevioctl()
spec_ioctl()
vnop_ioctl()
vno_ioctl()
common_ioctl()
ovlya_addr_sc_flih_main() 

DESCRIPTION: Improper order of acquire and release of locks during reconfiguration of DMP
when I/O activity was running parallelly, lead to above panic. 

RESOLUTION: Release the locks in the same order as they in which they are acquired. 

 * INCIDENT NO:2532440	 TRACKING ID:2495186

SYMPTOM: With TCP protocol used for replication, I/O throttling happens due to
memory flow control. 

DESCRIPTION: In some slow network configuration, the I/O throughput is throttled
back due to the replication I/O. 

RESOLUTION: It is better to keep the replication I/O outside the normal I/O code
path to improve its I/O throughput performance. 

 * INCIDENT NO:2563291	 TRACKING ID:2527289

SYMPTOM: In a Campus Cluster setup, storage fault may lead to DETACH of all the
configured site. This also results in IOfailure on all the nodes in the Campus
Cluster. 

DESCRIPTION: Site detaches are done on site consistent dgs when any volume in the dg looses
all the mirrors of a Site. During the processing of the DETACH of last mirror in
a site we identify that it is the last mirror and DETACH the site which in turn
detaches all the objects of that site.

In Campus Cluster setup we attach a dco volume for any data volume created on a
site-consistent dg. The general configuration is to have one DCO mirror on each
site. Loss of a single mirror of the dco volume on any node will result in the
detach of that site. 

In a 2 site configuration this particular scenario would result in both the dco
mirrors being lost simultaneously. While the site detach for the first mirror is
being processed we also signal for DETACH of the second mirror which ends up
DETACHING the second site too. 

This is not hit in other tests as we already have a check to make sure that we
do not DETACH the last mirror of a Volume. This check is being subverted in this
particular case due to the type of storage failure. 

RESOLUTION: Before triggering the site detach we need to have an explicit check to see if we
are trying to DETACH the last ACTIVE site. 

 * INCIDENT NO:2626900	 TRACKING ID:2608849

SYMPTOM: 1.Under a heavy I/O load on logclient node, write I/Os on VVR Primary logowner
takes a very long time to complete.

2. I/Os on "master" and "slave" nodes hang when "master" role is switched
multiple times using "vxclustadm setmaster" command. 

DESCRIPTION: 1.
VVR can not allow more than 2048 I/Os outstanding on the SRL volume. Any I/Os
beyond this threshold will be throttled. The throttled I/Os are restarted after
every SRL header flush operation. During restarting the throttled I/Os, I/Os
came from logclient are given higher priority causing logowner I/Os to starve.

2.
In CVM reconfiguration code path the RLINK ports are not cleanly deleted on old
log-owner. This causes the RLINks not to connect leading to both replication and
I/O hang. 

RESOLUTION: Algorithm which restarts the throttled I/Os is modified to give fair chance to
both local and remote I/Os to proceed.
Additionally, the code changes are made in CVM reconfiguration code path to
delete the RLINK ports cleanly before switching the master role. 

 * INCIDENT NO:2636094	 TRACKING ID:2635476

SYMPTOM: DMP (Dynamic Multi Pathing) driver does not automatically enable the failed 
paths of Logical Units (LUNs) that are restored. 

DESCRIPTION: DMP's restore demon probes each failed path at a default interval of 5 minutes 
(tunable) to detect if that path can be enabled. As part of enabling the path, 
DMP issues an open() on the path's device number. Owing to a bug in the DMP
code, the open() was issued on a wrong device partition which resulted in
failure for every probe. Thus, the path remained in failed status at DMP layer
though it was enabled at the array side. 

RESOLUTION: Modified the DMP restore daemon code path to issue the open() on the appropriate
device partitions. 

 * INCIDENT NO:2695228	 TRACKING ID:2688747

SYMPTOM: Under a heavy I/O load on logclient node, the writes on VVR Primary logowner
takes a very long time to complete. Writes appear to be hung. 

DESCRIPTION: VVR cannot allow more than specific number of I/Os (4096)outstanding on the SRL
volume. Any I/Os beyond this threshold will be throttled. The throttled I/Os are
restarted periodically. While restarting, I/Os belonging logclient get high
preference compared to logowner I/Os, which can eventually lead to starvation or
I/O hang situation on logowner. 

RESOLUTION: Changes are done in algorithm of I/O scheduling of restarted I/Os, it's made
sure that throttled local I/Os will get the chance to proceed under all conditions. 

 * INCIDENT NO:2713862	 TRACKING ID:2390998

SYMPTOM: When running 'vxdctl enable' or 'vxdisk scandisks' command after the 
configuration
changes in SAN ports, system panicked with the following stack trace:
.disable_lock()
dmp_close_path()
dmp_do_cleanup()
dmp_decipher_instructions()
dmp_process_instruction_buffer()
dmp_reconfigure_db()
gendmpioctl()
vxdmpioctl() 

DESCRIPTION: After the configuration changes in SAN ports, the configuration in VxVM also
needs to be updated. In the reconfiguration process, VxVM may temporarily have
the old dmp path nodes and the new dmp path nodes, both of which has the same
device number, to migrate the old ones to new ones. VxVM maintains two types
of open count to avoid platform dependency. However when openining/closing the
old dmp path nodes while the migration process is going on, VxVM wrongly 
calculates
the open counts in the dmp path nodes; calculates an open count in the new node
and then calculates the other open count in the old node. This results in the
inconsistent open counts of the node and cause panic while checking open counts. 

RESOLUTION: The code change has been done to maintain the open counts on the same dmp path
node database correctly while performing dmp device open/close. 

 * INCIDENT NO:2741105	 TRACKING ID:2722850

SYMPTOM: Disabling/enabling controllers while I/O is in progress results in dmp (Dynamic
Multi-Pathing) thread hang with following stack:

dmp_handle_delay_open
gen_dmpnode_update_cur_pri
dmp_start_failover
gen_update_cur_pri
dmp_update_cur_pri
dmp_process_curpri
dmp_daemons_loop 

DESCRIPTION: DMP takes an exclusive lock to quiesce a node to be failed over, and releases
the lock to do update operations. These update operations presume that the node
will be in quiesced status. A small timing window exists between lock release
and update operations, wherein other threads can break-in into this window and
unquiesce the node, which will lead to the hang while performing update operations. 

RESOLUTION: Corrected the quiesce counter of a node to avoid other threads unquiesce it when
a thread is performing update operations. 

 * INCIDENT NO:2774907	 TRACKING ID:2771452

SYMPTOM: In lossy and high latency network, I/O gets hung on VVR primary. Just before the
I/O hang, Rlink frequently connects and disconnects. 

DESCRIPTION: In lossy and high latency network, because of heartbeat time outs, RLINK gets
disconnected. As a part of Rlink disconnect, the communication port is deleted.
During this process, the RVG is serialized and the I/Os are kept in a special
queue - rv_restartq. The I/Os in rv_restartq are supposed to be restarted once the
port deletion is successful.
The port deletion involves termination of all the communication server processes.
Because of a bug in the port deletion logic, the global variable which keeps track
of number of communication server processes got decremented twice.
This caused port deletion process to be hung leading to I/Os in rv_restartq never
being restarted. 

RESOLUTION: In port deletion logic, it's made sure that the global variable which keeps track
of number of communication server processes will get decremented correctly. 

 PATCH ID:PHKL_42808

 * INCIDENT NO:2440015	 TRACKING ID:2428170

SYMPTOM: I/O hangs when reading or writing to a volume after a total storage 
failure in CVM environments with Active-Passive arrays. 

DESCRIPTION: In the event of a storage failure, in active-passive environments, 
the CVM-DMP fail over protocol is initiated. This protocol is responsible for 
coordinating the fail-over of primary paths to secondary paths on all nodes in 
the 
cluster.
In the event of a total storage failure, where both the primary paths and 
secondary paths fail, in some situations the protocol fails to cleanup some 
internal structures, leaving the devices quiesced. 

RESOLUTION: After a total storage failure all devices should be un-quiesced, 
allowing the I/Os to fail. The CVM-DMP protocol has been changed to cleanup 
devices, even if all paths to a device have been removed. 

 * INCIDENT NO:2493635	 TRACKING ID:2419803

SYMPTOM: Secondary Site panics in VVR (Veritas Volume Replicator).
Stack trace might look like:

kmsg_sys_snd+0xa8()
nmcom_send_tcp+0x800()
nmcom_do_send+0x290()
nmcom_throttle_send+0x178()
nmcom_sender+0x350()
thread_start+4() 

DESCRIPTION: While Secondary site is communicating with Primary site, if it 
encounters "EAGAIN" (try again) error, then it tries to send data on next 
connection. If all the session connections are not established by this time, it 
leads to panic as the connection is not initialized. 

RESOLUTION: Code changes have been made to check for a valid connection before sending data. 

 * INCIDENT NO:2497637	 TRACKING ID:2489350

SYMPTOM: In a Storage Foundation environment running Symantec Oracle Disk Manager (ODM),
Veritas File System (VxFS), Cluster volume Manager (CVM) and Veritas Volume
Replicator (VVR), kernel memory is leaked under certain conditions. 

DESCRIPTION: In CVR (CVM + VVR), under certain conditions (for example when I/O throttling
gets enabled or kernel messaging subsystem is overloaded), the I/O resources
allocated before are freed and the I/Os are being restarted afresh. While
freeing the I/O resources, VVR primary node doesn't free the kernel memory
allocated for FS-VM private information data structure and causing the kernel
memory leak of 32 bytes for each restarted I/O. 

RESOLUTION: Code changes are made in VVR to free the kernel memory allocated for FS-VM
private information data structure before the I/O is restarted afresh. 

 * INCIDENT NO:2497796	 TRACKING ID:2235382

SYMPTOM: IOs can hang in DMP driver when IOs are in progress while carrying out path
failover. 

DESCRIPTION: While restoring any failed path to a non-A/A LUN, DMP driver is checking that
whether any pending IOs are there on the same dmpnode. If any are present then DMP
is marking the corresponding LUN with special flag so that path failover/failback
can be triggered by the pending IOs. There is a window here and by chance if all
the pending IOs return before marking the dmpnode, then any future IOs on the
dmpnode get stuck in wait queues. 

RESOLUTION: Make sure that whenever the LUN is having pending IOs then only to set the flag on
it so that failover can be triggered by pending IOs. 

 * INCIDENT NO:2507124	 TRACKING ID:2484334

SYMPTOM: The system panic occurs with the following stack while collecting the DMP 
stats.

dmp_stats_is_matching_group()
dmp_group_stats()
dmp_get_stats()
gendmpioctl()
dmpioctl() 

DESCRIPTION: Whenever new devices are added to the system, the stats table is adjusted to
accomodate the new devices in the DMP. There exists a race between the stats
collection thread and the thread which adjusts the stats table to accomodate
the new devices. The race can result the stats collection thread to access the
memory beyond the known size of the table causing the system panic. 

RESOLUTION: The stats collection code in the DMP is rectified to restrict the access to the 
known size of the stats table. 

 * INCIDENT NO:2508418	 TRACKING ID:2390431

SYMPTOM: In a Disaster Recovery environment, when DCM (Data Change Map) is active and 
during SRL(Storage Replicator Log)/DCM flush, the system panics due to missing
parent on one of the DCM in an RVG (Replicated Volume Group). 

DESCRIPTION: The DCM flush happens during every log update and its frequency depends on the 
IO load. If the I/O load is high, the DCM flush happens very often and if there 
are more volumes in the RVG, the frequency is very high. Every DCM flush 
triggers the DCM flush on all the volumes in the RVG. If there are 50 volumes, 
in an RVG, then each DCM flush creates 50 children and is controlled by one 
parent SIO. Once all the 50 children are done, then the parent SIO releases 
itself for the next flush. Once the DCM flush of each child completes, it 
detaches itself from the parent by setting the parent field to NULL. It so 
happens that, if the 49th child is done and before it is detaching it from the 
parent, the 50th child completes and releases the parent_SIO for the next DCM 
flush. Before the 49th child detaches, the new DCM flush is started on the same 
50th child. After the next flush is started, the 49th child of the previous 
flush detaches itself from the parent and since it is a static SIO, it 
indirectly resets the new flush parent field. Also, the lock is not obtained 
before modifing the sio state field in a few scenarios. 

RESOLUTION: Before reducing the children count, detach the parent first. This will make 
sure the new flush will not race with the previous flush. Protect the field 
with the required lock in all the scenarios. 

 * INCIDENT NO:2531983	 TRACKING ID:2483053

SYMPTOM: VVR Primary system consumes very high kernel heap memory and appear to 
be hung. 

DESCRIPTION: There is a race between REGION LOCK deletion thread which runs as 
part of SLAVE leave reconfiguration and the thread which process the DATA_DONE 
message coming from log client to logowner. Because of this race, the flags 
which stores the status information about the I/Os was not correctly updated. 
This used to cause a lot of SIOs being stuck in a queue consuming a large kernel 
heap. 

RESOLUTION: The code changes are made to take the proper locks while updating 
the SIOs' fields. 

 * INCIDENT NO:2531987	 TRACKING ID:2510523

SYMPTOM: In CVM-VVR configuration, I/Os on "master" and "slave" nodes hang when "master"
role is switched to the other node using "vxclustadm setmaster" command. 

DESCRIPTION: Under heavy I/O load, the I/Os are sometimes throttled in VVR, if number of
outstanding I/Os on SRL reaches a certain limit (2048 I/Os).
When "master" role is switched to the other node by using "vxclustadm setmaster"
command, the throttled I/Os on original master are never restarted. This causes
the I/O hang. 

RESOLUTION: Code changes are made in VVR to make sure the throttled I/Os are restarted
before "master" switching is started. 

 * INCIDENT NO:2552402	 TRACKING ID:2432006

SYMPTOM: System intermittently hangs during boot if disk is encapsulated.
When this problem occurs, OS boot process stops after outputing this:
"VxVM sysboot INFO V-5-2-3409 starting in boot mode..." 

DESCRIPTION: The boot process hung due to a dead lock between two threads, one VxVM
transaction thread and another thread attempting a read on root volume 
issued by dhcpagent.  Read I/O is deferred till transaction is finished but
read count incremented earlier is not properly adjusted. 

RESOLUTION: Proper care is taken to decrement pending read count if read I/O is deferred. 

 * INCIDENT NO:2553391	 TRACKING ID:2536667

SYMPTOM: In a CVM (Clustered Volume Manager) environment, the slave node panics with the 
following stack:
e_block_thread()
pse_block_thread()
pse_sleep_thread()
volsiowait()
voldio()
vol_voldio_read()
volconfig_ioctl()
volsioctl_real()
volsioctl()
vols_ioctl()
rdevioctl()
spec_ioctl()
vnop_ioctl()
vno_ioctl() 

DESCRIPTION: Panic happened due to accessing a stale DG pointer as DG got deleted before the 
I/O returned. It may happen on cluster configuration where commands generating 
private region i/os and "vxdg deport/delete" commands are executing 
simultaneously on two nodes of the cluster. 

RESOLUTION: Code changes are made to drain private region I/Os before deleting the DG. 

 * INCIDENT NO:2568208	 TRACKING ID:2431448

SYMPTOM: Panic in vol_rv_add_wrswaitq() while processing duplicate message. Stack trace
of panic

vxio:vol_rv_add_wrswaitq
vxio:vol_rv_msg_metadata_req
vxio:vol_get_timespec_latest 
vxio:vol_mv_kmsg_request
vxio:vol_kmsg_obj_request 
vxio:kmsg_gab_poll
vxio:vol_kmsg_request_receive
vxio:kmsg_gab_poll
vxio:vol_kmsg_receiver 

DESCRIPTION: On receiving message from slave node, VVR looks for duplicate message before
adding to per node queue. In case of duplicate message, VVR tries to copy some
data structure from old message, if processing of old message is complete then 
we
might end up accessing freed pointer which will cause panic. 

RESOLUTION: For duplicate message, copy from old message is not required since we discard
duplicate message. Removing code of copying data structure resolved this panic. 

 * INCIDENT NO:2583307	 TRACKING ID:2185069

SYMPTOM: In a CVR setup, while the application I/Os are going on all nodes of the primary
site, bringing down a slave node results in a panic on the master node and the
following stack trace is displayed:

volsiodone
vol_subdisksio_done
volkcontext_process 
voldiskiodone 
voldiskiodone_intr 
voldmp_iodone 
bio_endio 
gendmpiodone 
dmpiodone 
bio_endio 
req_bio_endio 
blk_update_request 
blk_update_bidi_request 
blk_end_bidi_request 
blk_end_request 
scsi_io_completion 
scsi_finish_command 
scsi_softirq_done 
blk_done_softirq 
__do_softirq 
call_softirq 
do_softirq 
irq_exit 
smp_call_function_single_interrupt 
call_function_single_interrupt 

DESCRIPTION: An internal data structure access is not serialized properly, resulting in
corruption of that data structure. This triggers the panic. 

RESOLUTION: The code is modified to properly serialize access to the internal data structure
so that its contents are not corrupted under any conditions. 

 * INCIDENT NO:2603605	 TRACKING ID:2419948

SYMPTOM: Race between the SRL flush due to SRL overflow and the kernel logging code, 
leads to a panic. 

DESCRIPTION: Rlink is disconencted, the RLINK state is moved to HALT. Primary RVG SRL is 
overflowed since there is no replication and which initiated DCM logging.

This change the STATE of rlink to DCM. (since rlink is already disconencted, 
this will keep the finale state as HALT.
During the SRL overflow, if the rlink connection resoted, then it has many 
state changes before completing the connection.

If the SRL overflow and  klogging code, finishes inbetween the above state 
transistion, and if it not finding it in VOLRP_PHASE_HALT, then the system is 
initiating the panic. 

RESOLUTION: Consider the above state change as valid, and make sure the SRL overflow code 
dont always expect the HALT state. Take action for the other state or wait for 
the full state transistion to complete for the rlink connection. 

 PATCH ID:PHKL_42246

 * INCIDENT NO:2169348	 TRACKING ID:2094672

SYMPTOM: Master node hang with lot of I/O's and during node reconfig due to node leave. 

DESCRIPTION: The reconfig is stuck, because the I/O is not drained completely. The master 
node is responsible to handle the I/O for the both primary and slave. When the 
slave node is died, and the pending slave I/O on the master node is not cleaned 
up himself properly. This lead to some I/O's left in the queue un-deleted. 

RESOLUTION: clean up the I/O during the node failure and reconfig scenario. 

 * INCIDENT NO:2211971	 TRACKING ID:2190020

SYMPTOM: On heavy I/O system load dmp_deamon requests 1 mega byte continuous memory 
paging which in turn slows down the system due to continuous page swapping. 

DESCRIPTION: dmp_deamon keeps calculating statistical information (every 1 second by 
default). When the I/O load is high the I/O statistics buffer allocation code 
path 
calculation dynamically allocates continuous ~1 mega byte per-cpu. 

RESOLUTION: To avoid repeated memory allocation/free calls in every DMP I/O stats daemon 
interval, a two buffer strategy was implemented for storing DMP stats records. 
Two buffers of same size will be allocated at the beginning, one of the buffer 
will be used for writing active records while the other will be read by IO stats 
daemon. The two buffers will be swapped every stats daemon interval. 

 * INCIDENT NO:2214184	 TRACKING ID:2202710

SYMPTOM: Transactions on Rlink are not allowed during SRL to DCM flush. 

DESCRIPTION: Present implementation doesn't allow rlink transaction to go through if SRL
to DCM flush is in progress. As SRL overflows, VVR start reading from SRL and
mark the dirty regions in corresponding DCMs of data volumes, it is called SRL
to DCM flush. During SRL to DCM flush transactions on rlink is not allowed. Time
to complete SRL flush depend on SRL size, it could range from minutes to many
hours. If user initiate any transaction on rlink then it will hang until SRL
flush completes. 

RESOLUTION: Changed the code behavior to allow rlink transaction during SRL flush. Fix stops
the SRL flush for transaction to go ahead and restart the flush after
transaction completion. 

 * INCIDENT NO:2220064	 TRACKING ID:2228531

SYMPTOM: Vradmind hangs in vol_klog_lock() on VVR (Veritas Volume Replicator) Secondary 
site.
Stack trace might look like:

genunix:cv_wait+0x38()
vxio:vol_klog_lock+0x5c()
vxio:vol_mv_close+0xc0()
vxio:vol_close_object+0x30()
vxio:vol_object_ioctl+0x198()
vxio:voliod_ioctl()
vxio:volsioctl_real+0x2d4()
specfs:spec_ioctl()
genunix:fop_ioctl+0x20()
genunix:ioctl+0x184()
unix:syscall_trap32+0xcc() 

DESCRIPTION: In this scenario, a flag value should be set for vradmind to be signalled and 
woken up. As the flag value is not set here,it causes an enduring sleep. A race 
condition exists between setting and resetting of the flag values, resulting in 
the hang. 

RESOLUTION: Code changes are made to hold a lock to avoid the race condition between 
setting and resetting of the flag values. 

 * INCIDENT NO:2232829	 TRACKING ID:2232789

SYMPTOM: With NetApp metro cluster disk arrays, takeover operations (toggling of LUN
ownership within NetApp filer) can lead to IO failures on VxVM volumes.

Example of an IO error message at VxVM
VxVM vxio V-5-0-2 Subdisk disk_36-03 block 24928: Uncorrectable write error 

DESCRIPTION: During the takeover operation, the array fails the PGR and IO SCSI commands on
secondary paths with the following transient error codes - 0x02/0x04/0x0a
(NOT READY/LOGICAL UNIT NOT ACCESSIBLE, ASYMMETRIC ACCESS STATE TRANSITION) or
0x02/0x04/0x01 (NOT READY/LOGICAL UNIT IS IN PROCESS OF BECOMING READY) -  
that are not handled properly within VxVM. 

RESOLUTION: Included required code logic within the APM so that the SCSI commands with
transient errors are retried for the duration of NetApp filer reconfig time (60
secs) before failing the IO's on VxVM volumes. 

 * INCIDENT NO:2248354	 TRACKING ID:2245121

SYMPTOM: Rlinks do not connect for NAT (Network Address Translations) configurations. 

DESCRIPTION: When VVR (Veritas Volume Replicator) is replicating over a Network Address 
Translation (NAT) based firewall, rlinks fail to connect resulting in 
replication failure.

Rlinks do not connect as there is a failure during exchange of VVR heartbeats.
For NAT based firewalls, conversion of mapped IPV6 (Internet Protocol Version 
6) address to IPV4 (Internet Protocol Version 4) address is not handled which 
caused VVR heartbeat exchange with incorrect IP address leading to VVR 
heartbeat failure. 

RESOLUTION: Code fixes have been made to appropriately handle the exchange of VVR 
heartbeats under NAT based firewall. 

 * INCIDENT NO:2328268	 TRACKING ID:2285709

SYMPTOM: On a VxVM rooted setup with boot devices connected through Magellan interface
card, system hangs at the early boot time, due to transient i/o errors. 

DESCRIPTION: On a VxVM rooted setup with boot devices connected through Magellan interface
card, due to fault in Magellan interface card, transient i/o errors are seen, and
system hangs as DMP doesn't do error handling at this early boot. 

RESOLUTION: Changes are done in DMP code, to spawn the dmp_daemon threads which perform error
handling, restoring paths etc. at early dmp module intialization time. With this
change, if a transient error is seen, then dmp_daemon thread will try to probe the
path and try to bring it online, so that system doesn't hang at early boot cycle. 

 * INCIDENT NO:2353325	 TRACKING ID:1791397

SYMPTOM: Replication doesn't start if rlink detach and attach is done just after SRL
overflow. 

DESCRIPTION: As SRL overflows, it starts flush writes from SRL to DCM(Data change map). If
rlink is detached before complete SRL is flushed to DCM then it leaves the rlink
in SRL flushing state. Due to flushing state of rlink, attaching the rlink again
doesn't start the replication. Problem here is the way rlink flushing state is
interpreted. 

RESOLUTION: To fix this issue, we changed the logic to correctly interpret rlink flushing state. 

 * INCIDENT NO:2353327	 TRACKING ID:2179259

SYMPTOM: When using disks of size > 2TB and the disk encounters a media error with offset >
2TB while the disk responds to SCSI inquiry, data corruption can occur incase of a
write operation 

DESCRIPTION: The I/O rety logic in DMP assumes that the I/O offset is within 2TB limit and
hence when using disks of size > 2TB and the disk encounters a media error with
offset > 2TB while the disk responds to SCSI inquiry, the I/O would be issued on a
wrong offset within the 2TB range causing data corruption incase of write I/Os. 

RESOLUTION: The fix for this issue to change the I/O retry mechanism to work for >2TB offsets
as well so that no offset truncation happens that could lead to data corruption 

 * INCIDENT NO:2353404	 TRACKING ID:2334757

SYMPTOM: Vxconfigd consumes a lot of memory when the DMP tunable
dmp_probe_idle_lun is set on.  "pmap" command on vxconfigd process shows
continuous growing heap. 

DESCRIPTION: DMP path restoration daemon probes idle LUNs(Idle LUNs are VxVM disks on
which no I/O requests are scheduled) and generates notify events to vxconfigd. 
        Vxconfigd in turn send the nofification of these events to its clients.
For any reasons, if vxconfigd could not deliver  these events (because client is
busy processing earlier sent event), it keeps these events to itself.
        Because of this slowness of events consumption by its clients, memory
consumption of vxconfigd grows. 

RESOLUTION: dmp_probe_idle_lun is set to off by default. 

 * INCIDENT NO:2353410	 TRACKING ID:2286559

SYMPTOM: System panics in DMP (Dynamic Multi Pathing) kernel module due to kernel heap 
corruption while DMP path failover is in progress.

Panic stack may look like:

vpanic
kmem_error+0x4b4()
gen_get_enabled_ctlrs+0xf4()
dmp_get_enabled_ctlrs+0xf4()
dmp_info_ioctl+0xc8()
dmpioctl+0x20()
dmp_get_enabled_cntrls+0xac()
vx_dmp_config_ioctl+0xe8()
quiescesio_start+0x3e0()
voliod_iohandle+0x30()
voliod_loop+0x24c()
thread_start+4() 

DESCRIPTION: During path failover in DMP, the routine gen_get_enabled_ctlrs() allocates 
memory proportional to the number of enabled paths. However, while releasing 
the memory, the routine may end up freeing more memory because of the change in 
number of enabled paths. 

RESOLUTION: Code changes have been made in the routines to free allocated memory only. 

 * INCIDENT NO:2357579	 TRACKING ID:2357507

SYMPTOM: Machine can panic while detecting unstable paths with following stack
trace.

#0  crash_nmi_callback 
#1  do_nmi 
#2  nmi 
#3  schedule 
#4  __down 
#5  __wake_up 
#6  .text.lock.kernel_lock 
#7  thread_return 
#8  printk 
#9  dmp_notify_event 
#10 dmp_restore_node 

DESCRIPTION: After detecting unstable paths restore daemon allocates memory to
report the event to userland daemons like vxconfigd. While requesting for memory
allocation restore daemon did not drop the spin lock resulting to the machine
panic. 

RESOLUTION: Fixed the code so that spinlocks are not held while requesting for
memory allocation in restore daemon. 

 * INCIDENT NO:2357820	 TRACKING ID:2357798

SYMPTOM: VVR leaking memory due to unfreed vol_ru_update structure. Memory leak is very
small but it can accumulate to big value if VVR is running for many days. 

DESCRIPTION: VVR allocates update structure for each write, if replication is up-to-date then
next write coming in will also create multi-update and add it to VVR replication
queue. While creating multi-update, VVR wrongly marked the original update with
flag, which means that update is in replication queue, but it was never added(not
required) to replication queue. When update free routine is called it check if
update has flag marked then don't free it, assuming that update is still in
replication queue, it will get free while remove it from queue. Since update was
not in the queue it will never get free and leak the memory. Memory leak will
happen for only first write coming after each time rlink become up-to-date, that
is reason it will take many days to leak big memory. 

RESOLUTION: Marking of flag for some updates was causing this memory leak, flag marking is not
required as we are not adding update into replication queue. Fix is to remove
marking and checking of flag. 

 * INCIDENT NO:2360415	 TRACKING ID:2242268

SYMPTOM: The agenode which got already freed got accessed which led to the panic.
Panic stack looks like

[0674CE30]voldrl_unlog+0001F0 (F100000070D40D08, F10001100A14B000,
   F1000815B002B8D0, 0000000000000000)
[06778490]vol_mv_write_done+000AD0 (F100000070D40D08, F1000815B002B8D0)
[065AC364]volkcontext_process+0000E4 (F1000815B002B8D0)
[066BD358]voldiskiodone+0009D8 (F10000062026C808)
[06594A00]voldmp_iodone+000040 (F10000062026C808) 

DESCRIPTION: Panic happened because of accessing the memory location which got already freed. 

RESOLUTION: Skip the data structure for further processing when the memory 
already got freed off. 

 * INCIDENT NO:2364700	 TRACKING ID:2364253

SYMPTOM: In case of Space Optimized snapshots at secondary site, VVR leaks kernel memory. 

DESCRIPTION: In case of Space Optimized snapshots at secondary site, VVR proactively starts
the copy-on-write on the snapshot volume. The I/O buffer allocated for this
proactive copy-on-write was not freed even after I/Os are completed which lead
to the memory leak. 

RESOLUTION: After the proactive copy-on-write is complete, memory allocated for the I/O
buffers is released. 

 * INCIDENT NO:2382714	 TRACKING ID:2154287

SYMPTOM: In the presence of Not-Ready" devices when the SCSI inquiry on the device succeeds
but open or read/write operations fail, one sees that paths to such devices are
continuously marked as ENABLED and DISABLED for every DMP restore task cycle. 

DESCRIPTION: The issue is that the DMP restore task finds these paths connected and hence
enables them for I/O but soon finds that they cannot be used for I/O and
disables them 

RESOLUTION: The fix is to not enable the path unless it is found to be connected and available
to open and issue I/O. 

 * INCIDENT NO:2390804	 TRACKING ID:2249113

SYMPTOM: VVR volume recovery hang, at vol_ru_recover_primlog_done() function in a dead 
loop. 

DESCRIPTION: During the SRL recovery, the SRL is read to apply the update to the data 
volume.  
There are possible hold in the SRL due to some writes are not complete 
properly. 
This holes must have to be skipped. and this regions is read as a dummy update 
and sent it to secondary. If the dummy update size is larger than max_write 
(>256k), then the code logic goes intoa dead loop, keep reading the same dummy 
update for ever. 

RESOLUTION: Handle the large holes which are greater than VVR MAX_WRITE. 

 * INCIDENT NO:2390815	 TRACKING ID:2383158

SYMPTOM: The panic in vol_rv_mdship_srv_done() due to sio is freed and having the 
invalid node pointer. 

DESCRIPTION: The vol_rv_mdship_srv_done() is panicking at referencing wrsio->wrsrv_node as 
the wrsrv_node is having the invalid pointer.It is also observed that the wrsio 
is freed or allocated for different SIO. Looking closely, the 
vol_rv_check_wrswaitq() is called at every done of the SIO, which looks into 
the waitq and releases all the SIO which has RV_WRSHIP_SRV_SIO_FLAG_LOGEND_DONE 
flag set on it. In vol_rv_mdship_srv_done(), we set this flag and do more 
operations on wrsrv. During this time the other SIO which is completed with the 
DONE, calls the function vol_rv_check_wrswaitq() and deletes the SIO of it own 
and other SIO which has the RV_WRSHIP_SRV_SIO_FLAG_LOGEND_DONE flag set. This 
leads to deleting the SIO which is on the fly, and is causing the panic. 

RESOLUTION: The flag must be set just before calling the function vol_rv_mdship_srv_done(), 
and at the end of the SIOdone() to avoid other SIO's to race and delete the 
current running one. 

 * INCIDENT NO:2390822	 TRACKING ID:2369786

SYMPTOM: On VVR Secondary cluster, if SRL disk goes bad then,then vxconfigd may hang in
transaction code path. 

DESCRIPTION: In case of any error seen in VVR shared disk group environments, error handling
is done cluster wide. On VVR Secondary, if SRL disk goes bad due to some
temporary or actual disk failure, it starts cluster wide error handling. Error
handling requires serialization, in some cases we didn't do serialization which
caused error handling to go in dead loop hence the hang. 

RESOLUTION: Making sure we always serialize the I/O during error handling on VVR Secondary
resolved this issue. 

 * INCIDENT NO:2405446	 TRACKING ID:2253970

SYMPTOM: Enhancement to customize private region I/O size based on maximum transfer size 
of underlying disk. 

DESCRIPTION: There are different types of Array Controllers which support data transfer 
sizes starting from 256K and beyond. VxVM tunable volmax_specialio controls 
vxconfigd's configuration I/O as well as Atomic Copy I/O size. When 
volmax_specialio is tuned to a value greater than 1MB to leverage maximum 
transfer sizes of underlying disks, import operation is failing for disks which 
cannot accept more than 256K I/O size. If the tunable is set to 256k then it 
will be the case where large transfer size of disks is not being leveraged. 

RESOLUTION: This enhancement leverages large disk transfer sizes as well as supports Array 
controllers with 256K transfer sizes. 

 * INCIDENT NO:2408864	 TRACKING ID:2346376

SYMPTOM: Some DMP IO statistics records were lost from per-cpu IO stats queue. Hence, DMP 
IO stat reporting CLI was displaying incorrect data. 

DESCRIPTION: DMP IO statistics daemon has two buffers for maintaining IO statistics records. 
One of the buffer is active and is updated on every I/O completion, while the 
other shadow buffer is read by IO statistics daemon. The central IO statistics 
table is updated, every IO statistics interval, from the records in active 
buffer. The problem occurs because swapping of two buffers can happen from two 
contexts, IO throttling and IO statistics collection. IO throttling swaps the 
buffers but doesn't update central IO statistics table. So, all IO records in 
active buffer are lost when the two buffers are swapped in throttling context. 

RESOLUTION: As a 

 * INCIDENT NO:2413077	 TRACKING ID:2385680

SYMPTOM: The vol_rv_async_childdone() panic occurred because of corrupted pripendingq 

DESCRIPTION: The pripendingq is always corrupted in this panic. The head
entry is always freed from the queue but not removed. In mdship_srv_done code, 
for error condition, we remove the update from pripendingq only if the next or 
prev pointers of updateq is non-null. This leads to the head pointer not 
getting removed in the abort scenerio and causing the free to happen without 
deleting it from the queue. 

RESOLUTION: The prev and next checks are removed in all the places. Also handled the abort 
case carefully for the following conditions:

1) abort logendq due to slave node panic (i.e.) this has the update entry but 
the update is not removed from the pripendingq.

2) vol_kmsg_eagain type of failures, (i.e.) the update is there, but it is 
removed from the pripendingq.

3) abort very early in the mdship_sio_start() (i.e.) update is allocated but 
not in pripendingq. 

 * INCIDENT NO:2417184	 TRACKING ID:2407192

SYMPTOM: Application I/O hangs on RVG volumes when RVG logowner is being set on the node
which takes over the master role (either as part of "vxclustadm setmaster" OR as
part of original master leave) 

DESCRIPTION: Whenever a node takes over the master role, RVGs are recovered on the new
master. Because of a race between RVG recovery thread (initiated as part of
master takeover) and the thread which is changing RVG logowner(which is run as
part of "vxrvg set logowner=on",  RVG recovery does not get completed which
leads to I/O hang. 

RESOLUTION: The race condition is handled with appropriate locks and conditional variable. 

 * INCIDENT NO:2421100	 TRACKING ID:2419348

SYMPTOM: System Panic with stack 

dmp_get_path_state()
do_passthru_ioctl()
dmp_passthru_ioctl()
dmpioctl()
fop_ioctl()
ioctl()
syscall_trap32() 

DESCRIPTION: This panic was because of race condition between vxconfigd process and vxdclid
process. vxconfigd was doing reconfiguration and freed the dmpnode and this
dmpnode is accessed by vxdclid thread. This leads to system panic. 

RESOLUTION: Get the dmpnode from global hash table in vxdclid thread. This hash table always
will contain the dmpnode while execution of vxdclid thread. 

 * INCIDENT NO:2423086	 TRACKING ID:2033909

SYMPTOM: Disabling a controller of a A/P-G type array could lead to I/O hang even when 
there are available paths for I/O. 

DESCRIPTION: DMP was not clearing a flag, in an internal DMP data structure,  to enable I/O 
to all the LUNs during group failover operation. 

RESOLUTION: DMP code modified to clear the appropriate flag for all the LUNs of the LUN 
group so that the failover can occur when a controller is disabled. 

 * INCIDENT NO:2435050	 TRACKING ID:2421067

SYMPTOM: With VVR configured, 'vxconfigd' hangs on the primary site when trying to recover 
the SRL log, after a system or storage failure. 

DESCRIPTION: At the start of each SRL log disk we keep a config header. Part of this header 
includes a flag which is used by VVR to serialize the flushing of the SRL 
configuration table, to ensure only a single thread flushes the table at any one 
time.
In this instance, the 'VOLRV_SRLHDR_CONFIG_FLUSHING' flag was set in the config 
header, and then the config header was written to disk. At this point the storage 
became inaccessible.
During recovery the config header was read from from disk, and when trying to 
initiate a new flush of the SRL table, the system hung as the flag was already 
set to indicate that a flush was in progress. 

RESOLUTION: When loading the SRL header from disk, the flag 'VOLRV_SRLHDR_CONFIG_FLUSHING' is 
now cleared. 

 * INCIDENT NO:2436283	 TRACKING ID:2425551

SYMPTOM: The cvm reconfiguration takes 1 minute for each RVG configuration. 

DESCRIPTION: Every RVG is given 1 minute time to drain the IO, if not drained, then the code
wait for 1 minute before aborting the I/O's waiting in the logendq. The logic
is such that, for every RVG, it wait 1 minute for the I/O's to drain. 

RESOLUTION: It should be enough to give oveall 1 minute for all RVGs, and abort all the 
RVG's after 1 minute time, instead of waiting for each RVG.
The alternate solution (long term solution) is,

Abort the RVG immediately when the objiocount(rv) == queue_count(logendq). This
will reduce the 1 minute dealy further down to the actual requirend time. In 
this, follwoing things to be take care
1. rusio may be active, which need to be reduced in iocount.
2. every I/O goes into the logendq before getting serviced. So, have to make 
sure they are not in the process of servicing. 

 * INCIDENT NO:2436287	 TRACKING ID:2428875

SYMPTOM: On a CVR configuration, issue i/o from both master and slave. reboot the slave
lead to reconfiguration hang. 

DESCRIPTION: The I/O's on both master and slave fills up the SRL and goes to the DCM mode. In
DCM mode, the header flush to flush the DCM and the SRL header happens for every
512 updates. Since most of the I/O's are from the SLAVe node, the I/O's
throttled due to the hdr_flush is queued in mdship_throttle_q. This queue is
flushed at the end of header flush. If the slave node is rebooted and when the
SIO are in throttle_q, and when the system is rebooted, the reconfig code
path dont flush the mdship_throttleq and wait for them to drain. This lead to
the reconfiguration hang due to positive I/O count. 

RESOLUTION: abort all the SIO's queued in the mdship_throttleq, when the node is aborted.
Restart the SIO's for the nodes that did not leave. 

 * INCIDENT NO:2436288	 TRACKING ID:2411698

SYMPTOM: I/Os hang in CVR (Clustered Volume Replicator) environment. 

DESCRIPTION: In CVR environment, when CVM (Clustered Volume Manager) Slave node sends a 
write request to the CVM Master node, following tasks occur.

1) Master grabs the *REGION LOCK* for the write and permits slave to issue the 
write.
2) When new IOs occur on the same region (till the write that acquired *REGION 
LOCK* is not complete), they wait in a *REGION LOCK QUEUE*.
3) Once the IO that acquired the *REGION LOCK* is serviced by slave node, it 
responds to the Master about the same, and Master processes the IOs queued in 
the *REGION LOCK QUEUE*.

The problem occurs when the slave node dies before sending the response to the 
Master about completion of the IO that held the *REGION LOCK*. 

RESOLUTION: Code changes have been made to accomodate the condition as mentioned in the 
section "DESCRIPTION". 

 * INCIDENT NO:2440031	 TRACKING ID:2426274

SYMPTOM: In a Storage Foundation environment running Veritas File System (VxFS) and 
Volume Manager (VxVM), a system panic may occur with following the stack trace 
in case IO hints are being used. One such scenario is in case of using Symantec 
Oracle Disk Manager (ODM)

  [<ffffffffa11268fc>] _volsio_mem_free+0x4c/0x270 [vxio]
  [<ffffffffa112c8a9>] vol_subdisksio_done+0x59/0x220 [vxio]
  [<ffffffffa10d2076>] volkcontext_process+0x346/0x9a0 [vxio]
  [<ffffffffa109b014>] voldiskiodone+0x764/0x850 [vxio]
  [<ffffffffa109b30a>] voldiskiodone_intr+0xfa/0x180 [vxio]
  [<ffffffffa108e954>] volsp_iodone_common+0x234/0x3e0 [vxio]
  [<ffffffff811b578b>] blk_update_request+0xbb/0x3e0
  [<ffffffff811b5acf>] blk_update_bidi_request+0x1f/0x70
  [<ffffffff811b68d7>] blk_end_bidi_request+0x27/0x80
  [<ffffffffa00257aa>] scsi_end_request+0x3a/0xc0 [scsi_mod]
  [<ffffffffa0025b79>] scsi_io_completion+0x109/0x4e0 [scsi_mod]
  [<ffffffff811bb73d>] blk_done_softirq+0x6d/0x80
  [<ffffffff810526ff>] __do_softirq+0xbf/0x170
  [<ffffffff810040bc>] call_softirq+0x1c/0x30
  [<ffffffff81005cfd>] do_softirq+0x4d/0x80
  [<ffffffff81052475>] irq_exit+0x85/0x90
  [<ffffffff8100525e>] do_IRQ+0x6e/0xe0
  [<ffffffff81003913>] ret_from_intr+0x0/0xa
  [<ffffffff8100ae02>] default_idle+0x32/0x40
  [<ffffffff8100206a>] cpu_idle+0x5a/0xb0
  [<ffffffff81778e65>] start_kernel+0x2ca/0x395
  [<ffffffff81778378>] x86_64_start_kernel+0xe1/0xf2 

DESCRIPTION: A single Volume Manager I/O (staged I/O) while doing 'done' processing, was 
trying to access the FS-VM private information data structure which was freed. 
This free also resulted in an assert which indicated a mismatch in size of the 
IO that was freed thereby hitting a panic. 

RESOLUTION: The solution is to preserve the FS-VM private information data structure 
pertaining to the I/O till its last access. After that, it is freed to release 
that memory. 

 * INCIDENT NO:2479746	 TRACKING ID:2406292

SYMPTOM: In case of I/Os on volumes having multiple subdisks (example striped volumes),
System panicks with following stack.

unix:panicsys+0x48()
unix:vpanic_common+0x78()
unix:panic+0x1c()
genunix:kmem_error+0x4b4()
vxio:vol_subdisksio_delete() - frame recycled
vxio:vol_plexsio_childdone+0x80()
vxio:volsiodone() - frame recycled
vxio:vol_subdisksio_done+0xe0()
vxio:volkcontext_process+0x118()
vxio:voldiskiodone+0x360()
vxio:voldmp_iodone+0xc()
genunix:biodone() - frame recycled
vxdmp:gendmpiodone+0x1ec()
ssd:ssd_return_command+0x240()
ssd:ssdintr+0x294()
fcp:ssfcp_cmd_callback() - frame recycled
qlc:ql_fast_fcp_post+0x184()
qlc:ql_status_entry+0x310()
qlc:ql_response_pkt+0x2bc()
qlc:ql_isr_aif+0x76c()
pcisch:pci_intr_wrapper+0xb8()
unix:intr_thread+0x168()
unix:ktl0+0x48() 

DESCRIPTION: On a striped volume, the IO is split in to multiple parts equivalent to the 
number of sub-disks in the stripe. Each part of the IO is processed parallell 
by different threads. Thus any such two threads processing the IO completion 
can enter in to a race condition. Due to such race condition one of the threads 
happens to access a stale address causing the system panic. 

RESOLUTION: The critical section of code is modified to hold appropriate locks to avoid 
race condition. 

 * INCIDENT NO:2484466	 TRACKING ID:2480600

SYMPTOM: I/O of large sizes like 512k and 1024k hang in CVR (Clustered Volume 
Replicator). 

DESCRIPTION: When large IOs, say, of sizes like, 1MB, are performed on volumes under RVG 
(Replicated Volume Group), a limited number of IOs can be accommodated based on 
RVIOMEM pool limit. So, the pool remains full for majority of the duration.At 
this time, when CVM (Clustered Volume Manager) slave gets rebooted, or goes 
down, the pending IOs are aborted and the corresponding memory is freed. In one 
of the cases, it does not get freed, leading to the hang. 

RESOLUTION: Code changes have been made to free the memory under all scenarios. 

 * INCIDENT NO:2484695	 TRACKING ID:2484685

SYMPTOM: In a Storage Foundation environment running Symantec Oracle Disk Manager (ODM), 
Veritas File System (VxFS) and Volume Manager (VxVM), a system panic may occur 
with following the stack trace:

  000002a10247a7a1 vpanic()
  000002a10247a851 kmem_error+0x4b4()
  000002a10247a921 vol_subdisksio_done+0xe0()
  000002a10247a9d1 volkcontext_process+0x118()
  000002a10247aaa1 voldiskiodone+0x360()
  000002a10247abb1 voldmp_iodone+0xc()
  000002a10247ac61 gendmpiodone+0x1ec()
  000002a10247ad11 ssd_return_command+0x240()
  000002a10247add1 ssdintr+0x294()
  000002a10247ae81 ql_fast_fcp_post+0x184()
  000002a10247af31 ql_24xx_status_entry+0x2c8()
  000002a10247afe1 ql_response_pkt+0x29c()
  000002a10247b091 ql_isr_aif+0x76c()
  000002a10247b181 px_msiq_intr+0x200()
  000002a10247b291 intr_thread+0x168()
  000002a10240b131 cpu_halt+0x174()
  000002a10240b1e1 idle+0xd4()
  000002a10240b291 thread_start+4() 

DESCRIPTION: A race condition exists between two IOs (specifically Volume Manager subdisk 
level staged I/Os) while doing 'done' processing which causes one thread to 
free FS-VM private information data structure before other thread accesses it. 

The propensity of the race increases by increasing the number of CPUs. 

RESOLUTION: Avoid the race condition such that the slower thread doesn't access the freed 
FS-VM private information data structure. 

 * INCIDENT NO:2488042	 TRACKING ID:2431423

SYMPTOM: Panic in vol_mv_commit_check() while accessing Data Change Map(DCM) object. Stack
trace of panic
 
 vol_mv_commit_check at ffffffffa0bef79e
 vol_ktrans_commit at ffffffffa0be9b93
 volconfig_ioctl at ffffffffa0c4a957
 volsioctl_real at ffffffffa0c5395c
 vols_ioctl at ffffffffa1161122
 sys_ioctl at ffffffff801a2a0f
 compat_sys_ioctl at ffffffff801ba4fb
 sysenter_do_call at ffffffff80125039 

DESCRIPTION: In case of DCM failure, object pointer is set to NULL as part of transaction. If
DCM is active then we try to access DCM object in transaction code path without
checking it to be NULL. DCM object pointer could be NULL in case of failed DCM.
Accessing object pointer without check for NULL caused this panic. 

RESOLUTION: Fix is to put NULL check for DCM object in transaction code path. 

 * INCIDENT NO:2491856	 TRACKING ID:2424833

SYMPTOM: VVR primary node crashes while replicating in lossy and high latency network with
multiple TCP connections. In debug VxVM build TED assert is hit with following
stack :

brkpoint+000004 ()
ted_call_demon+00003C (0000000007D98DB8)
ted_assert+0000F0 (0000000007D98DB8, 0000000007D98B28,
   0000000000000000)
.hkey_legacy_gate+00004C ()
nmcom_send_msg_tcp+000C20 (F100010A83C4E000, 0000000200000002,
   0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000,
   000000DA000000DA, 0000000100000000)
.nmcom_connect_tcp+0007D0 ()
vol_rp_connect+0012D0 (F100010B0408C000)
vol_rp_connect_start+000130 (F1000006503F9308, 0FFFFFFFF420FC50)
voliod_iohandle+0000AC (F1000006503F9308, 0000000100000001,
   0FFFFFFFF420FC50)
voliod_loop+000CFC (0000000000000000)
vol_kernel_thread_init+00002C (0FFFFFFFF420FFF0)
threadentry+000054 (??, ??, ??, ??) 

DESCRIPTION: In lossy and high latency network, connection between VVR primary and seconadry
can get closed and re-established frequently because of heartbeat timeouts or
DATA acknowledgement timeouts. In TCP multi-connection scenario, VVR primary sends
its very first message (called NMCOM_HANDSHAKE) to secondary on zeroth socket
connection number and then it sends "NMCOM_SESSION" message for each of the next
connections. By some reasons, if the sending of the NMCOM_HANDSHAKE message fails,
VVR primary tries to send it through the another connection without checking
whether it's a valid connection or NOT. 

RESOLUTION: Code changes are made in VVR to use the other connections only after all the
connections are established. 

INCIDENTS FROM OLD PATCHES:
---------------------------
NONE