README VERSION		: 1.0
README Creation Date	: 2011-09-23
PATCH-ID 		: PHKL_42246 
PATCH NAME		: VRTSvxvm 5.1SP1RP1
BASE PACKAGE NAME	: VRTSvxvm
BASE PACKAGE VERSION	: 5.1.100.000 / 5.1.100.001
OBSOLETE PATCHES	: NONE
SUPERSEDED PATCHES	: NONE
REQUIRED PATCHES	: PHCO_42245
INCOMPATIBLE PATCHES	: NONE
SUPPORTED PADV		: hpux1131
(P-Platform , A-Architecture , D-Distribution , V-Version)
PATCH CATEGORY		:  CORE ,  CORRUPTION ,  HANG ,  MEMORYLEAK ,  PANIC ,  PERFORMANCE
REBOOT REQUIRED		: YES


FIXED INCIDENTS: 
----------------


 Patch Id::PHKL_42246

 * Incident no::2169348	 Tracking ID ::2094672

 Symptom::Master node hang with lot of I/O's and during node reconfig due to node leave.

 Description::The reconfig is stuck, because the I/O is not drained completely. The master 
node is responsible to handle the I/O for the both primary and slave. When the 
slave node is died, and the pending slave I/O on the master node is not cleaned 
up himself properly. This lead to some I/O's left in the queue un-deleted.

 Resolution::clean up the I/O during the node failure and reconfig scenario.

 * Incident no::2211971	 Tracking ID ::2190020

 Symptom::On heavy I/O system load dmp_deamon requests 1 mega byte continuous memory 
paging which in turn slows down the system due to continuous page swapping.

 Description::dmp_deamon keeps calculating statistical information (every 1 second by 
default). When the I/O load is high the I/O statistics buffer allocation code 
path 
calculation dynamically allocates continuous ~1 mega byte per-cpu.

 Resolution::To avoid repeated memory allocation/free calls in every DMP I/O stats daemon 
interval, a two buffer strategy was implemented for storing DMP stats records. 
Two buffers of same size will be allocated at the beginning, one of the buffer 
will be used for writing active records while the other will be read by IO stats 
daemon. The two buffers will be swapped every stats daemon interval.

 * Incident no::2214184	 Tracking ID ::2202710

 Symptom::Transactions on Rlink are not allowed during SRL to DCM flush.

 Description::Present implementation doesn't allow rlink transaction to go through if SRL
to DCM flush is in progress. As SRL overflows, VVR start reading from SRL and
mark the dirty regions in corresponding DCMs of data volumes, it is called SRL
to DCM flush. During SRL to DCM flush transactions on rlink is not allowed. Time
to complete SRL flush depend on SRL size, it could range from minutes to many
hours. If user initiate any transaction on rlink then it will hang until SRL
flush completes.

 Resolution::Changed the code behavior to allow rlink transaction during SRL flush. Fix stops
the SRL flush for transaction to go ahead and restart the flush after
transaction completion.

 * Incident no::2220064	 Tracking ID ::2228531

 Symptom::Vradmind hangs in vol_klog_lock() on VVR (Veritas Volume Replicator) Secondary 
site.
Stack trace might look like:

genunix:cv_wait+0x38()
vxio:vol_klog_lock+0x5c()
vxio:vol_mv_close+0xc0()
vxio:vol_close_object+0x30()
vxio:vol_object_ioctl+0x198()
vxio:voliod_ioctl()
vxio:volsioctl_real+0x2d4()
specfs:spec_ioctl()
genunix:fop_ioctl+0x20()
genunix:ioctl+0x184()
unix:syscall_trap32+0xcc()

 Description::In this scenario, a flag value should be set for vradmind to be signalled and 
woken up. As the flag value is not set here,it causes an enduring sleep. A race 
condition exists between setting and resetting of the flag values, resulting in 
the hang.

 Resolution::Code changes are made to hold a lock to avoid the race condition between 
setting and resetting of the flag values.

 * Incident no::2232829	 Tracking ID ::2232789

 Symptom::With NetApp metro cluster disk arrays, takeover operations (toggling of LUN
ownership within NetApp filer) can lead to IO failures on VxVM volumes.

Example of an IO error message at VxVM
VxVM vxio V-5-0-2 Subdisk disk_36-03 block 24928: Uncorrectable write error

 Description::During the takeover operation, the array fails the PGR and IO SCSI commands on
secondary paths with the following transient error codes - 0x02/0x04/0x0a
(NOT READY/LOGICAL UNIT NOT ACCESSIBLE, ASYMMETRIC ACCESS STATE TRANSITION) or
0x02/0x04/0x01 (NOT READY/LOGICAL UNIT IS IN PROCESS OF BECOMING READY) -  
that are not handled properly within VxVM.

 Resolution::Included required code logic within the APM so that the SCSI commands with
transient errors are retried for the duration of NetApp filer reconfig time (60
secs) before failing the IO's on VxVM volumes.

 * Incident no::2248354	 Tracking ID ::2245121

 Symptom::Rlinks do not connect for NAT (Network Address Translations) configurations.

 Description::When VVR (Veritas Volume Replicator) is replicating over a Network Address 
Translation (NAT) based firewall, rlinks fail to connect resulting in 
replication failure.

Rlinks do not connect as there is a failure during exchange of VVR heartbeats.
For NAT based firewalls, conversion of mapped IPV6 (Internet Protocol Version 
6) address to IPV4 (Internet Protocol Version 4) address is not handled which 
caused VVR heartbeat exchange with incorrect IP address leading to VVR 
heartbeat failure.

 Resolution::Code fixes have been made to appropriately handle the exchange of VVR 
heartbeats under NAT based firewall.

 * Incident no::2328268	 Tracking ID ::2285709

 Symptom::On a VxVM rooted setup with boot devices connected through Magellan interface
card, system hangs at the early boot time, due to transient i/o errors.

 Description::On a VxVM rooted setup with boot devices connected through Magellan interface
card, due to fault in Magellan interface card, transient i/o errors are seen, and
system hangs as DMP doesn't do error handling at this early boot.

 Resolution::Changes are done in DMP code, to spawn the dmp_daemon threads which perform error
handling, restoring paths etc. at early dmp module intialization time. With this
change, if a transient error is seen, then dmp_daemon thread will try to probe the
path and try to bring it online, so that system doesn't hang at early boot cycle.

 * Incident no::2353325	 Tracking ID ::1791397

 Symptom::Replication doesn't start if rlink detach and attach is done just after SRL
overflow.

 Description::As SRL overflows, it starts flush writes from SRL to DCM(Data change map). If
rlink is detached before complete SRL is flushed to DCM then it leaves the rlink
in SRL flushing state. Due to flushing state of rlink, attaching the rlink again
doesn't start the replication. Problem here is the way rlink flushing state is
interpreted.

 Resolution::To fix this issue, we changed the logic to correctly interpret rlink flushing state.

 * Incident no::2353327	 Tracking ID ::2179259

 Symptom::When using disks of size > 2TB and the disk encounters a media error with offset >
2TB while the disk responds to SCSI inquiry, data corruption can occur incase of a
write operation

 Description::The I/O rety logic in DMP assumes that the I/O offset is within 2TB limit and
hence when using disks of size > 2TB and the disk encounters a media error with
offset > 2TB while the disk responds to SCSI inquiry, the I/O would be issued on a
wrong offset within the 2TB range causing data corruption incase of write I/Os.

 Resolution::The fix for this issue to change the I/O retry mechanism to work for >2TB offsets
as well so that no offset truncation happens that could lead to data corruption

 * Incident no::2353404	 Tracking ID ::2334757

 Symptom::Vxconfigd consumes a lot of memory when the DMP tunable
dmp_probe_idle_lun is set on.  "pmap" command on vxconfigd process shows
continuous growing heap.

 Description::DMP path restoration daemon probes idle LUNs(Idle LUNs are VxVM disks on
which no I/O requests are scheduled) and generates notify events to vxconfigd. 
        Vxconfigd in turn send the nofification of these events to its clients.
For any reasons, if vxconfigd could not deliver  these events (because client is
busy processing earlier sent event), it keeps these events to itself.
        Because of this slowness of events consumption by its clients, memory
consumption of vxconfigd grows.

 Resolution::dmp_probe_idle_lun is set to off by default.

 * Incident no::2353410	 Tracking ID ::2286559

 Symptom::System panics in DMP (Dynamic Multi Pathing) kernel module due to kernel heap 
corruption while DMP path failover is in progress.

Panic stack may look like:

vpanic
kmem_error+0x4b4()
gen_get_enabled_ctlrs+0xf4()
dmp_get_enabled_ctlrs+0xf4()
dmp_info_ioctl+0xc8()
dmpioctl+0x20()
dmp_get_enabled_cntrls+0xac()
vx_dmp_config_ioctl+0xe8()
quiescesio_start+0x3e0()
voliod_iohandle+0x30()
voliod_loop+0x24c()
thread_start+4()

 Description::During path failover in DMP, the routine gen_get_enabled_ctlrs() allocates 
memory proportional to the number of enabled paths. However, while releasing 
the memory, the routine may end up freeing more memory because of the change in 
number of enabled paths.

 Resolution::Code changes have been made in the routines to free allocated memory only.

 * Incident no::2357579	 Tracking ID ::2357507

 Symptom::Machine can panic while detecting unstable paths with following stack
trace.

#0  crash_nmi_callback 
#1  do_nmi 
#2  nmi 
#3  schedule 
#4  __down 
#5  __wake_up 
#6  .text.lock.kernel_lock 
#7  thread_return 
#8  printk 
#9  dmp_notify_event 
#10 dmp_restore_node

 Description::After detecting unstable paths restore daemon allocates memory to
report the event to userland daemons like vxconfigd. While requesting for memory
allocation restore daemon did not drop the spin lock resulting to the machine
panic.

 Resolution::Fixed the code so that spinlocks are not held while requesting for
memory allocation in restore daemon.

 * Incident no::2357820	 Tracking ID ::2357798

 Symptom::VVR leaking memory due to unfreed vol_ru_update structure. Memory leak is very
small but it can accumulate to big value if VVR is running for many days.

 Description::VVR allocates update structure for each write, if replication is up-to-date then
next write coming in will also create multi-update and add it to VVR replication
queue. While creating multi-update, VVR wrongly marked the original update with
flag, which means that update is in replication queue, but it was never added(not
required) to replication queue. When update free routine is called it check if
update has flag marked then don't free it, assuming that update is still in
replication queue, it will get free while remove it from queue. Since update was
not in the queue it will never get free and leak the memory. Memory leak will
happen for only first write coming after each time rlink become up-to-date, that
is reason it will take many days to leak big memory.

 Resolution::Marking of flag for some updates was causing this memory leak, flag marking is not
required as we are not adding update into replication queue. Fix is to remove
marking and checking of flag.

 * Incident no::2360415	 Tracking ID ::2242268

 Symptom::The agenode which got already freed got accessed which led to the panic.
Panic stack looks like

[0674CE30]voldrl_unlog+0001F0 (F100000070D40D08, F10001100A14B000,
   F1000815B002B8D0, 0000000000000000)
[06778490]vol_mv_write_done+000AD0 (F100000070D40D08, F1000815B002B8D0)
[065AC364]volkcontext_process+0000E4 (F1000815B002B8D0)
[066BD358]voldiskiodone+0009D8 (F10000062026C808)
[06594A00]voldmp_iodone+000040 (F10000062026C808)

 Description::Panic happened because of accessing the memory location which got already freed.

 Resolution::Skip the data structure for further processing when the memory 
already got freed off.

 * Incident no::2364700	 Tracking ID ::2364253

 Symptom::In case of Space Optimized snapshots at secondary site, VVR leaks kernel memory.

 Description::In case of Space Optimized snapshots at secondary site, VVR proactively starts
the copy-on-write on the snapshot volume. The I/O buffer allocated for this
proactive copy-on-write was not freed even after I/Os are completed which lead
to the memory leak.

 Resolution::After the proactive copy-on-write is complete, memory allocated for the I/O
buffers is released.

 * Incident no::2382714	 Tracking ID ::2154287

 Symptom::In the presence of Not-Ready" devices when the SCSI inquiry on the device succeeds
but open or read/write operations fail, one sees that paths to such devices are
continuously marked as ENABLED and DISABLED for every DMP restore task cycle.

 Description::The issue is that the DMP restore task finds these paths connected and hence
enables them for I/O but soon finds that they cannot be used for I/O and
disables them

 Resolution::The fix is to not enable the path unless it is found to be connected and available
to open and issue I/O.

 * Incident no::2390804	 Tracking ID ::2249113

 Symptom::VVR volume recovery hang, at vol_ru_recover_primlog_done() function in a dead 
loop.

 Description::During the SRL recovery, the SRL is read to apply the update to the data 
volume.  
There are possible hold in the SRL due to some writes are not complete 
properly. 
This holes must have to be skipped. and this regions is read as a dummy update 
and sent it to secondary. If the dummy update size is larger than max_write 
(>256k), then the code logic goes intoa dead loop, keep reading the same dummy 
update for ever.

 Resolution::Handle the large holes which are greater than VVR MAX_WRITE.

 * Incident no::2390815	 Tracking ID ::2383158

 Symptom::The panic in vol_rv_mdship_srv_done() due to sio is freed and having the 
invalid node pointer.

 Description::The vol_rv_mdship_srv_done() is panicking at referencing wrsio->wrsrv_node as 
the wrsrv_node is having the invalid pointer.It is also observed that the wrsio 
is freed or allocated for different SIO. Looking closely, the 
vol_rv_check_wrswaitq() is called at every done of the SIO, which looks into 
the waitq and releases all the SIO which has RV_WRSHIP_SRV_SIO_FLAG_LOGEND_DONE 
flag set on it. In vol_rv_mdship_srv_done(), we set this flag and do more 
operations on wrsrv. During this time the other SIO which is completed with the 
DONE, calls the function vol_rv_check_wrswaitq() and deletes the SIO of it own 
and other SIO which has the RV_WRSHIP_SRV_SIO_FLAG_LOGEND_DONE flag set. This 
leads to deleting the SIO which is on the fly, and is causing the panic.

 Resolution::The flag must be set just before calling the function vol_rv_mdship_srv_done(), 
and at the end of the SIOdone() to avoid other SIO's to race and delete the 
current running one.

 * Incident no::2390822	 Tracking ID ::2369786

 Symptom::On VVR Secondary cluster, if SRL disk goes bad then,then vxconfigd may hang in
transaction code path.

 Description::In case of any error seen in VVR shared disk group environments, error handling
is done cluster wide. On VVR Secondary, if SRL disk goes bad due to some
temporary or actual disk failure, it starts cluster wide error handling. Error
handling requires serialization, in some cases we didn't do serialization which
caused error handling to go in dead loop hence the hang.

 Resolution::Making sure we always serialize the I/O during error handling on VVR Secondary
resolved this issue.

 * Incident no::2405446	 Tracking ID ::2253970

 Symptom::Enhancement to customize private region I/O size based on maximum transfer size 
of underlying disk.

 Description::There are different types of Array Controllers which support data transfer 
sizes starting from 256K and beyond. VxVM tunable volmax_specialio controls 
vxconfigd's configuration I/O as well as Atomic Copy I/O size. When 
volmax_specialio is tuned to a value greater than 1MB to leverage maximum 
transfer sizes of underlying disks, import operation is failing for disks which 
cannot accept more than 256K I/O size. If the tunable is set to 256k then it 
will be the case where large transfer size of disks is not being leveraged.

 Resolution::All the above scenarios mentioned in Description are handled in this 
enhancement to leverage large disk transfer sizes as well as support Array 
controllers with 256K transfer sizes.

 * Incident no::2408864	 Tracking ID ::2346376

 Symptom::Some DMP IO statistics records were lost from per-cpu IO stats queue. Hence, DMP 
IO stat reporting CLI was displaying incorrect data.

 Description::DMP IO statistics daemon has two buffers for maintaining IO statistics records. 
One of the buffer is active and is updated on every I/O completion, while the 
other shadow buffer is read by IO statistics daemon. The central IO statistics 
table is updated, every IO statistics interval, from the records in active 
buffer. The problem occurs because swapping of two buffers can happen from two 
contexts, IO throttling and IO statistics collection. IO throttling swaps the 
buffers but doesn't update central IO statistics table. So, all IO records in 
active buffer are lost when the two buffers are swapped in throttling context.

 Resolution::As a resolution to the problem, we avoid swapping the buffers in IO throttling 
context.

 * Incident no::2413077	 Tracking ID ::2385680

 Symptom::The vol_rv_async_childdone() panic occurred because of corrupted pripendingq

 Description::The pripendingq is always corrupted in this panic. The head
entry is always freed from the queue but not removed. In mdship_srv_done code, 
for error condition, we remove the update from pripendingq only if the next or 
prev pointers of updateq is non-null. This leads to the head pointer not 
getting removed in the abort scenerio and causing the free to happen without 
deleting it from the queue.

 Resolution::The prev and next checks are removed in all the places. Also handled the abort 
case carefully for the following conditions:

1) abort logendq due to slave node panic (i.e.) this has the update entry but 
the update is not removed from the pripendingq.

2) vol_kmsg_eagain type of failures, (i.e.) the update is there, but it is 
removed from the pripendingq.

3) abort very early in the mdship_sio_start() (i.e.) update is allocated but 
not in pripendingq.

 * Incident no::2417184	 Tracking ID ::2407192

 Symptom::Application I/O hangs on RVG volumes when RVG logowner is being set on the node
which takes over the master role (either as part of "vxclustadm setmaster" OR as
part of original master leave)

 Description::Whenever a node takes over the master role, RVGs are recovered on the new
master. Because of a race between RVG recovery thread (initiated as part of
master takeover) and the thread which is changing RVG logowner(which is run as
part of "vxrvg set logowner=on",  RVG recovery does not get completed which
leads to I/O hang.

 Resolution::The race condition is handled with appropriate locks and conditional variable.

 * Incident no::2421100	 Tracking ID ::2419348

 Symptom::Tags Empty

 Description::This panic is because of race condition between vxconfigd doing a
dmp_reconfigure_db() and another process (vxdclid) executing dmp_passthru_ioctl().

The stack of vxdclid thread:-
000002a107684d51 dmp_get_path_state+0xc(606a5b08140, 301937d9c20, 0, 0, 0, 0)
000002a107684e01 do_passthru_ioctl+0x76c(606a5b08140, 8, 0, 606a506c840,
606a506c848, 0)
000002a107684f61 dmp_passthru_ioctl+0x74(11d000005ca, 40b, 3ad4c0, 100081,
606a3d477b0, 2a107685adc)
000002a107685031 dmpioctl+0x20(11d000005ca, 40b, 3ad4c0, 100081, 606a3d477b0,
2a107685adc)
000002a1076850e1 fop_ioctl+0x20(60582fdfc00, 40b, 3ad4c0, 100081, 606a3d477b0,
1296a58)
000002a107685191 ioctl+0x184(a, 6065a188430, 3ad4c0, ff0bc910, ff1303d8, 40b)
000002a1076852e1 syscall_trap32+0xcc(a, 40b, 3ad4c0, ff0bc910, ff1303d8, ff13a5a0)

And the stack of vxconfid which is doing reconfiguarion:-
vxdmp:dmp_get_iocount+0x68(0x7)
vxdmp:dmp_check_ios_drained+0x40()
vxdmp:dmp_check_ios_drained_in_dmpnode+0x40(0x60693cc0f00, 0x20000000)
vxdmp:dmp_decode_destroy_dmpnode+0x11c(0x2a10536b698, 0x102003, 0x0, 0x19caa70)
vxdmp:dmp_decipher_instructions+0x2e4(0x2a10536b758, 0x10, 0x102003, 0x0, 0x19caa70)
vxdmp:dmp_process_instruction_buffer+0x150(0x11d0003ffff, 0x3df634, 0x102003,
0x0, 0x19caa70)
vxdmp:dmp_reconfigure_db+0x48()
vxdmp:gendmpioctl(0x11d0003ffff, , 0x3df634, 0x102003, 0x604a7017298,
0x2a10536badc) 
vxdmp:dmpioctl+0x20(, 0x444d5040, 0x3df634, 0x102003, 0x604a7017298)

In vxdclid thread we are trying to get the dmpnode from path_t structure. But at
the same time path_t has been freed as part of reconfiguration. So hence the panic.

 Resolution::Get the dmpnode from lvl1tab table instead of path_t structure. Because there is
an ioctl is going on this dmpnode, so dmpnode will be available at this time.

 * Incident no::2423086	 Tracking ID ::2033909

 Symptom::Disabling a controller of a A/P-G type array could lead to I/O hang even when 
there are available paths for I/O.

 Description::DMP was not clearing a flag, in an internal DMP data structure,  to enable I/O 
to all the LUNs during group failover operation.

 Resolution::DMP code modified to clear the appropriate flag for all the LUNs of the LUN 
group so that the failover can occur when a controller is disabled.

 * Incident no::2435050	 Tracking ID ::2421067

 Symptom::With VVR configured, 'vxconfigd' hangs on the primary site when trying to recover 
the SRL log, after a system or storage failure.

 Description::At the start of each SRL log disk we keep a config header. Part of this header 
includes a flag which is used by VVR to serialize the flushing of the SRL 
configuration table, to ensure only a single thread flushes the table at any one 
time.
In this instance, the 'VOLRV_SRLHDR_CONFIG_FLUSHING' flag was set in the config 
header, and then the config header was written to disk. At this point the storage 
became inaccessible.
During recovery the config header was read from from disk, and when trying to 
initiate a new flush of the SRL table, the system hung as the flag was already 
set to indicate that a flush was in progress.

 Resolution::When loading the SRL header from disk, the flag 'VOLRV_SRLHDR_CONFIG_FLUSHING' is 
now cleared.

 * Incident no::2436283	 Tracking ID ::2425551

 Symptom::The cvm reconfiguration takes 1 minute for each RVG configuration.

 Description::Every RVG is given 1 minute time to drain the IO, if not drained, then the code
wait for 1 minute before aborting the I/O's waiting in the logendq. The logic
is such that, for every RVG, it wait 1 minute for the I/O's to drain.

 Resolution::It should be enough to give oveall 1 minute for all RVGs, and abort all the 
RVG's after 1 minute time, instead of waiting for each RVG.
The alternate solution (long term solution) is,

Abort the RVG immediately when the objiocount(rv) == queue_count(logendq). This
will reduce the 1 minute dealy further down to the actual requirend time. In 
this, follwoing things to be take care
1. rusio may be active, which need to be reduced in iocount.
2. every I/O goes into the logendq before getting serviced. So, have to make 
sure they are not in the process of servicing.

 * Incident no::2436287	 Tracking ID ::2428875

 Symptom::On a CVR configuration, issue i/o from both master and slave. reboot the slave
lead to reconfiguration hang.

 Description::The I/O's on both master and slave fills up the SRL and goes to the DCM mode. In
DCM mode, the header flush to flush the DCM and the SRL header happens for every
512 updates. Since most of the I/O's are from the SLAVe node, the I/O's
throttled due to the hdr_flush is queued in mdship_throttle_q. This queue is
flushed at the end of header flush. If the slave node is rebooted and when the
SIO are in throttle_q, and when the system is rebooted, the reconfig code
path dont flush the mdship_throttleq and wait for them to drain. This lead to
the reconfiguration hang due to positive I/O count.

 Resolution::abort all the SIO's queued in the mdship_throttleq, when the node is aborted.
Restart the SIO's for the nodes that did not leave.

 * Incident no::2436288	 Tracking ID ::2411698

 Symptom::I/Os hang in CVR (Clustered Volume Replicator) environment.

 Description::In CVR environment, when CVM (Clustered Volume Manager) Slave node sends a 
write request to the CVM Master node, following tasks occur.

1) Master grabs the *REGION LOCK* for the write and permits slave to issue the 
write.
2) When new IOs occur on the same region (till the write that acquired *REGION 
LOCK* is not complete), they wait in a *REGION LOCK QUEUE*.
3) Once the IO that acquired the *REGION LOCK* is serviced by slave node, it 
responds to the Master about the same, and Master processes the IOs queued in 
the *REGION LOCK QUEUE*.

The problem occurs when the slave node dies before sending the response to the 
Master about completion of the IO that held the *REGION LOCK*.

 Resolution::Code changes have been made to accomodate the condition as mentioned in the 
section "DESCRIPTION".

 * Incident no::2440031	 Tracking ID ::2426274

 Symptom::In a Storage Foundation environment running Veritas File System (VxFS) and 
Volume Manager (VxVM), a system panic may occur with following the stack trace 
in case IO hints are being used. One such scenario is in case of using Symantec 
Oracle Disk Manager (ODM)

  [<ffffffffa11268fc>] _volsio_mem_free+0x4c/0x270 [vxio]
  [<ffffffffa112c8a9>] vol_subdisksio_done+0x59/0x220 [vxio]
  [<ffffffffa10d2076>] volkcontext_process+0x346/0x9a0 [vxio]
  [<ffffffffa109b014>] voldiskiodone+0x764/0x850 [vxio]
  [<ffffffffa109b30a>] voldiskiodone_intr+0xfa/0x180 [vxio]
  [<ffffffffa108e954>] volsp_iodone_common+0x234/0x3e0 [vxio]
  [<ffffffff811b578b>] blk_update_request+0xbb/0x3e0
  [<ffffffff811b5acf>] blk_update_bidi_request+0x1f/0x70
  [<ffffffff811b68d7>] blk_end_bidi_request+0x27/0x80
  [<ffffffffa00257aa>] scsi_end_request+0x3a/0xc0 [scsi_mod]
  [<ffffffffa0025b79>] scsi_io_completion+0x109/0x4e0 [scsi_mod]
  [<ffffffff811bb73d>] blk_done_softirq+0x6d/0x80
  [<ffffffff810526ff>] __do_softirq+0xbf/0x170
  [<ffffffff810040bc>] call_softirq+0x1c/0x30
  [<ffffffff81005cfd>] do_softirq+0x4d/0x80
  [<ffffffff81052475>] irq_exit+0x85/0x90
  [<ffffffff8100525e>] do_IRQ+0x6e/0xe0
  [<ffffffff81003913>] ret_from_intr+0x0/0xa
  [<ffffffff8100ae02>] default_idle+0x32/0x40
  [<ffffffff8100206a>] cpu_idle+0x5a/0xb0
  [<ffffffff81778e65>] start_kernel+0x2ca/0x395
  [<ffffffff81778378>] x86_64_start_kernel+0xe1/0xf2

 Description::A single Volume Manager I/O (staged I/O) while doing 'done' processing, was 
trying to access the FS-VM private information data structure which was freed. 
This free also resulted in an assert which indicated a mismatch in size of the 
IO that was freed thereby hitting a panic.

 Resolution::The solution is to preserve the FS-VM private information data structure 
pertaining to the I/O till its last access. After that, it is freed to release 
that memory.

 * Incident no::2479746	 Tracking ID ::2406292

 Symptom::In case of I/Os on volumes having multiple subdisks (example striped volumes),
System panicks with following stack.

unix:panicsys+0x48()
unix:vpanic_common+0x78()
unix:panic+0x1c()
genunix:kmem_error+0x4b4()
vxio:vol_subdisksio_delete() - frame recycled
vxio:vol_plexsio_childdone+0x80()
vxio:volsiodone() - frame recycled
vxio:vol_subdisksio_done+0xe0()
vxio:volkcontext_process+0x118()
vxio:voldiskiodone+0x360()
vxio:voldmp_iodone+0xc()
genunix:biodone() - frame recycled
vxdmp:gendmpiodone+0x1ec()
ssd:ssd_return_command+0x240()
ssd:ssdintr+0x294()
fcp:ssfcp_cmd_callback() - frame recycled
qlc:ql_fast_fcp_post+0x184()
qlc:ql_status_entry+0x310()
qlc:ql_response_pkt+0x2bc()
qlc:ql_isr_aif+0x76c()
pcisch:pci_intr_wrapper+0xb8()
unix:intr_thread+0x168()
unix:ktl0+0x48()

 Description::On a striped volume, the IO is split in to multiple parts equivalent to the 
number of sub-disks in the stripe. Each part of the IO is processed parallell 
by different threads. Thus any such two threads processing the IO completion 
can enter in to a race condition. Due to such race condition one of the threads 
happens to access a stale address causing the system panic.

 Resolution::The critical section of code is modified to hold appropriate locks to avoid 
race condition.

 * Incident no::2484466	 Tracking ID ::2480600

 Symptom::I/O of large sizes like 512k and 1024k hang in CVR (Clustered Volume 
Replicator).

 Description::When large IOs, say, of sizes like, 1MB, are performed on volumes under RVG 
(Replicated Volume Group), a limited number of IOs can be accomodated based on 
RVIOMEM pool limit. So, the pool remains full for majority of the duration.At 
this time, when CVM (Clustered Volume Manager) slave gets rebooted, or goes 
down, the pending IOs are aborted and the corresponding memory is freed. In one 
of the cases, it does not get freed, leading to the hang.

 Resolution::Code changes have been made to free the memory under all scenarios.

 * Incident no::2484695	 Tracking ID ::2484685

 Symptom::In a Storage Foundation environment running Symantec Oracle Disk Manager (ODM), 
Veritas File System (VxFS) and Volume Manager (VxVM), a system panic may occur 
with following the stack trace:

  000002a10247a7a1 vpanic()
  000002a10247a851 kmem_error+0x4b4()
  000002a10247a921 vol_subdisksio_done+0xe0()
  000002a10247a9d1 volkcontext_process+0x118()
  000002a10247aaa1 voldiskiodone+0x360()
  000002a10247abb1 voldmp_iodone+0xc()
  000002a10247ac61 gendmpiodone+0x1ec()
  000002a10247ad11 ssd_return_command+0x240()
  000002a10247add1 ssdintr+0x294()
  000002a10247ae81 ql_fast_fcp_post+0x184()
  000002a10247af31 ql_24xx_status_entry+0x2c8()
  000002a10247afe1 ql_response_pkt+0x29c()
  000002a10247b091 ql_isr_aif+0x76c()
  000002a10247b181 px_msiq_intr+0x200()
  000002a10247b291 intr_thread+0x168()
  000002a10240b131 cpu_halt+0x174()
  000002a10240b1e1 idle+0xd4()
  000002a10240b291 thread_start+4()

 Description::A race condition exists between two IOs (specifically Volume Manager subdisk 
level staged I/Os) while doing 'done' processing which causes one thread to 
free FS-VM private information data structure before other thread accesses it. 

The propensity of the race increases by increasing the number of CPUs.

 Resolution::Avoid the race condition such that the slower thread doesn't access the freed 
FS-VM private information data structure.

 * Incident no::2488042	 Tracking ID ::2431423

 Symptom::Panic in vol_mv_commit_check() while accessing Data Change Map(DCM) object. Stack
trace of panic
 
 vol_mv_commit_check at ffffffffa0bef79e
 vol_ktrans_commit at ffffffffa0be9b93
 volconfig_ioctl at ffffffffa0c4a957
 volsioctl_real at ffffffffa0c5395c
 vols_ioctl at ffffffffa1161122
 sys_ioctl at ffffffff801a2a0f
 compat_sys_ioctl at ffffffff801ba4fb
 sysenter_do_call at ffffffff80125039

 Description::In case of DCM failure, object pointer is set to NULL as part of transaction. If
DCM is active then we try to access DCM object in transaction code path without
checking it to be NULL. DCM object pointer could be NULL in case of failed DCM.
Accessing object pointer without check for NULL caused this panic.

 Resolution::Fix is to put NULL check for DCM object in transaction code path.

 * Incident no::2491856	 Tracking ID ::2424833

 Symptom::VVR primary node crashes while replicating in lossy and high latency network with
multiple TCP connections. In debug VxVM build TED assert is hit with following
stack :

brkpoint+000004 ()
ted_call_demon+00003C (0000000007D98DB8)
ted_assert+0000F0 (0000000007D98DB8, 0000000007D98B28,
   0000000000000000)
.hkey_legacy_gate+00004C ()
nmcom_send_msg_tcp+000C20 (F100010A83C4E000, 0000000200000002,
   0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000,
   000000DA000000DA, 0000000100000000)
.nmcom_connect_tcp+0007D0 ()
vol_rp_connect+0012D0 (F100010B0408C000)
vol_rp_connect_start+000130 (F1000006503F9308, 0FFFFFFFF420FC50)
voliod_iohandle+0000AC (F1000006503F9308, 0000000100000001,
   0FFFFFFFF420FC50)
voliod_loop+000CFC (0000000000000000)
vol_kernel_thread_init+00002C (0FFFFFFFF420FFF0)
threadentry+000054 (??, ??, ??, ??)

 Description::In lossy and high latency network, connection between VVR primary and seconadry
can get closed and re-established frequently because of heartbeat timeouts or
DATA acknowledgement timeouts. In TCP multi-connection scenario, VVR primary sends
its very first message (called NMCOM_HANDSHAKE) to secondary on zeroth socket
connection number and then it sends "NMCOM_SESSION" message for each of the next
connections. By some reasons, if the sending of the NMCOM_HANDSHAKE message fails,
VVR primary tries to send it through the another connection without checking
whether it's a valid connection or NOT.

 Resolution::Code changes are made in VVR to use the other connections only after all the
connections are established.