* * * READ ME * * *
* * * Veritas Volume Manager 5.1 SP1 RP2 * * *
* * * P-patch 1 * * *
Patch Date: 2011-10-18
This document provides the following information:
* PATCH NAME
* PACKAGES AFFECTED BY THE PATCH
* BASE PRODUCT VERSIONS FOR THE PATCH
* OPERATING SYSTEMS SUPPORTED BY THE PATCH
* INCIDENTS FIXED BY THE PATCH
* INSTALLATION PRE-REQUISITES
* INSTALLING THE PATCH
* REMOVING THE PATCH
PATCH NAME
----------
Veritas Volume Manager 5.1 SP1 RP2 P-patch 1
PACKAGES AFFECTED BY THE PATCH
------------------------------
VRTSvxvm
BASE PRODUCT VERSIONS FOR THE PATCH
-----------------------------------
* Veritas Volume Manager 5.1 SP1
* Veritas Storage Foundation for Oracle RAC 5.1 SP1
* Veritas Storage Foundation Cluster File System 5.1 SP1
* Veritas Storage Foundation 5.1 SP1
* Veritas Storage Foundation High Availability 5.1 SP1
* Veritas Storage Foundation Cluster File System for Oracle RAC 5.1 SP1
* Veritas Dynamic Multi-Pathing 5.1 SP1
* Symantec VirtualStore 5.1 SP1
* Symantec VirtualStore 5.1 SP1 PR3
OPERATING SYSTEMS SUPPORTED BY THE PATCH
----------------------------------------
RHEL5 x86-64
RHEL6 x86-64
SLES10 x86-64
SLES11 x86-64
INCIDENTS FIXED BY THE PATCH
----------------------------
This patch fixes the following Symantec incidents:
Patch ID: 5.1.132.100
* 2440015 (Tracking ID: 2428170)
SYMPTOM:
I/O hangs when reading or writing to a volume after a total storage
failure in CVM environments with Active-Passive arrays.
DESCRIPTION:
In the event of a storage failure, in active-passive environments,
the CVM-DMP fail over protocol is initiated. This protocol is responsible for
coordinating the fail-over of primary paths to secondary paths on all nodes in
the
cluster.
In the event of a total storage failure, where both the primary paths and
secondary paths fail, in some situations the protocol fails to cleanup some
internal structures, leaving the devices quiesced.
RESOLUTION:
After a total storage failure all devices should be un-quiesced,
allowing the I/Os to fail. The CVM-DMP protocol has been changed to cleanup
devices, even if all paths to a device have been removed.
* 2477272 (Tracking ID: 2169726)
SYMPTOM:
If a combination of cloned and non-cloned disks for a diskgroup is available at
the time of import, then the diskgroup imported through vxdg import operation
contains both cloned and non-cloned disks.
DESCRIPTION:
For a particular diskgroup, if some of the disks are not available during the
diskgroup import operation and the corresponding cloned disks are present, then
the diskgroup imported through vxdg import operation contains combination of
cloned and non-cloned disks.
Example -
Diskgroup named dg1 with the disks disk1 and disk2 exists on some machine.
Clones of disks named disk1_clone disk2_clone are also available. If disk2 goes
offline and the import for dg1 is performed, then the resulting diskgroup will
contain disks disk1 and disk2_clone.
RESOLUTION:
The diskgroup import operation will consider cloned disks only if no non-cloned
disk is available.
* 2497637 (Tracking ID: 2489350)
SYMPTOM:
In a Storage Foundation environment running Symantec Oracle Disk Manager (ODM),
Veritas File System (VxFS), Cluster volume Manager (CVM) and Veritas Volume
Replicator (VVR), kernel memory is leaked under certain conditions.
DESCRIPTION:
In CVR (CVM + VVR), under certain conditions (for example when I/O throttling
gets enabled or kernel messaging subsystem is overloaded), the I/O resources
allocated before are freed and the I/Os are being restarted afresh. While
freeing the I/O resources, VVR primary node doesn't free the kernel memory
allocated for FS-VM private information data structure and causing the kernel
memory leak of 32 bytes for each restarted I/O.
RESOLUTION:
Code changes are made in VVR to free the kernel memory allocated for FS-VM
private information data structure before the I/O is restarted afresh.
* 2497796 (Tracking ID: 2235382)
SYMPTOM:
IOs can hang in DMP driver when IOs are in progress while carrying out path
failover.
DESCRIPTION:
While restoring any failed path to a non-A/A LUN, DMP driver is checking that
whether any pending IOs are there on the same dmpnode. If any are present then DMP
is marking the corresponding LUN with special flag so that path failover/failback
can be triggered by the pending IOs. There is a window here and by chance if all
the pending IOs return before marking the dmpnode, then any future IOs on the
dmpnode get stuck in wait queues.
RESOLUTION:
Make sure that whenever the LUN is having pending IOs then only to set the flag on
it so that failover can be triggered by pending IOs.
* 2507120 (Tracking ID: 2438426)
SYMPTOM:
The following messages are displayed after vxconfigd is started.
pp_claim_device: Could not get device number for /dev/rdsk/emcpower0
pp_claim_device: Could not get device number for /dev/rdsk/emcpower1
DESCRIPTION:
Device Discovery Layer(DDL) has incorrectly marked a path under dmp device with
EFI flag even though there is no corresponding Extensible Firmware Interface
(EFI) device in /dev/[r]dsk/. As a result, Array Support Library (ASL) issues a
stat command on non-existent EFI device and displays the above messages.
RESOLUTION:
Avoided marking EFI flag on Dynamic MultiPathing (DMP) paths which correspond to
non-efi devices.
* 2507124 (Tracking ID: 2484334)
SYMPTOM:
The system panic occurs with the following stack while collecting the DMP
stats.
dmp_stats_is_matching_group+0x314()
dmp_group_stats+0x3cc()
dmp_get_stats+0x194()
gendmpioctl()
dmpioctl+0x20()
DESCRIPTION:
Whenever new devices are added to the system, the stats table is adjusted to
accomodate the new devices in the DMP. There exists a race between the stats
collection thread and the thread which adjusts the stats table to accomodate
the new devices. The race can result the stats collection thread to access the
memory beyond the known size of the table causing the system panic.
RESOLUTION:
The stats collection code in the DMP is rectified to restrict the access to the
known size of the stats table.
* 2508294 (Tracking ID: 2419486)
SYMPTOM:
Data corruption is observed with single path when naming scheme is changed
from enclodure based (EBN) to OS Native (OSN).
DESCRIPTION:
The Data corruption can occur in the following configuration,
when the naming scheme is changed while applications are on-line.
1. The DMP device is configured with single path or the devices are controlled
by Third party Multipathing Driver (Ex: MPXIO, MPIO etc.,)
2. The DMP device naming scheme is EBN (enclosure based naming) and
persistence=yes
3. The naming scheme is changed to OSN using the following command
# vxddladm set namingscheme=osn
There is possibility of change in name of the VxVM device (DA record) while
the naming scheme is changing. As a result of this the device attribute list
is updated with new DMP device names. Due to a bug in the code which updates
the attribute list, the VxVM device records are mapped to wrong DMP devices.
Example:
Following are the device names with EBN naming scheme.
MAS-usp0_0 auto:cdsdisk hitachi_usp0_0 prod_SC32 online
MAS-usp0_1 auto:cdsdisk hitachi_usp0_4 prod_SC32 online
MAS-usp0_2 auto:cdsdisk hitachi_usp0_5 prod_SC32 online
MAS-usp0_3 auto:cdsdisk hitachi_usp0_6 prod_SC32 online
MAS-usp0_4 auto:cdsdisk hitachi_usp0_7 prod_SC32 online
MAS-usp0_5 auto:none - - online invalid
MAS-usp0_6 auto:cdsdisk hitachi_usp0_1 prod_SC32 online
MAS-usp0_7 auto:cdsdisk hitachi_usp0_2 prod_SC32 online
MAS-usp0_8 auto:cdsdisk hitachi_usp0_3 prod_SC32 online
MAS-usp0_9 auto:none - - online invalid
disk_0 auto:cdsdisk - - online
disk_1 auto:none - - online invalid
bash-3.00# vxddladm set namingscheme=osn
The follwoing is after executing the above command.
The MAS-usp0_9 is changed as MAS-usp0_6 and the following devices
are changed accordingly.
bash-3.00# vxdisk list
DEVICE TYPE DISK GROUP STATUS
MAS-usp0_0 auto:cdsdisk hitachi_usp0_0 prod_SC32 online
MAS-usp0_1 auto:cdsdisk hitachi_usp0_4 prod_SC32 online
MAS-usp0_2 auto:cdsdisk hitachi_usp0_5 prod_SC32 online
MAS-usp0_3 auto:cdsdisk hitachi_usp0_6 prod_SC32 online
MAS-usp0_4 auto:cdsdisk hitachi_usp0_7 prod_SC32 online
MAS-usp0_5 auto:none - - online invalid
MAS-usp0_6 auto:none - - online invalid
MAS-usp0_7 auto:cdsdisk hitachi_usp0_1 prod_SC32 online
MAS-usp0_8 auto:cdsdisk hitachi_usp0_2 prod_SC32 online
MAS-usp0_9 auto:cdsdisk hitachi_usp0_3 prod_SC32 online
c4t20000014C3D27C09d0s2 auto:none - - online invalid
c4t20000014C3D26475d0s2 auto:cdsdisk - - online
RESOLUTION:
Code changes are made to update device attribute list correctly even if name of
the VxVM device is changed while the naming scheme is changing.
* 2508418 (Tracking ID: 2390431)
SYMPTOM:
In a Disaster Recovery environment, when DCM (Data Change Map) is active and
during SRL(Storage Replicator Log)/DCM flush, the system panics due to missing
parent on one of the DCM in an RVG (Replicated Volume Group).
DESCRIPTION:
The DCM flush happens during every log update and its frequency depends on the
IO load. If the I/O load is high, the DCM flush happens very often and if there
are more volumes in the RVG, the frequency is very high. Every DCM flush
triggers the DCM flush on all the volumes in the RVG. If there are 50 volumes,
in an RVG, then each DCM flush creates 50 children and is controlled by one
parent SIO. Once all the 50 children are done, then the parent SIO releases
itself for the next flush. Once the DCM flush of each child completes, it
detaches itself from the parent by setting the parent field to NULL. It so
happens that, if the 49th child is done and before it is detaching it from the
parent, the 50th child completes and releases the parent_SIO for the next DCM
flush. Before the 49th child detaches, the new DCM flush is started on the same
50th child. After the next flush is started, the 49th child of the previous
flush detaches itself from the parent and since it is a static SIO, it
indirectly resets the new flush parent field. Also, the lock is not obtained
before modifing the sio state field in a few scenarios.
RESOLUTION:
Before reducing the children count, detach the parent first. This will make
sure the new flush will not race with the previous flush. Protect the field
with the required lock in all the scenarios.
* 2511928 (Tracking ID: 2420386)
SYMPTOM:
Corrupted data is seen near the end of a sub-disk, on thin-reclaimable
disks with either CDS EFI or sliced disk formats.
DESCRIPTION:
In environments with thin-reclaim disks running with either CDS-EFI
disks or sliced disks, misaligned reclaims can be initiated. In some situations,
when reclaiming a sub-disk, the reclaim does not take into account the correct
public region start offset, which in rare instances can potentially result in
reclaiming data before the sub-disk which is being reclaimed.
RESOLUTION:
The public offset is taken into account when initiating all reclaim
operations.
* 2515137 (Tracking ID: 2513101)
SYMPTOM:
When VxVM is upgraded from 4.1MP4RP2 to 5.1SP1RP1, the data on CDS disk gets
corrupted.
DESCRIPTION:
When CDS disks are initialized with VxVM version 4.1MP4RP2, the no of cylinders
are calculated based on the disk raw geometry. If the calculated no. of
cylinders exceed Solaris VTOC limit (65535), because of unsigned integer
overflow, truncated value of no of cylinders gets written in CDS label.
After the VxVM is upgraded to 5.1SP1RP1, CDS label gets wrongly written in
the public region leading to the data corruption.
RESOLUTION:
The code changes are made to suitably adjust the no. of tracks & heads so that
the calculated no. of cylinders be within Solaris VTOC limit.
* 2525333 (Tracking ID: 2148851)
SYMPTOM:
"vxdisk resize" operation fails on a disk with VxVM cdsdisk/simple/sliced layout
on Solaris/Linux platform with the following message:
VxVM vxdisk ERROR V-5-1-8643 Device emc_clariion0_30: resize failed: New
geometry makes partition unaligned
DESCRIPTION:
The new cylinder size selected during "vxdisk resize" operation is unaligned with
the partitions that existed prior to the "vxdisk resize" operation.
RESOLUTION:
The algorithm to select the new geometry has been redesigned such that the new
cylinder size is always aligned with the existing as well as new partitions.
* 2531983 (Tracking ID: 2483053)
SYMPTOM:
VVR Primary system consumes very high kernel heap memory and appear to
be hung.
DESCRIPTION:
There is a race between REGION LOCK deletion thread which runs as
part of SLAVE leave reconfiguration and the thread which process the DATA_DONE
message coming from log client to logowner. Because of this race, the flags
which stores the status information about the I/Os was not correctly updated.
This used to cause a lot of SIOs being stuck in a queue consuming a large kernel
heap.
RESOLUTION:
The code changes are made to take the proper locks while updating
the SIOs' fields.
* 2531987 (Tracking ID: 2510523)
SYMPTOM:
In CVM-VVR configuration, I/Os on "master" and "slave" nodes hang when "master"
role is switched to the other node using "vxclustadm setmaster" command.
DESCRIPTION:
Under heavy I/O load, the I/Os are sometimes throttled in VVR, if number of
outstanding I/Os on SRL reaches a certain limit (2048 I/Os).
When "master" role is switched to the other node by using "vxclustadm setmaster"
command, the throttled I/Os on original master are never restarted. This causes
the I/O hang.
RESOLUTION:
Code changes are made in VVR to make sure the throttled I/Os are restarted
before "master" switching is started.
* 2531993 (Tracking ID: 2524936)
SYMPTOM:
Disk group is disabled after rescanning disks with "vxdctl enable"
command with the console output below,
<timestamp> pp_claim_device: 0
<timestamp> Could not get metanode from ODM database
<timestamp> pp_claim_device: 0
<timestamp> Could not get metanode from ODM database
The error messages below are also seen in vxconfigd debug log output,
<timestamp> VxVM vxconfigd ERROR V-5-1-12223 Error in claiming /dev/<disk>: The
process file table is full.
<timestamp> VxVM vxconfigd ERROR V-5-1-12223 Error in claiming /dev/<disk>: The
process file table is full.
...
<timestamp> VxVM vxconfigd ERROR V-5-1-12223 Error in claiming /dev/<disk>: The
process file table is full.
AIX-
DESCRIPTION:
When the total physical memory in AIX machine is greater than or equal to
40GB & multiple of 40GB (like 80GB, 120GB), a limitation/bug in setulimit
function causes an overflowed value set as the new limit/size of the data area,
which results in memory allocation failures in vxconfigd. Creation of the shared
memory segment also fails during this course. Error handling of this case is
missing in vxconfigd code, hence resulting in error in claiming disks and
offlining configuration copies which in-turn results in disabling disk group.
AIX-
RESOLUTION:
Code changes are made to handle the failure case on shared memory segment
creation.
* 2552402 (Tracking ID: 2432006)
SYMPTOM:
System intermittently hangs during boot if disk is encapsulated.
When this problem occurs, OS boot process stops after outputing this:
"VxVM sysboot INFO V-5-2-3409 starting in boot mode..."
DESCRIPTION:
The boot process hung due to a dead lock between two threads, one VxVM
transaction thread and another thread attempting a read on root volume
issued by dhcpagent. Read I/O is deferred till transaction is finished but
read count incremented earlier is not properly adjusted.
RESOLUTION:
Proper care is taken to decrement pending read count if read I/O is deferred.
* 2553391 (Tracking ID: 2536667)
SYMPTOM:
[04DAD004]voldiodone+000C78 (F10000041116FA08)
[04D9AC88]volsp_iodone_common+000208 (F10000041116FA08,
0000000000000000,
0000000000000000)
[04B7A194]volsp_iodone+00001C (F10000041116FA08)
[000F3FDC]internal_iodone_offl+0000B0 (??, ??)
[000F3F04]iodone_offl+000068 ()
[000F20CC]i_softmod+0001F0 ()
[0017C570].finish_interrupt+000024 ()
DESCRIPTION:
Panic happened due to accessing a stale DG pointer as DG got deleted before the
I/O returned. It may happen on cluster configuration where commands generating
private region i/os and "vxdg deport/delete" commands are executing
simultaneously on two nodes of the cluster.
RESOLUTION:
Code changes are made to drain private region I/Os before deleting the DG.
* 2563291 (Tracking ID: 2527289)
SYMPTOM:
In a Campus Cluster setup, storage fault may lead to DETACH of all the
configured site. This also results in IOfailure on all the nodes in the Campus
Cluster.
DESCRIPTION:
Site detaches are done on site consistent dgs when any volume in the dg looses
all the mirrors of a Site. During the processing of the DETACH of last mirror in
a site we identify that it is the last mirror and DETACH the site which in turn
detaches all the objects of that site.
In Campus Cluster setup we attach a dco volume for any data volume created on a
site-consistent dg. The general configuration is to have one DCO mirror on each
site. Loss of a single mirror of the dco volume on any node will result in the
detach of that site.
In a 2 site configuration this particular scenario would result in both the dco
mirrors being lost simultaneously. While the site detach for the first mirror is
being processed we also signal for DETACH of the second mirror which ends up
DETACHING the second site too.
This is not hit in other tests as we already have a check to make sure that we
do not DETACH the last mirror of a Volume. This check is being subverted in this
particular case due to the type of storage failure.
RESOLUTION:
Before triggering the site detach we need to have an explicit check to see if we
are trying to DETACH the last ACTIVE site.
* 2574840 (Tracking ID: 2344186)
SYMPTOM:
In a master-slave configuration with FMR3/DCO volumes, reboot of a cluster node
fails to join back the cluster again with following error messages in the console
[..]
Jul XX 18:44:09 vienna vxvm:vxconfigd: [ID 702911 daemon.error] V-5-1-11092
cleanup_client: (Volume recovery in progress) 230
Jul XX 18:44:09 vienna vxvm:vxconfigd: [ID 702911 daemon.error] V-5-1-11467
kernel_fail_join() : Reconfiguration interrupted: Reason is
retry to add a node failed (13, 0)
[..]
DESCRIPTION:
VxVM volumes with FMR3/DCO have inbuilt DRL mechanism to track the disk block of
in-flight IOs in order to recover the data much quicker in case of a node crash.
Thus, a joining node awaits the variable, responsible for recovery, to get unset
to join the cluster. However, due to a bug in FMR3/DCO code, this variable was set
forever, thus leading to node join failure.
RESOLUTION:
Modified the FMR3/DCO code to appropriately set and unset this recovery variable.
INSTALLING THE PATCH
--------------------
(rhel5 x86_64)
# rpm -Uhv VRTSvxvm-5.1.132.100-SP1RP2P1_RHEL5.x86_64.rpm
(rhel6 x86_64)
# rpm -Uhv VRTSvxvm-5.1.132.100-SP1RP2P1_RHEL6.x86_64.rpm
(sles10 x86_64)
# rpm -Uhv VRTSvxvm-5.1.132.100-SP1RP2P1_SLES10.x86_64.rpm
(sles11 x86_64)
# rpm -Uhv VRTSvxvm-5.1.132.100-SP1RP2P1_SLES11.x86_64.rpm
REMOVING THE PATCH
------------------
# rpm -e <rpm-name>
SPECIAL INSTRUCTIONS
--------------------
NONE
OTHERS
------
NONE