vcs-aix-GAB-4.0MP4RP2_LLT-4.0MP4RP3

 Basic information
Release type: Rolling Patch
Release date: 2009-05-15
OS update support: None
Technote: 325284
Documentation: None
Popularity: 1564 viewed    downloaded
Download size: 1.3 MB
Checksum: 2792781308

 Applies to one or more of the following products:
Cluster Server 4.0MP4 On AIX 5.2
Cluster Server 4.0MP4 On AIX 5.3
Storage Foundation HA 4.0MP4 On AIX 5.2
Storage Foundation HA 4.0MP4 On AIX 5.3

 Obsolete patches, incompatibilities, superseded patches, or other requirements:

This patch supersedes the following patches: Release date
vcs-aix-4.0MP4+e1274390_llt (obsolete) 2008-07-01

 Fixes the following incidents:
1274390, 1424932, 1446223, 1531665, 1638724, 1638780, 1639268

 Patch ID:
VRTSgab.rte-04.00.0004.0200
VRTSllt.rte-04.00.0004.0300

Readme file
OS: AIX
OS Version: 5.2, 5.3
* * * PATCH 4.0MP4RP2 for GAB & 4.0MP4RP3 for LLT on VCS 4.0MP4 AIX * * *

                  Patch Date: May 15, 2009

   This README provides information on:

   * BEFORE GETTING STARTED
   * CRC AND BYTE COUNT
   * FIXES AND ENHANCEMENTS INCLUDED IN THESE PATCHES
   * DESCRIPTION OF TUNABLES RELEVANT FOR THESE PATCHES
   * PACKAGES AFFECTED BY THESE PATCHES
   * INSTALLING THE PATCHES IN VCS ENVIRONMENT
   * UNINSTALLING THE PATCHES IN VCS ENVIRONMENT
   * INSTALLING THE PATCHES IN SFRAC ENVIRONMENT
   * UNINSTALLING THE PATCHES IN SFRAC ENVIRONMENT 


BEFORE GETTING STARTED
----------------------
This patch only applies to VRTSllt and VRTSgab 4.0MP4 running on 
AIX 5.2 and AIX 5.3. Ensure that you are running one of the supported 
configurations before installing this patch.

Testing for these patches was done on AIX 5.3 TL5.

CRC AND BYTE COUNT 
-------------------
Ensure that the file you have downloaded matches the following checksum and
byte count :
The following command can be used to ascertain this:
# cksum VRTSllt.rte.bff
2603599614      2048000 VRTSllt.rte.bff
# cksum VRTSgab.rte.bff
3218290488      2508800 VRTSgab.rte.bff

FIXES AND ENHANCEMENTS INCLUDED IN THIS PATCH
---------------------------------------------
Etrack Incidents: 1424932, 1639268, 1531665, 1274390, 1446223, 1638780, 1638724
The current patches for LLT and GAB supercede the previous rolling patches 
for these modules.

Fixes from previous RP's released after 4.0MP4 for GAB:
------------------------------------------------------
e862507:	GAB_F_SEQBUSY flag is being set even when the sequence
		request is not sent out.

Fixes from previous RP's released after 4.0MP4 for LLT:
------------------------------------------------------
e1031511:	LLT: add heuristic to deal with one-way link situations

e1233409:	In Cluster setup, Veritas low latency transport(LLT) driver 
		is used for communication. LLT communicate with AIX OS DLPI 
		driver for sending and receiving network packets on 
		physical network. The upcalls from DLPI driver to LLT use to 
		be always in process context.
		With latest changes in AIX DLPI driver now calls to LLT comes in 
		interrupt context. This causes panic or hang in LLT driver or in 
		clients of LLT like GAB. The patch made the changes in LLT to be
		interrupt safe and calls to clients of LLT done in process 
		context. The known AIX APAR which change the behaviour of DLPI 
		driver and causing panic in LLT or GAB are:

		5200-10 - AIX APAR IZ19838
		5300-06 - AIX APAR IZ05430
		5300-07 - AIX APAR IZ11726
		5300-08 - AIX APAR IZ09036

		To find if your system has APAR which changes the DLPI 
		behaviour, run the instfix command with APAR number or 
		grep for string "BRING DLPI DRIVER "TO SPEC"". for eg :
		# instfix  -iv | grep "BRING DLPI DRIVER \"TO SPEC\""
		IZ11726 Abstract: BRING DLPI DRIVER "TO SPEC"

e1274390:	Multiple LLT clients registering ports with LLT can result in
		deadlock due to race condition. The fix is done to resolve
		simultaneous port registration for multiple client

e1294686:	LLT-DLPI changes can cause hang on single CPU machine as 
		the thread holding lock is swapped out of CPU and another 
		thread spin on CPU for the same lock. Changes are done in 
		the locking mechanism for LLT.

New fixes introduced in this RP for GAB:
---------------------------------------
e1424932:
	Symptoms:
		One or more cluster nodes may get panicked due to 
		a stale GAB_CONNECTS message.
	Resolution:
		Reduced GAB_CONNECTS messages; those were not required and may
		result in race condition.
e1639268:
	Symptoms:
		With high priority processes running in the cluster, may 
		result in delayed response to GAB timer function, 
		which is not acceptable.
	Resolution:
		On AIX, the GAB timer function, with priority of 60 may
		fail to get scheduled as quickly as required, especially
		if there are higher priority processes in the system.
		Hence, making the priority for the gab timer a tunable as 
		"gab_timer_pri". The tunable can have a value in the range 
		from 2 to 60, where as the default value is kept as 17.
e1531665:
	Symptoms:
		Clients of the GAB service may not get cluster membership.
	Resolution:
		Symantec recommends that GAB must be configured to provide 
		membership only after a minimum quorum number of nodes join 
		the cluster. If a client of GAB comes up before GAB Port a 
		formed membership on that node,  then this client may not get 
		cluster membership until it starts up on at least the 
		(configured) quorum number of nodes, not even if Port a 
		or any other GAB Ports receive cluster membership.

New fixes introduced in this RP for LLT:
---------------------------------------
e1274390:
	Symptoms:
		After a reboot, when LLT is loaded, the system gets hung.
	Resolution:
		When a port registers with LLT, LLT creates a thread for 
		that port. In this incident, two ports were getting registered 
		with LLT in parallel in their own contexts. Both these contexts 
		tried to grab the same lock, while a third context was holding 
		that lock and waiting to get scheduled. This caused a deadlock.
		Add additional synchronization code in LLT code to fix this issue.
e1446223:
	Symptoms:
		LLT detects a duplicate node id in the cluster, even if there 
		is no duplicate node id used in the cluster.
	Resolution:
		LLT marks some broadcast packets it sends out by a unique 
		id so that when that broadcast packet is received back at 
		the original sender on any link, the sender can recognize that 
		this packet was actually sent by it. If this unique id is not 
		set in the broadcast packet, LLT can get confused when 
		that packet reaches back the original sender, and can then 
		falsely consider that packet to have come from another node 
		in the cluster. Seeing that the node id in this packet is 
		same as the node id of the local node, LLT falsely considers 
		this as proof of a duplicate node id in the cluster.
		Set the unique id in the broadcast LLT_ARP_REQ packets 
		that LLT sends out.
e1638780:
	Symptoms:
		While running some test workload, LLT causes a panic with 
		the following stack trace:
			pvthread+007F00 STACK:
			[0001AD40]abend_trap+000000 ()
			[0007E5B8]tstart+000558 (??)
			[00014F50].kernel_add_gate_cstack+000030 ()
			[F1000000A02A6044].llt_aix_timeout+0000E4 ()
			[F1000000A02AFD64].llt_timer_handler+000470 ()
			[F1000000A02A62F0].llt_timer_procfunc+0000A0 ()
			[00014D70].hkey_legacy_gate+00004C ()
			[001A2DD0]procentry+000010 (??, ??, ??, ??)
	Resolution:
		AIX code has a race condition in the code of the 
		function tstart(), wherein if a tstop() is not called before 
		a tstart(), the machine can panic.
		Added a call to tstop() before a call to tstart() in 
		the appropriate places.
e1638724:
	Symptoms:
		LLT throws quite a few of the following messages 
		in the error logs:
		"LLT INFO V-14-1-10035 timer not called for 610 ticks"
		This causes LLT not being able to heartbeat is time 
		to the other nodes in the cluster. This causes GAB 
		to evict some nodes from the cluster.
	Resolution:
		LLT registers a timeout with the AIX OS to be 
		fired every 0.1 seconds. This timeout function just 
		wakes up a thread, which actually does all the timer 
		related activities in LLT. This thread (call it 
		the LLT timer thread) runs at a much lower priority 
		than the timer interrupt itself. Thus, even though 
		the timeout is called in time and this thread is woken
		up in time, the actual timer thread might not get 
		scheduled in acceptable time.
		Call the functionality of the timer thread from 
		the timer interrupt context itself. Thus, the timer 
		related activities in LLT will happen in interrupt 
		context and not get unacceptably delayed.


DESCRIPTION OF TUNABLES RELEVANT FOR THESE PATCHES
--------------------------------------------------
gab_timer_pri:
	A new tunable, "gab_timer_pri", for GAB is introduced in this patch.
	This tunable sets the priority of the GAB timer thread. The default 
	value is set to 17, but it can be tuned to anything 
	between 2 and 60 if required.

"VCS_GAB_TIMEOUT" environment value update (Recommended):
	It is recommended to increase the "VCS_GAB_TIMEOUT" value 
	from 15s to 30s. To do so, add the following 2 lines 
	in /opt/VRTSvcs/bin/vcsenv after VCS is stopped and
	before it is restarted after the patch installation:
	##
	VCS_GAB_TIMEOUT=30000
	export VCS_GAB_TIMEOUT
	##

LLT peerinact time update (Recommended):
	It is recomended to increase the value of peerinact for LLT.
	To do so, add the following line in /etc/llttab after
	LLT is stopped and before it is restarted after 
	the patch installation:
	##
	set-timer peerinact:3000
	##


PACKAGES AFFECTED BY THE PATCHES
---------------------------------
This patch brings the 
	VRTSllt.rte fileset to 4.0.4.300 level
and 
	VRTSgab.rte fileset to 4.0.4.200 level


INSTALLING THE PATCHES IN VCS ENVIRONMENT
-----------------------------------------
The following steps should be run on all nodes in the VCS cluster:

Stopping the cluster:
-------------------
1. Offline all applications, which are configured on CVM/CFS 
   and are outside VCS control.

    After all applications using CFS and CVM have been taken down,
    run 'slibclean' to unload the libraries from memory.

2. Stop VCS on the current node.
	# /opt/VRTSvcs/bin/hastop -local 
   Verify that ports 'f' (CFS), 'v' and 'w' (CVM), 'h' (VCS) have been closed,
	# /sbin/gabconfig -a
   The display should not have port 'f', 'v', 'w' and 'h' listed	

3. If VXFEN is not configured, please go to step 5

4. Unconfigure VxFen:
	#  /sbin/vxfenconfig -U
   Verify that port 'b' has been closed
	# /sbin/gabconfig -a
   The display should not have port 'b' listed	

5. Unconfigure GAB:
	# /sbin/gabconfig -U

6. Unconfigure LLT:
	# /sbin/lltconfig -Uo

7. Unload the GAB driver:
	# /etc/methods/gabkext -stop

   Unload the LLT driver:
	# /usr/sbin/strload -ud /usr/lib/drivers/pse/llt

8. Verify that the LLT driver has been unloaded
	# /usr/sbin/strload -qd /usr/lib/drivers/pse/llt
	/usr/lib/drivers/pse/llt: no
   If llt is still loaded "yes" will show up in the output above.

   Verify that the GAB driver has been unloaded:
	# /etc/methods/gabkext -status
	gab: unloaded

NOTE: If you are unable to successfully unload either the GAB or LLT driver,
the server must be rebooted AFTER the installation of the patches. 
This is so that the new GAB driver gets loaded in the AIX kernel.

Installing Patch on all nodes:
-----------------------------
1. Adding a new entry in the ODM database for a new GAB tunable.
   a. Create the data to be entered:
	# cat > gab_tunable.txt << _EOF_
PdAt:
        uniquetype = "gab/node/gab"
        attribute = "gab_timer_pri"
        deflt = "17"
        values = "2-60, 1"
        width = ""
        type = "R"
        generic = "DU"
        rep = "nr"
        nls_index = 0
_EOF_

   b. Add the new GAB tunable to ODM:
	# odmadd ./gab_tunable.txt

   c. Verify that the tunable is updated properly,
	# odmget -q  "attribute=gab_timer_pri" PdAt
    Should report the same text as in gab_tunable.txt, mentioned above.

2. Change directory to the patch location and install the LLT 
    and GAB patch from the bff files from the same location:
	# installp -a -d ./VRTSllt.rte.bff VRTSllt.rte
	# installp -a -d ./VRTSgab.rte.bff VRTSgab.rte

3. Verify that the new fileset(s) has been installed:
	# lslpp -l VRTSllt.rte
VRTSllt.rte              4.0.4.300  APPLIED    VERITAS Low Latency Transport
                                               4.0MP4RP3
	# lslpp -l VRTSgab.rte
VRTSgab.rte              4.0.4.200  APPLIED    VERITAS Group Membership and
                                               Atomic Broadcast 4.0MP4RP2

Re-starting the cluster:
-----------------------
1. Verify that the new LLT driver has been loaded:
	# strload -qd /usr/lib/drivers/pse/llt
	/usr/lib/drivers/pse/llt: yes
   Verify that the new GAB driver has been loaded:
	# /etc/methods/gabkext -status
	gab: loaded

2. If not already loaded, load the newly installed LLT driver:
	# strload -d /usr/lib/drivers/pse/llt
   If not already loaded, load the newly installed GAB driver:
	# /etc/methods/gabkext -start

3. Configure LLT:
	# /sbin/lltconfig -c

4. Verify that LLT has been configured properly
	# /sbin/lltconfig
	LLT is running

5. Configure GAB:
	# sh /etc/gabtab

6. Verify that the GAB membership shows up correctly:
	# /sbin/gabconfig -a
   The display should have Port 'a' listed

7. Configure VxFen (if VxFEN was configured previously)
	# /sbin/vxfenconfig -c
   Verify that vxfen has been configured
	# /sbin/gabconfig -a
   The output should list port 'b'

8. Start VCS:
	# /opt/VRTSvcs/bin/hastart
   Verify that VCS is up and running:
	# /sbin/gabconfig -a
   The display should show port 'f', 'v', 'w' and 'h' listed.
   The 'f', 'v' and 'w' port will be listed if CVM and CFS are configured.

9.  Start applications (stopped earlier), which are outside VCS control.

Committing the Patch:
---------------------
1.  To commit the patch:
(Note: The patch cannot be backed out once it is committed.)
	# installp -c VRTSllt.rte
	# installp -c VRTSgab.rte

2.  Verify that the fileset is committed:
	# lslpp -l VRTSllt.rte
VRTSllt.rte              4.0.4.300  COMMITTED    VERITAS Low Latency Transport
                                                 4.0MP4RP3
	# lslpp -l VRTSgab.rte
VRTSgab.rte              4.0.4.200  COMMITTED    VERITAS Group Membership and
                                                 Atomic Broadcast 4.0MP4RP2


UNINSTALLING THE PATCHES IN VCS ENVIRONMENT
-------------------------------------------
The VRTSllt.rte.bff and VRTSgab.rte.bff patch can ONLY
be backed out if it has not been committed.

NOTE: Before uninstalling patch, make sure that the APAR changing 
DLPI behaviour is not installed on the system by running 
following commands:
  # instfix  -iv | grep "BRING DLPI DRIVER \"TO SPEC\""

If above mentioned command returns an APAR then backing out
this point patch will move llt to older version, which will 
cause panic or hang.

Steps to Backout the Patch:
--------------------------
1. Follow the steps provided under "Stopping the cluster" section above, 
   to stop the cluster & unload the drivers.

2. Backout the patches by the following command:
	# installp -r VRTSllt.rte 4.0.4.300
	# installp -r VRTSgab.rte 4.0.4.200

3. Verify that the patch has been backed out:
(Note: The previously installed fileset(s) will be in committed state again.
       It may differ from the mentioned, if a Hotfix was installed on top 
       of VCS 4.0MP4)
	# lslpp -l VRTSllt.rte
VRTSllt.rte                4.0.4.0  COMMITTED    VERITAS Low Latency Transport

	# lslpp -l VRTSgab.rte
VRTSgab.rte                4.0.4.0  COMMITTED    VERITAS Group Membership and
                                                 Atomic Broadcast

4. Restart the cluster following the steps under 
   "Re-Start the cluster" section above.

 Note: The llt & gab drivers will now refer to the old ones.


INSTALLING THE PATCHES IN SFRAC ENVIRONMENT
-------------------------------------------
The following steps should be run on all nodes in the cluster,
with SFRAC stack installed:

1. Offline all applications, which are configured on CVM/CFS 
   and are outside VCS control.

2. If Oracle database is not configured in VCS, stop it using following command:
	$ srvctl stop instance -d <database name> -i <instance name>

3(a). For Oracle 9iR2, stop 'gsd' using the follwing command as Oracle user
   	$ gsdctl stop
   To check the status of gsdctl, run the following command:
	$ gsdctl stat
   The gsdctl command is typically found in $ORACLE_HOME/bin.

3(b). For Oracle 10gR1 and 10gR2, Stop CRS manually, 
      if CRS is not under VCS control.
         # /etc/init.crs stop

4.  After all the oracle instances and other applications using 
    CFS and CVM have been stopped, run 'slibclean' to unload 
    the libraries from memory.

5. Stop VCS on the current node.
	# /opt/VRTSvcs/bin/hastop -local 

6. Verify that ports 'h', 'f', 'v' and 'w' have been closed
	# /sbin/gabconfig -a
   The display should not have ports 'h', 'f', 'v' and 'w' listed

7. Unconfigure VCSMM:
	# /sbin/vcsmmconfig -U
   Verify that port 'o' has been closed
	# /sbin/gabconfig -a
   The display should not have port 'o' listed.
   If it does ensure that Oracle instances are offline.

8. Unconfigure LMX:
	# /sbin/lmxconfig -U

9. Unconfigure VxFen:
	# /sbin/vxfenconfig -U
    Verify that port 'b' has been closed
	# /sbin/gabconfig -a
    The display should not have port 'b' listed

10. Unmount ODM:
	# umount /dev/odm
    Verify that port 'd' has been closed
	# /sbin/gabconfig -a
    The display should not have port 'd' listed

11. At this point all gab ports except port 'a' should have been closed
    Verify this as follows:
	# /sbin/gabconfig -a

12. Follow steps 5 to 8 of "Stopping the cluster" section from 
    "INSTALLING THE PATCHES IN VCS ENVIRONMENT" chapter above.

13. Follow all the instruction in "Installing the patch" section
    from "INSTALLING THE PATCHES IN VCS ENVIRONMENT" chapter above.

14. Follow steps 1 to 7 of "Re-starting the cluster" section from 
    "INSTALLING THE PATCHES IN VCS ENVIRONMENT" chapter above.

15. Configure LMX:
	# /sbin/lmxconfig -c

16. Configure VCSMM:
	# /sbin/vcsmmconfig -c
    Verify that vxfen has been configured
	# /sbin/gabconfig -a
    The output should list port 'o'

17. Mount ODM:
	# mount /dev/odm

18. Start VCS:
	# /opt/VRTSvcs/bin/hastart

19. Check if all ports are now open
	# /sbin/gabconfig -a
    The output should list ports 
    'a', 'b', 'd', 'f', 'h', 'o', 'v', and 'w'.

20(a). For Oracle 10gR1 and 10gR2, start CRS manually,
       if CRS is not under VCS control.
         # /etc/init.crs start

20(b). For Oracle 9iR2, start 'gsd' using the follwing command as Oracle user
   	$ gsdctl start
   To check the status of gsdctl, run the following command:
	$ gsdctl stat
   The gsdctl command is typically found in $ORACLE_HOME/bin.

21. If Oracle database is not configured in VCS, 
    start it using following procedure.
	$ srvctl start instance -d <database name> -i <instance name>

22. Online all applications, which are configured on CVM/CFS 
    and are outside VCS control (stopped earlier).

23. To commit the patches follow "Committing the Patch" section from 
    "INSTALLING THE PATCHES IN VCS ENVIRONMENT" chapter above.

UNINSTALLING THE PATCHES IN SFRAC ENVIRONMENT
---------------------------------------------
The VRTSllt.rte.bff and VRTSgab.rte.bff patches can ONLY be 
backed out if it has not been committed.

NOTE: Before uninstalling patch, make sure that the APAR 
changing DLPI behaviour is not installed on the system by
running following commands:
   # instfix  -iv | grep "BRING DLPI DRIVER \"TO SPEC\""

If above mentioned command returns an APAR then backing out
this point patch will move llt to older version, which will 
cause panic or hang.

Steps to Backout the Patch:
1.  Follow the steps outlined 1 through 12 of chapter 
    "INSTALLING THE PATCHES IN SFRAC ENVIRONMENT"
    to stop and unload the drivers.

2. Backout the patches:
	# installp -r VRTSllt.rte 4.0.4.300
	# installp -r VRTSgab.rte 4.0.4.200

3. Verify that the patch has been backed out:
Note: The previously installed fileset(s) will be in committed state again.
      It may differ from the mentioned, if a Hotfix was installed on top 
      of VCS 4.0MP4.
	# lslpp -l VRTSllt.rte
VRTSllt.rte                4.0.4.0  COMMITTED    VERITAS Low Latency Transport

	# lslpp -l VRTSgab.rte
VRTSgab.rte                4.0.4.0  COMMITTED    VERITAS Group Membership and
                                                 Atomic Broadcast

4. Next as before go through the process of loading and configuring
   LLT, GAB and bringing up SFRAC (steps 14 through 23 above of 
   chapter "INSTALLING THE PATCHES IN SFRAC ENVIRONMENT").

 Note: The llt & gab drivers will now refer to the old ones.