fs-rhel6_x86_64-5.1SP1RP4P1

 Basic information
Release type: P-patch
Release date: 2014-05-08
OS update support: None
Technote: None
Documentation: None
Popularity: 6621 viewed    downloaded
Download size: 8.87 MB
Checksum: 1370666749

 Applies to one or more of the following products:
VirtualStore 5.1SP1PR2 On RHEL6 x86-64
Storage Foundation 5.1SP1PR2 On RHEL6 x86-64
Storage Foundation Cluster File System 5.1SP1PR2 On RHEL6 x86-64
Storage Foundation Cluster File System for Oracle RAC 5.1SP1PR2 On RHEL6 x86-64
Storage Foundation for Oracle RAC 5.1SP1PR2 On RHEL6 x86-64
Storage Foundation HA 5.1SP1PR2 On RHEL6 x86-64

 Obsolete patches, incompatibilities, superseded patches, or other requirements:

This patch supersedes the following patches: Release date
fs-rhel6_x86_64-5.1SP1RP3P1HF1 (obsolete) 2013-06-04
fs-rhel6_x86_64-5.1SP1RP3P1 (obsolete) 2012-12-11
fs-rhel6_x86_64-5.1SP1RP2P2 (obsolete) 2012-06-28

This patch requires: Release date
sfha-rhel6_x86_64-5.1SP1PR3RP4 2013-08-21
sfha-rhel6_x86_64-5.1SP1PR2RP4 2013-08-21

 Fixes the following incidents:
2169326, 2243061, 2243063, 2243064, 2244612, 2247299, 2249658, 2255786, 2257904, 2275543, 2280386, 2280552, 2296277, 2311490, 2320044, 2320049, 2329887, 2329893, 2338010, 2340741, 2340794, 2340799, 2340808, 2340817, 2340825, 2340831, 2340834, 2340836, 2340839, 2341007, 2360817, 2360819, 2360821, 2368738, 2373565, 2386483, 2402643, 2412029, 2412169, 2412177, 2412179, 2412181, 2413811, 2418819, 2420060, 2425429, 2425439, 2426039, 2427269, 2427281, 2430679, 2478237, 2478325, 2480949, 2482337, 2482344, 2484815, 2486597, 2494464, 2508164, 2508171, 2521456, 2521514, 2521672, 2529356, 2561355, 2563929, 2564431, 2574396, 2578625, 2578631, 2578637, 2578643, 2581351, 2586283, 2587025, 2587035, 2603008, 2603015, 2607637, 2616395, 2616398, 2619910, 2619930, 2624650, 2627346, 2631026, 2631315, 2631390, 2635583, 2642027, 2669195, 2715030, 2725995, 2726010, 2726015, 2726018, 2726025, 2726031, 2726042, 2726056, 2752607, 2765308, 2768534, 2801689, 2821163, 2857465, 2932216, 3011828, 3023951, 3023964, 3023978, 3024020, 3024022, 3024028, 3024042, 3024049, 3024052, 3024088, 3096650, 3131795, 3131824, 3131829, 3135342, 3135346, 3138651, 3138653, 3138662, 3138663, 3138664, 3138668, 3138675, 3138695, 3141428, 3141433, 3141440, 3141445, 3142476, 3142575, 3142580, 3142583, 3153908, 3153928, 3153932, 3153934, 3153947, 3154174, 3158544, 3159607, 3164821, 3178899, 3204299, 3206363, 3207096, 3226404, 3233717, 3235517, 3246793, 3247280, 3248982, 3249151, 3261782, 3261849, 3261886, 3261892, 3262025, 3285687, 3317119, 3471169, 3471359

 Patch ID:
VRTSvxfs-5.1.134.100-SP1RP4P1_RHEL6

Readme file
                          * * * READ ME * * *
             * * * Veritas File System 5.1 SP1 RP4 P1 * * *
                          * * * Hot Fix * * *
                         Patch Date: 2014-05-08


This document provides the following information:

   * PATCH NAME
   * OPERATING SYSTEMS SUPPORTED BY THE PATCH
   * PACKAGES AFFECTED BY THE PATCH
   * BASE PRODUCT VERSIONS FOR THE PATCH
   * SUMMARY OF INCIDENTS FIXED BY THE PATCH
   * DETAILS OF INCIDENTS FIXED BY THE PATCH
   * INSTALLATION PRE-REQUISITES
   * INSTALLING THE PATCH
   * REMOVING THE PATCH


PATCH NAME
----------
Veritas File System 5.1 SP1 RP4 P1 Hot Fix


OPERATING SYSTEMS SUPPORTED BY THE PATCH
----------------------------------------
RHEL6 x86-64


PACKAGES AFFECTED BY THE PATCH
------------------------------
VRTSvxfs


BASE PRODUCT VERSIONS FOR THE PATCH
-----------------------------------
   * Symantec VirtualStore 5.1 SP1 PR2
   * Veritas Storage Foundation 5.1 SP1 PR2
   * Veritas Storage Foundation Cluster File System 5.1 SP1 PR2
   * Veritas Storage Foundation Cluster File System for Oracle RAC 5.1 SP1 PR2
   * Veritas Storage Foundation for Oracle RAC 5.1 SP1 PR2
   * Veritas Storage Foundation HA 5.1 SP1 PR2


SUMMARY OF INCIDENTS FIXED BY THE PATCH
---------------------------------------
Patch ID: 5.1.134.100
* 3261849 (3253210) The file system hangs when it reaches the space limitation.
* 3317119 (3317116) Internal command conformance text for mount command on RHEL6 Update4 hit debug assert inside the vx_get_sb_impl()function.
* 3471169 (3396959) RHEL 6.4 system panics with Stack Overflow errors due to
memory pressure.
* 3471359 (3349651) VxFS modules fail to load on RHEL6.5
Patch ID: 5.1.134.000
* 2340808 (2319348) The umount(1M) thread of a file system hangs on systems with
a large amount of memory.
* 2726042 (2667658) The 'fscdsconv endian' conversion operation fails because of a macro overflow.
* 2768534 (2641438) After a system is restarted, the modifications that are performed on the 
username space-extended attributes are lost.
* 2801689 (2695390) Accessing a vnode from cbdnlc cache hits an assert during internal testing.
* 2857465 (2735912) The performance of tier relocation using the fsppadm(1M)
enforce command degrades while migrating a large number of files.
* 2932216 (2594774) The "vx_msgprint" assert is observed several times in the internal Cluster File 
System (CFS) testing.
* 3011828 (2963763) When the thin_friendly_alloc() and deliache_enable() functionality is enabled,
VxFS may enter a deadlock.
* 3023951 (2773383) The read and write operations on a memory mapped files are
unresponsive.
* 3023964 (2750860) Performance of the write operation with small request size
may degrade on a large file system.
* 3023978 (2806466) A reclaim operation on a file system that is mounted on a
Logical Volume Manager (LVM) may panic the system.
* 3024020 (2858683) Reserve extent attributes changed after vxrestore, only for
files greater than 8192bytes
* 3024022 (2893551) The file attribute value is replaced with question mark symbols when the 
Network File System (NFS) connections experience a high load.
* 3024028 (2899907) On CFS, some file-system operations like vxcompress utility and de-duplication 
fail to respond.
* 3024042 (2923105) Removal of the VxFS module from the kernel takes a longer time.
* 3024049 (2926684) In rare cases, the system may panic while performing a logged write.
* 3024052 (2906018) The vx_iread errors are displayed after successful log replay and mount of the 
file system.
* 3024088 (3008451) In a Cluster File System (CFS) environment, shutting down the cluster may panic 
one of the nodes with a null pointer dereference.
* 3096650 (3081479) The Veritas File System (VxFS) module fails to load in the
RHEL 6 Update 4 environment.
* 3131795 (2912089) The system becomes unresponsive while growing a file through
vx_growfile in a fragmented file system.
* 3131824 (2966277) Systems with high file system activity like read/write/open/lookup may panic
the system.
* 3131829 (2991880) In low memory conditions on a VxFS, certain file system activities may seem
to be non-responsive.
* 3135342 (2439261) When the vx_fiostats_tunable value is changed from zero to
non-zero, the system panics.
* 3135346 (3073372) On larger CPUs/memory configurations with partition directory
feature enabled operations such as find, ls may seem to be slower.
* 3138651 (2834192) You are unable to mount the file system after the full fsck(1M) utility is run.
* 3138653 (2972299) The initial and subsequent reads on the directory with many symbolic links is
very slow.
* 3138662 (3135145) The fsadm(1M) command may partially reclaim space when the request size is
greater than 2GB.
* 3138663 (2732427) A Cluster mounted file-system may hang and become unresponsive.
* 3138664 (2956195) mmap in the CFS environment takes a long time to complete.
* 3138668 (3121933) The pwrite(2) function fails with the EOPNOTSUPP error.
* 3138675 (2756779) The code is modified to improve the fix for the read and write performance
concerns on Cluster File System (CFS) when it runs applications that rely on
the POSIX file-record using the fcntl lock.
* 3138695 (3092114) The information output displayed by the "df -i" command may be inaccurate for 
cluster mounted file systems.
* 3141428 (2972183) The fsppadm(1M) enforce command takes a long time on the secondary nodes
compared to the primary nodes.
* 3141433 (2895743) Accessing named attributes for some files stored in CFS seems to be slow.
* 3141440 (2908391) It takes a long time to remove checkpoints from the VxFS file system, when there
are a large number of files present.
* 3141445 (3003679) When running the fsppadm(1M) command and removing a file with the named
stream attributes (nattr) at the same time, the file system does not respond.
* 3142476 (3072036) Read operations from secondary node in CFS can sometimes fail with the ENXIO 
error code.
* 3142575 (3089210) The message V-2-17: vx_iread_1 filesystem file system inode
inode number marked bad incore is displayed in the system log.
* 3142580 (3011959) The system may panic because of the file system locking or
unlocking using the fsadm(1M) or the vxumount(1M) command.
* 3142583 (3007063) Policy enforcement using fsppadm(1M) takes a long time to
complete.
* 3153908 (2564442) On a cluster mounted file-system and internal test the
assertion"f:vx_prefault_uio_readable:1"  fails.
* 3153928 (2703747) The Cluster File System (CFS) failover may take over 20 minutes to complete.
* 3153932 (2977697) A core dump is generated while you are removing the clone.
* 3153934 (2649367) When you open a file, a race condition may occur leading to a system crash in
vx_fopen with NULL pointer deference.
* 3153947 (2370627) fsck(1M) core dumps while running  internal tests.
* 3154174 (2822984) The extendfs(1M) command fails when it attempts to extend a file system that is
greater than 2 terabytes.
* 3158544 (2191039) Large memory allocations on Linux lead to performance issues.
* 3159607 (2779427) The full fsck flag is set in after a failed inode read operation.
* 3164821 (3014791) Internal Cluster tests fail with "f:vx_msgprint:ndebug"
assertion failure.
* 3178899 (3206266) During an internal noise test, the "f:vx_purge_nattr:1"
assertion fails.
* 3204299 (2606294) Internal noise test dumps fsck(1M) core.
* 3206363 (3212625) The fsadm(1M) command  fails with the assert "ASSERT(sz <=
MAXBUFSZ)".
* 3207096 (3192985) Checkpoints quota usage on Cluster File System (CFS) can be negative.
* 3226404 (3214816) With the DELICACHE feature enabled, frequent creation and deletion of the inodes
of a user may result in corruption of the user quota file.
* 3233717 (3224101) After you enable the optimization for updating the i_size across the cluster
nodes lazily, the system panics.
* 3235517 (3240635) In a CFS environment, when a checkpoint is mount using the mount(1M) command 
the system may panic.
* 3246793 (2414266) The fallocate(2) system call fails on Veritas File System (VxFS) file systems in
the Linux environment.
* 3247280 (3257314) On systems installed with the SFORA/SFRAC stacks, when the DBED operations like 
the 
dbdst_obj_move (1M) command are run, the operation may fail.
* 3248982 (3272896) Internal stress test on the local mount hits a deadlock.
* 3249151 (3270357) The fsck (1m) command fails to clean the corrupt file system during the 
internal 'noise' test.
* 3261782 (3240403) The fidtovp()system call may cause panic in the vx_itryhold_locked () function.
* 3261886 (3046983) Invalid CFS node number in ".__fsppadm_fclextract", causes the DST policy 
enforcement failure.
* 3261892 (3260563) I/O error happens because the underlying disk or hardware failure is marked at incorrect offsets if the offsets are greater than 4Gb.
* 3262025 (3259634) A Cluster File System having more than 4G blocks gets corrupted because the
blocks containing some file system metadata get eliminated.
Patch ID: 5.1.133.000
* 3285687 (3285688) Re-packaged/Rolled Up 5.1SP1RP3 patch into 5.1SP1RP4
Patch ID: 5.1.132.200
* 2340794 (2086902) system crash when spinlock was held too long
* 2715030 (2715028) fsadm -d hang during vx_dircompact
* 2725995 (2566875) A write(2) operation exceeding the quota limit fails with an EDQUOT error.
* 2726010 (2651922) Performance degradation of 'll' and high SYS% CPU in vx_ireuse()
* 2726015 (2666249) File system hangs during Filestore operations
* 2726018 (2670022) Duplicate file names can be seen in a directory.
* 2726025 (2674639) VxFS returning error 61493 (VX_EFCLNOSPC) on CFS.
* 2726031 (2684573) enhancement request for force option of cfsumount command
* 2726056 (2709869) System panic with redzone violation when vx_free() tried to free fiostat
* 2752607 (2745357) The fix is such, that the piggy back data would not be ignored if its of type VX_IOT_ATTR, in vx_populate_bpdata, to improve performance.
* 2765308 (2753944) VxFS hang in vx_pd_create
* 2821163 (2821152) LM-stress test hit an assert "f:vx_dio_physio:4, 1" via "vx_dio_rdwri" on SLES11_SP1.
Patch ID: 5.1.132.100
Patch ID: 5.1.132.000


DETAILS OF INCIDENTS FIXED BY THE PATCH
---------------------------------------
This patch fixes the following Symantec incidents:

Patch ID: 5.1.134.100

* 3261849 (Tracking ID: 3253210)

SYMPTOM:
When the file system reaches the space limitation, it hangs with the following 
stack trace:
vx_svar_sleep_unlock()
default_wake_function()
wake_up()
vx_event_wait()
vx_extentalloc_handoff()
vx_te_bmap_alloc()
vx_bmap_alloc_typed()
vx_bmap_alloc()
vx_bmap()
vx_exh_allocblk()
vx_exh_splitbucket()
vx_exh_split()
vx_dopreamble()
vx_rename_tran()
vx_pd_rename()

DESCRIPTION:
When large directory hash is enabled through the vx_dexh_sz (5M) tunable, 
Veritas File System (VxFS) uses the large directory hash for directories. When 
you rename a file, a new directory entry is inserted to the hash table, which 
results in hash split. The hash split fails the current transaction. The 
transaction is retried after completing some housekeeping jobs. These jobs 
include allocating more space for the hash table. However, VxFS does not check 
the return value of the preamble job. As a result, when VxFS runs out of space, 
the rename transaction is re-entered permanently without knowing if more space 
is allocated by preamble jobs.

RESOLUTION:
The code is modified to enable VxFS to exit looping when the ENOSPC error is 
returned from the preamble job.

* 3317119 (Tracking ID: 3317116)

SYMPTOM:
Internal command conformance text for mount command on RHEL6 Update4 hit debug assert inside the vx_get_sb_impl() function

DESCRIPTION:
In RHEL6 Update4 kernel security update 2.6.32-358.18.1, Redhat changed the flag used to save mount status of dentry from d_flags to d_mounted. This resulted in debug assert in the vx_get_sb_impl() function, as d_flags was used to check mount status of dentry in RHEL6.

RESOLUTION:
The code is modified to use d_flags if OS is RHEL6 update2, else use d_mounted to determine mount status for dentry.

* 3471169 (Tracking ID: 3396959)

SYMPTOM:
On RHEL 6.4, with low free memory available, creating a file may panic the
system with the following stack trace. 

shrink_mem_cgroup_zone()
shrink_zone()
do_try_to_free_pages()
try_to_free_pages()
__alloc_pages_nodemask()
alloc_pages_current()
__get_free_pages()
vx_getpages()
vx_alloc()
vx_bc_getfreebufs()
vx_bc_getblk()
vx_getblk_bp()
vx_getblk_cmn()
vx_getblk()
vx_iread()
vx_local_iread()
vx_iget()
vx_ialloc()
vx_dirmakeinode()
vx_dircreate()
vx_dircreate_tran()
vx_pd_create()
vx_create1_pd()
vx_do_create()
vx_create1()
vx_create_vp()
vx_create()
vfs_create()
do_filp_open()
do_sys_open() 
sys_open()
system_call_fastpath()

DESCRIPTION:
VxFS estimates the the stack required to perform various kernel operations and
created hand-off threads if the stack usage is estimated to go above 
the allowed kernel limit. However, this estimate may go wrong when the system is
under heavy memory pressure, as some Linux kernel changes in RHEL6.4 increases
the depth of stack a lot. Since there might be additional functions 
called in the context of getpage to alleviate the situation.Thus leading to
increased stack usage.

RESOLUTION:
The code is modified to take stack depth calculations which are used to
correctly estimate the stack usage under memory pressure conditions.

* 3471359 (Tracking ID: 3349651)

SYMPTOM:
VxFS modules fail to load on RHEL6.5 and following error messages are
reported in system log.
kernel: vxfs: disagrees about version of symbol putname
kernel: vxfs: disagrees about version of symbol getname

DESCRIPTION:
In RHEL6.5 the kernel interfaces for getname and putname used by
VxFS have changed.

RESOLUTION:
Code modified to use latest definitions of getname and putname
kernel interfaces.

Patch ID: 5.1.134.000

* 2340808 (Tracking ID: 2319348)

SYMPTOM:
The umount(1M) thread of a file system takes a longer time on systems with a
large amount of memory. A kernel stack trace of a thread contending for this
lock appears as follows:

vx_put_pagelist
vx_invalidate_pages
vx_pvn_range_dirty
vx_putpage_dirty_wbc
vx_do_putpage
vx_freeze_iflush_list
vx_workitem_process
vx_worklist_process
vx_worklist_thread 
vx_kthread_init

DESCRIPTION:
A umount(1M) of a file-system with a large number of file data held in the
memory (in pages), results in a burst of memory page releases. Most of the pages
are already clean when the umount(1M) is invoked. A burst of memory page freeing
creates a contention on the 'zone->lock'.

RESOLUTION:
The code is modified to limit the calls to the memory page freeing interface.
This is done by keeping a per-CPU page vector. Instead of releasing all pages
during the umount operation, some pages are added to the current CPU's
page-vector. These pages are subsequently released by the background garbage
collection worker thread. This relieves the contention on the lock.

* 2726042 (Tracking ID: 2667658)

SYMPTOM:
Attempt to perform an fscdsconv-endian conversion from the SPARC little-endian 
byte order to the x86 big-endian byte order fails because of a macro overflow.

DESCRIPTION:
Using the fscdsconv(1M) command to perform endian conversion from the SPARC 
little-endian (any SPARC architecture machine) byte order to the x86 big-endian 
(any x86 architecture machine) byte order fails. The write operation for the 
recovery file results in the control data offset (a hard coded macro to 500MB) 
overflow.

RESOLUTION:
The code is modified to take an estimate of the control-data offset explicitly 
and dynamically while creating and writing the recovery file.

* 2768534 (Tracking ID: 2641438)

SYMPTOM:
When a system is unexpectedly shutdown, on a restart, you may lose the 
modifications that are performed on the username space-extended attributes 
("user").

DESCRIPTION:
The modification of a username space-extended attribute leads to an 
asynchronous-write operation. As a result, these modifications get lost during 
an unexpected system shutdown.

RESOLUTION:
The code is modified such that the modifications that are performed on the 
username space-extended attributes before the shutdown are made synchronous.

* 2801689 (Tracking ID: 2695390)

SYMPTOM:
TEDed built hits "f:vx_cbdnlc_lookup:3" assert during internal test runs.

DESCRIPTION:
If a vnode having a NULL file system pointer is added to cbdnlc cache and 
later, during some lookup, if vnode is returned, an assert gets hit during 
validation of the vnode.

RESOLUTION:
The code is modified to identify the place where an invalid vnode (whose file 
system pointer is not set) is added to the cache and to prevent it from being 
added to the cbdnlc cache.

* 2857465 (Tracking ID: 2735912)

SYMPTOM:
The performance of tier relocation for moving a large number of files is poor 
when the `fsppadm enforce' command is used.  When looking at the fsppadm(1M) 
command in the kernel, the following stack trace is observed:

vx_cfs_inofindau 
vx_findino
vx_ialloc
vx_reorg_ialloc
vx_reorg_isetup
vx_extmap_reorg
vx_reorg
vx_allocpolicy_enforce
vx_aioctl_allocpolicy
vx_aioctl_common
vx_ioctl
vx_compat_ioctl

DESCRIPTION:
When the relocation is for each file located in Tier 1 to be relocated to Tier 
2, Veritas File System (VxFS) allocates a new reorg inode and all its extents 
in Tier 2. VxFS then swaps the content of these two files and deletes the 
original file. This new inode allocation which involves a lot of processing can 
result in poor performance when a large number of files are moved.

RESOLUTION:
The code is modified to develop a reorg inode pool or cache instead of 
allocating it each time.

* 2932216 (Tracking ID: 2594774)

SYMPTOM:
The f:vx_msgprint:ndebug assert is observed several times in the internal 
Cluster File System (CFS) testing.

DESCRIPTION:
In case of CFS, the "no space left on device" (ENOSPC) error is observed when 
the File Change Log (FCL) is enabled during the reorganization operation. The 
secondary node requests the primary to delegate the allocation units (AUs). The 
primary node delegates an AU which has an exclusion zone set. This returns the 
ENOSPC error. You must retry to get another AU. Currently the retry count for 
getting an AU and allocation failures are set at 3. This retry count can be 
increased.

RESOLUTION:
Code is modified to increase the number of retries when allocation fails 
because the exclusion zones are set on the delegated AU and when the CFS is 
frozen.

* 3011828 (Tracking ID: 2963763)

SYMPTOM:
When thin_friendly_alloc and deliache_enable parameters are enabled, Veritas 
File System (VxFS) may hit the deadlock. The thread involved in the deadlock 
can have the following stack trace:

vx_rwsleep_lock()
vx_tflush_inode()
vx_fsq_flush()
vx_tranflush()
vx_traninit()
vx_remove_tran()
vx_pd_remove()
vx_remove1_pd()
vx_do_remove()
vx_remove1()
vx_remove_vp()
vx_remove()
vfs_unlink()
do_unlinkat

The threads waiting in vx_traninit() for transaction space, displays following 
stack trace:

vx_delay2() 
vx_traninit()
vx_idelxwri_done()
vx_idelxwri_flush()
vx_common_inactive_tran()
vx_inactive_tran()
vx_local_inactive_list()
vx_inactive_list+0x530()
vx_worklist_process()
vx_worklist_thread()

DESCRIPTION:
In the extent allocation code paths, VxFS is setting the IEXTALLOC flag on the 
inode, without taking the ILOCK, with overlapping transactions picking up this 
same inode off the delicache list makes the transaction done code paths to miss 
the IUNLOCK call.

RESOLUTION:
The code is modified to change the corresponding code paths to set the 
IEXTALLOC flag under proper protection.

* 3023951 (Tracking ID: 2773383)

SYMPTOM:
Deadlock involving two threads are observed. One holding the semaphore
and waiting for the 'irwlock' and the other holding 'irwlock' and waiting for
the 'mmap' semaphore and the following stack trace is displayed:
vx_rwlock 
vx_naio_do_work
vx_naio_worker 
vx_kthread_init 
kernel_thread.

DESCRIPTION:
The hang in 'down_read' occurs due to waiting for the mmap_sem. The
thread holding the mmap_sem is waiting for the RWLOCK. This is being held by one
of the threads wanting the mmap_sem, and hence the deadlock observed. An
enhancement was made to not take the mmap_sem for cio and mmap. This was not
complete and did not allow the mmap_sem to be taken for native asynch io calls
when using this nommapcio option.

RESOLUTION:
The code is modified to skip taking the mmap_sem in case of
native-io if the file has CIO advisory set.

* 3023964 (Tracking ID: 2750860)

SYMPTOM:
Performance of the write operation with small request size may degrade
on a large Veritas File System (VxFS) file system. Many threads may be found
sleeping with the following stack trace:

vx_sleep_lock
vx_lockmap
vx_getemap
vx_extfind
vx_searchau_downlevel
vx_searchau_downlevel
vx_searchau_downlevel
vx_searchau_downlevel
vx_searchau_uplevel
vx_searchau+0x600
vx_extentalloc_device
vx_extentalloc
vx_te_bmap_alloc
vx_bmap_alloc_typed
vx_bmap_alloc
vx_write_alloc3
vx_recv_prealloc
vx_recv_rpc
vx_msg_recvreq
vx_msg_process_thread
kthread_daemon_startup

DESCRIPTION:
A VxFS allocate unit (AU) is composed of 32768 disk blocks, and can
be expanded when it is partially allocated, or non-expanded when the AU is fully
occupied or completely unused. The extent map for a large file system with 1k
block size is organized as a big tree. For example, a 4-TB file system with 1KB
file system block size can have up to 128k Aus. To find an appropriate extent,
VxFS extent allocation algorithm will first search expanded AU to avoid causing
free space fragmentation by traversing the free extent map tree. If getting
failed, it will do the same with the non-expanded AUs. When there are too many
small extents(less than 32768 blocks) requests, and all the small free extents
are used up, but a large number of au-size extents (32768 blocks) are available;
the file system could run into this hang. Because of no small available extents
in the expanded AUs, VxFS will look for some larger non-expanded extents, namely
au-size extents, which are not what VxFS wanted (expanded AU is
expected). As a result, each request will walk along the big extent map tree for
every au-size extent, which will end up with failure finally. The requested
extent can be gotten during the second attempt for non-expanded AUs eventually,
but the unnecessary work consumes a lot of CPU resource.

RESOLUTION:
The code is modified to optimize the free-extend-search algorithm by
skipping certain au-size extents to reduce the overall search time.

* 3023978 (Tracking ID: 2806466)

SYMPTOM:
A reclaim operation on a file system that is mounted on an LVM volume using the
fsadm(1M) command with the -R option may panic the system. And the following
stack trace is displayed:
vx_dev_strategy+0xc0() 
vx_dummy_fsvm_strategy+0x30() 
vx_ts_reclaim+0x2c0() 
vx_aioctl_common+0xfd0() 
vx_aioctl+0x2d0() 
vx_ioctl+0x180()

DESCRIPTION:
Thin reclamation supports only mounted file systems on a VxVM volume.

RESOLUTION:
The code is modified to return errors without panicking the system if the
underlying volume is LVM.

* 3024020 (Tracking ID: 2858683)

SYMPTOM:
The reserve-extent attributes are changed after the vxrestore(1M ) operation, 
for files that are greater than 8192 bytes.

DESCRIPTION:
A local variable is used to contain the number of the reserve bytes that are 
reused during the vxrestore(1M) operation, for further VX_SETEXT ioctl call for 
files that are greater than 8k. As a result, the attribute information is 
changed.

RESOLUTION:
The code is modified to preserve the original variable value till the end of 
the function.

* 3024022 (Tracking ID: 2893551)

SYMPTOM:
When the Network File System (NFS) connections experience a high load, the file 
attribute value is replaced with question mark symbols. This issue occurs 
because the EACCESS() function got appended with the ls -l command when the 
cached entries from the NFS server were deleted.

DESCRIPTION:
The Veritas File System (VxFS) uses capabilities, such as, CAP_CHOWN to 
override the default inode permissions and allow users to search for 
directories. VxFS allows users to perform the search operation even when 
the 'r' or 'x' bits are not set as permissions. 
When the nfsd file system uses these capabilities to perform a dentry reconnect 
to connect to the dentry tree, some of the Linux file systems use the 
inode_permission() function to verify if a user is authorized to perform the 
operation.

When dentry connects on behalf of the disconnected dentries then nfsd file 
system enables all capabilities without setting the on-wire user id to fsuid. 
Hence, VxFS fails to understand these capabilities and reports an error stating 
that the user doesn't have permissions on the directory.

RESOLUTION:
The code is modified to enable the vx_iaccess() function to check if Linux is 
processing the capabilities before returning the EACCES() function. This 
modification adds a minimum capability support for the nfsd file system.

* 3024028 (Tracking ID: 2899907)

SYMPTOM:
Some file-system operations on a Cluster File System (CFS) may hang with the 
following stack trace. 
vxg_svar_sleep_unlock
vxg_grant_sleep
vxg_cmn_lock
vxg_api_lock
vx_glm_lock
vx_mdele_hold
vx_extfree1
vx_exttrunc
vx_trunc_ext4
vx_trunc_tran2
vx_trunc_tran
vx_cfs_trunc
vx_trunc
vx_inactive_remove
vx_inactive_tran
vx_cinactive_list
vx_workitem_process
vx_worklist_process
vx_worklist_thread
vx_kthread_init
kernel_thread

DESCRIPTION:
In CFS, a node can lock a mdelelock for an extent map while holding a mdelelock 
for a different extent map locked. This can result in a deadlock between 
different nodes in the cluster.

RESOLUTION:
The code is modified to prevent the deadlock between different nodes in the 
cluster.

* 3024042 (Tracking ID: 2923105)

SYMPTOM:
Removing the Veritas File System (VxFS) module using rmmod(8) on a system 
having heavy buffer cache usage may hang.

DESCRIPTION:
When a large number of buffers are allocated from the buffer cache, at the time 
of removing VxFS module, the process of freeing the buffers takes a long time.

RESOLUTION:
The code is modified to use an improved algorithm which prevents it from 
traversing the free lists even if it has found the free chunk. Instead, it will 
break out from the search and free that buffer.

* 3024049 (Tracking ID: 2926684)

SYMPTOM:
On systems with heavy transactions workload like creation, deletion of files 
and so on, the system may panic with the following stack trace:
a|..
vxfs:vx_traninit+0x10
vxfs:vx_dircreate_tran+0x420
vxfs:vx_pd_create+0x980
vxfs:vx_create1_pd+0x1d0
vxfs:vx_do_create+0x80
vxfs:vx_create1+0xd4
vxfs:vx_create+0x158
a|..

DESCRIPTION:
In case of a delayed log, a transaction commit can complete before completing 
the log write. The memory for transaction is freed before logging the 
transaction and corrupts the transaction freelist causing the system to panic.

RESOLUTION:
The code is modified such that the transaction is not freed untill the log is 
written.

* 3024052 (Tracking ID: 2906018)

SYMPTOM:
In the event of a system crash, the fsck-intent-log is not replayed and file 
system is marked clean. Subsequently, mounting the file-system-extended 
operations is not completed.

DESCRIPTION:
Only when a file system that contains PNOLTs is mounted locally (mounted 
without using 'mount -o cluster') are potentially exposed to this issue. 

The reason why fsck silently skips the intent-log replay is that each PNOLT has 
a flag to identify whether the intent-log is dirty or not - in the event of a 
system crash this flag signifies whether intent-log replay is required or not. 
In the event of a system crash whilst the file system was mounted locally and 
the PNOLTs are not utilized. The fsck intent-log replay will still check for 
the flags in the PNOLTs, however, these are the wrong flags to check if the 
file system was locally mounted. The fsck intent-log replay therefore assumes 
that the intent-logs are clean (because the PNOLTs are not marked dirty) and it 
therefore skips the replay of intent-log altogether.

RESOLUTION:
The code is modified such that when PNOLTs exist in the file system, VxFS will 
set the dirty flag in the CFS primary PNOLT while mounting locally. With this 
change, in the event of system crash whilst a file system is locally mounted, 
the subsequent fsck intent-log replay will correctly utilize the PNOLT 
structures and successfully replay the intent log.

* 3024088 (Tracking ID: 3008451)

SYMPTOM:
On a cluster mounted filesystem hastop -all command may panic some of the nodes 
with the following stack trace. 

vxfs:vx_is_fs_disabled_implamf:is_fs_disabled
amf:amf_ev_fsoff_verify
amf:amf_event_reg
amf:amfioctl
amf:amf_ioctl
specfs:spec_ioctl
genunix:fop_ioctl
genunix:ioctl

DESCRIPTION:
vx_is_fs_disabled_impl function which is called during the umount operation 
(triggered by hastop-all)  traverses the vx_fsext_list one by one and returns 
true if the file system is disabled.  While traversing this list, It also 
accesses filesystems which have fse_zombie flag which denoted that the 
filesystem is in unstable state some pointers may be NULL which when accessed 
would panic the machine with the above mentioned stack trace.

RESOLUTION:
The code is modified to skip fsext with fse_zombie flag set since fse_zombie 
flag set implies fsext is in unstable state.

* 3096650 (Tracking ID: 3081479)

SYMPTOM:
The Veritas File System (VxFS) module fails to load in RHEL 6 Update 4
environments.

DESCRIPTION:
The module fails to load because of two kABI incompatibilities with
the RHEL 6 Update 4 environment.

RESOLUTION:
The code is modified to ensure that the VxFS module is supported in
RHEL 6 Update 4 environments.

* 3131795 (Tracking ID: 2912089)

SYMPTOM:
On a Cluster mounted File System which is highly fragmented, a grow file 
operation may hang with the following stack traces. 

T1: 
vx_event_wait+0001A8 
vx_async_waitmsg+000174
vx_msg_send+0006B0005BC
vx_cfs_pagealloc+00023C
vx_alloc_getpage+0002DC
vx_do_getpage+001618 
vx_mm_getpage+0000B4
vx_internal_alloc+00029C 
vx_write_alloc+00051C 
vx_write1+0014D4
vx_write_common_slow+000EB0
vx_write_common+000C34
vx_rdwr_attr+0002C4

T2:
vx_glm_lock+000120
vx_genglm_lock+0000B0
vx_iglock3+0004B4
vx_iglock2+0005E4
vx_iglock+00004C
vx_write1+000E70
vx_write_common_slow+000EB0
vx_write_common+000C34
vx_rdwr_attr+0002C4

DESCRIPTION:
While growing a file a transaction is performed to allocate extents. CFS can 
only allow up to a maximum number of sub transactions within a transaction. 
When the maximum limit for sub transactions is reached, CFS retries the 
operation. If the file system is badly fragmented then CFS goes into an 
infinite loop due to crossing maximum sub transaction limit in every retrial.

RESOLUTION:
Code is modified to specify a maximum retry limit and abort the operation with 
ENOSPC error after the retry limit is reached.

* 3131824 (Tracking ID: 2966277)

SYMPTOM:
Systems with high file-system activity like read/write/open/lookup may panic 
with the following stack trace due to a rare race condition:
spinlock+0x21 ( )
 ->  vx_rwsleep_unlock()
 vx_ipunlock+0x40()
 vx_inactive_remove+0x530()
 vx_inactive_tran+0x450()
 vx_local_inactive_list+0x30()
 vx_inactive_list+0x420()
 ->  vx_workitem_process()
 ->  vx_worklist_process()
 vx_worklist_thread+0x2f0()
 kthread_daemon_startup+0x90()

DESCRIPTION:
ILOCK is released before doing a IPUNLOCK that causes a race condition. This 
results in a panic when an inode that has been set free is accessed.

RESOLUTION:
The code is modified so that the ILOCK is used to protect the inodes' memory 
from being set free, while the memory is being accessed.

* 3131829 (Tracking ID: 2991880)

SYMPTOM:
In low memory conditions on a Veritas File System (VxFS) certain file systems
activities may seem to be non-responsive with the following stack traces seen in
the crash dumps:

vx_diput
d_kill
prune_one_dentry
__shrink_dcache_sb
prune_dcache
shrink_dcache_memory 
shrink_slab 
do_try_to_free_pages 
try_to_free_pages 
__alloc_pages_slowpath 
__alloc_pages_nodemask 
kmem_getpages 
fallback_alloc 
kmem_cache_alloc 
do_tune_cpucache
enable_cpucache 
kmem_cache_create

DESCRIPTION:
In low memory conditions a cache created operation from the VxFS kernel may
trigger a cache shrink operation via vx_diput since there no enough free memory
to allocate. Both these operations require the cache_chain_mutex lock resulting
in a deadlock situation.

RESOLUTION:
The code is modified to remove the cache shrink operation from the vx_diput
context, thus avoiding the deadlock.

* 3135342 (Tracking ID: 2439261)

SYMPTOM:
When the vx_fiostats_tunable is changed from zero to non-zero, the
system panics with the following stack trace:
vx_fiostats_do_update
vx_fiostats_update
vx_read1
vx_rdwr
vno_rw
rwuio
pread

DESCRIPTION:
When vx_fiostats_tunable is changed from zero to non-zero, all the
incore-inode fiostats attributes are set to NULL. When these attributes are
accessed, the system panics due to the NULL pointer dereference.

RESOLUTION:
The code has been modified to check the file I/O stat attributes are
present before dereferencing the pointers.

* 3135346 (Tracking ID: 3073372)

SYMPTOM:
On larger CPUs/memory configurations with partition directory feature enabled
operations such as find, ls may seem to be slower.

DESCRIPTION:
The contention happens when the maximum number of the partition-directory level
is set to 3 and the default partition-directory threshold (directory size beyond
which partition directories come into effect) is 32000.

RESOLUTION:
An enhancement is made to change the default maximum number of the
partition-directory level to 2 and the default partition-directory threshold to
32768.

* 3138651 (Tracking ID: 2834192)

SYMPTOM:
The mount operation fails after full fsck(1M) utility is run and displays the 
following error message on the console:
'UX:vxfs mount.vxfs: ERROR: V-3-26881 : Cannot be mounted until it has been 
cleaned by fsck. Please run "fsck -t vxfs -y MNTPNT" before mounting'.

DESCRIPTION:
When a CFS is mounted, VxFS validates the per-node-cut entries (PNCUT) which 
are in-core against their counterparts on the disk. This validation failure 
makes the mount unsuccessful for the full fsck. Full fsck is in the fourth pass 
when it checks the free inode/extent maps and merges the dirty PNCUT files in-
core, and validates them with the corresponding on-disk values. However, if any 
PNCUT entry is corrupted, then the fsck(1M) utility simply ignores it. This 
results in the mount failure.

RESOLUTION:
The code is modified to enhance the fsck(1M) utility to handle any delinquent 
PNCUT entries and rebuild them as required.

* 3138653 (Tracking ID: 2972299)

SYMPTOM:
open(O_CREAT) (1m) operation can take upto  0.5 seconds to complete. A high 
value of vxi_bc_reuse   counter is also seen in vxfsstat data.

DESCRIPTION:
After the directory blocks are cached, they are expected to remain in cache 
till they are evicted from the cache. The buffer-cache reuse code uses 
the "lbolt" value to determine the age of the buffer. All the buffers which are 
older than a particular threshold are reused. Errors are introduced to buffer-
reuse calculations because the simple signed-unsigned arithmetic causes the 
buffers to be reused every time. Hence subsequent reads take a longer than 
expected time.

RESOLUTION:
The code is modified so that the variables which store time are correctly 
declared as signed int.

* 3138662 (Tracking ID: 3135145)

SYMPTOM:
The fsadm(1M) command may partially reclaim space when the request size is
greater than 2GB.

DESCRIPTION:
The reclamation request is passed to the underlying volume via the Veritas
Volume Manager (VxVM) layer. The default max buffer size for Volume Manager (VM)
request is 32-bit. Hence the reclaim request was truncated in the VM layer
resulting in a partial reclamation.

RESOLUTION:
The code is modified to pass requests which are greater than 2GB to the VM layer
via a special hint, to process such request appropriately.

* 3138663 (Tracking ID: 2732427)

SYMPTOM:
The system hangs with the following stacks:

T1:
_spin_lock_irqsave
vx_bc_do_brelse
vx_bc_biodone
vx_inode_iodone
vx_end_io_bp
vx_end_io
blk_update_request
blk_update_bidi_request
__blk_end_request_all
vxvm_end_request
volkiodone
volsiodone
vol_subdisksio_done
volkcontext_process
voldiskiodone
getnstimeofday
voldiskiodone_intr
gendmpiodone
blk_update_request
blk_update_bidi_request
blk_end_bidi_request
scsi_end_request
scsi_io_completion
blk_done_softirq
__do_softirq
call_softirq
do_softirq
irq_exit
do_IRQ
--- <IRQ stack> ---
ret_from_intr
    [exception RIP: vxg_api_deinitlock+147]
vx_glm_deinitlock
vx_cbuf_free_list
vx_workitem_process
vx_worklist_process
vx_worklist_thread
vx_kthread_init
kernel_thread

T2:
_spin_lock
vx_cbuf_rele
vx_bc_getblk
vx_getblk_bp
vx_getblk_clust
vx_getblk_cmn
find_busiest_group
vx_getblk
vx_iupdat_local
vx_cfs_iupdat
vx_iflush_list
vx_iflush
vx_workitem_process
vx_worklist_process
vx_worklist_thread
vx_kthread_init

T3:
_spin_lock_irqsave
vxvm_end_request
volkiodone
volsiodone
vol_subdisksio_done
volkcontext_process
voldiskiodone
getnstimeofday
voldiskiodone_intr
gendmpiodone
blk_update_request
blk_update_bidi_request
blk_end_bidi_request
scsi_end_request
scsi_io_completion
blk_done_softirq
__do_softirq
call_softirq
do_softirq
irq_exit
do_IRQ
--- <IRQ stack> ---
ret_from_intr
    [exception RIP: _spin_lock+9]
vx_cbuf_lookup
vx_getblk_clust
vx_getblk_cmn
find_busiest_group
vx_cfs_iupdat
vx_iflush_list
vx_iflush
vx_workitem_process
vx_worklist_process
vx_worklist_thread
vx_kthread_init
kernel_thread

DESCRIPTION:
There are three locks which constitutes the dead lock. They include a volume 
manager lock (L1), a buffer list lock (L2), and cluster buffer list lock (L3). 
T1, which tries to release a buffer for I/O completion, holds a volume manager 
spin lock (L1) and waits for a buffer free list lock (L2). T2 is the owner of 
L2. The T2 is chasing a cluster buffer lock (L3) to release its affiliated 
cluster buffer. When T3 tries to obtain the L3 an unexpected disk interrupt 
happens, which processes an iodone job. As a result, T3 in volume manager layer 
is stuck by the volume manager lock L1, which causes a deadlock.

RESOLUTION:
The code is modified so that in vx_bc_getblk, buffer list lock is dropped 
before acquiring cluster buffer list lock.

* 3138664 (Tracking ID: 2956195)

SYMPTOM:
mmap in CFS environment takes a long time to complete.

DESCRIPTION:
During a mmap operation, the read-write operations across the nodes
in CFS invalidate the cache if the cached file is accessed from a different
node. And the cache invalidation slows the mmap operation down.

RESOLUTION:
The code is modified to add a new module parameter
"vx_cfs_mmap_perf_tune" for Linux kernel version beyond 2.6.27. If the value is
set to 1, the mmap operation over CFS runs by an optimized path.

* 3138668 (Tracking ID: 3121933)

SYMPTOM:
The pwrite()  function fails with EOPNOTSUPP when the write range is in two 
indirect extents.

DESCRIPTION:
When the range of pwrite() falls in two indirect extents (one ZFOD extent 
belonging to DB2 pre-allocated files created with setext( , VX_GROWFILE, ) 
ioctl and another DATA extent belonging to adjacent INDIR) write fails with 
EOPNOTSUPP. The reason is that VxFS is trying to coalesce extents which belong 
to different indirect address extents as part of this transaction - such a meta-
data change consumes more transaction resources which VxFS transaction engine 
is unable to support in the current implementation.

RESOLUTION:
Code is modified to retry the transaction without coalescing the extents, as 
latter is an optimisation and should not fail write.

* 3138675 (Tracking ID: 2756779)

SYMPTOM:
Write and read performance concerns on Cluster File System (CFS) when running 
applications that rely on POSIX file-record locking (fcntl).

DESCRIPTION:
The usage of fcntl on CFS leads to high messaging traffic across nodes thereby 
reducing the performance of readers and writers.

RESOLUTION:
The code is modified to cache the ranges that are being file-record locked on 
the node. This is tried whenever possible to avoid broadcasting of messages 
across the nodes in the cluster.

* 3138695 (Tracking ID: 3092114)

SYMPTOM:
The information output by the "df -i" command can often be inaccurate for 
cluster mounted file systems.

DESCRIPTION:
In Cluster File System 5.0 release a concept of delegating metadata to nodes in 
the cluster is introduced. This delegation of metadata allows CFS secondary 
nodes to update metadata without having to ask the CFS primary to do it. This 
provides greater node scalability. 
However, the "df -i" information is still collected by the CFS primary 
regardless of which node (primary or secondary) the "df -i" command is executed 
on.

For inodes the granularity of each delegation is an Inode Allocation Unit 
[IAU], thus IAUs can be delegated to nodes in the cluster.
When using a VxFS 1Kb file system block size each IAU will represent 8192 
inodes.
When using a VxFS 2Kb file system block size each IAU will represent 16384 
inodes.
When using a VxFS 4Kb file system block size each IAU will represent 32768 
inodes.
When using a VxFS 8Kb file system block size each IAU will represent 65536 
inodes.
Each IAU contains a bitmap that determines whether each inode it represents is 
either allocated or free, the IAU also contains a summary count of the number 
of inodes that are currently free in the IAU.
The ""df -i" information can be considered as a simple sum of all the IAU 
summary counts.
Using a 1Kb block size IAU-0 will represent inodes numbers      0 -  8191
Using a 1Kb block size IAU-1 will represent inodes numbers   8192 - 16383
Using a 1Kb block size IAU-2 will represent inodes numbers  16384 - 32768
etc.
The inaccurate "df -i" count occurs because the CFS primary has no visibility 
of the current IAU summary information for IAU that are delegated to Secondary 
nodes.
Therefore the number of allocated inodes within an IAU that is currently 
delegated to a CFS Secondary node is not known to the CFS Primary.  As a 
result, the "df -i" count information for the currently delegated IAUs is 
collected from the Primary's copy of the IAU summaries. Since the Primary's 
copy of the IAU is stale, therefore the "df -i" count is only accurate when no 
IAUs are currently delegated to CFS secondary nodes.
In other words - the IAUs currently delegated to CFS secondary nodes will cause 
the "df -i" count to be inaccurate.
Once an IAU is delegated to a node it can "timeout" after a 3 minutes  of 
inactivity. However, not all IAU delegations will timeout. One IAU will always 
remain delegated to each node for performance reasons. Also an IAU whose inodes 
are all allocated (so no free inodes remain in the IAU) it would not timeout 
either.
The issue can be best summarized as:
The more IAUs that remain delegated to CFS secondary nodes, the greater the 
inaccuracy of the "df -i" count.

RESOLUTION:
Allow the delegations for IAU's whose inodes are all allocated (so no free 
inodes in the IAU) to "timeout" after 3 minutes of inactivity.

* 3141428 (Tracking ID: 2972183)

SYMPTOM:
"fsppadm enforce"  takes longer than usual time force update the secondary 
nodes than it takes to force update the primary nodes.

DESCRIPTION:
The ilist is force updated on secondary node. As a result the performance on 
the secondary becomes low.

RESOLUTION:
Force update the ilist file on Secondary nodes only on error condition.

* 3141433 (Tracking ID: 2895743)

SYMPTOM:
It takes a longer than usual time for many Windows7 clients to log off in 
parallel if the user profile is stored in Cluster File system (CFS).

DESCRIPTION:
Veritas File System (VxFS) keeps file creation time/full ACL things for samba 
clients in the extended attribute which is implemented via named streams. VxFS 
reads the named stream for each of the ACL objects. Reading of named stream is 
a costly operation, as it results in an open, an opendir, a lookup, and another 
open to get the fd. The VxFS function vx_nattr_open() holds the exclusive 
rwlock to read an ACL object that stored as extended attribute. It may cause 
heavy lock contention when many threads want the same lock. They might get 
blocked until one of the nattr_open releases it. This takes time since 
nattr_open is very slow.

RESOLUTION:
The code is modified so that it takes the rwlock in shared mode instead of 
Exclusive mode.

* 3141440 (Tracking ID: 2908391)

SYMPTOM:
Checkpoint removal takes too long if Veritas File System (VxFS) has a large 
number of files. The cfsumount(1M) command could hang if removal of multiple 
checkpoints is in progress for such a file system.

DESCRIPTION:
When removing a checkpoint, VxFS traverses every inode to determine if 
pull/push is needed for upstream/downstream checkpoint in its chain. This is 
time consuming if the file system has large number of files. This results in 
the slow checkpoint removal.

The command "cfsumount -c fsname" forces the umounts operation on a VxFS file 
system if there is any asynchronous checkpoint removal job in progress by 
checking if the value of vxfs stat "vxi_clonerm_jobs" is larger than zero. 
However, the stat does not count in the jobs in the checkpoint removal working 
queue and the jobs are entered into the working queue.  The "force umount" 
operation does not happen even if there are pending checkpoint removal jobs 
because of the incorrect value of "vxi_clonerm_jobs" (zero).

RESOLUTION:
For slow checkpoint removal issue: 
Code is modified to create multiple threads to work on different Inode 
Allocation Units (IAUs) in parallel and to reduce the inode push work by 
sorting the checkpoint removal jobs by the creation time in ascending order and 
enlarged the checkpoint push size.

For the cfsumount(1M) command hang issue: 
Code is modified to add the counts of jobs in the working queue in 
the "vxi_clonerm_jobs" stat.

* 3141445 (Tracking ID: 3003679)

SYMPTOM:
The file system hangs when doing fsppadm and removing a file with named stream 
attributes (nattr) at the same time. The following two typical threads are 
involved: 

T1:
COMMAND: "fsppadm"
schedule at
 vxg_svar_sleep_unlock
vxg_grant_sleep
 vxg_cmn_lock
 vxg_api_lock
 vx_glm_lock
 vx_ihlock
 vx_cfs_iread
 vx_iget
 vx_traverse_tree
vx_dir_lookup
vx_rev_namelookup
vx_aioctl_common
vx_ioctl
vx_compat_ioctl
compat_sys_ioctl
T2:
COMMAND: "vx_worklist_thr"
 schedule
 vxg_svar_sleep_unlock
 vxg_grant_sleep
 vxg_cmn_lock
 vxg_api_lock
 vx_glm_lock
 vx_genglm_lock
 vx_dirlock
 vx_do_remove
 vx_purge_nattr
vx_nattr_dirremove
vx_inactive_tran
vx_cfs_inactive_list
vx_inactive_list
vx_workitem_process
vx_worklist_process
vx_worklist_thread
vx_kthread_init
kernel_thread

DESCRIPTION:
The file system hangs due to the deadlock between the threads. T1 initiated by 
fsppadm calls vx_traverse_tree to obtain the path name for a given inode 
number. T2 removes the inode as well as its affiliated nattr inodes.
The reverse name lookup (T1) holds the global dirlock in vx_dir_lookup during 
the lookup process. It traverses the entire path from bottom to top to resolve 
the inode number inversely in vx_traverse_tree. During the lookup, VxFS needs 
to hold the hlock of each inode to read them, and drop it after reading.
The file removal (T2) is processed via vx_inactive_tran which will take 
the "hlock" of the inode being removed. After that, it will remove all its 
named attribute inodes invx_do_remove, where sometimes the global dirlock is 
needed. Eventually, each thread waits for the lock, which is held by the other 
thread and this result in the deadlock.

RESOLUTION:
The code is modified so that the dirlock is not acquired during reserve name 
lookup.

* 3142476 (Tracking ID: 3072036)

SYMPTOM:
Reads from secondary node in CFS can sometimes fail with ENXIO (No such device 
or address).

DESCRIPTION:
The incore attribute ilist on secondary node is out of sync with that of the 
primary.

RESOLUTION:
The code is modified so that the incore attribute ilist on secondary node is 
not out of sync with that of the primary.

* 3142575 (Tracking ID: 3089210)

SYMPTOM:
After vx_maxlink (the number of sub-directories in a directory or hard
link count) is increased (from 32K to 64K), you may receive the following error
message: 

vxfs: msgcnt 2545 mesg 017: V-2-17: vx_iread_1 - <filesystem> file system inode
<indoe number>  marked bad incore 

A full fsck operation on such file system may move such directories to /lost+found.

DESCRIPTION:
When the vx_malink value is increased from 32k to 64k, the new
limit is saved in a .conf file. However, the upgrade operation and the reinstall
operation don't read the file. Instead the module initialization scripts roll it
back to the default 32k, which leads to the mentioned message. And the file
system check utility (fsck) still verifies file system consistency using the
outdated max limit of 32k. It moves all directories that are larger than 32k to
the /lost+found directory.

RESOLUTION:
The code is modified to read the configuration file during the
upgrade operation and the re-install operation. And fsck is modified to check
the new limit.

* 3142580 (Tracking ID: 3011959)

SYMPTOM:
The system may panic because of the file system locking or unlocking using the
fsadm(1M) or vxumount(1M) command with the following stack trace:
vx_show_options 
show_vfsmnt 
seq_read 
vfs_read
sys_read
system_call_fastpath

DESCRIPTION:
When performing the file system locking or unlocking operations and read the
mount lock key, a user cannot serialize these operations with a lock. As a
result, when other users try to read the file system locking or unlocking the
user reset, the other users are trying to read an entry which is already freed.
This conflict causes a null pointer de-reference resulting in the mentioned
panic above.

RESOLUTION:
The code is modified to serialize file system locking operations with the
CLONEOPS lock.

* 3142583 (Tracking ID: 3007063)

SYMPTOM:
Policy enforcement using fsppadm(1M) takes a long time to complete.

DESCRIPTION:
The fsppadm(1M) command moves data from one volume to another in a
Dynamic Storage Tiering (DST) configuration. The data move uses only 8k buffer,
causing slow transfers.

RESOLUTION:
The code is modified to use 64k buffer to move data during DST
policy enforcement.

* 3153908 (Tracking ID: 2564442)

SYMPTOM:
On a Cluster mounted file system and internal test the assertion
"f:vx_prefault_uio_readable:1" fails .

DESCRIPTION:
While using the sendfile() interface VxFS tries to read the data
from the user buffer, however the data  sent is directly read from the file,
failing the "f:vx_prefault_uio_readable:1"  assertion.

RESOLUTION:
The code is modified such that the buffer read operation is skipped
in case the call is made from a sendfile context.

* 3153928 (Tracking ID: 2703747)

SYMPTOM:
Cluster File System (CFS) failover takes long time because the fsck(1M) command 
takes long time to replay the intent log.

DESCRIPTION:
While replaying the intent log, the fsck(1M) command builds a list of logged 
transactions in the memory. Every time before adding a new element it traverses 
the complete list. It repeats the same, to find the end of the particular 
transaction.

RESOLUTION:
The code is modified to maintain the tail of the logged transactions list. This 
prevents the complete traversal of the logged transactions list.

* 3153932 (Tracking ID: 2977697)

SYMPTOM:
Deleting checkpoints of file systems with character special device
files viz. /dev/null using fsckptadm may panic the machine with the following
stack trace:
vx_idetach
vx_inode_deinit
vx_idrop
vx_inull_list
vx_workitem_process
vx_worklist_process
vx_worklist_thread
vx_kthread_init

DESCRIPTION:
During the checkpoint removal operation the type of the inodes is
converted to 'pass through inode'.  During a conversion we try to refer to the
device reference for the special file, which is invalid in the clone context
leading to a panic.

RESOLUTION:
The code is modified to remove device reference of the special character files
during the clone removal operation thus preventing the panic.

* 3153934 (Tracking ID: 2649367)

SYMPTOM:
The system crashes in vx_fopen because of NULL pointer dereference and the
following stack trace is observed:

machine_kexec 
crash_kexec 
__die 
do_page_fault 
error_exit 
vx_fopen
__dentry_open 
do_filp_open 
do_sys_open

DESCRIPTION:
The system crashes due to a race condition between force umount thread (on Linux
platform) and the vx_fopen call.

RESOLUTION:
The code is modified to set the appropriate active level which prevents this
race condition occurrence.

* 3153947 (Tracking ID: 2370627)

SYMPTOM:
The fsck(1M) command dumps core while running the fsqa conformance tests.
The following stack trace is observed:

xted_free ()
ag_olt_unload ()
ag_init ()
replay ()
process_device ()
main ()

DESCRIPTION:
In the fsck(1M) command  code, fsnpnolt indicates the number of olt entries per 
node, in the IFPNOLT file. It is derived from the nlink of the IFPNOLT file. 
The post intent log-replay of the pending transactions changes the value of 
fsnpnolt and the code passes a garbage pointer. The fsck(1M) command binary 
dumps core as it tries to free a garbage pointer which was never allocated via 
xted_malloc.

RESOLUTION:
The code is modified to detect if fsnpnolt is changed post replay and if there 
are any dirty pnolt logs which needs to be replayed. If fsnpnolt is changed 
after log replay, then replay is restarted so that the dirty pnlogs are handled 
correctly.

* 3154174 (Tracking ID: 2822984)

SYMPTOM:
When the extendfs(1m) command extends the file system that is greater than 2TB 
the extendfs(1m)  command fails and the following error message is displayed:
"UX:vxfs fsck: ERROR: V-3-25315: could not seek to block offset"

DESCRIPTION:
This is a typecasting problem. When the extendfs(1m) command tries to extend 
the file system, the bs_bseek() function is invoked. The bs_bseek()  function's 
return type is a 32 bit integer value. This value gets negative for offsets 
greater than 2TB and results in failure.

RESOLUTION:
The code is modified to resolve the typecasting problem.

* 3158544 (Tracking ID: 2191039)

SYMPTOM:
On a Veritas File System (VxFS) file system various file system activities, such
as lookup, create, read, and write can hamper the overall system performance in
low memory conditions.

DESCRIPTION:
Large memory allocations on Linux can cause the kernel swap daemon (kswapd) to
become aggressive as it tries to free enough physically contiguous pages to
satisfy the allocation. It continues attempts even after the allocation has
failed leading to reduced system performance.

RESOLUTION:
The code is modified to restrict the non-vmalloc allocation to a reasonable
limit (64K), thus solving this problem.

* 3159607 (Tracking ID: 2779427)

SYMPTOM:
On a Cluster Mounted File-system read/write/lookup operation may mark the file-
system for a full fsck. The following messages are seen in the system log:

vxfs: msgcnt <count> mesg 096: V-2-96: vx_setfsflags - <vol_name>  file system 
fullfsck flag set - vx_cfs_iread

DESCRIPTION:
The in-core ilist (inode list) on secondary node is not synchronized with the 
primary node.

RESOLUTION:
The code is modified to retry the operation a fixed number of times before 
giving out the error.

* 3164821 (Tracking ID: 3014791)

SYMPTOM:
Internal Cluster tests fail with "f:vx_msgprint:ndebug" assertion
failure and the following stack trace is displayed:
vx_msgprint  
vx_fs_init  
vx_attach_fs 
vx_domount 
vx_fill_super
vx_get_sb_bdev 
vx_get_sb_impl  
vx_get_sb  
vfs_kern_mount

DESCRIPTION:
During file system check-point restoration, the file system
structures are initialized and if the OLT structures exist and the dirty flag is
set. The clearing of the dirty flag is missed. This triggers the file-system
clean operation during the next mount, leading to the above mentioned panic.

RESOLUTION:
The code is modified to correctly clear the dirty flag during
check-point restoration.

* 3178899 (Tracking ID: 3206266)

SYMPTOM:
During an internal noise test, the "f:vx_purge_nattr:1" assertion fails.

DESCRIPTION:
The assertion fails because the fsck(1M) command does not break the
linkage between the parent inode and its attribute directory inode after the
attribute inode was removed. This causes the parent inode to have the bad
attribute inode pointer.

RESOLUTION:
The fsck(1M) command is modified to clear the linkages between the
attribute inode and the parent inode when the attribute inode is removed.

* 3204299 (Tracking ID: 2606294)

SYMPTOM:
During internal noise test, the fsck(1M) dumps core with the following
stack trace:
pthread_kill
_p_raise()
raise.raise()
abort()
print.assert(cond = "fcl_super->fc_loff <= noff", file = "replay.c", line =
3645), line 982 in "print.c"
flush_fcloffs()
flush_fcloffs_switch()
run_replay()
replay()
process_device()
main()

DESCRIPTION:
A transaction for which the BUFDONE record exists wrongly picked up
for replay. There is no side effect in the production code as the File Change
Log (FCL) super block is not updated if the condition of the assert fails.

RESOLUTION:
The code is modified to ensure that the replay does not pick up such
transactions.

* 3206363 (Tracking ID: 3212625)

SYMPTOM:
The fsadm(1M) command  fails with the assert "ASSERT(sz <= MAXBUFSZ)",
and the following stack trace is displayed: 
assert
fs_getblk 
ext_bcread 
extchk_typed 
extchk_typed 
inochk_typd 
fset_inochk 
reorg_inode
fset_ilist_process
do_reorg 
ext_reorg
do_fsadm 
main

DESCRIPTION:
In case of active File System (FS), fsadm(1M) can have extent's
stale data. The stale data can be interpreted as wrong length to search for
buffer in buffer-cache.

RESOLUTION:
The code is modified to validate the extent's data before using it.

* 3207096 (Tracking ID: 3192985)

SYMPTOM:
Checkpoints quota usage on CFS can be negative.
An example is as follows:
Filesystem     hardlimit     softlimit        usage         action_flag
/sofs1         51200         51200     18446744073709490176  << negative

DESCRIPTION:
In CFS, to manage the intent logs, and the other extra objects required for 
CFS, a holding object referred to as a per-node-object-location table (PNOLT) 
is created. In CFS, the quota usage is calculated by reading the per node cut 
(current usage table) files (member of PNOLT) and summing up the quota usage 
for each clone clain. However, when the quotaoff and quotaon operations are 
fired on a CFS checkpoint, the usage shows "0" after these two operations are 
executed. This happens because the quota usage calculation is skipped. 
Subsequently, if a delete operation is performed, the usage becomes negative 
since the blocks allocated for the deleted file are subtracted from zero.

RESOLUTION:
The code is modified such that when the quotaon operation is performed, the 
quota usage calculation is not skipped.

* 3226404 (Tracking ID: 3214816)

SYMPTOM:
When you create and delete the inodes of a user frequently with the DELICACHE 
feature enabled, the user quota file becomes corrupt.

DESCRIPTION:
The inode DELICACHE feature causes this issue. This feature optimizes the 
updates on the inode map during the file creation and deletion operations. It 
is enabled by default. You can disable this feature with the vxtunefs(1M) 
command.

When DELICACHE is enabled and the quota is set for Veritas File System (VxFS), 
VxFS updates the quota for the inodes before the inodes are on the DELICACHE 
list and after they are on the inactive list during the removal process. As a 
result, VxFS decrements the current number of user files twice. This causes the 
quota file corruption.

RESOLUTION:
The code is modified to identify the inodes moved to the inactive list from the 
DELICACHE list. This flag prevents the quota being decremented again during the 
removal process.

* 3233717 (Tracking ID: 3224101)

SYMPTOM:
On a file system that is mounted by a cluster, the system panics after you
enable the lazy optimization for updating the i_size across the cluster nodes.
The stack trace may look as follows:
vxg_free()
vxg_cache_free4()
vxg_cache_free()
vxg_free_rreq()
vxg_range_unlock_body()
vxg_api_range_unlock()
vx_get_inodedata()
vx_getattr()
vx_linux_getattr()
vxg_range_unlock_body()
vxg_api_range_unlock()
vx_get_inodedata()
vx_getattr()
vx_linux_getattr()

DESCRIPTION:
On a file system that is mounted by a cluster with the -o cluster option, read
operations or write operations take a range lock to synchronize updates across
the different nodes. The lazy optimization incorrectly enables a node to release
a range lock which is not acquired and panic the node.

RESOLUTION:
The code has been modified to release only those range locks which are acquired.

* 3235517 (Tracking ID: 3240635)

SYMPTOM:
In a CFS environment, when a checkpoint is mount using the mount(1M) command 
the system may panic. The following stack trace is observed:

vx_domount
vx_fill_super 
get_sb_nodev
vx_get_sb_nodev
vx_get_clone_impl
vx_get_clone_sb
do_kern_mount
do_mount
sys_mount

DESCRIPTION:
When a checkpoint is mounted cluster-wide the protocol version is verified. 
However, if the primary fileset (999) is not mounted cluster-wide, some of the 
file system data structures remain uninitialized. This results in a panic.

RESOLUTION:
The code is modified to disable the cluster-wide mount of the checkpoint if the 
primary fileset is mounted locally.

* 3246793 (Tracking ID: 2414266)

SYMPTOM:
The fallocate(2) system call fails on VxFS file systems in the Linux environment.

DESCRIPTION:
The fallocate(2) system call, which is used for pre-allocating the file space on
Linux, is not supported on VxFS.

RESOLUTION:
The code is modified to support the fallocate(2) system call on VxFS in the
Linux environment.

* 3247280 (Tracking ID: 3257314)

SYMPTOM:
On systems installed with the SFORA/SFRAC stacks, when the DBED operations like 
the dbdst_obj_move (1M) command are run, the operation may fail with the 
following error message:

dst_obj_adm : FSPPADM err : UX:vxfs fsppadm: WARNING: V-3-26543: File handling 
failure

DESCRIPTION:
The filename length calculation is done incorrectly, causing the malloc and 
free operations to fail. This results in the error message.

RESOLUTION:
The code is modified to calculate the length of the file name correctly.

* 3248982 (Tracking ID: 3272896)

SYMPTOM:
Internal stress test on the local mount hits a deadlock.

DESCRIPTION:
When the rename operation is performed, the directory lock of the file system 
is necessary to check whether the source directory is being renamed to the sub-
directory of itself (directory loop). However, when the directory is 
partitioned this check is not required. This unnecessary directory lock caused 
a deadlock when the directory is partitioned.

RESOLUTION:
The code is modified to take the directory lock in case of the rename operation 
when the directory is partitioned.

* 3249151 (Tracking ID: 3270357)

SYMPTOM:
The fsck (1m) command fails to clean the corrupt file system during the 
internal 'noise' test. The following error message is displayed:
pass0 - checking structural files
pass1 - checking inode sanity and blocks
fileset 999 primary-ilist inode <inode number> mismatched reorg extent map
fileset 999 primary-ilist inode <inode number> bad blocks found clear? (ynq)n
fileset 999 primary-ilist inode <inode number> does not have corresponding 
matchino
clear? (ynq)n

DESCRIPTION:
The inode which is reorged is from the atribute ilist. When a reorg inode is 
allocated, the 2262161th inode from the delicache is referred to. However, this 
inode is from the primary ilist. There is no check in the 'vx_ialloc' that 
forces an attribute list inode's corresponding reorg inode to be allocated from 
the same ilist. But the fsck (1M) code expects the source and the reorg inode 
to be from the same ilist. So that when the reorg inode is examined from the 
primary ilist, it checks the corresponding source inode. Also, in the primary 
list the VX_IEREORG is not set. Thereby, an error message is displayed.

RESOLUTION:
The code is modified to the add a check for 'vx_ialloc' to ensure that the 
reorg inode is allocated from the same ilist.

* 3261782 (Tracking ID: 3240403)

SYMPTOM:
The fidtovp() system call can panic in the vx_itryhold_locked() function with 
the following stack trace:

vx_itryhold_locked
vx_iget
vx_common_vget
vx_do_vget
vx_vget_skey
vfs_vget
fidtovp
kernel_add_gate_cstack
nfs3_fhtovp
rfs3_write
rfs_dispatch
svc_getreq
threadentry
[kdb_read_mem]

DESCRIPTION:
Some VxFS operations like the vx_vget() function try to get a hold on an in-
core inode using the vx_itryhold_locked() function, without taking the lock on 
the corresponding directory inode. This might lead to a race condition when 
this inode is present on the delicache list and is inactivated Thereby this 
results in a panic when the vx_itryhold_locked() function tries to remove it 
from a free list.

RESOLUTION:
The code is modified to take the delicahe lock before unlocking the ilist lock 
inside the vx_inactive() function when the IDELICACHE flag is set. This 
prevents the race condition.

* 3261886 (Tracking ID: 3046983)

SYMPTOM:
There is an invalid CFS node number (<inode number>) 
in ".__fsppadm_fclextract". This causes the Dynamic Storage Tiering (DST) 
policy enforcement to fail.

DESCRIPTION:
DST policy enforcement sometimes depends on the extraction of the File Change 
Log (FCL). When the FCL change log is processed, it reads the FCL records from 
the change log into the buffer. If it finds that the buffer is not big enough 
to hold the records, it will do some rollback and pass out the needed buffer 
size. However, the rollback is not complete, this results in the problem.

RESOLUTION:
The code is modified to add the codes to the rollback content of "fh_bp1-
>fb_addr" and "fh_bp2->fb_addr".

* 3261892 (Tracking ID: 3260563)

SYMPTOM:
I/O error happens because the underlying disk or hardware failure may
be marked at incorrect offsets if the offsets are greater than 4Gb.

DESCRIPTION:
Due to a typecasting issue, the value of the block number is
truncated from 64 bit to 32 bit.

RESOLUTION:
The code has been modified to correct the typecast for the block number.

* 3262025 (Tracking ID: 3259634)

SYMPTOM:
In CFS each node that has the file system cluster mounted has its own intent-log
in the file system. A cluster file system that has more than 4, 294, 967, 296 file
system blocks can zero out an incorrect location due to an incorrect
typecasting, for example 65536 file system blocks at block offset of
1, 537, 474, 560 [fs blocks] can be incorrectly zeroed out using a 8Kb fs block
size and an intent-log of size 65536 fs blocks. This issue can only occur if an
intent-log is located above an offset of 4, 294, 967, 296 file system blocks. This
situation can occur when adding a new node to the cluster and mounting an
additional CFS secondary for the first time, which needs to create and zero a
new intent-log. This situation can also be triggered if the file system or
intent log is resized and an intent-log needs to be cleared.

The problem occurs only with the following file system size and the FS block
size combinations:

1kb block size and FS size > 4TB
2kb block size and FS size > 8TB
4kb block size and FS size > 16TB
8kb block size and FS size > 32TB

The message log can contain the following messages:

<example>

full fsck flag is set on a file system with the following type of messages:

 

2013 Apr 17 14:52:22 sfsys kernel: vxfs: msgcnt 5 mesg 096: V-2-96:
vx_setfsflags - /dev/vx/dsk/sfsdg/vol1 file system fullfsck flag set - vx_ierror

2013 Apr 17 14:52:22 sfsys kernel: vxfs: msgcnt 6 mesg 017: V-2-17:
vx_attr_iget - /dev/vx/dsk/sfsdg/vol1 file system inode 13675215 marked bad incore

2013 Jul 17 07:41:22 sfsys kernel: vxfs: msgcnt 47 mesg 096:  V-2-96:
vx_setfsflags - /dev/vx/dsk/sfsdg/vol1 file system fullfsck  flag set - vx_ierror 

2013 Jul 17 07:41:22 sfsys kernel: vxfs: msgcnt 48 mesg 017:  V-2-17:
vx_dirbread - /dev/vx/dsk/sfsdg/vol1 file system inode 55010476  marked bad incore 

</example>

DESCRIPTION:
In CFS each node that has the file system cluster mounted has its own intent-log
in the file system.An intent-log is created when an additional node mounts the
file system as a CFS Secondary.
Note that intent-logs are never removed, they are reused.

Whilst clearing an intent log, an incorrect block number is passed to the log
clearing routine resulting in zeroing out an incorrect location. The incorrect
location might point to file data or file system metadata, or the incorrect
location might be part of the file system's available freespace. This is silent
corruption. If file system metadata is corrupted it will be detected when the
corrupt metadata is subsequently accessed and the file system will be marked for
full fsck.

RESOLUTION:
The code is modified so that the correct block number is passed to the log
clearing routine.

Patch ID: 5.1.133.000

* 3285687 (Tracking ID: 3285688)

SYMPTOM:
Re-packaged/Rolled Up 5.1SP1RP3 patch into 5.1SP1RP4

DESCRIPTION:
Re-packaged/Rolled Up 5.1SP1RP3 patch into 5.1SP1RP4

RESOLUTION:
Re-packaged/Rolled Up 5.1SP1RP3 patch into 5.1SP1RP4

Patch ID: 5.1.132.200

* 2340794 (Tracking ID: 2086902)

SYMPTOM:
The performance of a system with Veritas File System (VxFS) is affected due to 
high contention for a spinlock.

DESCRIPTION:
The contention occurs because there are a large number of work items in these 
systems and currently, these work items are enqueued and dequeued from the 
global list individually.

RESOLUTION:
The code is modified to process the work items by bulk enqueue/dequeue to 
reduce the VxFS worklist lock contention.

* 2715030 (Tracking ID: 2715028)

SYMPTOM:
The fsadm(1M) command with the '-d' option may hang when compacting a directory 
if it is run on the Cluster File System (CFS) secondary node while the find(1) 
command is running on any other node.

DESCRIPTION:
During the compacting of a directory, the CFS secondary node has ownership of 
the inode of the directory. To complete the compacting of directory, the 
truncation message needs to be processed on the CFS primary node. For this to 
occur, the CFS primary node needs to have ownership of the inode of the 
directory. This causes a deadlock.

RESOLUTION:
The code is modified to force the processing of the truncation message on the 
CFS secondary node which initiated the compacting of directory.

* 2725995 (Tracking ID: 2566875)

SYMPTOM:
The write(2) operation exceeding the quota limit fails with an EDQUOT error 
("Disc quota exceeded") before the user quota limit is reached.

DESCRIPTION:
When a write request exceeds a quota limit, the EDQUOT error is handled so that 
Veritas File System (VxFS) can allocate space up to the hard quota limit to 
proceed with a partial write. However, VxFS does not handle this error and an 
erroris returned without performing a partial write.

RESOLUTION:
The code is modified to handle the EDQUOT error from the extent allocation 
routine.

* 2726010 (Tracking ID: 2651922)

SYMPTOM:
On a local VxFS file system, the ls(1M) command with the '-l' option runs 
slowly and high CPU usage is observed.

DESCRIPTION:
Currently, Cluster File System (CFS) inodes are not allowed to be reused as 
local inodes to avoid Global Lock Manager (GLM) deadlo`ck issue when Veritas 
File System (VxFS) reconfiguration is in process. Hence, if a VxFS local inode 
is needed, all the inode free lists need to be traversed to find a local inode 
if the free lists are almost filled up with CFS inodes.

RESOLUTION:
The code is modified to add a global variable, 'vxi_icache_cfsinodes' to count 
the CFS inodes in inode cache. The condition is relaxed for converting a 
cluster inode to a local inode when the number of in-core CFS inodes is greater 
than the 'vx_clreuse_threshold' threshold and reconfiguration is not in 
progress.

* 2726015 (Tracking ID: 2666249)

SYMPTOM:
File system hangs with the backtrace like
vx_svar_sleep_unlock
vx_event_wait
vx_async_waitmsg
vx_msg_broadcast
vx_cwfa_step
vx_cwfreeze_common
vx_cwfreeze_all
vx_freeze
vx_allocpolicy_define
vx_aioctl_allocpolicy
vx_aioctl_common
vx_ioctl
vx_compat_ioctl

DESCRIPTION:
The hang occurs when doing umount/remount and other operations which requires a
file system freeze. To fix an fopen related issue, we change the function of
fopen, but with a flaw that could drops the active level 1 twice, which leads to
hang when the freeze thread is expecting active levels to be zero.

RESOLUTION:
Correct the taking/dropping active level logic.

* 2726018 (Tracking ID: 2670022)

SYMPTOM:
Duplicate file names can be seen in a directory.

DESCRIPTION:
Veritas File System (VxFS) maintains an internal Directory Name Lookup Cache 
(DNLC) to improve the performance of directory lookups. A race condition occurs 
in the DNLC lists manipulation code during lookup/creation of file names that 
have more than 32 characters (which further affects other file creations). This 
causes the DNLC to have a stale entry for an existing file in the directory. A 
lookup of such a file through DNLC does not find the file and allows another 
duplicate file with the same name in the directory.

RESOLUTION:
The code is modified to fix the race condition by protecting the DNLC lists 
through proper locks.

* 2726025 (Tracking ID: 2674639)

SYMPTOM:
The cp(1) command with the '-p' option may fail on a file system whose File 
Change Log (FCL) feature is enabled. The following error messages are displayed:
"cp: setting permissions for 'file_name': Input/output error"
"cp: preserving permissions for 'file_name': No data available"

The fsetxattr() system call during the failed cp(1) command returns the error 
value, 61493.

DESCRIPTION:
During the execution of the cp(1) command with the '-p' option, the attributes 
of the source file are copied to the newly created file. If FCL is enabled on 
the file system, then the log of newly created attributes is added/logged in 
the FCL file. While logging this new entry, if the FCL file does not have free 
space, its size is extended up to the value of the Veritas File System 
(VxFS) 'fcl_maxalloc' tunable and then the entry is logged.

If the FCL file does not have free space, an error is returned instead of 
extending the FCL file and retrying. Hence, the cp(1) command with the '-p' 
option fails with an error message.

RESOLUTION:
The code is modified to allocate the required space and extend the FCL file.

* 2726031 (Tracking ID: 2684573)

SYMPTOM:
The performance of the cfsumount(1M) command for the VRTScavf package is slow 
when some checkpoints are deleted.

DESCRIPTION:
When a checkpoint is removed asynchronously, a kernel thread is started to 
handle the job in the background. If an unmount command is issued before these 
checkpoint removal jobs are completed, the command waits for the completion of 
these jobs. A forced unmount can interrupt the process of checkpoint deletion 
and the remaining work is left to the next mount.

RESOLUTION:
The code is modified to add a counter in the vxfsstat(1M) command to determine 
the number of checkpoint removal threads in the kernel. The '-c' option is 
added to the cfsumount(1M) command to force unmount a mounted file system if 
the checkpoint jobs are running.

* 2726056 (Tracking ID: 2709869)

SYMPTOM:
The system panics with a redzone violation while releasing inodes File 
Input/Output (FIO) statistics structure and the following stack trace is 
displayed:

kmem_error()
kmem_cache_free_debug()
kmem_cache_free()
vx_fiostats_alloc()
fdd_common1()
fdd_odm_open()
odm_vx_open()
odm_ident_init()
odm_identify()
odmioctl()
fop_ioctl()
ioctl()

DESCRIPTION:
Different types of statistics are maintained when a file is accessed in Quick 
Input/Output (QIO) and non-QIO mode. Some common statistics are copied when the 
file access mode is changed from QIO to non-QIO or vice versa. While switching 
from QIO mode to non-QIO, the QIO statistics structure is freed and FIO 
statistics structure is allocated to maintain FIO file-level statistics. There 
is a race between the thread freeing the QIO statistics which also allocates 
the FIO statistics and the thread updating the QIO statistics when the file is 
opened in QIO mode. Thus, the FIO statistics gets corrupted as another thread 
writes to it assuming that the QIO statistics is allocated.

RESOLUTION:
The code is modified to protect the allocation/releasing of FIO/QIO statistics 
using the read-write lock/spin lock for file statistics structure.

* 2752607 (Tracking ID: 2745357)

SYMPTOM:
Performance enhancements are made for the read/write operation on Veritas File 
System (VxFS) structural files.

DESCRIPTION:
The read/write performance of VxFS structural files is affected when the piggy 
back data in the vx_populate_bpdata() function is ignored. This occurs if the 
buffer type is not mentioned properly, consequently requiring another disk I/O 
to get the same data.

RESOLUTION:
The code is modified so that the piggy back data is not ignored if it is of 
type VX_IOT_ATTR in the vx_populate_bpdata() function, thus leading to an 
improvement in the performance of the read/write to the VxFS structural files.

* 2765308 (Tracking ID: 2753944)

SYMPTOM:
The file creation threads can hang. The following stack trace is displayed:
cv_wait+0x38
vx_rwsleep_rec_lock+0xa4
vx_recsmp_rangelock+0x14
vx_irwlock2
vx_irwlock+0x34
vx_fsetialloc+0x98
vx_noinode+0xe4
vx_dircreate_tran+0x7d4
vx_pd_create+0xbb8
vx_create1_pd+0x818
vx_do_create+0x80
vx_create1+0x2f8
vx_create+0x158
fop_create+0x34
lo_create+0x138
fop_create+0x34
vn_createat+0x590
vn_openat+0x138
copen+0x260()

DESCRIPTION:
The Veritas File Systems (VxFS) uses Inode Allocation Units (IAU) to keep track 
of the allocated and free inodes. Two counters are maintained per IAU. One for 
the number of free regular inodes in that IAU and the other for the number of 
free director inodes. A global in-memory counter is also maintained to keep a 
track of the total number of free inodes in all the IAUs in the file system. 
The creation threads refer to this global counter to quickly check the number 
of free inodes at any given time. Every time an inode is allocated, this global 
count is decremented. Similarly, it is incremented when an inode is freed. The 
hang is caused when the global counter unexpectedly becomes negative which 
confuses the file creation threads. 
This global counter is calculated by adding per IAU counters during mount time. 
As the code is multi-threaded, any modification to the global counter must be 
guarded by a summary lock which is missing in the multi threaded code.
Therefore, the calculation goes wrong and the global counter and per IAU 
counters are out of sync. This results in a negative value and causes the file 
creation threads to hang.

RESOLUTION:
The code is modified to update the global inode free count under the protection 
of the summary lock.

* 2821163 (Tracking ID: 2821152)

SYMPTOM:
Internal Stress test hit an assert "f:vx_dio_physio:4, 1" on locally mounter file
system.

DESCRIPTION:
The issue was that there were two pages that were getting allocated, because the
block size of the FS is of 8k. Now, while coming for "dio" the pages which were
getting allocated they were tested whether the pages are good for I/O or
not(VX_GOOD_IO_PAGE) i.e. its _count should be greater than zero(page_count(pp)
> 0) or the page is compound(PageCompund(pp)) or the its
reserverd(PageReserved(pp)). The first page was getting passed through the
assert because the _count of the page was greater than zero and the second page
was hitting the assert as all three conditions were failing for it. The problem
why second page didn't have _count greater than 0 was because the compound page
was getting allocated for it and in case of compound only the head page
maintains the count.

RESOLUTION:
Code changed such that compound allocation shouldn't be done for pages greater
than the PAGESIZE, instead we use VX_BUF_KMALLOC().

Patch ID: 5.1.132.100

* 2244612 (Tracking ID: 2196308)

SYMPTOM:
Write and read performance concerns on CFS when running apps that read of the end 
of a file which is being increased in size by another writere thread at the same 
time.

DESCRIPTION:
===============================================

Circumstances requiring investigation by Symantec

===============================================

 

Circumstance 1: Writer thread writing to inode with masterless locks, big impact 
on throughput if locks are then normalized.

 

Stage1:

-      a file is created on a cluster mounted file system using 8Kb fs-blocksize 
(touch xx)

-      we preallocate a single 32Gb extent (setext -r 4194304 -f contig ../xx)

-      we run 'Dtool -r 6386 -W -s 4096 -f ...../xx' (not all args listed)

-      what this does is perform 6836*4Kb writes per second, which is 26.70 Mb/Sec

-      volume has 2 stripes, write_nstream=2, write_pref_io=128K

-      vxstat shows ~54588 sectors-writes/sec and ~108 write ops, i.e. ~26.7 
Mb/Sec to disk.

-      we only access the new file from one node.

 

Stage2:

-      from another node, we read 4Kb of the file - "dd if=./xx of=/dev/null 
bs=4k count=1"

-      this 'normalizes' the locks for file 'xx'

-      no other operations are performed from this node.

 

Stage3:

-      the writer thread throughput drops immediately to 16Mb/Sec, and this 
persists

-      So despite only writing and being able to cache the GLM locks locally on 
this node the throughput drops by nearly half.

-      Do not know the GLM lock mastering, but at the time of testing there were 
still at least 6 nodes in the cluster.

 

 

Circumstance 2: Writer thread appending, reader[s] on different node[s] reading 
in a different range impacts writer throughput

 

Stage1:

-      a file is created on a cluster mounted file system using 8Kb fs-blocksize 
(touch yy)

-      we preallocate a single 32Gb extent (setext -r 4194304 -f contig ./yy)

-      we run 'Dtool -r 6386 -W -s 4096' -f./yy' (not all args listed)

-      what this does is perform 6836*4Kb writes per second

-      we access the file from another node to normalize the glm locks

-      the writer thread throughput drops to ~16Mb/Sec, and this persists

-      We allow this writer process to continuing running a while to allow the 
file to grow 

-      This writer process simulates a producer thread.

-      The writer is always appending to the file, so its range-lock is from EOF 
to maxfileoffset.

 

Stage2:

-      from a second node we start a reader process (or start two, each from 
different nodes)

-      we run 'Dtool -r 27 -R -s 524288 -f ./yy' (not all args listed)

-      what this does is perform 27*512Kb sequential reads of file 'yy' per second

-      setting discovered direct i/o to 256K or 1Mb made no notable difference 
(if I recall correctly)

-      This reader process simulates a DSS consumer process.

-      Logically the reader has to catch-up with the writer

-      so the reader is reading a difference range to the writer

 

Stage3

-      However as soon as the Dtool reader process starts the writer process 
throughput drops to 10-11 Mb/sec (~from 16Mb/sec)

-      If the reader is reading from a difference range to the writer, why is the 
writer throughput affected.

-      Need to clearly understand this.

 

 

Circumstance 3: Writer thread appending, reader[s] on different node reading in a 
the same range impacts writer throughput a little more

 

Stage1:

-      a file is created on a cluster mounted file system using 8Kb fs-blocksize 
(touch zz)

-      we preallocate a single 32Gb extent (setext -r 4194304 -f contig ../zz)

 

Stage2:

-      from a second node we start a reader process (or start two, each from 
different nodes)

-      we run 'Dtool -r 27 -R -s 524288 -f ./zz' (not all args listed)

-      what this does is perform 27*512Kb sequential reads of file 'zz' per second

-      This reader process simulates a DSS consumer process.

-      however the reader's read(2) call returns 0, as no data is written into 
the file yet

 

Stage3:

-      we run 'Dtool -r 6386 -W -s 4096' -f./zz' (not all args listed)

-      what this does is perform 6836*4Kb writes per second

-      the writer thread throughput drops to ~9-10Mb/Sec, and this persists

-      are the reader processes now reading from/near the same range as the 
writer process..?

-      either way the throughput is lower than circumstance 2.

 

Circumstance 4: 

Writer thread appending, reader[s] on different node reading as in circumstance3, 
however one reader is using 10000 128byte i/o's per sec 

 

Stage1:

-      a file is created on a cluster mounted file system using 8Kb fs-blocksize 
(touch aa)

-      we preallocate a single 32Gb extent (setext -r 4194304 -f contig ./aa)

 

Stage2:

-      from a second node we start one reader process to simulate DSS consumer 
thread

-      we run 'Dtool -r 27 -R -s 524288 -f ./aa' (not all args listed)

-      what this does is perform 27*512Kb sequential reads of file 'zz' per second

 

Stage3:

-      from a third node we start one reader process to simulate Surveillance 
consumer thread

-      we run 'Dtool -r 10000 -R -s 128 -f ./aa' (not all args listed)

-      This performs 10000 read i/o per second, where each read i/o is 128 bytes

 

Stage4:

-      however these reader's  read(2) calls  returns 0, as no data is written 
into the file yet

-      we run 'Dtool -r 6386 -W -s 4096' -f./aa' (not all args listed)

-      what this does is perform 6836*4Kb writes per second

-      the writer thread throughput drops to ~6-7Mb/Sec, and this persists

circumstance7, detail in brief:

o    writer node (CFS secondary) performing DTool and the usual 26.7 Mb/Sec

o    reader node (CFS secondary) is running 'dd if=./xx of=/dev/null bs=128'

o    dd is started a few seconds after the DTool writer, dd reader reads at 
~6.5Mb/sec - so it cannot keep up

o    stop the DTool writer, and the dd reader i/o rate remains unchanged at 
~6.5Mb/sec

o    dd eventually runs to EOF and reports 6.5Mb/sec average. (glmstat showed 
continued activity)

o    Now, simply run the dd again (umount/mount if wished) and the dd will read 
at 166Mb/sec !!

o    Why does the dd reader run so slowly, even when the writer has stopped!

RESOLUTION:
3 changes are involved here :-

1. Lock ahead the pglock if we are taking the RWlock off the end of the file. 
This means that now the writer will not fight with the readers for the pglock.

2. Allow the pdflush threads to do more of the work on flushing dirty pages. this 
eliminates some locking by vxfs.

3. Change the read codepath to only lock the range that it can actually read ( 
due to EOF ) rather than taking the RWlock over the end of the file.

* 2340836 (Tracking ID: 2314212)

SYMPTOM:
Poor DB2 performance when upgrading from 9.1 of DB2 to 9.7

DESCRIPTION:
DB2 9.5 onwards changed to using posix threads. As DB2 uses CIO 
this 
led to some workloads having contention on the mmap_sem which is taken to avoid 
a 
locking hierarchy violation with vx_mmap

RESOLUTION:
Introduced a new mount option "nommapcio" which can be used to fail 
mmap() requests for files open with CIO. Then in this case we need not take the 
mmap_sem which will eliminate the contention.

* 2413811 (Tracking ID: 1590963)

SYMPTOM:
Maximum Number of subdirectories is limited to 32767

DESCRIPTION:
Currently, there limit on maximum numbers of subdirectories. This 
limit is fixed to 32767. This value can be increased to 64K on Linux, HP and 
Solaris platforms. Also there is a need to provide Flexibility to change the 
value 
of maximum numbers of subdirectories.

RESOLUTION:
Code is modified to add vx_maxlink tunable to control maximum number
of sub directories

* 2508171 (Tracking ID: 2246127)

SYMPTOM:
Mount command may take more time in case of large IAU file.

DESCRIPTION:
At time of mount, IAU file is read one block at time. The read block is processed
and then next block is read. In case there are huge number of files in filesystem,
IAU file for the filesystem becomes large. Reading of such large IAU file, one
block at time is taking time to complete mount command.

RESOLUTION:
Code is changed to read IAU file using multiple threads in parallel, also now
complete extent is read and then it is processed.

* 2521456 (Tracking ID: 2429281)

SYMPTOM:
"repquota" command is taking too long to complete when the user with large id is
set quota on the file system.

DESCRIPTION:
If there is a user whose uid is large and the file system use quota, the quota
file becomes huge. The external quota file consists of user quota record (32 bytes
per record), which are organized in an linear way, so if the user with large id is
set quota on the file system, the file size would be increased to "(largest
id)*32"(bytes) accordingly.

The way "repquota" to display the user quota info for a file system is looping
through each record in the quota file by calling fread, it , it would issue a huge
number of small I/O although fread has some buffers for its own use. So in case of
any user whose id is very large, the size of external quota file would be huge,
the main problems here is too many small I/Os are slowing the execution time of
"repquota".

RESOLUTION:
Allocated the internal buffers which are big enough such that many quota records
info can be read into the buffers in one system call instead of using fread to
read the quota record through the external quota file one by one.

* 2521672 (Tracking ID: 2515380)

SYMPTOM:
The ff command hangs and later it exits after program exceeds memory
limit with following error.

# ff -F vxfs   /dev/vx/dsk/bernddg/testvol 
UX:vxfs ff: ERROR: V-3-24347: program limit of 30701385 exceeded for directory
data block list
UX:vxfs ff: ERROR: V-3-20177: /dev/vx/dsk/bernddg/testvol

DESCRIPTION:
'ff' command lists all files on device of vxfs file system. In 'ff' command we
do directory lookup. In a function we save the block addresses for a directory.
For that we traverse all the directory blocks.
Then we have function which keeps track of buffer in which we read directory
blocks and the extent up to which we have read directory blocks. This function
is called with offset and it return the offset up to which we have read the
directory blocks. The offset passed to this function has to be the offset within
the extent. But, we were wrongly passing logical offset which can be greater
than extent size. As a effect the offset returned gets wrapped to 0. The caller
thinks that we have not read anything and hence the loop.

RESOLUTION:
Remove call to function which maintains buffer offsets for reading data. That
call was incorrect and redundant. We actually call that function correctly from
one of the functions above.

* 2561355 (Tracking ID: 2561334)

SYMPTOM:
System log file may contain following error message on multi-threaded 
environment
with Dynamic Storage Tiers(DST).
<snip>
UX:vxfs fsppadm: ERROR: V-3-26626: File Change Log IOTEMP and ACCESSTEMP index
creation failure for /vx/fsvm with message Argument list too long
</snip>

DESCRIPTION:
In DST, while enforcing policy, SQL queries are generated and written to file
.__fsppadm_enforcesql present in lost+found. In multi threaded environment, 16
threads works in parallel on ILIST and geenrate SQL queries and write it to
file. This may lead to corruption of file, if multiple threads write to file
simultaneously.

RESOLUTION:
using flockfile() instead of adding new code to take lock on 
.__fsppadm_enforcesq file descriptor before writing into it.

* 2563929 (Tracking ID: 2648078)

SYMPTOM:
Manual upgrade for VxFS will fail in when ODM is running

DESCRIPTION:
ODM module is dependent on the VXFS module and should be stopped
before upgrading VXFS.

RESOLUTION:
Added check to make sure that ODM is not running while upgrading
VXFS. The upgrade will now fail with the following message 
  
    vxodm is still running, please stop it before upgrading.
    '/etc/init.d/vxodm stop' to stop vxodm.
    Please upgrade vxodm also before starting it again.

* 2564431 (Tracking ID: 2515459)

SYMPTOM:
Local mount hangs in vx_bc_binval_cookie like the following stack
delay
vx_bc_binval_cookie
vx_blkinval_cookie
vx_freeze_flush_cookie
vx_freeze_all
vx_freeze
vx_set_tunefs1
vx_set_tunefs
vx_aioctl_full
vx_aioctl_common
vx_aioctl
vx_ioctl
genunix:ioctl
unix:syscall_trap32

DESCRIPTION:
The hanging process for local mount is waiting for a buffer to be unlocked. But
that buffer can only be released if its associated cloned map writes get flushed.
But a necessary flush is missed.

RESOLUTION:
Add code to synchronize cloned map writes so that all the cloned maps will be
cleared and the buffers associated with them will be released.

* 2574396 (Tracking ID: 2433934)

SYMPTOM:
Performance degradation observed when CFS is used compared to standalone VxFS as
back-end NFS data servers.

DESCRIPTION:
In CFS, if one thread holding read-write lock on inode in exclusive mode, other
threads are stuck for the same inode, even if they want to access inode in shared
mode, resulting in performance degradation.

RESOLUTION:
Code is changed to avoid taking read-write lock for inode in exclusive mode, where
it is not required.

* 2578625 (Tracking ID: 2275679)

SYMPTOM:
On SLES10 machine with high I/O activity some writes may appear to be
stalled.

DESCRIPTION:
The stall occurs as SLES10 only checks the overall state of dirty
memory - not the dirty memory associated with the file system that is being
written to. This causes flushing on filesystems which are not heavily dritied ,
thus increasing time for writes to complete.

RESOLUTION:
The code is modified to ensure that dedicated flushing threads are
invoked at regular intervals per filesystem ensuring that we never end up with a
heavily dirtied  vxfs file-system.

* 2578631 (Tracking ID: 2342067)

SYMPTOM:
While mounting a filesystem the machine and there is any error while
reading the super block from the disk may panic due to null pointer de-reference
and the following stack trace will be displayed.

vx_kill_sb
deactivate_super
get_sb_bdev
vx_get_sb_bdev 
vx_get_sb_impl
vx_get_sb

DESCRIPTION:
While mounting the filesystem we read the super block from the disk
, if there is any error during this operation we have to do a cleanup.
During the cleanup we use unitiialised data structures, which leads to panic

RESOLUTION:
The code is modified to ensure that only properly initialized data
structures are processed during the error handling of the mount operation.

* 2578637 (Tracking ID: 2191031)

SYMPTOM:
VxFS performance may be impacted due to the frequent triggering  of
swapping daemon (kswapd).

DESCRIPTION:
Filesystem used to allocate physically contigous memory in larger
chunk at some places triggering kswapd

RESOLUTION:
The memory allocation request was relaxed to use memory which not
physically contiguos.

* 2578643 (Tracking ID: 2212686)

SYMPTOM:
The read /write performance may degrade over a period of time due to
memory fragmentation.

DESCRIPTION:
The fragmentation detection mechanism used did not detect
fragmentation effectively.

RESOLUTION:
is modified to use the advanced fragmentation detection mechanism to
identify fragmentation.

* 2581351 (Tracking ID: 2588593)

SYMPTOM:
df(1M) shows wrong usage value for volume when large file is deleted.

DESCRIPTION:
We maintain all freed extent size in the in core global variable and transaction 
subroutine specific data structures.
After deletion of a large file, we were missing to update this in core global 
variable. df(1M) while reporting
usage data, read the freed space information from this global variable which 
contains stale information.

RESOLUTION:
Code is modified to account the freed extent data into global vaiable used by 
df(1M) so that correct usage for volume is
reported by df(1M).

* 2586283 (Tracking ID: 2603511)

SYMPTOM:
On Systems with Oracle 11.2.0.3 or higher installed, database operations can 
fail with the following message is displayed in the system logs: "ODM ERROR 
V-41-4-1-105-22 Invalid argument"

DESCRIPTION:
A change in and Oracle API in 11.2.0.3 , Oracle Disk Manager (ODM) is unable to 
create a file due to a unrecognized f-mode.

RESOLUTION:
The code is modified to mask and take into consideration only the files which 
are known to ODM instead of failing the creation of the file.

* 2587025 (Tracking ID: 2528819)

SYMPTOM:
AIX can fail to create new worker threads for VxFS. The following message is seen
in the system log-
"WARNING: msgcnt 175 mesg 097: V-2-97: vxfs failed to create new thread"

DESCRIPTION:
AIX is failing the thread creation because it cannot find a free slot in that
kproc and returning ENOMEM.

RESOLUTION:
Limit the maximum number of VxFS worker threads.

* 2587035 (Tracking ID: 2576794)

SYMPTOM:
access() system call fails with EPERM (Permission denied) on cluster node
even if the file has executable permissions.

DESCRIPTION:
The cluster node on which the access() failed had stale permissions for the vxfs
inode. This was due to not holding the proper RW-lock while initializing the 
Linux OS inode (struct inode) for the corresponding vxfs inode.

RESOLUTION:
We now hold the appropriate RW-lock before initializing the Linux inode.

* 2603008 (Tracking ID: 2528888)

SYMPTOM:
CFS mount fails after recovery from I/O path failure. The system message
contains the following error:
vxfs: msgcnt 19 mesg 037: V-2-37: vx_metaioerr - vx_pnolt_stateupd_notran_2 -
/dev/vx/dsk/testdg/vol3 file system me ta data read error in dev/block 0/196608

DESCRIPTION:
When unmount file system, updating the per node metadata could fail with EIO
error because the volume is already unavailable for example disabled by
vxdmpadm. The EIO failure affects vxfs unmount function since it can't be
returned to Linux kernel interface, which will ignore any unmount errors from
filesystem. In that circumstance, the file system is only disabled but not
cleaned up that can result in next mount failure.

RESOLUTION:
When hitting EIO, convert the error to EBUSY and disable the fs, then retry vxfs
unmount function.

* 2603015 (Tracking ID: 2565400)

SYMPTOM:
Sequential buffered I/O reads are slow in performance.

DESCRIPTION:
Read-Aheads are not happening because the file-system's read-ahead size gets 
incorrectly calculated.

RESOLUTION:
Fixed the incorrect typecast.

* 2607637 (Tracking ID: 2607631)

SYMPTOM:
vxportal may fail to load

DESCRIPTION:
Loading vxportal creates device file under "/dev". Udevd daemon
creates all the device file under "/dev". At time boot time this daemon may be
busy with creating other device in case there are lot of luns to be initialize.
This will delay creation if vxportal device under "/dev".

RESOLUTION:
Modified code which will retry the device creation if it fails
because of busy udev daemon.

* 2616395 (Tracking ID: 2609010)

SYMPTOM:
The vxfs_fcl_seektime function seeks to the first record in the FCL file that
has a timestamp greater than or equal to the specified time. FCL
vxfs_fcl_seektime()API can  incorrectly return EINVAL(no records in FCL file
newer than specified time) error even though records after specified time are
present in FCL log.

DESCRIPTION:
While doing binary search, it hit the case where last block only had one "partial"
record. In this case, search is continued even after the last record in the
block and therefore end offsets(where search should end) were not set correctly.
This resulted into wrongly returning EINVAL error.

RESOLUTION:
Now case where last block only had one "partial" record is handled such that end
offsets are set correctly.

* 2616398 (Tracking ID: 2611279)

SYMPTOM:
Filesystem with shared extents may panic with following stack trace.
  page_fault 
  vx_overlay_bmap
  vx_bmap_lookup
  vx_bmap
  vx_local_get_sharedblkcnt 
  vx_get_sharedblkcnt 
  vx_aioctl_get_sharedblkcnt 
  vx_aioctl_common 
  mntput_no_expire 
  vx_aioctl 
  vx_ioctl

DESCRIPTION:
The mechanism to manage shared extent uses special file. We never
expects HOLE in this special file. In HOLE cases we may see panic while working
on this file.

RESOLUTION:
Code has been modified to check if HOLE is present in special file.
In case if it is HOLE processing is skipped and thus panic is avoided.

* 2619910 (Tracking ID: 2591702)

SYMPTOM:
Following error messages are seen in VCS engine.log :

Can't exec "/sbin/portmap": No such file or directory at
/opt/VRTSvcs/bin/ApplicationNone/CNFS.pm line 353.
mount: nfsd already mounted or /proc/fs/nfsd busy
mount: according to mtab, nfsd is already mounted on
/proc/fs/nfsdfs.nfs.nlm_grace_period = 90 Can't exec "rpc.lockd": No such file
or directory at /opt/VRTSvcs/bin/ApplicationNone/CNFS.pm line 268.

DESCRIPTION:
Reason for above 3 different error messages seen is described below:
1] Can't exec "/sbin/portmap" error message : portmap service is not available
in RHEL6 and is replaced by "rpcbind".
2] Can't exec "rpc.lockd" error message :  This daemon was required for NFS file
locking. This daemon is not separately available in RHEL6. 
This functionality is now moved in RHEL6 kernel. 
3] mount error messages : It was due to incorrect remount of 'nfsd' while
starting a VCS agent.

RESOLUTION:
Now RPC bind service is used instead of 'portmap' service. 'rpc.lockd' is not
used as NFS file locking functionality is available in kernel itself. Now
incorrect remounting of 'nfsd' has been corrected.

* 2619930 (Tracking ID: 2584531)

SYMPTOM:
System hangs with the stack
schedule_timeout
vx_iget
vx_dirlook
vx_lookup
do_lookup
do_path_lookup

DESCRIPTION:
The hanging thread is attempting to get an inode which it finds in reuse state.
The thread awaits with hope that the reuse flag will get reset. But the flag
never gets reset as the inode is not actually being reused by any other thread.

While doing the cleanup of inodes in a chunk for reuse, we set the flag on all
inodes in chunck upfront. We remove all those inoeds from freelist. Then we
clean inodes one by one. After that we clear the reuse flag.

During this process the inotify watch can be set on this inode. We skip the
inode cleanup for such inode and do not reset the flag. We add all inodes in the
the chunk to the freelist again. We reset the reuse flag on all inodes except
for which the inotify watch has come.

RESOLUTION:
During inode cleanup if we find a watch on inode then reset the reuse flag first
then return. So now inode goes to freelist with reuse flag reset.

* 2624650 (Tracking ID: 2624262)

SYMPTOM:
Panic hit in vx_bc_do_brelse() function while executing dedup functionality with 
following backtrace.
vx_bc_do_brelse()
vx_mixread_compare()
vx_dedup_extents()
enqueue_entity()
__alloc_pages_slowpath()
__get_free_pages()
vx_getpages()
vx_do_dedup()
vx_aioctl_dedup()
vx_aioctl_common()
vx_rwunlock()
vx_aioctl()
vx_ioctl()
vfs_ioctl()
do_vfs_ioctl()
sys_ioctl()

DESCRIPTION:
While executing function vx_mixread_compare() in dedup codepath, we hit error 
due 
to which an allocated data structure remained uninitialised.
The panic occurs due to writing to this uninitialised allocated data structure 
in 
the function vx_mixread_compare().

RESOLUTION:
Code is changed to free the memory allocated to the data structure when we are 
going out due to error.

* 2627346 (Tracking ID: 1590324)

SYMPTOM:
Umount can hang if linux is using inotify.

DESCRIPTION:
We may see hang while unmounting FS in case Linux is using inotify
mechanism. Internally inotify increments reference count for inode which is
causing hang in VxFS

RESOLUTION:
Modified code to take care of increased reference count by inotify at
time of unmounting FS.

* 2631026 (Tracking ID: 2332314)

SYMPTOM:
Internal noise.fullfsck test with ODM enabled hit an assert fdd_odm_aiodone:3

DESCRIPTION:
In case of failed IO in fdd_write_clone_end() function, error was not set on
buffer which is causing the assert.

RESOLUTION:
Code is changed so we set the error on buffer in case of IO failures in
fdd_write_clone_end() function.

* 2631315 (Tracking ID: 2631276)

SYMPTOM:
Lookup fails for the file which is in partitioned directory and is being 
accessed 
using its vxfs namespace extension name.

DESCRIPTION:
If file is present in the partitioned directory and is accessed using its vxfs 
namespace extension name then its name is searched in
one of the hidden leaf directory. This leaf directory mostly doesn't contains 
entry 
for this file. Due this lookup fails.

RESOLUTION:
Code has been modified to call partitioned directory related lookup routine at 
upper 
level so that lookup doesn't fails even if 
file is accessed using its extended namespace name.

* 2631390 (Tracking ID: 2530747)

SYMPTOM:
Threads doing rename, create and remove can wait indefinitely for the exclusive
cluster wide dirlock of the file-system.

The stack of rename thread will look like -

#0 [ffff8132af5e5898] schedule 
#1 [ffff8132af5e5970] vxg_svar_sleep_unlock 
#2 [ffff8132af5e59c0] vxg_grant_sleep 
#3 [ffff8132af5e59f0] vxg_cmn_lock 
#4 [ffff8132af5e5a40] vxg_api_lock
#5 [ffff8132af5e5a80] vx_glm_lock 
#6 [ffff8132af5e5aa0] vx_pd_rename 
#7 [ffff8132af5e5c10] vx_rename1_pd
#8 [ffff8132af5e5cd0] vx_rename1 
#9 [ffff8132af5e5d20] vx_rename 
#10 [ffff8132af5e5dc0] vfs_rename
#11 [ffff8132af5e5e10] sys_renameat 
#12 [ffff8132af5e5f80] sysenter_do_call

DESCRIPTION:
Thread holding the exclusive cluster wide dirlock of the file-system won't release
it in certain error code-paths.

RESOLUTION:
Added code to release the dirlock of the file-system.

* 2635583 (Tracking ID: 2271797)

SYMPTOM:
Internal Noise Testing with locally mounted VxFS filesystem hit an assert
"f:vx_getblk:1a"

DESCRIPTION:
The assert is hit due to overlay inode is being marked with the flag regarding bad
copy of inode present on disk.

RESOLUTION:
Code is changed to set the flag regarding bad copy of inode present on disk, only
if the inode is not overlay.

* 2642027 (Tracking ID: 2350956)

SYMPTOM:
Internal noise test on locally mounted filesystem exited with error message
"bin/testit : Failed to full fsck cleanly, exiting" and in the logs we get the
userspace assert 
"bmaptops.c   369:       ASSERT(devid == 0 || (start == VX_HOLE && devid ==
VX_DEVID_HOLE)) failed".

DESCRIPTION:
The function bmap_data4_set() gets called while entering bmap allocation
information for typed extents of type VX_TYPE_DATA_4 or VX_TYPE_IADDR_4. The
assert expects that, either devid should be zero or if extent start is a hole,
then devid should be VX_DEVID_HOLE. However, we never have extent descriptors to
represent holes in typed extents. The assertion is incorrect.

RESOLUTION:
The assert corrected to check that extent start is not a hole and either devid is
zero, or extent start is VX_OVERLAY with devid being VX_DEVID_HOLE.

* 2669195 (Tracking ID: 2326037)

SYMPTOM:
Internal Stress Test on cluster file system with clones failed while writing to
file with error ENOENT.

DESCRIPTION:
VxFS file-system trying to write to clone which is in process of removal. As clone
removal process works asynchronously, process starts to push changes from inode of
primary fset to inode of clone fset. But when actual write happens the inode of
clone fset is removed, hence error ENOENT is returned.

RESOLUTION:
Code is added to re-validate the inode being written.

Patch ID: 5.1.132.000

* 2169326 (Tracking ID: 2169324)

SYMPTOM:
On LM , When clone is mounted for a file system and some quota is 
assigned to clone. And if quota exceeds then clone is removed and if files from 
clones are being accessed then assert may hit in function vx_idelxwri_off() 
through vx_trunc_tran()

DESCRIPTION:
During clone removable, we go through the all inodes of the 
clone(s) being removed and hit the assert because there is difference between 
on-disk and in-core sizes for the file , which is being modified by the 
application.

RESOLUTION:
While truncating files, if VX_IEPTTRUNC op is set, 
set the in-core file size to on_disk file size.

* 2243061 (Tracking ID: 1296491)

SYMPTOM:
Performing a nested mount on a CFS file system triggers a data page fault if a
forced unmount is also taking place on the CFS file system. The panic stack
trace involves the following kernel routines:

vx_glm_range_unlock
vx_mount
domount 
mount 
syscall

DESCRIPTION:
When the underlying cluster mounted file system is in the process of unmounting,
the nested mount dereferences a NULL vfs structure pointer, thereby causing a
system panic.

RESOLUTION:
The code has been modified to prevent the underlying cluster file system from a
forced unmount when a nested mount above the file system, is in progress. The
ENXIO error will be returned to the forced unmount attempt.

* 2243063 (Tracking ID: 1949445)

SYMPTOM:
Hang when file creates were being performed on large directory. stack of hung
thread is similar to below:

vxglm:vxg_grant_sleep+226                                             
vxglm:vxg_cmn_lock+563
vxglm:vxg_api_lock+412                                             
vxfs:vx_glm_lock+29
vxfs:vx_get_ownership+70                                                  
vxfs:vx_exh_coverblk+89  
vxfs:vx_exh_split+142                                                 
vxfs:vx_dexh_setup+1874 
vxfs:vx_dexh_create+385                                              
vxfs:vx_dexh_init+832 
vxfs:vx_do_create+713

DESCRIPTION:
For large directories, Large Directory Hash(LDH) is enabled to improve lookup on
such large directories. Hang was due to taking ownership of LDH inode twice in
same thread context i.e. while building hash for directory.

RESOLUTION:
Avoid taking ownership again if we already have the ownership of the LDH inode.

* 2243064 (Tracking ID: 2111921)

SYMPTOM:
On linux platforms readv()/writev() performance with DIO/CIO can be more than 
2x slower on vxfs than on raw volumes.

DESCRIPTION:
In the current implementation of DIO  If there are multiple iovecs passed 
during IO [eg. readv()/writev()] , we do a DIO for each iovec in a loop. We 
cannot do coalescing of the given iovecs directly, because there is no 
guarantee that the user addresses are contiguous.

RESOLUTION:
We introduced a concept of Parrallel-DIO with this we can submit iovecs in the 
same extent together. By doing this and the use of the io_submit interface 
directly we are able to make use of linux's scatter gather enahancements. This 
change brings the readv() writev() performance in vxfs to the same as raw.

* 2247299 (Tracking ID: 2161379)

SYMPTOM:
In a CFS enviroment various filesytems operations hang with the following stack
trace
T1:
vx_event_wait+0x40
vx_async_waitmsg+0xc
vx_msg_send+0x19c
vx_iread_msg+0x27c
vx_rwlock_getdata+0x2e4
vx_glm_cbfunc+0x14c
vx_glmlist_thread+0x204

T2:
vx_ilock+0xc
vx_assume_iowner+0x100
vx_hlock_getdata+0x3c
vx_glm_cbfunc+0x104
vx_glmlist_thread+0x204

DESCRIPTION:
Due to improper handling of the ENOTOWNER error in the iread receive function.
We continously retry the operation while holding an Inode Lock blocking 
all other threads and causing a deadlock

RESOLUTION:
The code is modified to release the inode lock on ENOTOWNER error and acquire it
again, thus resolving the deadlock
 
There are totally 4 vx_msg_get_owner() caller with ilocked=1:
    vx_rwlock_getdata() : Need Fix
    vx_glock_getdata()  : Need Fix
    vx_cfs_doextop_iau(): Not using the owner for message loop, no need to fix.
    vx_iupdat_msg()     : Already has 'unlock/delay/lock' on ENOTOWNER 
condition!

* 2249658 (Tracking ID: 2220300)

SYMPTOM:
'vx_sched' hogs CPU resources.

DESCRIPTION:
vx_sched process calls vx_iflush_list() to perform the background flushing
processing. vx_iflush_list() calls vx_logwrite_flush() if the file has had
logged-writes performed upon it. vx_logwrite_flush() performs a old trick that
is ineffective when flushing in chunks. The trick is to flush the file
asynchronously, then flush the file again synchronously. This therefore flushes
the entire file twice, this is double the work when chunk flushing.

RESOLUTION:
vx_logwrite_flush() has been changed to flush the file once rather than twice.
So Removed asynchronous flush in vx_logwrite_flush().

* 2255786 (Tracking ID: 2253617)

SYMPTOM:
Fullfsck fails to run cleanly using "fsck -n".

DESCRIPTION:
In case of duplicate file name entries in one directory, fsck 
compares the directory entry with the previous entries. If the filename
already exists further action is taken according to the user input [Yes/No]. As 
we are using strncmp, it will compare first n characters, if it matches it will 
return success and will consider it as a duplicate file name entry and fails to 
run cleanly using "fsck -n"

RESOLUTION:
Checking the filename size and changing the length in strncmp to 
name_len + 1 solves the issue.

* 2257904 (Tracking ID: 2251223)

SYMPTOM:
The 'df -h' command can take 10 seconds to run to completion and yet still 
report an inaccurate free block count, shortly after removing a large number 
of files.

DESCRIPTION:
When removing files, some file data blocks are released and counted in the 
total free block count instantly. However blocks may not always be freed 
immediately as VxFS can sometimes delay the releasing of blocks. Therefore 
the displayed free block count, at any one time, is the summation of the 
free blocks and the 'delayed' free blocks. 

Once a file 'remove transaction' is done, its delayed free blocks will be 
eliminated and the free block count increased accordingly. However, some 
functions which process transactions, for example a metadata update, can also 
alter the free block count, but ignore the current delayed free blocks. As a 
result, if file 'remove transactions' have not finished updating their free 
blocks and their delayed free blocks information, the free space count can 
occasionally show greater than the real disk space. Therefore to obtain an 
up-to-date and valid free block count for a file system a delay and retry 
loop was delaying 1 second before each retry and looping 10 times before
giving up. Thus the 'df -h' command can sometimes take 10 seconds, but
even if the file system waits for 10 seconds there is no guarantee that the
output displayed will be accurate or valid.

RESOLUTION:
The delayed free block count is recalculated accurately when transactions 
are created and when metadata is flushed to disk.

* 2275543 (Tracking ID: 1475345)

SYMPTOM:
write() system call hangs for over 10 seconds

DESCRIPTION:
While performing a transactions in case of logged write we used to
asynchronously flush one buffer at a time belonging to the transaction space.
Such Asynchronous flushing was causing intermediate delays in write operation
because of reduced transaction space.

RESOLUTION:
Flush  all the dirty buffers on the file in one attempt through synchronous
flush, which will free up a large amount of transaction space. This will reduce
the delay during write system call.

* 2280386 (Tracking ID: 2061177)

SYMPTOM:
'fsadm -de' command erroring with 'bad file number' on filesystem(s) on
5.0MP3RP1.

DESCRIPTION:
<1>first, Our kernel fs doesn't have any problem. There is not corrupt layout in
their system. The metasave got from the customer is the proof (we can't
reproduce this problem and there is not corrupted inode in that metasave).
<2>second, As you know, fsadm is a application which has 2 parts: the
application part and the kernel part. The application part read layout from raw
disk to make strategy and the kernel part is to implement. So, For a buffer
write fs, there should be a problem that can't be avoided that is the sync
problem. In our customer's system, when they do fsadm -de, they also so huge of
write operation (they also have many check points and As you know more check
points more copy on write which means checkpoint will multi the write operation.
That why more checkpoints more problem).

RESOLUTION:
our solution is to add sync operation in fsadm before it read layout
from raw disk to avoid kernel and application un-sync.

* 2280552 (Tracking ID: 2246579)

SYMPTOM:
Filesystem corruptions and system panic when attempting to extend a 100%-full disk
layout version 5(DLV5) VxFS filesystem using fsadm(1M).

DESCRIPTION:
The behavior is caused by filesystem metadata that is relocated to the intent log
area inadvertently being destroyed when the intent log is cleared during the
resize operation.

RESOLUTION:
Refresh the incore intent log extent map by reading the bmap of the intent log
inode before clearing it.

* 2296277 (Tracking ID: 2296107)

SYMPTOM:
The fsppadm command (fsppadm query -a mountpoint ) displays ""Operation not
applicable" " while querying the mount point.

DESCRIPTION:
During fsppadm query process, fsppadm will try to open every file's named data
stream "" in the filesystem. but vxfs inernal file FCL: "changelog" doesn't
support this operation. "ENOSYS" is returned in this case. fsppadm will
translate "ENOSYS" into "Operation not applicable", and print the bogus error
message.

RESOLUTION:
Fix fsppadm's get_file_tags() to ignore the "ENOSYS" error.

* 2311490 (Tracking ID: 2074806)

SYMPTOM:
a dmapi program using dm_punch_hole may result in corrupted data

DESCRIPTION:
When the dm_punch_hole call is made on a file with allocated 
extents is used immediatly after a previous write then data can be written 
through stale pages. This causes data to be written to the wrong location

RESOLUTION:
dm_punch_hole will now invalidate all the pages within the hole its 
creating.

* 2320044 (Tracking ID: 2419989)

SYMPTOM:
ncheck(1M) command with '-i' option does not limit the output to the specified 
inodes.

DESCRIPTION:
Currently, ncheck(1M) command  with '-i' option currently shows free space
information and other inodes that are not in the list provides by '-i' option.

RESOLUTION:
ncheck(1M) command is modified to print only those inodes that are specified by
'-i' option.

* 2320049 (Tracking ID: 2419991)

SYMPTOM:
There is no way to specify an inode that is unique to the file 
system since we reuse inode numbers in multiple filesets.  We therefore would 
need to be able to specify a list of filesets similar to the '-i' option for 
inodes, or add a new '-o' option where you can specify fileset+inode pairs.

DESCRIPTION:
When ncheck command is called with '-i' option in conjunction with
-oblock/device/sector option, it displays inodes having same inode number from 
all
filesets. We don't have any command line option that helps us to specify a 
unique
inode and fileset combination.

RESOLUTION:
Code is modified to add '-f' option in ncheck command using which one could 
specify
the fset number on which one wants to filter the results.  Further, if this 
option
is used with '-i' option, we could uniquely specify the inode-fileset pair/s 
that
that we want to display.

* 2329887 (Tracking ID: 2253938)

SYMPTOM:
In a Cluster File System (CFS) environment , the file read
performances gradually degrade up to 10% of the original
read performance and the fsadm(1M) -F vxfs -D -E
<mount point> shows a large number (> 70%) of free blocks in
extents smaller than 64k.
For example,
% Free blocks in extents smaller than 64 blks: 73.04
% Free blocks in extents smaller than  8 blks: 5.33

DESCRIPTION:
In a CFS environment, the disk space is divided into
Allocation Units (AUs).The delegation for these AUs is
cached locally on the nodes.

When an extending write operation is performed on a file,
the file system tries to allocate the requested block from
an AU whose delegation is locally cached, rather than
finding the largest free extent available that matches the
requested size in the other AUs. This leads to a
fragmentation of the free space, thus leading to badly
fragmented files.

RESOLUTION:
The code is modified such that the time for which the
delegation of the AU is cached can be reduced using a
tuneable, thus allowing allocations from other AUs with
larger size free extents. Also, the fsadm(1M) command is
enhanced to de-fragment free space using the -C option.

* 2329893 (Tracking ID: 2316094)

SYMPTOM:
vxfsstat incorrectly reports "vxi_bcache_maxkbyte" greater than "vx_bc_bufhwm"
after reinitialization of buffer cache globals. reinitialization can happen in
case of dynamic reconfig operations. 

vxfsstat's "vxi_bcache_maxkbyte" counter shows maximum memory available for
buffer cache buffers allocation. Maximum memory available for buffer allocation
depends on total memory available for Buffer cache(buffers + buffer headers)
i.e. "vx_bc_bufhwm" global. Therefore vxi_bcache_maxkbyte should never greater
than vx_bc_bufhwm.

DESCRIPTION:
"vxi_bcache_maxkbyte" is per-CPU counter i.e. part of global per-CPU 'vx_info'
counter structure. vxfsstat does sum of all per-cpu counters and reports result
of sum. During re-intitialation of buffer cache, this counter was not set to
zero properly before new value is assigned to it. Therefore total sum of this
per-CPU counter can be more than 'vx_bc_bufhwm'.

RESOLUTION:
During buffer cache re-initialization, "vxi_bcache_maxkbyte" is now correctly
set to zero such that final sum of this per-CPU counter is correct.

* 2338010 (Tracking ID: 2337737)

SYMPTOM:
For Linux version 2.6.27 and onwards, a write() may never complete due to
forever looping of the write kernel thread, thus consuming most of the CPU
leading to system hang like situation.
A kernel stack of such looping write thread looks like -
<>
bad_to_user
vx_uiomove
vx_write_default
vx_write1
vx_rwsleep_unlock
vx_do_putpage
vx_write_common_slow
handle_mm_fault
d_instantiate
do_page_fault
vx_write_common
vx_prefault_uio_readable
vx_write
vfs_write
sys_write
system_call_fastpath
<>

DESCRIPTION:
Some pages created during a write() may be partially initialized and are
therefore destroyed. However due to a bug, the variable representing number of
bytes copied is not updated correctly to reflect this destroying of page. Thus
subsequent page-fault for the destroyed page occur at incorrect offset leading 
to indefinite looping of write kernel thread.

RESOLUTION:
Update correctly the number of bytes copied after destroying partially
initialized pages.

* 2340741 (Tracking ID: 2282201)

SYMPTOM:
On a VxFS filesystem, vxdump(1m) operation running in parallel with
other filesystem Operations like create, delete etc. can fail with signal
SIGSEGV generating a core file.

DESCRIPTION:
vxdump caches the inodes to be dumped in a bit map before starting the dump of a
directory, however this value can change if there are creates and deletes
happening in the background leading to inconsistent bit map eventually
generating a core file.

RESOLUTION:
The code is updated to refresh the inode bit map before actually starting the
dump operation thus avoiding the core file generation.

* 2340799 (Tracking ID: 2059611)

SYMPTOM:
system panics because NULL tranp in vx_unlockmap().

DESCRIPTION:
vx_unlockmap is to unlock a map structure of file system. If the map is being
handled, we incremented the hold count. vx_unlockmap() attempts to check whether
this is an empty mlink doubly linked list while we have an async vx_mapiodone
routine which can change the link at unpredictable timing even though the hold
count is zero.

RESOLUTION:
evaluation order is changed inside vx_unlockmap(), such that
further evaluation can be skipped over when map hold count is zero.

* 2340817 (Tracking ID: 2192895)

SYMPTOM:
System panics when performing fcl commands at

unix:panicsys
unix:vpanic_common
unix:panic
genunix:vmem_xalloc
genunix:vmem_alloc
unix:segkmem_xalloc
unix:segkmem_alloc_vn
genunix:vmem_xalloc
genunix:vmem_alloc
genunix:kmem_alloc
vxfs:vx_getacl
vxfs:vx_getsecattr
genunix:fop_getsecattr
genunix:cacl
genunix:acl
unix:syscall_trap32

DESCRIPTION:
The acl count in inode can be corrupted due to race condition. For
example, setacl can change the acl count when getacl is processing the same
inode, which could cause a invalid use of acl count.

RESOLUTION:
Code is modified to add the protection for the vulnerable acl count to avoid
corruption.

* 2340825 (Tracking ID: 2290800)

SYMPTOM:
When using fsdb to look at the map of ILIST file ("mapall" command), fsdb can
wrongly report a large hole at the end of ILIST file.

DESCRIPTION:
while reading bmap of ILIST file, if hole at the end of indirect extents is
found, fsdb may incorrectly end up marking the hole as the last extent in the
bmap, causing the mapall command to show a large hole till the end of file.

RESOLUTION:
Code has been modified to read ILIST file's bmap correctly when holes at the end
of indirect extents found, instead of marking that hole as the last extent of 
file.

* 2340831 (Tracking ID: 2272072)

SYMPTOM:
GAB panics the box because VCS engine "had" did not respond, the lbolt
wraps around.

DESCRIPTION:
The lbolt wraps around after 498 days machine uptime. In VxFS, we
flush VxFS meta data buffers based on their age. The age calculation happens
taking lbolt in account.

Due to lbolt wrapping the buffers were not flushed. So, a lot of metadata IO's
stopped and hence, the panic.

RESOLUTION:
In the function for handling flushing of dirty buffers, also handle 
the condition if lbolt has wrapped. If it has then assign current lbolt time
to the last update time of dirtylist.

* 2340834 (Tracking ID: 2302426)

SYMPTOM:
System panics when multiple 'vxassist mirror' commands are running 
concurrently with following stack strace:

0)  panic+0x410 
1)  unaligned_hndlr+0x190 
2)  bubbleup+0x880  ( ) 
+-------------  TRAP #1  ---------------------------- 
|  Unaligned Reference Fault in KERNEL mode 
|    IIP=0xe000000000b03ce0:0 
|    IFA=0xe0000005aa53c114    <--- 
|  p struct save_state 0x2c561031.0x9fffffff5ffc7400 
+-------------  TRAP #1  ---------------------------- 
LVL  FUNC  ( IN0, IN1, IN2, IN3, IN4, IN5, IN6, IN7 ) 
3)  vx_copy_getemap_structs+0x70 
4)  vx_send_getemapmsg+0x240 
5)  vx_cfs_getemap+0x240 
6)  vx_get_freeexts_ioctl+0x990
7)  vxportal_ioctl+0x4d0 
8)  spec_ioctl+0x100 
9)  vno_ioctl+0x390 
10) ioctl+0x3c0
11) syscall+0x5a0

DESCRIPTION:
Panic is caused because of de-referencing an unaligned address in CFS message
structure.

RESOLUTION:
Used bcopy to ensure proper alignment of the addresses.

* 2340839 (Tracking ID: 2316793)

SYMPTOM:
Shortly after removing files in a file system commands like 'df', which use
'statfs()', can take 10 seconds to complete.

DESCRIPTION:
To obtain an up-to-date and valid free block count in a file system a delay and
retry loop was delaying 1 second before each retry and looping 10 times before
giving up. This unnecessarily excessive retying could cause a 10 second delay
per file system when executing the df command.

RESOLUTION:
The original 10 retries with a 1 second delay each, have been reduced to 1 retry
after a 20 millisecond delay, when waiting for an updated free block count.

* 2341007 (Tracking ID: 2300682)

SYMPTOM:
When a file is newly created, issuing "fsppadm query -a /mount_point" could show
incorrect IOTemp information.

DESCRIPTION:
fsppadm query outputs incorrect data when the file re-uses the inode number
which belonged to a removed file, but the database still contains this obsolete
record for the removed one. 
 
fsppadm utility takes use of a database to save inodes' historical data. It
compares the nearest and the farthest records for an inode to compute IOTemp in
a time window. And it picks the generation of inode in the farthest record to
check the inode existence. However, if the farthest is missing, zero as the
generation is used mistakenly.

RESOLUTION:
If the nearest record for a given inode exists in database, we extract the
generation entry instead of that from the farthest one.

* 2360817 (Tracking ID: 2332460)

SYMPTOM:
Executing the VxFS 'vxedquota -p user1 user2' command to copy quota information
of one user to other users takes a long time to run to completion.

DESCRIPTION:
VxFS maintains quota information in two files - external quota files and
internal quota files. Whilst copying quota information of one user to another,
the required quota information is read from both the external and internal
files. However the quota information should only need to be read from the
external file in a situation where the read from the internal file has failed.
Reading from both files is therefore causing an unnecessary delay in the command
execution time.

RESOLUTION:
The unnecessary duplication of reading both the external and internal Quota
files to retrieve the same information has been removed.

* 2360819 (Tracking ID: 2337470)

SYMPTOM:
Cluster File System can unexpectedly and prematurely report a 'file system 
out of inodes' error when attempting to create a new file. The error message 
reported will be similar to the following:

vxfs: msgcnt 1 mesg 011: V-2-11: vx_noinode - /dev/vx/dsk/dg/vol file system out
of inodes

DESCRIPTION:
When allocating new inodes in a cluster file system, vxfs will search for an 
available free inode in the 'Inode-Allocation-Units' [IAUs] that are currently
delegated to the local node. If none are available, it will then search the 
IAUs that are not currently delegated to any node, or revoke an IAU delegated 
to another node. It is also possible for gaps or HOLEs to be created in the IAU 
structures as a side effect of the CFS delegation processing. However when 
searching for an available free inode vxfs simply ignores any HOLEs it may find,
if the maximum size of the metadata structures has been reached (2^31) new IAUs
cannot be created, thus one of the HOLEs should then be populated and used for 
new inode allocation. The problem occurred as HOLEs were being ignored, 
consequently vxfs can prematurely report the "file system out of inodes" error 
message even though there is plenty of free space in the vxfs file system to
create new inodes.

RESOLUTION:
New inodes will now be allocated from the gaps, or HOLEs, in the IAU structures 
(created as a side effect of the CFS delegation processing). The HOLEs will be 
populated rather than returning a 'file system out of inodes' error.

* 2360821 (Tracking ID: 1956458)

SYMPTOM:
When attempting to check information of checkpoints by 
fsckptadm -C blockinfo <pathname> <ckpt-name> <mountpoint>,
the command failed with error 6 (ENXIO), the file system is disabled and some
errors come out in message file:
vxfs: msgcnt 4 mesg 012: V-2-12: vx_iget - /dev/vx/dsk/sfsdg/three file system
invalid inode number 4495
vxfs: msgcnt 5 mesg 096: V-2-96: vx_setfsflags - /dev/vx/dsk/sfsdg/three file
system fullfsck flag set - vx_cfs_iread

DESCRIPTION:
VxFS takes use of ilist files in primary fileset and checkpoints to accommodate
inode information. A hole in a ilist file indicates that inodes in the hole
don't exist and are not allocated yet in the corresponding fileset or 
checkpoint. 

fsckptadm will check every inode in the primary fileset and the downstream
checkpoints. If the inode falls into a hole in a prior checkpoint,
i.e. the associated file was not generated at the time of the checkpoint
creation, fsckptadm exits with error.

RESOLUTION:
Skip inodes in the downstream checkpoints, if these inodes are located in a 
hole.

* 2368738 (Tracking ID: 2368737)

SYMPTOM:
If a file which has shared extents has corrupt indirect blocks, then in certain 
cases the reference count tracking system can try to interpret this block and 
panic the system. Since this is a asynchronous background operation, this 
processing will retry repeatedly on every file system mount and hence can result 
in panic every time the file system is mounted.

DESCRIPTION:
Reference count tracking system for shared extents updates reference count in a 
lazy fashion. So in certain cases it asynchronously has to access shared 
indirect blocks belonging to a file to account for reference count updates. But 
due if this indirect block has been corrupted badly "a priori", then this 
tracking mechanism can panic the system repeatedly on every mount.

RESOLUTION:
The reference count tracking system validates the read indirect extent from the 
disk and in case it is not found valid sets VX_FULLFSCK flag in the superblock 
marking it for full fsck and disables the file system on the current node.

* 2373565 (Tracking ID: 2283315)

SYMPTOM:
System may panic when "fsadm -e" is run on a file system containing file level 
snapshots. The panic stack looks like:

crash_kexec()
__die at() 
do_page_fault()
error_exit()
[exception RIP: vx_bmap_lookup+36]
vx_bmap_lookup()
vx_bmap()
vx_reorg_emap()
vx_extmap_reorg()
vx_reorg()
vx_aioctl_full()
vx_aioctl_common()
vx_aioctl()
vx_ioctl()
do_ioctl()
vfs_ioctl()
sys_ioctl()
tracesys

DESCRIPTION:
The panic happened because of a NULL inode pointer passed to vx_bmap_lookup() 
function. During reorganizing extents of a file, block map (bmap) lookup
operation is done on a file to get the information about the extents of the 
file. If this bmap lookup finds a hole at an offset in a file containing shared 
extents, a local variable is not updated that makes the inode pointer NULL
during the next bmap lookup operation.

RESOLUTION:
Initialized the local variable such that inode pointer passed to 
vx_bmap_lookup() will be non NULL.

* 2386483 (Tracking ID: 2374887)

SYMPTOM:
Access to a file system can hang when creating a named attribute 
due to a read/write lock being held exclusively and indefinitely 
causing a thread to loop in vx_tran_nattr_dircreate() 

A typical stacktrace of a looping thread:

vx_itryhold_locked
vx_iget
vx_attr_iget
vx_attr_kgeti
vx_attr_getnonimmed
vx_acl_inherit
vx_aclop_creat
vx_attr_creatop
vx_new_attr
vx_attr_inheritbuf
vx_attr_inherit
vx_tran_nattr_dircreate
vx_nattr_copen
vx_nattr_open
vx_setea
vx_linux_setxattr
vfs_setxattr
link_path_walk
sys_setxattr
system_call

DESCRIPTION:
The initial creation of a named attribute for a regular file 
or directory will result in the automatic creation of a 
'named attribute directory'. Creations are initially attempted 
in a single transaction. Should the single transaction fail 
due to a read/write lock being held then a retry should split
the task into multiple transactions. An incorrect reset of a 
tracking structure meant that all retries were performed using 
a single transaction creating an endless retry loop.

RESOLUTION:
The tracking structure is no longer reset within the retry loop.

* 2402643 (Tracking ID: 2399178)

SYMPTOM:
Full fsck does large directory index validation during pass2c. However, if
the number of large directories are more then this pass takes a lot of time. There
is huge scope to improve the full fsck performance during this pass.

DESCRIPTION:
Pass2c consists of following basic operations:-
[1] Read the entries in the large directory
[2] Cross check hash values of those entries with the hash directory inode
contents residing on the attribute ilist.

This means this is another heavy IO intensive pass.

RESOLUTION:
1.Use directory block read-ahead during Step [1].
2.Wherever possible, access the file contents extent-wise rather than in fs
block size (while reading entries in the directory) or in hash block size (8k,
during dexh_getblk)

Using above mentioned enhancements, the buffer cache can be utilized in better way.

* 2412029 (Tracking ID: 2384831)

SYMPTOM:
System panics with the following stack trace. This happens in some cases when 
names streams are used in VxFS.

machine_kexec()
crash_kexec()
__die
do_page_fault()
error_exit()
[exception RIP: iput+75]
vx_softcnt_flush()
vx_ireuse_clean()
vx_ilist_chunkclean()
vx_inode_free_list()
vx_ifree_scan_list()
vx_workitem_process()
vx_worklist_process()
vx_worklist_thread()
vx_kthread_init()
kernel_thread()

DESCRIPTION:
VxFS internally creates a directory to keep the named streams pertaining to a 
file. In some scenarios, an error code path is missing to release the hold on 
that directory. Due to this unmount of the file system will not clean the inode 
belonging to that directory. Later when VxFS reuses such a inode panic is seen.

RESOLUTION:
Release the hold on the named streams directory in case of an error.

* 2412169 (Tracking ID: 2371903)

SYMPTOM:
There is an extra empty line in "/proc/devices" when activating file system
checkpoints, like # tail -6 /proc/devices
199 VxVM
201 VxDMP
252 vxclonefs-0 
                   <<< There is a space here.  
253 device-mapper
254 mdp

DESCRIPTION:
vxfs improperly adds a new line character i.e. "n", when composing device name
of clone. As a result, when the function "register_blkdev" is called with the
device name to register block device driver, an additional blank line is showed
in "/proc/devices".

RESOLUTION:
Remove the new line when creating the clone device.

* 2412177 (Tracking ID: 2371710)

SYMPTOM:
User quota file corruption occurs when DELICACHE feature is enabled, the current
usage of inodes of a user becomes negative after frequent file creations and
deletions. Checking quota info using command "vxquota -vu username", the number
of files is "-1" like:

    # vxquota -vu testuser2
   Disk quotas for testuser2 (uid 500):
   Filesystem     usage  quota  limit     timeleft  files  quota  limit     
timeleft
   /vol01       1127809 8239104 8239104                  -1      0      0

DESCRIPTION:
This issue is introduced by the inode DELICACHE feature in 5.1SP1, it is a
performance enhancement to optimize the updates done to inode map during file
creations and deletions. The feature is enabled by default, and can be changed
by vxtunefs. 

When DELICACHE is enabled and quota is set for vxfs, there will be an extra
quota update for the inodes on inactive list during removing process. Since
these inodes' quota has been updated already before put on delicache list, the
current number of user files gets decremented twice eventually.

RESOLUTION:
Add a flag to identify the inodes moved to inactive list from delicache list, so
that the flag can be used to prevent updating the quota again during removing
process.

* 2412179 (Tracking ID: 2387609)

SYMPTOM:
Quota usage gets set to ZERO when umount/mount the file system though files
owned by users exist. This issue may occur after some file creations and
deletions. Checking the quota usage using "vxrepquota" command and the output
would be like following:

# vxrepquota -uv /vx/sofs1/
/dev/vx/dsk/sfsdg/sofs1 (/vx/sofs1):
                     Block limits                      File limits
User          used   soft   hard    timeleft    used       soft    hard    timeleft
testuser1 --      0 3670016 4194304                   0      0      0
testuser2 --      0 3670016 4194304                   0      0      0
testuser3 --      0 3670016 4194304                   0      0      0

Additionally the quota usage may not be updated after inode/block usage reaches
ZERO.

DESCRIPTION:
The issue occurs when VxFS merges external per node quota files with internal
quota file. The block offset within external quota file could be calculated
wrongly in some scenario. When any hole found in per node quota file, the file
offset such that it points the next non-HOLE offset will be modified, but we
miss to change the block offset accordingly which points to the next available
quota record in a block.

VxFS updates per node quota records only when global internal quota file shows
either of some bytes or inode usage, otherwise it doesn't copy the usage from
global quota file to per node quota file. But for the case where quota usage in
external quota files has gone down to zero and both bytes and inode usage in
global file becomes zero, per node quota records would be not updated and left
with incorrect usage. It should also check bytes or inodes usage in per node
quota record. It should skip coping records only when bytes and inodes usage in
both global quota file and per node quota file is zero.

RESOLUTION:
Corrected the way to calculate the block offset when any hole is found in per
node quota file.
 
Added code to also check blocks or inodes usage in per node quota record while
updating user quota usage.

* 2412181 (Tracking ID: 2372093)

SYMPTOM:
New fsadm command options, to defragment a given percentage 
of the available freespace in a file system, have been introduced
as part of an initiative to help improve Cluster File System [CFS]
performance - the new additional command usage is as follows:

     fsadm -C -U <percentage-of-freespace> <mount_point>

We have since found that this new freespace defragmentation 
operation can sometimes hang (whilst it also continues to consume 
some cpu) in specific circumstances when executed on a Cluster 
mounted File System [CFS]

DESCRIPTION:
The hang can occur when file system metadata is being relocated.
In our example case the hang occurs whilst relocating inodes whose 
corresponding files are being actively updated via a different
node (from which the fsadm command is being executed) in the cluster.
During the relocation an error code path is taken due to an 
unexpected mismatch between temporary replica metadata, the code 
path then results in a deadlock, or hang.

RESOLUTION:
As there is no overriding need to relocate structural metadata for 
the purposes of defragmenting the available freespace, we have
chosen to simply leave all structural metadata where it is when 
performing this operation thus avoiding its relocation. The changes
required for this solution are therefore very low risk.

* 2418819 (Tracking ID: 2283893)

SYMPTOM:
In a Cluster File System (CFS) environment , the file read
performances gradually degrade up to 10% of the original
read performance and the fsadm(1M) -F vxfs -D -E
<mount point> shows a large number (> 70%) of free blocks in
extents smaller than 64k.
For example,
% Free blocks in extents smaller than 64 blks: 73.04
% Free blocks in extents smaller than  8 blks: 5.33

DESCRIPTION:
In a CFS environment, the disk space is divided into
Allocation Units (AUs).The delegation for these AUs is
cached locally on the nodes.

When an extending write operation is performed on a file,
the file system tries to allocate the requested block from
an AU whose delegation is locally cached, rather than
finding the largest free extent available that matches the
requested size in the other AUs. This leads to a
fragmentation of the free space, thus leading to badly
fragmented files.

RESOLUTION:
The code is modified such that the time for which the
delegation of the AU is cached can be reduced using a
tuneable, thus allowing allocations from other AUs with
larger size free extents. Also, the fsadm(1M) command is
enhanced to de-fragment free space using the -C option.

* 2420060 (Tracking ID: 2403126)

SYMPTOM:
Hang is seen in the cluster when one of the nodes in the cluster leaves or 
rebooted. One of the nodes in the cluster will contain the following stack trace.

e_sleep_thread()
vx_event_wait()
vx_async_waitmsg()
vx_msg_send()
vx_send_rbdele_resp()
vx_recv_rbdele+00029C ()
vx_recvdele+000100 ()
vx_msg_recvreq+000158 ()
vx_msg_process_thread+0001AC ()
vx_thread_base+00002C ()
threadentry+000014 (??, ??, ??, ??)

DESCRIPTION:
Whenever a node in the cluster leaves, reconfiguration happens and all the 
resources that are held by the leaving nodes are consolidated. This is done on
one node of the cluster called primary node. Each node sends a message to the 
primary node about the resources it is currently holding. During this 
reconfiguration, in a corner case, VxFS is incorrectly calculating the message
length which is larger than what GAB(Veritas Group Membership and Atomic
Broadcast) layer can handle. As a result the message is getting lost. The sender
thinks that the message is sent and waits for acknowledgement. The message is
actually dropped at sender and never sent. The master node which is waiting for
this message will wait forever and the reconfiguration never completes leading
to hang.

RESOLUTION:
The message length calculation is done properly now and GAB can handle the messages.

* 2425429 (Tracking ID: 2422574)

SYMPTOM:
On CFS, after turning the quota on, when any node is rebooted and rejoins the
cluster, it fails to mount the filesystem.

DESCRIPTION:
At the time of mounting the filesystem after rebooting the node, mntlock was
already set, which didn't allow the remount of filesystem, if quota is on.

RESOLUTION:
Code is changed so that the mntlock flag is masked in quota operation as it's
already set on the mount.

* 2425439 (Tracking ID: 2242630)

SYMPTOM:
Earlier distributions of Linux had a maximum size of memory that could be
allocated via vmalloc().  This throttled the maximum size of VxFS's hash tables,
and so limited the size of the inode and buffer caches.  RHEL5/6 and  SLES10/11 do
not have this limit.

DESCRIPTION:
Limitations in the Linux kernel used to limit the inode and buffer cache for VxFS.

RESOLUTION:
Code is changed to accommodate the change of limits in Linux kernel, hence
modified limits on inode and buffer cache for VxFS.

* 2426039 (Tracking ID: 2412604)

SYMPTOM:
Once the time limit expires after exceeding the soft-limit of user quota size on
VxFS filesystem, writes are still permissible over that soft-limit.

DESCRIPTION:
After exceeding the soft-limit, in the initial setup of the soft-limit the timer
didn't use to start.

RESOLUTION:
Start the timer during the initial setting of quota limits if current usage has
already crossed the soft quota limits.

* 2427269 (Tracking ID: 2399228)

SYMPTOM:
Occasionally Oracle Archive logs can be created smaller than they should be, 
in the reported case the resultant Oracle Archive logs were incorrectly sized
as 512 bytes.

DESCRIPTION:
The fcntl  [file control] command F_FREESP [Free storage space] can be 
utilised to change the size of a regular file. If the file size is reduced we 
call it a "truncate", and space allocated in the truncated area will be 
returned to the file system freespace pool. If the file size is increased using 
F_FREESP we call it a "truncate-up", although the file size changes no space is 
allocated in the extended area of the file.

Oracle archive logs utilize the F_FREESP fcntl command to perform a truncate-up
of a new file before a smaller write of 512 bytes [at the the start of the 
file] is then performed. A timing window was found with F_FREESP which meant 
that 'truncate-up' file size was lost, or rather overwritten, by the subsequent 
write of the data, thus causing the file to appear with a size of just 512 
bytes.

RESOLUTION:
A timing window has been closed whereby the flush of the allocating [512byte]
write was triggered after the new F_FREESP file size has been updated in the 
inode.

* 2427281 (Tracking ID: 2413172)

SYMPTOM:
vxfs_fcl_seektime() API seeks to the first record in the File change log(FCL)
file after specified time. This API can incorrectly return EINVAL(FCL record not
found)error while reading first block of FCL file.

DESCRIPTION:
To seek to the first record after the given time, first a binary search is 
performed to get the largest block offset where fcl record time is less than the
given time. Then a linear search from this offset is performed to find the first
record which has time value greater than specified time. 

FCL records are read in buffers. There can be scenarios where FCL records read
in one buffer are less than buffer size, e.g. reading first block of FCL file.
In such scenarios, buffer read can continue even when all data in current buffer
has been read. This is due to wrong check which decides if all records in one
buffer has been read.
Thus reading buffer beyond boundary was causing search to terminate without
finding record for given time and hence EINVAL error was returned.

Actually, VxFS should detect that it is partially filled buffer and the search
should continue reading the next buffer.

RESOLUTION:
Check which decides if all records in buffer have been read is corrected such
that buffer is read within its boundaries.

* 2430679 (Tracking ID: 1892045)

SYMPTOM:
A multitude of slab-1024 memory is consumed, which can be checked from
/proc/slabinfo. For example,
# cat /proc/slabinfo | grep "size-1024 "
size-1024          51676  51700   1024    4    1 : tunables   54   27    8 :
slabdata  12925  12925      0
where 51700 slabs are allocated with 1024 bytes unit size.

DESCRIPTION:
VxFS creates some fake inodes for various background and bookkeeping tasks for
which the kernel wants VxFS to have an inode that doesn't strictly have to be a
real file. The fake inode number is computed from NR_CPUS. If the kernel has a
large NR_CPUS, the slab-1024 will become significant accordingly.

RESOLUTION:
Allocate the fake inodes based on the kernel's routine num_possible_cpus() that
would help to reduce the slab-1024 number to thousands.

* 2478237 (Tracking ID: 2384861)

SYMPTOM:
The following asserts are seen during internal stress and regression runs 
f:vx_do_filesnap:1b
f:vx_inactive:2a
f:xted_check_rwdata:31
f:vx_do_unshare:1

DESCRIPTION:
These asserts validate some assumption in various function also there were some 
miscellaneous issues which were seen during internal testing.

RESOLUTION:
The code has been modified to fix the internal reported issues which other 
miscellaneous changes.

* 2478325 (Tracking ID: 2251015)

SYMPTOM:
Command fsck(1M) will take longer time to complete.

DESCRIPTION:
Command fsck(1M), in extreme case like 2TB file system with a 1KB block size, 130+
checkpoints, and 100-250 million inodes per file set takes 15+ hours
to complete intent log replay because it has to read a few GB of IAU headers and
summaries one synchronous block at a time.

RESOLUTION:
Changing the fsck code, to do read ahead on the IAU file reduces the fsck
log-replay time.

* 2480949 (Tracking ID: 2480935)

SYMPTOM:
System log file may contain following error message on multi-threaded environment
with Dynamic Storage Tiers(DST).
<snip>
UX:vxfs fsppadm: ERROR: V-3-26626: File Change Log IOTEMP and ACCESSTEMP index
creation failure for /vx/fsvm with message Argument list too long
</snip>

DESCRIPTION:
In DST, while enforcing policy, SQL queries are generated and written to file
.__fsppadm_enforcesql present in lost+found. In multi threaded environment, 16
threads works in parallel on ILIST and geenrate SQL queries and write it to
file. This may lead to corruption of file, if multiple threads write to file
simultaneously.

RESOLUTION:
Mutex is used to serialize writing of threads on SQL file.

* 2482337 (Tracking ID: 2431674)

SYMPTOM:
panic in vx_common_msgprint() via vx_inactive()

DESCRIPTION:
The problem is that the call VX_CMN_ERR( ) , uses a "llx" format 
character which vx_common_msgprint() doesn't understand.  It gives up trying to 
process that format, but continues on without consuming the corresponding 
parameter.  
Everything else in the parameter list is effectively shifted by 8 bytes, and 
when we 
get to processing the string argument, it's game over.

RESOLUTION:
Changed the format to "llu", which vx_common_msgprint() understands.

* 2482344 (Tracking ID: 2424240)

SYMPTOM:
In the case of deduplication, when FS block size is bigger and there is a partial
block match, we end up sharing the block anyway, resulting in corruption of files.

DESCRIPTION:
This is because to handle slivers we are rounding up the matched length to block
boundary.

RESOLUTION:
Code is changed so to round-down the length that we intend to dedup to align with
fs block size.

* 2484815 (Tracking ID: 2440584)

SYMPTOM:
System panic when force unmounting the file system. The backtrace looks like

machine_kexec
crash_kexec
__die
do_page_fault
error_exit
vx_sync
vx_sync_fs
sync_filesystems
do_sync
sys_sync
system_call

or

machine_kexec
crash_kexec
__die
do_page_fault
error_exit
sync_filesystems
do_sync
sys_sync
system_call

DESCRIPTION:
When we do force unmount, VxFS will free some memory structures which can still
be referenced by the code path of sync. Thus, there is a window that allows the
race between sync command and force unmount as shown in the first backtrace.

For the second panic, prior to the completion of force unmount, sync_filesystems
could invoke sync_fs (one member of super_operations) callback of VxFS which
however could be already set NULL by force unmount.

RESOLUTION:
vx_sync function will not do real sync if detecting a force unmount is in
process. Add dummy functions instead of NULL pointers to vxfs super_operations
during force unmount.

* 2486597 (Tracking ID: 2486589)

SYMPTOM:
Multiple threads may wait on a mutex owned by a thread that is in function
vx_ireuse_steal() with following stack trace on machine with severe inode pressure.
<snip>
vx_ireuse_steal()
vx_ireuse()
vx_iget()
</snip>

DESCRIPTION:
Several thread are waiting to get inodes from VxFS. The current number of inodes
reached max number of inodes (vxfs_ninode) that can be created in memory. So no
new allocations can be possible, which results in thread wait.

RESOLUTION:
Code is modified so that in such situation, threads return ENOINODE instead of
retrying to get inodes.

* 2494464 (Tracking ID: 2247387)

SYMPTOM:
Internal local mount noise.fullfsck.N4 test hit an assert vx_ino_update:2
With stack trace looking as below

<snip>
panic: f:vx_ino_update:2
Stack Trace:
  IP                  Function Name
  0xe0000000023d5780  ted_call_demon+0xc0
  0xe0000000023d6030  ted_assert+0x130
  0xe000000000d66f80  vx_ino_update+0x230
  0xe000000000d727e0  vx_iupdat_local+0x13b0
  0xe000000000d638b0  vx_iupdat+0x230
  0xe000000000f20880  vx_tflush_inode+0x210
  0xe000000000f1fc80  __vx_fsq_flush___vx_tran.c__4096000_0686__+0xed0
  0xe000000000f15160  vx_tranflush+0xe0
  0xe000000000d2e600  vx_tranflush_threaded+0xc0
  0xe000000000d16000  vx_workitem_process+0x240
  0xe000000000d15ca0  vx_worklist_thread+0x7f0
  0xe000000001471270  kthread_daemon_startup+0x90
End of Stack Trace
</snip>

DESCRIPTION:
INOILPUSH flag is not set when inode is getting updated, which caused above
assert. The problem was creation and deletion of clone resets the INOILPUSH flag
and function  vx_write1_fast() does not set the flag after updating the inode and
file.

RESOLUTION:
Code is modified so that if INOILPUSH flag is not set while function
vx_write1_fast(), then the flag is set in the function.

* 2508164 (Tracking ID: 2481984)

SYMPTOM:
Access to the file system got hang.

DESCRIPTION:
In function 'vx_setqrec', it will call 'vx_dqget'. when 'vx_dqget' return
errors, it will try to unlock DQ structure using 'VX_DQ_CLUSTER_UNLOCK'. But, in
this situation, DQ structure doesn't hold the lock. hence, this hang happens.

RESOLUTION:
'dq_inval' would be set in 'vx_dqget' in case of any error happens in
'vx_dqget'. Skip unlocking DQ structure in the error code path of 'vx_setqrec',
if 'dq_inval' is set.

* 2521514 (Tracking ID: 2177591)

SYMPTOM:
System panic in vx_softcnt_flush() with stack as below:

 #4 [ffff8104981a7d08] generic_drop_inode()
 #5 [ffff8104981a7d28] vx_softcnt_flush()
 #6 [ffff8104981a7d58] vx_ireuse_clean()
 #7 [ffff8104981a7d88] vx_ilist_chunkclean()
 #8 [ffff8104981a7df8] vx_inode_free_list()
 #9 [ffff8104981a7e38] vx_ifree_scan_list()
#10 [ffff8104981a7e48] vx_workitem_process()
#11 [ffff8104981a7e58] vx_worklist_process()
#12 [ffff8104981a7ed8] vx_worklist_thread()
#13 [ffff8104981a7ee8] vx_kthread_init()
#14 [ffff8104981a7f48] kernel_thread()

DESCRIPTION:
panic occured in iput()(vx_softcnt_flush()) which was called to drop the
softcount hold(i_count)
held by VxFS. Panic thread is cleaning an inode on freelist. Inode being cleaned
belongs to FS which is already unmounted. FS superblock structure is freed after
vxfs unmount returns
irrespective of unmount is successfull or failed as linux always expects umount
to succeed.

Unmount of this FS was not clean i.e. detach of this FS was failing with EBUSY
error. Detach of FS
can fail with EBUSY error if there are any busy inodes i.e. having pending
operations.
During an unmount operation, all the inodes belonging to that file system are
cleaned.
But as unmount of FS was not successful, inode processing of FS didn't
completed. When background thread
picked inodes of such unmounted FS, panic happened while accessing superblock
structure which was freed.

RESOLUTION:
Check for busy inodes while unmounting comes before processing all inodes in
inode cache. So this can leave inodes on freelist without their cleanup during
unmount.
Now after some failed attempt to detach fset, calling detach fset with force
option. This can be more aggressive in tearing down fileset and therefore should
help to
clear fest's inodes.

* 2529356 (Tracking ID: 2340953)

SYMPTOM:
During internal stress test, f:vx_iget:1a assert is seen.

DESCRIPTION:
While renaming certain file, we check if the target directory is in the path of
the source file to be renamed. while using function vx_iget() to reach till root
inode, one of parent directory incode number was 0 and hence the assert.

RESOLUTION:
Code is changed so that during renames, parent directory is first assigned correct
inode number before using vx_iget() function to retrieve root inode.



INSTALLING THE PATCH
--------------------
#rpm -Uvh VRTSvxfs-5.1.134.100-SP1RP4P1_RHEL6.x86_64.rpm
New installations of SFHA 5.1SP1RP4 on RHEL 6.5 For new installations, use the following high level steps:
1) Install SFHA 5.1SP1 using the CPI installer utility (accompanies the 5.1SP1 installation s/w) with the "-
install" option to install the product with no configuration. If requested to configure s/w, answer no.
./installer -install
2) Install RP4 patch using the CPI installer (accompanies the RP4 s/w) with the "-install" option ./installrp -
install
3) Install the two patches above
rpm -Uvh VRTSodm-5.1.134.100-SP1RP4P1_RHEL6.x86_64.rpm
rpm -Uvh VRTSvxfs-5.1.134.100-SP1RP4P1_RHEL6.x86_64.rpm
4) Use the CPI Product installer with the "-configure" option to complete the configuration of the product
/opt/VRTS/install/install<productname>   -configure where productname is the SFHA product being installed i.e. CFS
- installsfcfs


REMOVING THE PATCH
------------------
#rpm -e rpm_name


SPECIAL INSTRUCTIONS
--------------------
NONE


OTHERS
------
NONE