README VERSION : 1.1 README CREATION DATE : 2012-09-26 PATCH-ID : PHKL_43127 PATCH NAME : VRTSvxfs 5.1SP1RP2 BASE PACKAGE NAME : VRTSvxfs BASE PACKAGE VERSION : 5.1 SUPERSEDED PATCHES : PHKL_42970 REQUIRED PATCHES : PHCO_43172 INCOMPATIBLE PATCHES : NONE SUPPORTED PADV : hpux1131 (P-PLATFORM , A-ARCHITECTURE , D-DISTRIBUTION , V-VERSION) PATCH CATEGORY : CORRUPTION , HANG , PANIC PATCH CRITICALITY : CRITICAL HAS KERNEL COMPONENT : YES ID : NONE REBOOT REQUIRED : YES PATCH INSTALLATION INSTRUCTIONS: -------------------------------- Please refer to Release Notes for install instructions PATCH UNINSTALLATION INSTRUCTIONS: ---------------------------------- Please refer to Release Notes for install instructions SPECIAL INSTRUCTIONS: --------------------- NONE SUMMARY OF FIXED ISSUES: ----------------------------------------- 2726056 (2709869) System panic with redzone violation when vx_free() tried to free fiostat 2851320 (2839871) process hung in vx_extentalloc_delicache 2852690 (2373266) vx_lookup from NFS on a regular file can cause system hang on 5.1SP1 on HP 11.31 2855862 (2694208) Added 64 bit quota support. 2855898 (2845175) When the Access Control List (ACL) feature is enabled, the system may panic with "Data Key Miss Fault in KERNEL mode 2900974 (2841059) full fsck fails to clear the corruption in attribute inode 15 2925919 (2925918) Checkpoint promotion may lead to deadlock SUMMARY OF KNOWN ISSUES: ----------------------------------------- KNOWN ISSUES : -------------- FIXED INCIDENTS: ---------------- PATCH ID:PHKL_43127 * INCIDENT NO:2726056 TRACKING ID:2709869 SYMPTOM: The system panics with a redzone violation while releasing inodes File Input/Output (FIO) statistics structure and the following stack trace is displayed: kmem_error() kmem_cache_free_debug() kmem_cache_free() vx_fiostats_alloc() fdd_common1() fdd_odm_open() odm_vx_open() odm_ident_init() odm_identify() odmioctl() fop_ioctl() ioctl() DESCRIPTION: Different types of statistics are maintained when a file is accessed in Quick Input/Output (QIO) and non-QIO mode. Some common statistics are copied when the file access mode is changed from QIO to non-QIO or vice versa. While switching from QIO mode to non-QIO, the QIO statistics structure is freed and FIO statistics structure is allocated to maintain FIO file-level statistics. There is a race between the thread freeing the QIO statistics which also allocates the FIO statistics and the thread updating the QIO statistics when the file is opened in QIO mode. Thus, the FIO statistics gets corrupted as another thread writes to it assuming that the QIO statistics is allocated. RESOLUTION: The code is modified to protect the allocation/releasing of FIO/QIO statistics using the read-write lock/spin lock for file statistics structure. * INCIDENT NO:2851320 TRACKING ID:2839871 SYMPTOM: On a system with DELICACHE enabled, several file system operations may hang with the following stack trace: vx_delicache_inactive vx_delicache_inactive_wp vx_workitem_process vx_worklist_process vx_worklist_thread vx_kthread_init DESCRIPTION: The DELICACHE lock is used to synchronize the access to the DELICACHE list and it is held only while updating this list. However, in some cases it is held longer and is released only after the issued I/O is completed, causing other threads to hang. RESOLUTION: The code is modified to release the spinlock before issuing a blocking I/O request. * INCIDENT NO:2852690 TRACKING ID:2373266 SYMPTOM: When partition directory feature is enabled, a lookup operation from Network File System (NFS) on a regular file could cause a system hang and the following stack trace is displayed: vx_rwsleep_rec_lock_em() vx_recsmp_rangelock() vx_irwlock() vx_lookup_pd() vx_lookup() DESCRIPTION: During a lookup operation, when a regular file is passed by the NFS layer, the file is wrongly identified as a partition directory. The operation goes into an endless loop trying to read the file. RESOLUTION: The code is modified to return an ENOTDIR error if the lookup operation is not being performed on a directory inode. * INCIDENT NO:2855862 TRACKING ID:2694208 SYMPTOM: 64 bit quotas (1m) were not supported by VxFS DESCRIPTION: Support for 64 bit quotas was not provided in the kernel RESOLUTION: The kernel is enhanced to support 64 bit quotas. * INCIDENT NO:2855898 TRACKING ID:2845175 SYMPTOM: When the Access Control List (ACL) feature is enabled, the system may panic with "Data Key Miss Fault in KERNEL mode" error message in the vx_do_getacl() function and the following stack trace is displayed: vx_do_getacl+0x840 () vx_getacl+0x70 () acl+0x480 () DESCRIPTION: In the vx_do_getacl() function, a local variable is accessed without being initializing as a result leading to a panic. RESOLUTION: The code is modified to initialize the local variable to NULL before using it. * INCIDENT NO:2900974 TRACKING ID:2841059 SYMPTOM: The file system gets marked for a full fsck operation and the following message is displayed in the system log: V-2-96: vx_setfsflags file system fullfsck flag set - vx_ierror vx_setfsflags+0xee/0x120 vx_ierror+0x64/0x1d0 [vxfs] vx_iremove+0x14d/0xce0 vx_attr_iremove+0x11f/0x3e0 vx_fset_pnlct_merge+0x482/0x930 vx_lct_merge_fs+0xd1/0x120 vx_lct_merge_fs+0x0/0x120 vx_walk_fslist+0x11e/0x1d0 vx_lct_merge+0x24/0x30 vx_workitem_process+0x18/0x30 vx_worklist_process+0x125/0x290 vx_worklist_thread+0x0/0xc0 vx_worklist_thread+0x6d/0xc0 vx_kthread_init+0x9b/0xb0 V-2-17: vx_iremove_2 : file system inode 15 marked bad incore DESCRIPTION: Due to a race condition, the thread tries to remove an attribute inode that has already been removed by another thread. Hence, the file system is marked for a full fsck operation and the attribute inode is marked as 'bad ondisk'. RESOLUTION: The code is modified to check if the attribute node that a thread is trying to remove has already been removed. * INCIDENT NO:2925919 TRACKING ID:2925918 SYMPTOM: Checkpoint promotion may lead to deadlock and having stack similar to this : vx_init_overlay() vx_reread_inode() vx_assume_iowner() vx_hlock_getdata() vx_glm_cbfunc() vx_glmlist_thread() thread_start() DESCRIPTION: While doing the checkpoint promotion, ilist on secondary nodes may remain stale resulting infinite looping in vx_init_overlay() function. RESOLUTION: Incore ilist should be refreshed refreshed on secondary nodes while doing checkpoint promotion. PATCH ID:PHKL_42970 * INCIDENT NO:2340794 TRACKING ID:2086902 SYMPTOM: The performance of a system with Veritas File System (VxFS) is affected due to high contention for a spinlock. DESCRIPTION: The contention occurs because there are a large number of work items in these systems and currently, these work items are enqueued and dequeued from the global list individually. RESOLUTION:The code is modified to process the work items by bulk enqueue/dequeue to reduce the VxFS worklist lock contention. * INCIDENT NO:2693008 TRACKING ID:2693010 SYMPTOM: The formatted uncompressed files created by the catman(1M) command for Veritas File System (VxFS) man pages remain available even after the removal of the VxFS package. DESCRIPTION: Currently, the formatted compressed files created by the catman(1M) command for VxFS man pages are not removed during the removal/uninstall of patches. RESOLUTION:The code is modified in the postremove and preinstall/postinstall packaging scripts to cleanup any remaining formatted uncompressed files created by the catman(1M) command. * INCIDENT NO:2715030 TRACKING ID:2715028 SYMPTOM: The fsadm(1M) command with the '-d' option may hang when compacting a directory if it is run on the Cluster File System (CFS) secondary node while the find(1) command is running on any other node. DESCRIPTION: During the compacting of a directory, the CFS secondary node has ownership of the inode of the directory. To complete the compacting of directory, the truncation message needs to be processed on the CFS primary node. For this to occur, the CFS primary node needs to have ownership of the inode of the directory. This causes a deadlock. RESOLUTION:The code is modified to force the processing of the truncation message on the CFS secondary node which initiated the compacting of directory. * INCIDENT NO:2725995 TRACKING ID:2566875 SYMPTOM: The write(2) operation exceeding the quota limit fails with an EDQUOT error ("Disc quota exceeded") before the user quota limit is reached. DESCRIPTION: When a write request exceeds a quota limit, the EDQUOT error is handled so that Veritas File System (VxFS) can allocate space up to the hard quota limit to proceed with a partial write. However, VxFS does not handle this error and an erroris returned without performing a partial write. RESOLUTION:The code is modified to handle the EDQUOT error from the extent allocation routine. * INCIDENT NO:2726010 TRACKING ID:2651922 SYMPTOM: On a local VxFS file system, the ls(1M) command with the '-l' option runs slowly and high CPU usage is observed. DESCRIPTION: Currently, Cluster File System (CFS) inodes are not allowed to be reused as local inodes to avoid Global Lock Manager (GLM) deadlo`ck issue when Veritas File System (VxFS) reconfiguration is in process. Hence, if a VxFS local inode is needed, all the inode free lists need to be traversed to find a local inode if the free lists are almost filled up with CFS inodes. RESOLUTION:The code is modified to add a global variable, 'vxi_icache_cfsinodes' to count the CFS inodes in inode cache. The condition is relaxed for converting a cluster inode to a local inode when the number of in-core CFS inodes is greater than the 'vx_clreuse_threshold' threshold and reconfiguration is not in progress. * INCIDENT NO:2726018 TRACKING ID:2670022 SYMPTOM: Duplicate file names can be seen in a directory. DESCRIPTION: Veritas File System (VxFS) maintains an internal Directory Name Lookup Cache (DNLC) to improve the performance of directory lookups. A race condition occurs in the DNLC lists manipulation code during lookup/creation of file names that have more than 32 characters (which further affects other file creations). This causes the DNLC to have a stale entry for an existing file in the directory. A lookup of such a file through DNLC does not find the file and allows another duplicate file with the same name in the directory. RESOLUTION:The code is modified to fix the race condition by protecting the DNLC lists through proper locks. * INCIDENT NO:2726031 TRACKING ID:2684573 SYMPTOM: The performance of the cfsumount(1M) command for the VRTScavf package is slow when some checkpoints are deleted. DESCRIPTION: When a checkpoint is removed asynchronously, a kernel thread is started to handle the job in the background. If an unmount command is issued before these checkpoint removal jobs are completed, the command waits for the completion of these jobs. A forced unmount can interrupt the process of checkpoint deletion and the remaining work is left to the next mount. RESOLUTION:The code is modified to add a counter in the vxfsstat(1M) command to determine the number of checkpoint removal threads in the kernel. The '-c' option is added to the cfsumount(1M) command to force unmount a mounted file system if the checkpoint jobs are running. * INCIDENT NO:2726047 TRACKING ID:2696067 SYMPTOM: When a getaccess() command is issued on a file which inherits the default Access Control List (ACL) entries from the parent, it shows incorrrect group object permissions. DESCRIPTION: If a newly created file leverages the ACL entries of its parent directory, the vx_daccess() function does not fabricate any GROUP_OBJ entry unlike the vx_do_getacl() function. RESOLUTION:The code is modified to fabricate a GROUP_OBJ entry. * INCIDENT NO:2726056 TRACKING ID:2709869 SYMPTOM: The system panics with a redzone violation while releasing inodes File Input/Output (FIO) statistics structure and the following stack trace is displayed: kmem_error() kmem_cache_free_debug() kmem_cache_free() vx_fiostats_alloc() fdd_common1() fdd_odm_open() odm_vx_open() odm_ident_init() odm_identify() odmioctl() fop_ioctl() ioctl() DESCRIPTION: Different types of statistics are maintained when a file is accessed in Quick Input/Output (QIO) and non-QIO mode. Some common statistics are copied when the file access mode is changed from QIO to non-QIO or vice versa. While switching from QIO mode to non-QIO, the QIO statistics structure is freed and FIO statistics structure is allocated to maintain FIO file-level statistics. There is a race between the thread freeing the QIO statistics which also allocates the FIO statistics and the thread updating the QIO statistics when the file is opened in QIO mode. Thus, the FIO statistics gets corrupted as another thread writes to it assuming that the QIO statistics is allocated. RESOLUTION:The code is modified to protect the allocation/releasing of FIO/QIO statistics using the read-write lock/spin lock for file statistics structure. * INCIDENT NO:2726061 TRACKING ID:2715186 SYMPTOM: The system panics with following stack trace: panic+0x130 spinlocks_held_leaving_processor+0x4c slpq_swtch_core+0x3cc sleep_spinunlock+0x214 vx_logflush+0x1a0 vx_iupdat_tran+0x1bc vx_iunlock+0x248 vx_tranimdone+0xa4c vx_trancommit2+0xc8 vx_trancommit+0xba4 vx_write_alloc3+0x5b4 vx_tran_write_alloc+0x64 vx_write_alloc2+0x60 vx_external_alloc+0xc0 vx_write_alloc+0x578 vx_write1+0xfb4 vx_rdwr+0x1890 vno_rw+0x64 rwuio+0x11c write+0x68 syscall+0x424 $syscallrtn+0x0 DESCRIPTION: The panic occurs because the DELICACHE lock and inode lock are released in the wrong order. The DELICACHE lock should not be held while trying to unlock the inode lock. RESOLUTION:The code is modified to correct the order of unlock calls for the DELICACHE lock and the inode lock. The DELICACHE lock is released before the inode lock to avoid the panic. * INCIDENT NO:2748461 TRACKING ID:2730894 SYMPTOM: Currently only the binaries that have the fixes for reported incidents are shipped. DESCRIPTION: Only changed binaries were added earlier. From this patch onwards all the binaries in VXFS 5.1V3 would be shipped. RESOLUTION:The code is modified to ship all the binaries from 5.1v3 SP1RP1P2 patch. * INCIDENT NO:2752607 TRACKING ID:2745357 SYMPTOM: Performance enhancements are made for the read/write operation on Veritas File System (VxFS) structural files. DESCRIPTION: The read/write performance of VxFS structural files is affected when the piggy back data in the vx_populate_bpdata() function is ignored. This occurs if the buffer type is not mentioned properly, consequently requiring another disk I/O to get the same data. RESOLUTION:The code is modified so that the piggy back data is not ignored if it is of type VX_IOT_ATTR in the vx_populate_bpdata() function, thus leading to an improvement in the performance of the read/write to the VxFS structural files. * INCIDENT NO:2755784 TRACKING ID:2730759 SYMPTOM: The sequential read performance is poor because of the read-ahead issues. DESCRIPTION: The read-ahead on sequential reads performed incorrectly because of wrong read- advisory and the read-ahead pattern offsets are used to detect and perform the read-ahead. Also, more sync reads are performed which can affect the performance. RESOLUTION:The code is modified and the read-ahead pattern offsets are updated correctly to detect and perform the read-ahead at the required offsets. The read-ahead detection is also modified to reduce the sync reads. * INCIDENT NO:2765308 TRACKING ID:2753944 SYMPTOM: The file creation threads can hang. The following stack trace is displayed: cv_wait+0x38 vx_rwsleep_rec_lock+0xa4 vx_recsmp_rangelock+0x14 vx_irwlock2 vx_irwlock+0x34 vx_fsetialloc+0x98 vx_noinode+0xe4 vx_dircreate_tran+0x7d4 vx_pd_create+0xbb8 vx_create1_pd+0x818 vx_do_create+0x80 vx_create1+0x2f8 vx_create+0x158 fop_create+0x34 lo_create+0x138 fop_create+0x34 vn_createat+0x590 vn_openat+0x138 copen+0x260() DESCRIPTION: The Veritas File Systems (VxFS) uses Inode Allocation Units (IAU) to keep track of the allocated and free inodes. Two counters are maintained per IAU. One for the number of free regular inodes in that IAU and the other for the number of free director inodes. A global in-memory counter is also maintained to keep a track of the total number of free inodes in all the IAUs in the file system. The creation threads refer to this global counter to quickly check the number of free inodes at any given time. Every time an inode is allocated, this global count is decremented. Similarly, it is incremented when an inode is freed. The hang is caused when the global counter unexpectedly becomes negative which confuses the file creation threads. This global counter is calculated by adding per IAU counters during mount time. As the code is multi-threaded, any modification to the global counter must be guarded by a summary lock which is missing in the multi threaded code. Therefore, the calculation goes wrong and the global counter and per IAU counters are out of sync. This results in a negative value and causes the file creation threads to hang. RESOLUTION:The code is modified to update the global inode free count under the protection of the summary lock. * INCIDENT NO:2779617 TRACKING ID:2779609 SYMPTOM: The creation of directory hangs during an internal conformance test. DESCRIPTION: While creating the directories, in the case of partitioned directories, if the vx_maxlink tunable value is changed, the creation of directories hangs due to wrong checks for the local mount file system. RESOLUTION:The code is modified to avoid the usage of lease and delta in the partitioned directories for the local mounts. * INCIDENT NO:2782982 TRACKING ID:2161660 SYMPTOM: While performing the inode update operations on a cluster mounted file system, the system may panic and the following stack trace is displayed: vx_hlock_getpbdata_return+0xac vxg_recv_pbdata+0x138 vxg_process_nmsg+0x21c vxg_inmsg_work+0x6c vxg_gen_worker+0x120 DESCRIPTION: While performing inode update operations on a cluster mounted file system, under certain conditions the inode locks are released and re-acquired. The identity of the inode can change when the locks on the inode are released. This may result in the wrong inode getting updated causing a panic. RESOLUTION:The code is modified to keep track of the updation made to the inode using a counter. If this counter value is non-zero, the inode is not re-used. PATCH ID:PHKL_42718 * INCIDENT NO:2508171 TRACKING ID:2246127 SYMPTOM:Mount command may take more time in case of large IAU file. DESCRIPTION:At time of mount, IAU file is read one block at time. The read block is processed and then next block is read. In case there are huge number of files in filesystem, IAU file for the filesystem becomes large. Reading of such large IAU file, one block at time is taking time to complete mount command. RESOLUTION:Code is changed to read IAU file using multiple threads in parallel, also now complete extent is read and then it is processed. * INCIDENT NO:2521674 TRACKING ID:2510903 SYMPTOM:Writing to clones loops permanently on HP-UX 11.31, there are some threads of the typical stack like following: vx_tranundo vx_logged_cwrite vx_write_clone vx_write1 vx_rdwr vno_rw inline rwuio write syscall DESCRIPTION:A VxFS write with small size can go to logged write which stores the data in intent log. The logged write can boost performance for small writes but requires the write size within logged write limit. However, When we write data to check points and the write length is greater than logged write limit, vxfs cannot proceed with logged write and retry forever. RESOLUTION:Skipped the logged write if the write size exceeds the specific limit. * INCIDENT NO:2551564 TRACKING ID:2428964 SYMPTOM:Value of kernel tunable max_thread_proc gets incremented by 1 after every software maintenance related activity (install, remove etc.) of VRTSvxfs package. DESCRIPTION:In the postinstall script for VRTSvxfs package, value of kernel tunable max_thread_proc is wrongly increment by 1. RESOLUTION:From postinstall script increment operation of max_thread_proc tunable is removed. * INCIDENT NO:2564431 TRACKING ID:2515459 SYMPTOM:Local mount hangs in vx_bc_binval_cookie like the following stack delay vx_bc_binval_cookie vx_blkinval_cookie vx_freeze_flush_cookie vx_freeze_all vx_freeze vx_set_tunefs1 vx_set_tunefs vx_aioctl_full vx_aioctl_common vx_aioctl vx_ioctl genunix:ioctl unix:syscall_trap32 DESCRIPTION:The hanging process for local mount is waiting for a buffer to be unlocked. But that buffer can only be released if its associated cloned map writes get flushed. But a necessary flush is missed. RESOLUTION:Add code to synchronize cloned map writes so that all the cloned maps will be cleared and the buffers associated with them will be released. * INCIDENT NO:2567091 TRACKING ID:2527578 SYMPTOM:System crashes due to NULL pointer deference with the following stack - simple_lock+000014 () vx_bhash_rele@AF161_63+00001C () vx_inode_deinit+0000D4 () vx_idrop+0002A4 () vx_detach_fset+000CC8 () vx_unmount+0001AC () vx_unmount_skey+000034 () vfs_unmount+000098 () kunmount+0000DC () uvmount+000208 () ovlya_addr_sc_flih_main+000130 () DESCRIPTION:The crash happens as a result of accessing an address for which memory hasn't been allocated. This address corresponds to a spinlock and therefore the crash while locking the spinlock. RESOLUTION:Allocate and initialize the spinlock before locking. * INCIDENT NO:2574396 TRACKING ID:2433934 SYMPTOM:Performance degradation observed when CFS is used compared to standalone VxFS as back-end NFS data servers. DESCRIPTION:In CFS, if one thread holding read-write lock on inode in exclusive mode, other threads are stuck for the same inode, even if they want to access inode in shared mode, resulting in performance degradation. RESOLUTION:Code is changed to avoid taking read-write lock for inode in exclusive mode, where it is not required. * INCIDENT NO:2581351 TRACKING ID:2588593 SYMPTOM:df(1M) shows wrong usage value for volume when large file is deleted. DESCRIPTION:We maintain all freed extent size in the in core global variable and transaction subroutine specific data structures. After deletion of a large file, we were missing to update this in core global variable. df(1M) while reporting usage data, read the freed space information from this global variable which contains stale information. RESOLUTION:Code is modified to account the freed extent data into global vaiable used by df(1M) so that correct usage for volume is reported by df(1M). * INCIDENT NO:2587025 TRACKING ID:2528819 SYMPTOM:AIX can fail to create new worker threads for VxFS. The following message is seen in the system log- "WARNING: msgcnt 175 mesg 097: V-2-97: vxfs failed to create new thread" DESCRIPTION:AIX is failing the thread creation because it cannot find a free slot in that kproc and returning ENOMEM. RESOLUTION:Limit the maximum number of VxFS worker threads. * INCIDENT NO:2587030 TRACKING ID:2561739 SYMPTOM:When the file is created and the if the parent has default ACL entry then that entry is not taken into account for calculating the class entry of that file. When a separate dummy entry added we take into account the default entry from parent as well. e.g. $ getacl . # file: . # owner: root # group: sys user::rwx group::rwx class:rwx other:rwx default:user:user_one:r-x $ touch file1 $ getacl file1 # file: try1 # owner: root # group: sys user::rw- user:user_one:r-x group::rw- class:rw- <------ other:rw- The class entry here should be rwx. DESCRIPTION:We were not taking into account the default entry of parent. We share the attribute inode with parent and do not create new attribute inode for newly created file. But when an ACL entry is explicitly made we create the separate attribute inode so the default entry also get copied in new inode and taken into consideration while returning the class entry of file. RESOLUTION:Now before returning the ACL entry buffer we calculate the class entry again and consider all the entries. * INCIDENT NO:2587033 TRACKING ID:2492304 SYMPTOM:"find" command displays duplicate directory entries. DESCRIPTION:Whenever the directory entries can fit in the inode's immediate area VxFS doesn't allocate new directory blocks. As new directory entries get added to the directory this immediate area gets filled and all the directory entries are then moved to a newly allocated directory block. The directory blocks have space reserved at the start of the block to hold the block hash information which is used for fast lookup of entries in that block. Offset of the directory entry, which was at say x bytes in the inode's immediate area, when moved to the directory block, will be at (x + y) bytes. "y" is the size of the block hash. During this transition phase from immediate area to directory blocks, a readdir() can report a directory entry more than once. RESOLUTION:Directory entry offsets returned to the "readdir" call are adjusted so that when the entries move to a new block, they will be at the same offsets. * INCIDENT NO:2602982 TRACKING ID:2599590 SYMPTOM:Expansion of a 100% full file system may panic the machine with the following stack trace. bad_kern_reference() $cold_vfault() vm_hndlr() bubbledown() vx_logflush() vx_log_sync1() vx_log_sync() vx_worklist_thread() kthread_daemon_startup() DESCRIPTION:When 100% full file system is expanded intent log of the file system is truncated and blocks freed up are used during the expansion. Due to a bug the block map of the replica intent log inode was not getting updated thereby causing the block maps of the two inodes to differ. This caused some of the in- core structures of the intent log to go NULL. The machine panics while de- referencing one of this structure. RESOLUTION:Updated the block map of the replica intent log inode correctly. 100% full file system now can be expanded only If the last extent in the intent log contains more than 32 blocks, otherwise fsadm will fail. To expand such a file-system, some of the files should be deleted manually and resize be retried. * INCIDENT NO:2603015 TRACKING ID:2565400 SYMPTOM:Sequential buffered I/O reads are slow in performance. DESCRIPTION:Read-Aheads are not happening because the file-system's read-ahead size gets incorrectly calculated. RESOLUTION:Fixed the incorrect typecast. * INCIDENT NO:2631026 TRACKING ID:2332314 SYMPTOM:Internal noise.fullfsck test with ODM enabled hit an assert fdd_odm_aiodone:3 DESCRIPTION:In case of failed IO in fdd_write_clone_end() function, error was not set on buffer which is causing the assert. RESOLUTION:Code is changed so we set the error on buffer in case of IO failures in fdd_write_clone_end() function. * INCIDENT NO:2631315 TRACKING ID:2631276 SYMPTOM:Lookup fails for the file which is in partitioned directory and is being accessed using its vxfs namespace extension name. DESCRIPTION:If file is present in the partitioned directory and is accessed using its vxfs namespace extension name then its name is searched in one of the hidden leaf directory. This leaf directory mostly doesn't contains entry for this file. Due this lookup fails. RESOLUTION:Code has been modified to call partitioned directory related lookup routine at upper level so that lookup doesn't fails even if file is accessed using its extended namespace name. * INCIDENT NO:2635583 TRACKING ID:2271797 SYMPTOM:Internal Noise Testing with locally mounted VxFS filesystem hit an assert "f:vx_getblk:1a" DESCRIPTION:The assert is hit due to overlay inode is being marked with the flag regarding bad copy of inode present on disk. RESOLUTION:Code is changed to set the flag regarding bad copy of inode present on disk, only if the inode is not overlay. * INCIDENT NO:2642027 TRACKING ID:2350956 SYMPTOM:Internal noise test on locally mounted filesystem exited with error message "bin/testit : Failed to full fsck cleanly, exiting" and in the logs we get the userspace assert "bmaptops.c 369: ASSERT(devid == 0 || (start == VX_HOLE && devid == VX_DEVID_HOLE)) failed". DESCRIPTION:The function bmap_data4_set() gets called while entering bmap allocation information for typed extents of type VX_TYPE_DATA_4 or VX_TYPE_IADDR_4. The assert expects that, either devid should be zero or if extent start is a hole, then devid should be VX_DEVID_HOLE. However, we never have extent descriptors to represent holes in typed extents. The assertion is incorrect. RESOLUTION:The assert corrected to check that extent start is not a hole and either devid is zero, or extent start is VX_OVERLAY with devid being VX_DEVID_HOLE. * INCIDENT NO:2669195 TRACKING ID:2326037 SYMPTOM:Internal Stress Test on cluster file system with clones failed while writing to file with error ENOENT. DESCRIPTION:VxFS file-system trying to write to clone which is in process of removal. As clone removal process works asynchronously, process starts to push changes from inode of primary fset to inode of clone fset. But when actual write happens the inode of clone fset is removed, hence error ENOENT is returned. RESOLUTION:Code is added to re-validate the inode being written. PATCH ID:PHKL_42228 * INCIDENT NO:2169326 TRACKING ID:2169324 SYMPTOM:On LM , When clone is mounted for a file system and some quota is assigned to clone. And if quota exceeds then clone is removed and if files from clones are being accessed then assert may hit in function vx_idelxwri_off() through vx_trunc_tran() DESCRIPTION:During clone removable, we go through the all inodes of the clone(s) being removed and hit the assert because there is difference between on-disk and in-core sizes for the file ,which is being modified by the application. RESOLUTION:While truncating files, if VX_IEPTTRUNC op is set, set the in-core file size to on_disk file size. * INCIDENT NO:2243061 TRACKING ID:1296491 SYMPTOM:Performing a nested mount on a CFS file system triggers a data page fault if a forced unmount is also taking place on the CFS file system. The panic stack trace involves the following kernel routines: vx_glm_range_unlock vx_mount domount mount syscall DESCRIPTION:When the underlying cluster mounted file system is in the process of unmounting, the nested mount dereferences a NULL vfs structure pointer, thereby causing a system panic. RESOLUTION:The code has been modified to prevent the underlying cluster file system from a forced unmount when a nested mount above the file system, is in progress. The ENXIO error will be returned to the forced unmount attempt. * INCIDENT NO:2243063 TRACKING ID:1949445 SYMPTOM:Hang when file creates were being performed on large directory. stack of hung thread is similar to below: vxglm:vxg_grant_sleep+226 vxglm:vxg_cmn_lock+563 vxglm:vxg_api_lock+412 vxfs:vx_glm_lock+29 vxfs:vx_get_ownership+70 vxfs:vx_exh_coverblk+89 vxfs:vx_exh_split+142 vxfs:vx_dexh_setup+1874 vxfs:vx_dexh_create+385 vxfs:vx_dexh_init+832 vxfs:vx_do_create+713 DESCRIPTION:For large directories, Large Directory Hash(LDH) is enabled to improve lookup on such large directories. Hang was due to taking ownership of LDH inode twice in same thread context i.e. while building hash for directory. RESOLUTION:Avoid taking ownership again if we already have the ownership of the LDH inode. * INCIDENT NO:2247299 TRACKING ID:2161379 SYMPTOM:In a CFS enviroment various filesytems operations hang with the following stack trace T1: vx_event_wait+0x40 vx_async_waitmsg+0xc vx_msg_send+0x19c vx_iread_msg+0x27c vx_rwlock_getdata+0x2e4 vx_glm_cbfunc+0x14c vx_glmlist_thread+0x204 T2: vx_ilock+0xc vx_assume_iowner+0x100 vx_hlock_getdata+0x3c vx_glm_cbfunc+0x104 vx_glmlist_thread+0x204 DESCRIPTION:Due to improper handling of the ENOTOWNER error in the iread receive function. We continously retry the operation while holding an Inode Lock blocking all other threads and causing a deadlock RESOLUTION:The code is modified to release the inode lock on ENOTOWNER error and acquire it again, thus resolving the deadlock There are totally 4 vx_msg_get_owner() caller with ilocked=1: vx_rwlock_getdata() : Need Fix vx_glock_getdata() : Need Fix vx_cfs_doextop_iau(): Not using the owner for message loop, no need to fix. vx_iupdat_msg() : Already has 'unlock/delay/lock' on ENOTOWNER condition! * INCIDENT NO:2257904 TRACKING ID:2251223 SYMPTOM:The 'df -h' command can take 10 seconds to run to completion and yet still report an inaccurate free block count, shortly after removing a large number of files. DESCRIPTION:When removing files, some file data blocks are released and counted in the total free block count instantly. However blocks may not always be freed immediately as VxFS can sometimes delay the releasing of blocks. Therefore the displayed free block count, at any one time, is the summation of the free blocks and the 'delayed' free blocks. Once a file 'remove transaction' is done, its delayed free blocks will be eliminated and the free block count increased accordingly. However, some functions which process transactions, for example a metadata update, can also alter the free block count, but ignore the current delayed free blocks. As a result, if file 'remove transactions' have not finished updating their free blocks and their delayed free blocks information, the free space count can occasionally show greater than the real disk space. Therefore to obtain an up-to-date and valid free block count for a file system a delay and retry loop was delaying 1 second before each retry and looping 10 times before giving up. Thus the 'df -h' command can sometimes take 10 seconds, but even if the file system waits for 10 seconds there is no guarantee that the output displayed will be accurate or valid. RESOLUTION:The delayed free block count is recalculated accurately when transactions are created and when metadata is flushed to disk. * INCIDENT NO:2275543 TRACKING ID:1475345 SYMPTOM:write() system call hangs for over 10 seconds DESCRIPTION:While performing a transactions in case of logged write we used to asynchronously flush one buffer at a time belonging to the transaction space. Such Asynchronous flushing was causing intermediate delays in write operation because of reduced transaction space. RESOLUTION:Flush all the dirty buffers on the file in one attempt through synchronous flush, which will free up a large amount of transaction space. This will reduce the delay during write system call. * INCIDENT NO:2289610 TRACKING ID:2073336 SYMPTOM:vxfsstat does not reflect the change of vx_ninode after changing it with kctune. DESCRIPTION:we haven't sync the vxfs_ninode and vxi_icache_maxino in our call back function. RESOLUTION:add sync code in our callback function vx_ninode_callback * INCIDENT NO:2311490 TRACKING ID:2074806 SYMPTOM:a dmapi program using dm_punch_hole may result in corrupted data DESCRIPTION:When the dm_punch_hole call is made on a file with allocated extents is used immediatly after a previous write then data can be written through stale pages. This causes data to be written to the wrong location RESOLUTION:dm_punch_hole will now invalidate all the pages within the hole its creating. * INCIDENT NO:2329887 TRACKING ID:2253938 SYMPTOM:In a Cluster File System (CFS) environment , the file read performances gradually degrade up to 10% of the original read performance and the fsadm(1M) -F vxfs -D -E shows a large number (> 70%) of free blocks in extents smaller than 64k. For example, % Free blocks in extents smaller than 64 blks: 73.04 % Free blocks in extents smaller than 8 blks: 5.33 DESCRIPTION:In a CFS environment, the disk space is divided into Allocation Units (AUs).The delegation for these AUs is cached locally on the nodes. When an extending write operation is performed on a file, the file system tries to allocate the requested block from an AU whose delegation is locally cached, rather than finding the largest free extent available that matches the requested size in the other AUs. This leads to a fragmentation of the free space, thus leading to badly fragmented files. RESOLUTION:The code is modified such that the time for which the delegation of the AU is cached can be reduced using a tuneable, thus allowing allocations from other AUs with larger size free extents. Also, the fsadm(1M) command is enhanced to de-fragment free space using the -C option. * INCIDENT NO:2329893 TRACKING ID:2316094 SYMPTOM:vxfsstat incorrectly reports "vxi_bcache_maxkbyte" greater than "vx_bc_bufhwm" after reinitialization of buffer cache globals. reinitialization can happen in case of dynamic reconfig operations. vxfsstat's "vxi_bcache_maxkbyte" counter shows maximum memory available for buffer cache buffers allocation. Maximum memory available for buffer allocation depends on total memory available for Buffer cache(buffers + buffer headers) i.e. "vx_bc_bufhwm" global. Therefore vxi_bcache_maxkbyte should never greater than vx_bc_bufhwm. DESCRIPTION:"vxi_bcache_maxkbyte" is per-CPU counter i.e. part of global per-CPU 'vx_info' counter structure. vxfsstat does sum of all per-cpu counters and reports result of sum. During re-intitialation of buffer cache, this counter was not set to zero properly before new value is assigned to it. Therefore total sum of this per-CPU counter can be more than 'vx_bc_bufhwm'. RESOLUTION:During buffer cache re-initialization, "vxi_bcache_maxkbyte" is now correctly set to zero such that final sum of this per-CPU counter is correct. * INCIDENT NO:2340755 TRACKING ID:2334061 SYMPTOM:When file system is mounted with tranflush option, operations requiring metadata update take comparatively more time. DESCRIPTION:When VxFS file system is mounted with tranflush option, we flush transaction metadata on the disk and wait for 100 milliseconds before flushing next transaction. This delay is affecting severely operation of various commands on the VxFS file system. RESOLUTION:Since the flushing is synchronous and is performed in loop, 100 milliseconds delay is too much. To solve the problem, the delay is reduced to a more reasonable 2 milliseconds value from 100 milliseconds. * INCIDENT NO:2340799 TRACKING ID:2059611 SYMPTOM:system panics because NULL tranp in vx_unlockmap(). DESCRIPTION:vx_unlockmap is to unlock a map structure of file system. If the map is being handled, we incremented the hold count. vx_unlockmap() attempts to check whether this is an empty mlink doubly linked list while we have an async vx_mapiodone routine which can change the link at unpredictable timing even though the hold count is zero. RESOLUTION:evaluation order is changed inside vx_unlockmap(), such that further evaluation can be skipped over when map hold count is zero. * INCIDENT NO:2340802 TRACKING ID:2129455 SYMPTOM:Lots of vxfs threads seen doing inactive processing. DESCRIPTION:There were 2 issues which can cause lots of vxfs threads doing inactive processing: 1, We used to spawn one inactive processing thread per inactive list. On high end machines, we could see lots of threads doing inactive processing. 2, vx_inactive_started was bumped wrongly in vx_icache_process() instead of vx_inactive_process() which could cause lots of inactive processing threads in corner case. RESOLUTION:For the first issue, we can change that to max(ncpu/2, 8) number of threads at one time that will do inactive processing. For the second issue, it gets fixed by bumping vx_inactive_started in vx_inactive_process() * INCIDENT NO:2340813 TRACKING ID:2183320 SYMPTOM:VxFS mmap performance degredation on HP-UX 11.31. DESCRIPTION:While filling the pages in vx_alloc_getpage, various synchronous I/Os are issued. These I/Os were being performed sequentially ie. we were issuing next I/O only after first is finished. This caused performance problem in case of large pages on HP. RESOLUTION:Problem is fixed by making a single chain of buffers and issuing I/Os in parallel on all the buffers and then waiting for their completion instead of waiting for completion of one I/O before issuing next as it was done previously. * INCIDENT NO:2340817 TRACKING ID:2192895 SYMPTOM:System panics when performing fcl commands at unix:panicsys unix:vpanic_common unix:panic genunix:vmem_xalloc genunix:vmem_alloc unix:segkmem_xalloc unix:segkmem_alloc_vn genunix:vmem_xalloc genunix:vmem_alloc genunix:kmem_alloc vxfs:vx_getacl vxfs:vx_getsecattr genunix:fop_getsecattr genunix:cacl genunix:acl unix:syscall_trap32 DESCRIPTION:The acl count in inode can be corrupted due to race condition. For example, setacl can change the acl count when getacl is processing the same inode, which could cause a invalid use of acl count. RESOLUTION:Code is modified to add the protection for the vulnerable acl count to avoid corruption. * INCIDENT NO:2340831 TRACKING ID:2272072 SYMPTOM:GAB panics the box because VCS engine "had" did not respond, the lbolt wraps around. DESCRIPTION:The lbolt wraps around after 498 days machine uptime. In VxFS, we flush VxFS meta data buffers based on their age. The age calculation happens taking lbolt in account. Due to lbolt wrapping the buffers were not flushed. So, a lot of metadata IO's stopped and hence, the panic. RESOLUTION:In the function for handling flushing of dirty buffers, also handle the condition if lbolt has wrapped. If it has then assign current lbolt time to the last update time of dirtylist. * INCIDENT NO:2340834 TRACKING ID:2302426 SYMPTOM:System panics when multiple 'vxassist mirror' commands are running concurrently with following stack strace: 0) panic+0x410 1) unaligned_hndlr+0x190 2) bubbleup+0x880 ( ) +------------- TRAP #1 ---------------------------- | Unaligned Reference Fault in KERNEL mode | IIP=0xe000000000b03ce0:0 | IFA=0xe0000005aa53c114 <--- | p struct save_state 0x2c561031.0x9fffffff5ffc7400 +------------- TRAP #1 ---------------------------- LVL FUNC ( IN0, IN1, IN2, IN3, IN4, IN5, IN6, IN7 ) 3) vx_copy_getemap_structs+0x70 4) vx_send_getemapmsg+0x240 5) vx_cfs_getemap+0x240 6) vx_get_freeexts_ioctl+0x990 7) vxportal_ioctl+0x4d0 8) spec_ioctl+0x100 9) vno_ioctl+0x390 10) ioctl+0x3c0 11) syscall+0x5a0 DESCRIPTION:Panic is caused because of de-referencing an unaligned address in CFS message structure. RESOLUTION:Used bcopy to ensure proper alignment of the addresses. * INCIDENT NO:2340839 TRACKING ID:2316793 SYMPTOM:Shortly after removing files in a file system commands like 'df', which use 'statfs()', can take 10 seconds to complete. DESCRIPTION:To obtain an up-to-date and valid free block count in a file system a delay and retry loop was delaying 1 second before each retry and looping 10 times before giving up. This unnecessarily excessive retying could cause a 10 second delay per file system when executing the df command. RESOLUTION:The original 10 retries with a 1 second delay each, have been reduced to 1 retry after a 20 millisecond delay, when waiting for an updated free block count. * INCIDENT NO:2360819 TRACKING ID:2337470 SYMPTOM:Cluster File System can unexpectedly and prematurely report a 'file system out of inodes' error when attempting to create a new file. The error message reported will be similar to the following: vxfs: msgcnt 1 mesg 011: V-2-11: vx_noinode - /dev/vx/dsk/dg/vol file system out of inodes DESCRIPTION:When allocating new inodes in a cluster file system, vxfs will search for an available free inode in the 'Inode-Allocation-Units' [IAUs] that are currently delegated to the local node. If none are available, it will then search the IAUs that are not currently delegated to any node, or revoke an IAU delegated to another node. It is also possible for gaps or HOLEs to be created in the IAU structures as a side effect of the CFS delegation processing. However when searching for an available free inode vxfs simply ignores any HOLEs it may find, if the maximum size of the metadata structures has been reached (2^31) new IAUs cannot be created, thus one of the HOLEs should then be populated and used for new inode allocation. The problem occurred as HOLEs were being ignored, consequently vxfs can prematurely report the "file system out of inodes" error message even though there is plenty of free space in the vxfs file system to create new inodes. RESOLUTION:New inodes will now be allocated from the gaps, or HOLEs, in the IAU structures (created as a side effect of the CFS delegation processing). The HOLEs will be populated rather than returning a 'file system out of inodes' error. * INCIDENT NO:2360820 TRACKING ID:2345626 SYMPTOM:File access to be denied on regular files that inherit the default group ACL from the parent directory. When this behavior occurs, commands that attempt to open an affected regular file will fail with a "cannot open", or similar message. DESCRIPTION:In case of files which do not have ACL's explicitly set, the file, shares ACLs with its parent directory. Now when checking access for the file we need to read only default ACL entries of parent directory. The default entries are stored after the non-default entries in parent directory's inode. We do count default entries correctly but we also need to advance the aclp pointer to the start of the default entries. Here is example in which file "file1" that inherits ACL entries from directory "dir1" incorrectly denies access to a user in group "grp1": # umask 007 # getacl /dir1 # file: /dir1 # owner: root # group: sys user::rwx group::rwx group:grp1:r-x class:rwx other:--- default:group:grp1:r-x # touch /dir1/file1 # getacl /dir1/file1 # file: /dir1/file1 # owner: root # group: sys user::rw- group::rw- group:grp1:r-x #effective:r-- class:rw- other:--- # getaccess -u user1 -g grp1 /dir1/file1 --- /dir1/file1 RESOLUTION:For the file which shares ACL with parent directory read the ACL entries correctly. * INCIDENT NO:2360821 TRACKING ID:1956458 SYMPTOM:When attempting to check information of checkpoints by fsckptadm -C blockinfo , the command failed with error 6 (ENXIO), the file system is disabled and some errors come out in message file: vxfs: msgcnt 4 mesg 012: V-2-12: vx_iget - /dev/vx/dsk/sfsdg/three file system invalid inode number 4495 vxfs: msgcnt 5 mesg 096: V-2-96: vx_setfsflags - /dev/vx/dsk/sfsdg/three file system fullfsck flag set - vx_cfs_iread DESCRIPTION:VxFS takes use of ilist files in primary fileset and checkpoints to accommodate inode information. A hole in a ilist file indicates that inodes in the hole don't exist and are not allocated yet in the corresponding fileset or checkpoint. fsckptadm will check every inode in the primary fileset and the downstream checkpoints. If the inode falls into a hole in a prior checkpoint, i.e. the associated file was not generated at the time of the checkpoint creation, fsckptadm exits with error. RESOLUTION:Skip inodes in the downstream checkpoints, if these inodes are located in a hole. * INCIDENT NO:2368738 TRACKING ID:2368737 SYMPTOM:If a file which has shared extents has corrupt indirect blocks, then in certain cases the reference count tracking system can try to interpret this block and panic the system. Since this is a asynchronous background operation, this processing will retry repeatedly on every file system mount and hence can result in panic every time the file system is mounted. DESCRIPTION:Reference count tracking system for shared extents updates reference count in a lazy fashion. So in certain cases it asynchronously has to access shared indirect blocks belonging to a file to account for reference count updates. But due if this indirect block has been corrupted badly "a priori", then this tracking mechanism can panic the system repeatedly on every mount. RESOLUTION:The reference count tracking system validates the read indirect extent from the disk and in case it is not found valid sets VX_FULLFSCK flag in the superblock marking it for full fsck and disables the file system on the current node. * INCIDENT NO:2368788 TRACKING ID:2343158 SYMPTOM:For tuning failure of vx_ninode in case of new value is less than (250*vx_nfreelists), below message is displayed : "vmunix: ERROR: mesg 112: V-2-112: The new value requires changes to Inode table which can be made only after a reboot " DESCRIPTION:Message after this tuning failure is reported as "ERROR" which can confuse some customers and create questions like ' whether system is safe to run after this error'. This tuning failure is not serious failure/error and therefore failure message can be reported as WARNING and message itself can be modified to mention than tuning requests has failed. RESOLUTION:Failure Message is modified such that it is reported as 'WANRING' instead of 'ERROR' and message will clearly mention about failure of tuning operation. * INCIDENT NO:2371921 TRACKING ID:2371910 SYMPTOM:mkfs fails to create VxFS with disk layout version 4 DESCRIPTION:In 5.1SP1, the support for VxFS disk layout version (DLV) 4 has been deprecated. This means, mkfs for DLV 4 will fail. However, DLV 4 can still be mounted so that file system can be upgraded to the supported DLV using vxupgrade. RESOLUTION:On special request from HP, DLV4 has been un-deprecated for 5.1SP1RP2. Hence, now it will be possible to create VxFS with DLV4 using mkfs in 5.1SP1RP2. * INCIDENT NO:2371923 TRACKING ID:2371909 SYMPTOM:In VxFS, Delete performance affected on Postmark test. DESCRIPTION:In VxFS, when function vx_delxwri_flush() is executed, all inode are flushed to disk, which result in extra load while deleting files affecting delete performance.. RESOLUTION:Code changed to mark the inodes which are last flushed and flush inodes which are not flushed in previous run. * INCIDENT NO:2373565 TRACKING ID:2283315 SYMPTOM:System may panic when "fsadm -e" is run on a file system containing file level snapshots. The panic stack looks like: crash_kexec() __die at() do_page_fault() error_exit() [exception RIP: vx_bmap_lookup+36] vx_bmap_lookup() vx_bmap() vx_reorg_emap() vx_extmap_reorg() vx_reorg() vx_aioctl_full() vx_aioctl_common() vx_aioctl() vx_ioctl() do_ioctl() vfs_ioctl() sys_ioctl() tracesys DESCRIPTION:The panic happened because of a NULL inode pointer passed to vx_bmap_lookup() function. During reorganizing extents of a file, block map (bmap) lookup operation is done on a file to get the information about the extents of the file. If this bmap lookup finds a hole at an offset in a file containing shared extents, a local variable is not updated that makes the inode pointer NULL during the next bmap lookup operation. RESOLUTION:Initialized the local variable such that inode pointer passed to vx_bmap_lookup() will be non NULL. * INCIDENT NO:2386483 TRACKING ID:2374887 SYMPTOM:Access to a file system can hang when creating a named attribute due to a read/write lock being held exclusively and indefinitely causing a thread to loop in vx_tran_nattr_dircreate() A typical stacktrace of a looping thread: vx_itryhold_locked vx_iget vx_attr_iget vx_attr_kgeti vx_attr_getnonimmed vx_acl_inherit vx_aclop_creat vx_attr_creatop vx_new_attr vx_attr_inheritbuf vx_attr_inherit vx_tran_nattr_dircreate vx_nattr_copen vx_nattr_open vx_setea vx_linux_setxattr vfs_setxattr link_path_walk sys_setxattr system_call DESCRIPTION:The initial creation of a named attribute for a regular file or directory will result in the automatic creation of a 'named attribute directory'. Creations are initially attempted in a single transaction. Should the single transaction fail due to a read/write lock being held then a retry should split the task into multiple transactions. An incorrect reset of a tracking structure meant that all retries were performed using a single transaction creating an endless retry loop. RESOLUTION:The tracking structure is no longer reset within the retry loop. * INCIDENT NO:2409792 TRACKING ID:2373239 SYMPTOM:Performace issue pointing to read flush behind algorithm DESCRIPTION:While system is under memory pressure, the vxfs read flush behind algorithm may invalidating pages we read ahead before we have a chance to consume it. The invalidated pages must be re-read which lead to bad application performance. Customer used adb to turn this feature off and did get some very good improvements. RESOLUTION:Keep a gap between the read flush offset and current read offset, the gap length is fs_flush_size. Pages in this gap range will not be flushed which give user application a chance to consume them. * INCIDENT NO:2412029 TRACKING ID:2384831 SYMPTOM:System panics with the following stack trace. This happens in some cases when names streams are used in VxFS. machine_kexec() crash_kexec() __die do_page_fault() error_exit() [exception RIP: iput+75] vx_softcnt_flush() vx_ireuse_clean() vx_ilist_chunkclean() vx_inode_free_list() vx_ifree_scan_list() vx_workitem_process() vx_worklist_process() vx_worklist_thread() vx_kthread_init() kernel_thread() DESCRIPTION:VxFS internally creates a directory to keep the named streams pertaining to a file. In some scenarios, an error code path is missing to release the hold on that directory. Due to this unmount of the file system will not clean the inode belonging to that directory. Later when VxFS reuses such a inode panic is seen. RESOLUTION:Release the hold on the named streams directory in case of an error. * INCIDENT NO:2412173 TRACKING ID:2383225 SYMPTOM:System panics during a user write with the following stack trace and with the panic string "pfd_unlock: bad lock state!" (panic+0x128) (bad_kern_reference+0x64) (vfault+0x1ec) ($0000009B+0xac) ($thndlr_rtn+0x0) (vx_dio_rdwri+0xdc) (vx_write_direct+0x2ec) (vx_write1+0x13a8) (vx_rdwr+0xa88) (vno_rw+0x64) (rwuio+0x11c) (aio_rw_child_thread+0x178) (aio_exec_req_thread+0x258) (kthread_daemon_startup+0x24) (kthread_daemon_startup+0x0) DESCRIPTION:When write to a file is handled as a direct I/O, user pages are pinned using the pas_pin() interface provided by the OS before the I/O is issued. pas_pin() interface can return ENOSPC error. VxFS Write code path has misinterpreted the ENOSPC error and retried the write without resetting a variable in the uio structure. System panics while dereferencing that variable of the uio structure later. RESOLUTION:Do not retry the write when pas_pin() returns ENOSPC error. * INCIDENT NO:2412177 TRACKING ID:2371710 SYMPTOM:User quota file corruption occurs when DELICACHE feature is enabled, the current usage of inodes of a user becomes negative after frequent file creations and deletions. Checking quota info using command "vxquota -vu username", the number of files is "-1" like: # vxquota -vu testuser2 Disk quotas for testuser2 (uid 500): Filesystem usage quota limit timeleft files quota limit timeleft /vol01 1127809 8239104 8239104 -1 0 0 DESCRIPTION:This issue is introduced by the inode DELICACHE feature in 5.1SP1, it is a performance enhancement to optimize the updates done to inode map during file creations and deletions. The feature is enabled by default, and can be changed by vxtunefs. When DELICACHE is enabled and quota is set for vxfs, there will be an extra quota update for the inodes on inactive list during removing process. Since these inodes' quota has been updated already before put on delicache list, the current number of user files gets decremented twice eventually. RESOLUTION:Add a flag to identify the inodes moved to inactive list from delicache list, so that the flag can be used to prevent updating the quota again during removing process. * INCIDENT NO:2412179 TRACKING ID:2387609 SYMPTOM:Quota usage gets set to ZERO when umount/mount the file system though files owned by users exist. This issue may occur after some file creations and deletions. Checking the quota usage using "vxrepquota" command and the output would be like following: # vxrepquota -uv /vx/sofs1/ /dev/vx/dsk/sfsdg/sofs1 (/vx/sofs1): Block limits File limits User used soft hard timeleft used soft hard timeleft testuser1 -- 0 3670016 4194304 0 0 0 testuser2 -- 0 3670016 4194304 0 0 0 testuser3 -- 0 3670016 4194304 0 0 0 Additionally the quota usage may not be updated after inode/block usage reaches ZERO. DESCRIPTION:The issue occurs when VxFS merges external per node quota files with internal quota file. The block offset within external quota file could be calculated wrongly in some scenario. When any hole found in per node quota file, the file offset such that it points the next non-HOLE offset will be modified, but we miss to change the block offset accordingly which points to the next available quota record in a block. VxFS updates per node quota records only when global internal quota file shows either of some bytes or inode usage, otherwise it doesn't copy the usage from global quota file to per node quota file. But for the case where quota usage in external quota files has gone down to zero and both bytes and inode usage in global file becomes zero, per node quota records would be not updated and left with incorrect usage. It should also check bytes or inodes usage in per node quota record. It should skip coping records only when bytes and inodes usage in both global quota file and per node quota file is zero. RESOLUTION:Corrected the way to calculate the block offset when any hole is found in per node quota file. Added code to also check blocks or inodes usage in per node quota record while updating user quota usage. * INCIDENT NO:2413004 TRACKING ID:2413036 SYMPTOM:In CFS environment with partitioned directory enabled (disk layout 8), trying to create huge number of subdirectories within a directory hangs with the following stacktrace vx_pd_lookup+0008A8 () vx_init_pdnlink+0003D4 () vx_get_pdnlink+0004FC () vx_mkdir1_pd+00017C () vx_do_mkdir@AF53_26+000080 () vx_do_mkdir+00002C () vx_mkdir1@AF54_27+00004C () vx_mkdir1+00002C () vx_mkdir+0000D4 () DESCRIPTION:With partitioned directory support, CFS can create MAXLINK - delta number of subdirectories within a directory. The value of delta depends on number of nodes present in the cluster. The delta value was not checked while detecting if the limit for subdirectories is reached and thus retrying the operation endlessly. RESOLUTION:Code is modified to consider delta while checking the limit for links during mkdir. * INCIDENT NO:2413010 TRACKING ID:2413037 SYMPTOM:In VxFS system with partitioned directory enabled (disk layout 8) and accessed through NFS, directory listing lists less number of entries DESCRIPTION:NFS uses a limited sized buffer to read directory entries and hence have to invoke readdir call multiple times to read all the entries of a big directory. NFS uses d_off field of the last directory entry in the buffer as the offset for next readdir invocation. d_off field was not set properly resulting the directory listing operation terminate prematurely. RESOLUTION:With partitioned directory support, entries of a namespace visible directory are distributed across multiple hidden subdirectories. Code is modified so that d_off field reflects the offset of an entry from the beginning of the namespace visible directory rather than from the beginning of the hidden subdirectory. * INCIDENT NO:2413015 TRACKING ID:2413039 SYMPTOM:In VxFS system with partitioned directory enabled (disk layout 8) and accessed through a read-only mount, in some particular scenario directory listing lists less number of entries. DESCRIPTION:With partitioned directory support, all the entries of a namespace visible directory are distributed across multiple hidden subdirectories. The distribution happens during partitioning, and if the partitioning does not complete for any reason, such as system crash, then some entries move to respective hidden directories, but some still remain under the namespace visible directory. Those entries may not get reported as part of directory listing if the filesystem is mounted read only. RESOLUTION:Readdir code is modified to ensure such entries get reported even if the file system is mounted read-only. * INCIDENT NO:2418819 TRACKING ID:2283893 SYMPTOM:In a Cluster File System (CFS) environment , the file read performances gradually degrade up to 10% of the original read performance and the fsadm(1M) -F vxfs -D -E shows a large number (> 70%) of free blocks in extents smaller than 64k. For example, % Free blocks in extents smaller than 64 blks: 73.04 % Free blocks in extents smaller than 8 blks: 5.33 DESCRIPTION:In a CFS environment, the disk space is divided into Allocation Units (AUs).The delegation for these AUs is cached locally on the nodes. When an extending write operation is performed on a file, the file system tries to allocate the requested block from an AU whose delegation is locally cached, rather than finding the largest free extent available that matches the requested size in the other AUs. This leads to a fragmentation of the free space, thus leading to badly fragmented files. RESOLUTION:The code is modified such that the time for which the delegation of the AU is cached can be reduced using a tuneable, thus allowing allocations from other AUs with larger size free extents. Also, the fsadm(1M) command is enhanced to de-fragment free space using the -C option. * INCIDENT NO:2420060 TRACKING ID:2403126 SYMPTOM:Hang is seen in the cluster when one of the nodes in the cluster leaves or rebooted. One of the nodes in the cluster will contain the following stack trace. e_sleep_thread() vx_event_wait() vx_async_waitmsg() vx_msg_send() vx_send_rbdele_resp() vx_recv_rbdele+00029C () vx_recvdele+000100 () vx_msg_recvreq+000158 () vx_msg_process_thread+0001AC () vx_thread_base+00002C () threadentry+000014 (??, ??, ??, ??) DESCRIPTION:Whenever a node in the cluster leaves, reconfiguration happens and all the resources that are held by the leaving nodes are consolidated. This is done on one node of the cluster called primary node. Each node sends a message to the primary node about the resources it is currently holding. During this reconfiguration, in a corner case, VxFS is incorrectly calculating the message length which is larger than what GAB(Veritas Group Membership and Atomic Broadcast) layer can handle. As a result the message is getting lost. The sender thinks that the message is sent and waits for acknowledgement. The message is actually dropped at sender and never sent. The master node which is waiting for this message will wait forever and the reconfiguration never completes leading to hang. RESOLUTION:The message length calculation is done properly now and GAB can handle the messages. * INCIDENT NO:2425429 TRACKING ID:2422574 SYMPTOM:On CFS, after turning the quota on, when any node is rebooted and rejoins the cluster, it fails to mount the filesystem. DESCRIPTION:At the time of mounting the filesystem after rebooting the node, mntlock was already set, which didn't allow the remount of filesystem, if quota is on. RESOLUTION:Code is changed so that the mntlock flag is masked in quota operation as it's already set on the mount. * INCIDENT NO:2426039 TRACKING ID:2412604 SYMPTOM:Once the time limit expires after exceeding the soft-limit of user quota size on VxFS filesystem, writes are still permissible over that soft-limit. DESCRIPTION:After exceeding the soft-limit, in the initial setup of the soft-limit the timer didn't use to start. RESOLUTION:Start the timer during the initial setting of quota limits if current usage has already crossed the soft quota limits. * INCIDENT NO:2427269 TRACKING ID:2399228 SYMPTOM:Occasionally Oracle Archive logs can be created smaller than they should be, in the reported case the resultant Oracle Archive logs were incorrectly sized as 512 bytes. DESCRIPTION:The fcntl [file control] command F_FREESP [Free storage space] can be utilised to change the size of a regular file. If the file size is reduced we call it a "truncate", and space allocated in the truncated area will be returned to the file system freespace pool. If the file size is increased using F_FREESP we call it a "truncate-up", although the file size changes no space is allocated in the extended area of the file. Oracle archive logs utilize the F_FREESP fcntl command to perform a truncate-up of a new file before a smaller write of 512 bytes [at the the start of the file] is then performed. A timing window was found with F_FREESP which meant that 'truncate-up' file size was lost, or rather overwritten, by the subsequent write of the data, thus causing the file to appear with a size of just 512 bytes. RESOLUTION:A timing window has been closed whereby the flush of the allocating [512byte] write was triggered after the new F_FREESP file size has been updated in the inode. * INCIDENT NO:2478237 TRACKING ID:2384861 SYMPTOM:The following asserts are seen during internal stress and regression runs f:vx_do_filesnap:1b f:vx_inactive:2a f:xted_check_rwdata:31 f:vx_do_unshare:1 DESCRIPTION:These asserts validate some assumption in various function also there were some miscellaneous issues which were seen during internal testing. RESOLUTION:The code has been modified to fix the internal reported issues which other miscellaneous changes. * INCIDENT NO:2482337 TRACKING ID:2431674 SYMPTOM:panic in vx_common_msgprint() via vx_inactive() DESCRIPTION:The problem is that the call VX_CMN_ERR( ) , uses a "llx" format character which vx_common_msgprint() doesn't understand. It gives up trying to process that format, but continues on without consuming the corresponding parameter. Everything else in the parameter list is effectively shifted by 8 bytes, and when we get to processing the string argument, it's game over. RESOLUTION:Changed the format to "llu", which vx_common_msgprint() understands. * INCIDENT NO:2486597 TRACKING ID:2486589 SYMPTOM:Multiple threads may wait on a mutex owned by a thread that is in function vx_ireuse_steal() with following stack trace on machine with severe inode pressure. vx_ireuse_steal() vx_ireuse() vx_iget() DESCRIPTION:Several thread are waiting to get inodes from VxFS. The current number of inodes reached max number of inodes (vxfs_ninode) that can be created in memory. So no new allocations can be possible, which results in thread wait. RESOLUTION:Code is modified so that in such situation, threads return ENOINODE instead of retrying to get inodes. * INCIDENT NO:2494464 TRACKING ID:2247387 SYMPTOM:Internal local mount noise.fullfsck.N4 test hit an assert vx_ino_update:2 With stack trace looking as below panic: f:vx_ino_update:2 Stack Trace: IP Function Name 0xe0000000023d5780 ted_call_demon+0xc0 0xe0000000023d6030 ted_assert+0x130 0xe000000000d66f80 vx_ino_update+0x230 0xe000000000d727e0 vx_iupdat_local+0x13b0 0xe000000000d638b0 vx_iupdat+0x230 0xe000000000f20880 vx_tflush_inode+0x210 0xe000000000f1fc80 __vx_fsq_flush___vx_tran.c__4096000_0686__+0xed0 0xe000000000f15160 vx_tranflush+0xe0 0xe000000000d2e600 vx_tranflush_threaded+0xc0 0xe000000000d16000 vx_workitem_process+0x240 0xe000000000d15ca0 vx_worklist_thread+0x7f0 0xe000000001471270 kthread_daemon_startup+0x90 End of Stack Trace DESCRIPTION:INOILPUSH flag is not set when inode is getting updated, which caused above assert. The problem was creation and deletion of clone resets the INOILPUSH flag and function vx_write1_fast() does not set the flag after updating the inode and file. RESOLUTION:Code is modified so that if INOILPUSH flag is not set while function vx_write1_fast(), then the flag is set in the function. * INCIDENT NO:2496959 TRACKING ID:2496954 SYMPTOM:Using vxtunefs(1M) command, tunable pdir_enable can be set to invalid values. DESCRIPTION:During sanity check, tunable pdir_enable is improperly checked. Due to it, invalid values can be set to tunable pdir_enable, RESOLUTION:Code is changed to correct the error in the sanity check of pdir_enable. * INCIDENT NO:2508164 TRACKING ID:2481984 SYMPTOM:Access to the file system got hang. DESCRIPTION:In function 'vx_setqrec', it will call 'vx_dqget'. when 'vx_dqget' return errors, it will try to unlock DQ structure using 'VX_DQ_CLUSTER_UNLOCK'. But, in this situation, DQ structure doesn't hold the lock. hence, this hang happens. RESOLUTION:'dq_inval' would be set in 'vx_dqget' in case of any error happens in 'vx_dqget'. Skip unlocking DQ structure in the error code path of 'vx_setqrec', if 'dq_inval' is set. * INCIDENT NO:2529356 TRACKING ID:2340953 SYMPTOM:During internal stress test, f:vx_iget:1a assert is seen. DESCRIPTION:While renaming certain file, we check if the target directory is in the path of the source file to be renamed. while using function vx_iget() to reach till root inode, one of parent directory incode number was 0 and hence the assert. RESOLUTION:Code is changed so that during renames, parent directory is first assigned correct inode number before using vx_iget() function to retrieve root inode. * INCIDENT NO:2559601 TRACKING ID:476179 SYMPTOM:The most common symptom known so far involves corrupted IAU header(s), uncovered by a full-fsck, as shown in the example below. # fsck -F vxfs -o full /dev/vx/dsk/bigdg/bigvol fileset 999, invalid magic number in primary IAU 196 log replay in progress fileset 999, invalid magic number in primary IAU 196 fileset 999, invalid magic number in primary IAU 196 fileset 999, invalid magic number in primary IAU 196 pass0 - checking structural files fileset 1 primary-ilist inode 64 (Primary IAU) failed validation clear? (ynq) ... DESCRIPTION:The root cause of the issue is that the kernel routine 'vx_write_blk()' incorrectly truncates a 64-bit block offset into a 32-bit block offset, ignoring the high order bits of the offset. VxFS invokes this routine in several places to write metadata, including inode allocation units (IAUs), extended attributes and file change log. If this routine happens to write to a file system block mapping to a greater than 4TB offset, the data will be incorrectly written to an offset 4TB less. RESOLUTION:Code changes were made to remove truncation of 64 bit block offset to 32 bit. * INCIDENT NO:2559801 TRACKING ID:2429566 SYMPTOM:Memory used for VxFS internal buffer cache may significantly grow after 497 days uptime when LBOLT(global which gives current system time) wraps over. DESCRIPTION:We calculate age of buffers based on LBOLT value. Like age = (current LBOLT - LBOLT when buffer added to list). Buffer is reused when age becomes greater than threshold. When LBOLT wraps, current LBOLT becomes very small value and age becomes negative. VxFS thinks that this is not old buffer and never reuses it. Buffer cache memory usage increases as buffers are not reused. RESOLUTION:Now we check if the the LBOLT has wrapped around. If it is, we reassign the buffer time with current LBOLT so that it gets reused after some time. INCIDENTS FROM OLD PATCHES: --------------------------- NONE