* * * READ ME * * * * * * Veritas File System 5.1 SP1 RP2 * * * * * * P-patch 3 * * * Patch Date: 2012-08-06 This document provides the following information: * PATCH NAME * PACKAGES AFFECTED BY THE PATCH * BASE PRODUCT VERSIONS FOR THE PATCH * OPERATING SYSTEMS SUPPORTED BY THE PATCH * INCIDENTS FIXED BY THE PATCH * INSTALLATION PRE-REQUISITES * INSTALLING THE PATCH * REMOVING THE PATCH PATCH NAME ---------- Veritas File System 5.1 SP1 RP2 P-patch 3 PACKAGES AFFECTED BY THE PATCH ------------------------------ VRTSvxfs BASE PRODUCT VERSIONS FOR THE PATCH ----------------------------------- * Veritas Storage Foundation for Oracle RAC 5.1 SP1 * Veritas Storage Foundation Cluster File System 5.1 SP1 * Veritas Storage Foundation 5.1 SP1 * Veritas Storage Foundation High Availability 5.1 SP1 OPERATING SYSTEMS SUPPORTED BY THE PATCH ---------------------------------------- AIX 5.3 AIX 6.1 AIX 7.1 INCIDENTS FIXED BY THE PATCH ---------------------------- This patch fixes the following Symantec incidents: Patch ID: 5.1.112.300 * 2169326 (Tracking ID: 2169324) SYMPTOM: On LM , When clone is mounted for a file system and some quota is assigned to clone. And if quota exceeds then clone is removed and if files from clones are being accessed then assert may hit in function vx_idelxwri_off() through vx_trunc_tran() DESCRIPTION: During clone removable, we go through the all inodes of the clone(s) being removed and hit the assert because there is difference between on-disk and in-core sizes for the file , which is being modified by the application. RESOLUTION: While truncating files, if VX_IEPTTRUNC op is set, set the in-core file size to on_disk file size. * 2243061 (Tracking ID: 1296491) SYMPTOM: Performing a nested mount on a CFS file system triggers a data page fault if a forced unmount is also taking place on the CFS file system. The panic stack trace involves the following kernel routines: vx_glm_range_unlock vx_mount domount mount syscall DESCRIPTION: When the underlying cluster mounted file system is in the process of unmounting, the nested mount dereferences a NULL vfs structure pointer, thereby causing a system panic. RESOLUTION: The code has been modified to prevent the underlying cluster file system from a forced unmount when a nested mount above the file system, is in progress. The ENXIO error will be returned to the forced unmount attempt. * 2243063 (Tracking ID: 1949445) SYMPTOM: Hang when file creates were being performed on large directory. stack of hung thread is similar to below: vxglm:vxg_grant_sleep+226 vxglm:vxg_cmn_lock+563 vxglm:vxg_api_lock+412 vxfs:vx_glm_lock+29 vxfs:vx_get_ownership+70 vxfs:vx_exh_coverblk+89 vxfs:vx_exh_split+142 vxfs:vx_dexh_setup+1874 vxfs:vx_dexh_create+385 vxfs:vx_dexh_init+832 vxfs:vx_do_create+713 DESCRIPTION: For large directories, Large Directory Hash(LDH) is enabled to improve lookup on such large directories. Hang was due to taking ownership of LDH inode twice in same thread context i.e. while building hash for directory. RESOLUTION: Avoid taking ownership again if we already have the ownership of the LDH inode. * 2247299 (Tracking ID: 2161379) SYMPTOM: In a Cluster File System (CFS) environment, various file system operations hang with the following stack trace: T1: vx_event_wait() vx_async_waitmsg() vx_msg_send() vx_iread_msg() vx_rwlock_getdata() vx_glm_cbfunc() vx_glmlist_thread() T2: vx_ilock() vx_assume_iowner() vx_hlock_getdata() vx_glm_cbfunc() vx_glmlist_thread() DESCRIPTION: Due to improper handling of the ENOTOWNER error in the ireadreceive() function, the operation is retried repeatedly while holding an inode lock. All the other threads are blocked, thus causing a deadlock. RESOLUTION: The code is modified to release the inode lock on the ENOTOWNER error and acquire it again, thus resolving the deadlock. * 2249658 (Tracking ID: 2220300) SYMPTOM: 'vx_sched' hogs CPU resources. DESCRIPTION: vx_sched process calls vx_iflush_list() to perform the background flushing processing. vx_iflush_list() calls vx_logwrite_flush() if the file has had logged-writes performed upon it. vx_logwrite_flush() performs a old trick that is ineffective when flushing in chunks. The trick is to flush the file asynchronously, then flush the file again synchronously. This therefore flushes the entire file twice, this is double the work when chunk flushing. RESOLUTION: vx_logwrite_flush() has been changed to flush the file once rather than twice. So Removed asynchronous flush in vx_logwrite_flush(). * 2255786 (Tracking ID: 2253617) SYMPTOM: Fullfsck fails to run cleanly using "fsck -n". DESCRIPTION: In case of duplicate file name entries in one directory, fsck compares the directory entry with the previous entries. If the filename already exists further action is taken according to the user input [Yes/No]. As we are using strncmp, it will compare first n characters, if it matches it will return success and will consider it as a duplicate file name entry and fails to run cleanly using "fsck -n" RESOLUTION: Checking the filename size and changing the length in strncmp to name_len + 1 solves the issue. * 2257904 (Tracking ID: 2251223) SYMPTOM: The df(1M) command with the -h option takes 10 seconds to execute and reports an inaccurate free block count, shortly after a large number of files are removed. DESCRIPTION: When removing files, some of the file data blocks are released and counted in the total free block count instantly. However, all the blocks may not be freed immediately as Veritas File System (VxFS) can sometimes delay the releasing of blocks. Therefore, the displayed free block count, at any given time, is the total of the free blocks and the 'delayed' free blocks. Once a 'file remove' transaction is done, its 'delayed' free blocks are eliminated and the free block count increases accordingly. However, some functions which process certain transactions, for example a metadata update, can also alter the free block count, but ignore the current 'delayed' free blocks. As a result, if the 'file remove' transactions have not finished updating their free blocks and their 'delayed' free blocks information, the free space count can occasionally show more than the real disk space. Therefore, to obtain an up-to-date and valid free block count for a file system, a delay and retry loop delays 1 second before each retry and loops 10 times before giving up. Thus, the df(1M) command with the -h option sometimes takes 10 seconds to execute. But even if the file system waits for 10 seconds, there is no guarantee that the output displayed will be accurate or valid. RESOLUTION: The code is modified so that the delayed free block count is recalculated accurately when transactions are created and metadata is flushed to the disk. * 2275543 (Tracking ID: 1475345) SYMPTOM: write() system call hangs for over 10 seconds DESCRIPTION: While performing a transactions in case of logged write we used to asynchronously flush one buffer at a time belonging to the transaction space. Such Asynchronous flushing was causing intermediate delays in write operation because of reduced transaction space. RESOLUTION: Flush all the dirty buffers on the file in one attempt through synchronous flush, which will free up a large amount of transaction space. This will reduce the delay during write system call. * 2280386 (Tracking ID: 2061177) SYMPTOM: On systems running Veritas File System (VxFS) version of 5.0MP3RP1, the fsadm (1M) command with the ' -de' option displays an error with 'bad file number' on file systems. DESCRIPTION: The fsadm(1M) command maintains an in-core copy of the inode information from the disk and may re-read it later for any other processing. The command fails with an error because the in-core data and the on disk data are out of synchronization. RESOLUTION: The code is modified to add a sync operation in the fsadm(1M) command before it reads the layout from the raw disk to ensure synchronization of the in-core data and the on disk data. * 2280552 (Tracking ID: 2246579) SYMPTOM: Filesystem corruptions and system panic when attempting to extend a 100%-full disk layout version 5(DLV5) VxFS filesystem using fsadm(1M). DESCRIPTION: The behavior is caused by filesystem metadata that is relocated to the intent log area inadvertently being destroyed when the intent log is cleared during the resize operation. RESOLUTION: Refresh the incore intent log extent map by reading the bmap of the intent log inode before clearing it. * 2296277 (Tracking ID: 2296107) SYMPTOM: The fsppadm command (fsppadm query -a mountpoint ) displays ""Operation not applicable" " while querying the mount point. DESCRIPTION: During fsppadm query process, fsppadm will try to open every file's named data stream "" in the filesystem. but vxfs inernal file FCL: "changelog" doesn't support this operation. "ENOSYS" is returned in this case. fsppadm will translate "ENOSYS" into "Operation not applicable", and print the bogus error message. RESOLUTION: Fix fsppadm's get_file_tags() to ignore the "ENOSYS" error. * 2311490 (Tracking ID: 2074806) SYMPTOM: a dmapi program using dm_punch_hole may result in corrupted data DESCRIPTION: When the dm_punch_hole call is made on a file with allocated extents is used immediatly after a previous write then data can be written through stale pages. This causes data to be written to the wrong location RESOLUTION: dm_punch_hole will now invalidate all the pages within the hole its creating. * 2320044 (Tracking ID: 2419989) SYMPTOM: ncheck(1M) command with '-i' option does not limit the output to the specified inodes. DESCRIPTION: Currently, ncheck(1M) command with '-i' option currently shows free space information and other inodes that are not in the list provides by '-i' option. RESOLUTION: ncheck(1M) command is modified to print only those inodes that are specified by '-i' option. * 2320049 (Tracking ID: 2419991) SYMPTOM: There is no way to specify an inode that is unique to the file system since we reuse inode numbers in multiple filesets. We therefore would need to be able to specify a list of filesets similar to the '-i' option for inodes, or add a new '-o' option where you can specify fileset+inode pairs. DESCRIPTION: When ncheck command is called with '-i' option in conjunction with -oblock/device/sector option, it displays inodes having same inode number from all filesets. We don't have any command line option that helps us to specify a unique inode and fileset combination. RESOLUTION: Code is modified to add '-f' option in ncheck command using which one could specify the fset number on which one wants to filter the results. Further, if this option is used with '-i' option, we could uniquely specify the inode-fileset pair/s that that we want to display. * 2329887 (Tracking ID: 2253938) SYMPTOM: In a Cluster File System (CFS) environment , the file read performances gradually degrade up to 10% of the original read performance and the fsadm(1M) -F vxfs -D -E shows a large number (> 70%) of free blocks in extents smaller than 64k. For example, % Free blocks in extents smaller than 64 blks: 73.04 % Free blocks in extents smaller than 8 blks: 5.33 DESCRIPTION: In a CFS environment, the disk space is divided into Allocation Units (AUs).The delegation for these AUs is cached locally on the nodes. When an extending write operation is performed on a file, the file system tries to allocate the requested block from an AU whose delegation is locally cached, rather than finding the largest free extent available that matches the requested size in the other AUs. This leads to a fragmentation of the free space, thus leading to badly fragmented files. RESOLUTION: The code is modified such that the time for which the delegation of the AU is cached can be reduced using a tuneable, thus allowing allocations from other AUs with larger size free extents. Also, the fsadm(1M) command is enhanced to de-fragment free space using the -C option. * 2329893 (Tracking ID: 2316094) SYMPTOM: vxfsstat incorrectly reports "vxi_bcache_maxkbyte" greater than "vx_bc_bufhwm" after reinitialization of buffer cache globals. reinitialization can happen in case of dynamic reconfig operations. vxfsstat's "vxi_bcache_maxkbyte" counter shows maximum memory available for buffer cache buffers allocation. Maximum memory available for buffer allocation depends on total memory available for Buffer cache(buffers + buffer headers) i.e. "vx_bc_bufhwm" global. Therefore vxi_bcache_maxkbyte should never greater than vx_bc_bufhwm. DESCRIPTION: "vxi_bcache_maxkbyte" is per-CPU counter i.e. part of global per-CPU 'vx_info' counter structure. vxfsstat does sum of all per-cpu counters and reports result of sum. During re-intitialation of buffer cache, this counter was not set to zero properly before new value is assigned to it. Therefore total sum of this per-CPU counter can be more than 'vx_bc_bufhwm'. RESOLUTION: During buffer cache re-initialization, "vxi_bcache_maxkbyte" is now correctly set to zero such that final sum of this per-CPU counter is correct. * 2340741 (Tracking ID: 2282201) SYMPTOM: On a VxFS filesystem, vxdump(1m) operation running in parallel with other filesystem Operations like create, delete etc. can fail with signal SIGSEGV generating a core file. DESCRIPTION: vxdump caches the inodes to be dumped in a bit map before starting the dump of a directory, however this value can change if there are creates and deletes happening in the background leading to inconsistent bit map eventually generating a core file. RESOLUTION: The code is updated to refresh the inode bit map before actually starting the dump operation thus avoiding the core file generation. * 2340794 (Tracking ID: 2086902) SYMPTOM: The performance of a system with Veritas File System (VxFS) is affected due to high contention for a spinlock. DESCRIPTION: The contention occurs because there are a large number of work items in these systems and currently, these work items are enqueued and dequeued from the global list individually. RESOLUTION: The code is modified to process the work items by bulk enqueue/dequeue to reduce the VxFS worklist lock contention. * 2340799 (Tracking ID: 2059611) SYMPTOM: system panics because NULL tranp in vx_unlockmap(). DESCRIPTION: vx_unlockmap is to unlock a map structure of file system. If the map is being handled, we incremented the hold count. vx_unlockmap() attempts to check whether this is an empty mlink doubly linked list while we have an async vx_mapiodone routine which can change the link at unpredictable timing even though the hold count is zero. RESOLUTION: evaluation order is changed inside vx_unlockmap(), such that further evaluation can be skipped over when map hold count is zero. * 2340817 (Tracking ID: 2192895) SYMPTOM: A system panic occurs when executing File Change Log (FCL) commands and the following stack trace is displayed: panicsys() panic_common() panic() vmem_xalloc() vmem_alloc() segkmem_xalloc() segkmem_alloc_vn() vmem_xalloc() vmem_alloc() kmem_alloc() vx_getacl() vx_getsecattr() fop_getsecattr() cacl() acl() syscall_trap32() DESCRIPTION: The Access Control List (ACL) count in the inode can be corrupted due to a race condition. For example, the setacl() function can change the ACL count when the getacl() function is processing the same inode. This results in an incorrect ACL count. RESOLUTION: The code is modified to add protection to the vulnerable ACL count to avoid corruption. * 2340825 (Tracking ID: 2290800) SYMPTOM: When the fsdb_vxfs(1M) command is used to look at the bmap of an ILIST file ("mapall" command), a large hole at the end of the ILIST file is wrongly reported. DESCRIPTION: While reading the bmap of an ILIST file, if a hole is found at the end of indirect extents, the fsdb_vxfs(1M) command may incorrectly mark the hole as the last extent in the bmap, causing the "mapall" command within the filesystem debugger to show a large hole till the end of the file. RESOLUTION: The code has been modified to read an ILIST file's bmap correctly when holes are found at the end of the indirect extents. * 2340831 (Tracking ID: 2272072) SYMPTOM: GAB panics the box because VCS engine "had" did not respond, the lbolt wraps around. DESCRIPTION: The lbolt wraps around after 498 days machine uptime. In VxFS, we flush VxFS meta data buffers based on their age. The age calculation happens taking lbolt in account. Due to lbolt wrapping the buffers were not flushed. So, a lot of metadata IO's stopped and hence, the panic. RESOLUTION: In the function for handling flushing of dirty buffers, also handle the condition if lbolt has wrapped. If it has then assign current lbolt time to the last update time of dirtylist. * 2340834 (Tracking ID: 2302426) SYMPTOM: System panics when multiple 'vxassist mirror' commands are running concurrently with following stack strace: 0) panic+0x410 1) unaligned_hndlr+0x190 2) bubbleup+0x880 ( ) +------------- TRAP #1 ---------------------------- | Unaligned Reference Fault in KERNEL mode | IIP=0xe000000000b03ce0:0 | IFA=0xe0000005aa53c114 <--- | p struct save_state 0x2c561031.0x9fffffff5ffc7400 +------------- TRAP #1 ---------------------------- LVL FUNC ( IN0, IN1, IN2, IN3, IN4, IN5, IN6, IN7 ) 3) vx_copy_getemap_structs+0x70 4) vx_send_getemapmsg+0x240 5) vx_cfs_getemap+0x240 6) vx_get_freeexts_ioctl+0x990 7) vxportal_ioctl+0x4d0 8) spec_ioctl+0x100 9) vno_ioctl+0x390 10) ioctl+0x3c0 11) syscall+0x5a0 DESCRIPTION: Panic is caused because of de-referencing an unaligned address in CFS message structure. RESOLUTION: Used bcopy to ensure proper alignment of the addresses. * 2340839 (Tracking ID: 2316793) SYMPTOM: After removing the files in a file system, the df(1M)command which uses the statfs(2)function may take 10 seconds to complete. DESCRIPTION: To obtain an up-to- date and valid free block count in a file system a delay and retry loop delays for one second and retries 10 times. This excessive retrying causes a 10 second delay per file system while executing the df(1M) command. RESOLUTION: The code is modified to reduce the original 10 retries with one second delay each, to one retry after a 20 millisecond delay. * 2341007 (Tracking ID: 2300682) SYMPTOM: When a file is newly created, issuing "fsppadm query -a /mount_point" could show incorrect IOTemp information. DESCRIPTION: fsppadm query outputs incorrect data when the file re-uses the inode number which belonged to a removed file, but the database still contains this obsolete record for the removed one. fsppadm utility takes use of a database to save inodes' historical data. It compares the nearest and the farthest records for an inode to compute IOTemp in a time window. And it picks the generation of inode in the farthest record to check the inode existence. However, if the farthest is missing, zero as the generation is used mistakenly. RESOLUTION: If the nearest record for a given inode exists in database, we extract the generation entry instead of that from the farthest one. * 2360817 (Tracking ID: 2332460) SYMPTOM: Executing the VxFS 'vxedquota -p user1 user2' command to copy quota information of one user to other users takes a long time to run to completion. DESCRIPTION: VxFS maintains quota information in two files - external quota files and internal quota files. Whilst copying quota information of one user to another, the required quota information is read from both the external and internal files. However the quota information should only need to be read from the external file in a situation where the read from the internal file has failed. Reading from both files is therefore causing an unnecessary delay in the command execution time. RESOLUTION: The unnecessary duplication of reading both the external and internal Quota files to retrieve the same information has been removed. * 2360819 (Tracking ID: 2337470) SYMPTOM: The Cluster File System (CFS) can unexpectedly and prematurely report a 'file system out of inodes' error when attempting to create a new file. The following error message is displayed: vxfs: msgcnt 1 mesg 011: V-2-11: vx_noinode - /dev/vx/dsk/dg/vol file system out of inodes. DESCRIPTION: While allocating new index nodes (inodes) in a CFS, Veritas File System (VxFS) searches for an available free inode in the Inode Allocation Units (IAUs) that are delegated to the local node. If none are available, it searches the IAUs that are not delegated to any node, or revokes an IAU delegated to another node. Gaps may be created in the IAU structures as a side effect of the CFS delegation processing. However, while searching for an available free inode, if VxFS ignores any gaps, new IAUs cannot be created if the maximum size of the metadata structures reaches (2^31). Therefore, one of the gaps must be populated and used for the allocation of the new inode. If the gaps are ignored, VxFS may prematurely report the "file system out of inodes" error message even though there is enough free space in the VxFS file system to create new inodes. RESOLUTION: The code is modified to allocate new inodes from the gaps in the IAU structures created as a part of the CFS delegation processing. * 2360821 (Tracking ID: 1956458) SYMPTOM: When attempting to check information of checkpoints by fsckptadm -C blockinfo , the command failed with error 6 (ENXIO), the file system is disabled and some errors come out in message file: vxfs: msgcnt 4 mesg 012: V-2-12: vx_iget - /dev/vx/dsk/sfsdg/three file system invalid inode number 4495 vxfs: msgcnt 5 mesg 096: V-2-96: vx_setfsflags - /dev/vx/dsk/sfsdg/three file system fullfsck flag set - vx_cfs_iread DESCRIPTION: VxFS takes use of ilist files in primary fileset and checkpoints to accommodate inode information. A hole in a ilist file indicates that inodes in the hole don't exist and are not allocated yet in the corresponding fileset or checkpoint. fsckptadm will check every inode in the primary fileset and the downstream checkpoints. If the inode falls into a hole in a prior checkpoint, i.e. the associated file was not generated at the time of the checkpoint creation, fsckptadm exits with error. RESOLUTION: Skip inodes in the downstream checkpoints, if these inodes are located in a hole. * 2368738 (Tracking ID: 2368737) SYMPTOM: If a file which has shared extents has corrupt indirect blocks, then in certain cases the reference count tracking system can try to interpret this block and panic the system. Since this is a asynchronous background operation, this processing will retry repeatedly on every file system mount and hence can result in panic every time the file system is mounted. DESCRIPTION: Reference count tracking system for shared extents updates reference count in a lazy fashion. So in certain cases it asynchronously has to access shared indirect blocks belonging to a file to account for reference count updates. But due if this indirect block has been corrupted badly "a priori", then this tracking mechanism can panic the system repeatedly on every mount. RESOLUTION: The reference count tracking system validates the read indirect extent from the disk and in case it is not found valid sets VX_FULLFSCK flag in the superblock marking it for full fsck and disables the file system on the current node. * 2373565 (Tracking ID: 2283315) SYMPTOM: System may panic when "fsadm -e" is run on a file system containing file level snapshots. The panic stack looks like: crash_kexec() __die at() do_page_fault() error_exit() [exception RIP: vx_bmap_lookup+36] vx_bmap_lookup() vx_bmap() vx_reorg_emap() vx_extmap_reorg() vx_reorg() vx_aioctl_full() vx_aioctl_common() vx_aioctl() vx_ioctl() do_ioctl() vfs_ioctl() sys_ioctl() tracesys DESCRIPTION: The panic happened because of a NULL inode pointer passed to vx_bmap_lookup() function. During reorganizing extents of a file, block map (bmap) lookup operation is done on a file to get the information about the extents of the file. If this bmap lookup finds a hole at an offset in a file containing shared extents, a local variable is not updated that makes the inode pointer NULL during the next bmap lookup operation. RESOLUTION: Initialized the local variable such that inode pointer passed to vx_bmap_lookup() will be non NULL. * 2386483 (Tracking ID: 2374887) SYMPTOM: Access to a file system can hang when creating a named attribute due to a read/write lock being held exclusively and indefinitely causing a thread to loop in vx_tran_nattr_dircreate() A typical stacktrace of a looping thread: vx_itryhold_locked vx_iget vx_attr_iget vx_attr_kgeti vx_attr_getnonimmed vx_acl_inherit vx_aclop_creat vx_attr_creatop vx_new_attr vx_attr_inheritbuf vx_attr_inherit vx_tran_nattr_dircreate vx_nattr_copen vx_nattr_open vx_setea vx_linux_setxattr vfs_setxattr link_path_walk sys_setxattr system_call DESCRIPTION: The initial creation of a named attribute for a regular file or directory will result in the automatic creation of a 'named attribute directory'. Creations are initially attempted in a single transaction. Should the single transaction fail due to a read/write lock being held then a retry should split the task into multiple transactions. An incorrect reset of a tracking structure meant that all retries were performed using a single transaction creating an endless retry loop. RESOLUTION: The tracking structure is no longer reset within the retry loop. * 2402643 (Tracking ID: 2399178) SYMPTOM: Full fsck does large directory index validation during pass2c. However, if the number of large directories are more then this pass takes a lot of time. There is huge scope to improve the full fsck performance during this pass. DESCRIPTION: Pass2c consists of following basic operations:- [1] Read the entries in the large directory [2] Cross check hash values of those entries with the hash directory inode contents residing on the attribute ilist. This means this is another heavy IO intensive pass. RESOLUTION: 1.Use directory block read-ahead during Step [1]. 2.Wherever possible, access the file contents extent-wise rather than in fs block size (while reading entries in the directory) or in hash block size (8k, during dexh_getblk) Using above mentioned enhancements, the buffer cache can be utilized in better way. * 2405590 (Tracking ID: 2397976) SYMPTOM: A system can panic when performing buffered i/o using large i/o sizes. The following is an example stacktrace showing where the system panics: d_map_list_tce+000510 efc_mapdma_iocb+000498 efc_start+000518 efc_output+0005A4 efsc_start+0006A8 efsc_strategy+000D98 std_devstrat+000364 devstrat+000050 scsidisk_start+000D7C scsidisk_iodone+00068C internal_iodone_offl+000174 iodone_offl+000080 i_softmod+000274 flih_util+00024C DESCRIPTION: Prior to the VxFS5.1 release the maximum buffered i/o size was restricted to 128Kb on the AIX platform. To assist buffered i/o performance an enhancement was introduced in release 5.1 to perform larger sized buffered i/o's. With this enhancement VxFS can be asked to perform buffered i/o of sizes greater than 128K by tuning the vxfs tunables "read_pref_io" and "write_pref_io" to values greater than 128K. When performing buffered i/o using i/o sizes of multiple-megabytes the system can panic in the routine d_map_list_tce() whilst performing DMA mappings. RESOLUTION: A 128Kb buffered i/o size limitation is being reintroduced to avoid the possibility of system panic. Regardless of the "read_pref_io" or "write_pref_io" tunable settings the maximum i/o size for buffered i/o will be restricted to 128Kb. * 2412029 (Tracking ID: 2384831) SYMPTOM: System panics with the following stack trace. This happens in some cases when names streams are used in VxFS. machine_kexec() crash_kexec() __die do_page_fault() error_exit() [exception RIP: iput+75] vx_softcnt_flush() vx_ireuse_clean() vx_ilist_chunkclean() vx_inode_free_list() vx_ifree_scan_list() vx_workitem_process() vx_worklist_process() vx_worklist_thread() vx_kthread_init() kernel_thread() DESCRIPTION: VxFS internally creates a directory to keep the named streams pertaining to a file. In some scenarios, an error code path is missing to release the hold on that directory. Due to this unmount of the file system will not clean the inode belonging to that directory. Later when VxFS reuses such a inode panic is seen. RESOLUTION: Release the hold on the named streams directory in case of an error. * 2412177 (Tracking ID: 2371710) SYMPTOM: User quota file gets corrupted when DELICACHE feature is enabled and the current usage of inodes of a user becomes negative after frequent file creations and deletions. If the quota information is checked using the vxquota command with the '-vu username ' option, the number of files is "-1". For example: # vxquota -vu testuser2 Disk quotas for testuser2 (uid 500): Filesystem usage quota limit timeleft files quota limit timeleft /vol01 1127809 8239104 8239104 -1 0 0 DESCRIPTION: This issue is introduced by the inode DELICACHE feature which is a performance enhancement to optimize the updates done to the inode map during file creation and deletion operations. The feature is enabled by default, and can be changed by the vxtunefs(1M) command. When DELICACHE is enabled and the quota is set for Veritas File System (VxFS), there is an extra quota update for the inodes on the inactive list during the removal process. Since this quota has been updated already before being put on the DELICACHE list, the current number of user files gets decremented twice. RESOLUTION: The code is modified to add a flag to identify the inodes which have been moved to the inactive list from the DELICACHE list. This flag is used to prevent decrementing the quota again during the removal process. * 2412179 (Tracking ID: 2387609) SYMPTOM: Quota usage gets set to ZERO when umount/mount the file system though files owned by users exist. This issue may occur after some file creations and deletions. Checking the quota usage using "vxrepquota" command and the output would be like following: # vxrepquota -uv /vx/sofs1/ /dev/vx/dsk/sfsdg/sofs1 (/vx/sofs1): Block limits File limits User used soft hard timeleft used soft hard timeleft testuser1 -- 0 3670016 4194304 0 0 0 testuser2 -- 0 3670016 4194304 0 0 0 testuser3 -- 0 3670016 4194304 0 0 0 Additionally the quota usage may not be updated after inode/block usage reaches ZERO. DESCRIPTION: The issue occurs when VxFS merges external per node quota files with internal quota file. The block offset within external quota file could be calculated wrongly in some scenario. When any hole found in per node quota file, the file offset such that it points the next non-HOLE offset will be modified, but we miss to change the block offset accordingly which points to the next available quota record in a block. VxFS updates per node quota records only when global internal quota file shows either of some bytes or inode usage, otherwise it doesn't copy the usage from global quota file to per node quota file. But for the case where quota usage in external quota files has gone down to zero and both bytes and inode usage in global file becomes zero, per node quota records would be not updated and left with incorrect usage. It should also check bytes or inodes usage in per node quota record. It should skip coping records only when bytes and inodes usage in both global quota file and per node quota file is zero. RESOLUTION: Corrected the way to calculate the block offset when any hole is found in per node quota file. Added code to also check blocks or inodes usage in per node quota record while updating user quota usage. * 2412181 (Tracking ID: 2372093) SYMPTOM: In a Cluster File System (CFS) environment , the file read performances gradually degrade up to 10% of the original read performance and the fsadm(1M) - F vxfs -D -E shows a large number (> 70%) of free blocks in extents smaller than 64k. For example, % Free blocks in extents smaller than 64 blks: 73.04 % Free blocks in extents smaller than 8 blks: 5.33 DESCRIPTION: In a CFS environment, the disk space is divided into Allocation Units (AUs).The delegation for these AUs is cached locally on the nodes. When an extending write operation is performed on a file, the file system tries to allocate the requested block from an AU whose delegation is locally cached, rather than finding the largest free extent available that matches the requested size in the other AUs. This leads to a fragmentation of the free space, thus leading to badly fragmented files. RESOLUTION: The code is modified such that the time for which the delegation of the AU is cached can be reduced using a tunable, thus allowing allocations from other AUs with larger size free extents. Also, the fsadm(1M) command is enhanced to de- fragment free space using the -C option. * 2412187 (Tracking ID: 2401196) SYMPTOM: System panic in vx_ireuse_clean during Dynamic Reconfiguration on VRTSvxfs 5.1SP1 onwards. DESCRIPTION: while Dynamic Reconfiguration(DR) we move inodes to temporary DR inode lists so that DR can go ahead and deinit the various inode lists. In this case we were not taking care of the newly added delicache list(added in 5.1SP1) and were wrongly and unknowingly moving the inode from delicache list to the freelist. That way we end up having the the inode belonging to a local FS with VX_IREMOVE set on the freelist, which breaks our assumption that such inode should belong to CFS and causes panic because of dereferencing NULL CFS inode related fields of the local inode. RESOLUTION: Added case to take care of inodes on delicache list during DR so that they remain on their own list and don't get moved to freelist. * 2413811 (Tracking ID: 1590963) SYMPTOM: Maximum Number of subdirectories is limited to 32767 DESCRIPTION: Currently, there limit on maximum numbers of subdirectories. This limit is fixed to 32767. This value can be increased to 64K on Linux, HP and Solaris platforms. Also there is a need to provide Flexibility to change the value of maximum numbers of subdirectories. RESOLUTION: Code is modified to add vx_maxlink tunable to control maximum number of sub directories * 2418819 (Tracking ID: 2283893) SYMPTOM: In a Cluster File System (CFS) environment , the file read performances gradually degrade up to 10% of the original read performance and the fsadm(1M) -F vxfs -D -E shows a large number (> 70%) of free blocks in extents smaller than 64k. For example, % Free blocks in extents smaller than 64 blks: 73.04 % Free blocks in extents smaller than 8 blks: 5.33 DESCRIPTION: In a CFS environment, the disk space is divided into Allocation Units (AUs).The delegation for these AUs is cached locally on the nodes. When an extending write operation is performed on a file, the file system tries to allocate the requested block from an AU whose delegation is locally cached, rather than finding the largest free extent available that matches the requested size in the other AUs. This leads to a fragmentation of the free space, thus leading to badly fragmented files. RESOLUTION: The code is modified such that the time for which the delegation of the AU is cached can be reduced using a tuneable, thus allowing allocations from other AUs with larger size free extents. Also, the fsadm(1M) command is enhanced to de-fragment free space using the -C option. * 2420060 (Tracking ID: 2403126) SYMPTOM: Hang is seen in the cluster when one of the nodes in the cluster leaves or rebooted. One of the nodes in the cluster will contain the following stack trace. e_sleep_thread() vx_event_wait() vx_async_waitmsg() vx_msg_send() vx_send_rbdele_resp() vx_recv_rbdele+00029C () vx_recvdele+000100 () vx_msg_recvreq+000158 () vx_msg_process_thread+0001AC () vx_thread_base+00002C () threadentry+000014 (??, ??, ??, ??) DESCRIPTION: Whenever a node in the cluster leaves, reconfiguration happens and all the resources that are held by the leaving nodes are consolidated. This is done on one node of the cluster called primary node. Each node sends a message to the primary node about the resources it is currently holding. During this reconfiguration, in a corner case, VxFS is incorrectly calculating the message length which is larger than what GAB(Veritas Group Membership and Atomic Broadcast) layer can handle. As a result the message is getting lost. The sender thinks that the message is sent and waits for acknowledgement. The message is actually dropped at sender and never sent. The master node which is waiting for this message will wait forever and the reconfiguration never completes leading to hang. RESOLUTION: The message length calculation is done properly now and GAB can handle the messages. * 2425429 (Tracking ID: 2422574) SYMPTOM: On CFS, after turning the quota on, when any node is rebooted and rejoins the cluster, it fails to mount the filesystem. DESCRIPTION: At the time of mounting the filesystem after rebooting the node, mntlock was already set, which didn't allow the remount of filesystem, if quota is on. RESOLUTION: Code is changed so that the mntlock flag is masked in quota operation as it's already set on the mount. * 2426039 (Tracking ID: 2412604) SYMPTOM: Once the time limit expires after exceeding the soft-limit of user quota size on VxFS filesystem, writes are still permissible over that soft-limit. DESCRIPTION: After exceeding the soft-limit, in the initial setup of the soft-limit the timer didn't use to start. RESOLUTION: Start the timer during the initial setting of quota limits if current usage has already crossed the soft quota limits. * 2427269 (Tracking ID: 2399228) SYMPTOM: Occasionally Oracle Archive logs can be created smaller than they should be, in the reported case the resultant Oracle Archive logs were incorrectly sized as 512 bytes. DESCRIPTION: The fcntl [file control] command F_FREESP [Free storage space] can be utilised to change the size of a regular file. If the file size is reduced we call it a "truncate", and space allocated in the truncated area will be returned to the file system freespace pool. If the file size is increased using F_FREESP we call it a "truncate-up", although the file size changes no space is allocated in the extended area of the file. Oracle archive logs utilize the F_FREESP fcntl command to perform a truncate-up of a new file before a smaller write of 512 bytes [at the the start of the file] is then performed. A timing window was found with F_FREESP which meant that 'truncate-up' file size was lost, or rather overwritten, by the subsequent write of the data, thus causing the file to appear with a size of just 512 bytes. RESOLUTION: A timing window has been closed whereby the flush of the allocating [512byte] write was triggered after the new F_FREESP file size has been updated in the inode. * 2427281 (Tracking ID: 2413172) SYMPTOM: vxfs_fcl_seektime() API seeks to the first record in the File change log(FCL) file after specified time. This API can incorrectly return EINVAL(FCL record not found)error while reading first block of FCL file. DESCRIPTION: To seek to the first record after the given time, first a binary search is performed to get the largest block offset where fcl record time is less than the given time. Then a linear search from this offset is performed to find the first record which has time value greater than specified time. FCL records are read in buffers. There can be scenarios where FCL records read in one buffer are less than buffer size, e.g. reading first block of FCL file. In such scenarios, buffer read can continue even when all data in current buffer has been read. This is due to wrong check which decides if all records in one buffer has been read. Thus reading buffer beyond boundary was causing search to terminate without finding record for given time and hence EINVAL error was returned. Actually, VxFS should detect that it is partially filled buffer and the search should continue reading the next buffer. RESOLUTION: Check which decides if all records in buffer have been read is corrected such that buffer is read within its boundaries. * 2478237 (Tracking ID: 2384861) SYMPTOM: The following asserts are seen during internal stress and regression runs f:vx_do_filesnap:1b f:vx_inactive:2a f:xted_check_rwdata:31 f:vx_do_unshare:1 DESCRIPTION: These asserts validate some assumption in various function also there were some miscellaneous issues which were seen during internal testing. RESOLUTION: The code has been modified to fix the internal reported issues which other miscellaneous changes. * 2478325 (Tracking ID: 2251015) SYMPTOM: Command fsck(1M) will take longer time to complete. DESCRIPTION: Command fsck(1M), in extreme case like 2TB file system with a 1KB block size, 130+ checkpoints, and 100-250 million inodes per file set takes 15+ hours to complete intent log replay because it has to read a few GB of IAU headers and summaries one synchronous block at a time. RESOLUTION: Changing the fsck code, to do read ahead on the IAU file reduces the fsck log-replay time. * 2480949 (Tracking ID: 2480935) SYMPTOM: System log file may contain following error message on multi-threaded environment with Dynamic Storage Tiers(DST). UX:vxfs fsppadm: ERROR: V-3-26626: File Change Log IOTEMP and ACCESSTEMP index creation failure for /vx/fsvm with message Argument list too long DESCRIPTION: In DST, while enforcing policy, SQL queries are generated and written to file .__fsppadm_enforcesql present in lost+found. In multi threaded environment, 16 threads works in parallel on ILIST and geenrate SQL queries and write it to file. This may lead to corruption of file, if multiple threads write to file simultaneously. RESOLUTION: Mutex is used to serialize writing of threads on SQL file. * 2482337 (Tracking ID: 2431674) SYMPTOM: panic in vx_common_msgprint() via vx_inactive() DESCRIPTION: The problem is that the call VX_CMN_ERR( ) , uses a "llx" format character which vx_common_msgprint() doesn't understand. It gives up trying to process that format, but continues on without consuming the corresponding parameter. Everything else in the parameter list is effectively shifted by 8 bytes, and when we get to processing the string argument, it's game over. RESOLUTION: Changed the format to "llu", which vx_common_msgprint() understands. * 2486597 (Tracking ID: 2486589) SYMPTOM: Multiple threads may wait on a mutex owned by a thread that is in function vx_ireuse_steal() with following stack trace on machine with severe inode pressure. vx_ireuse_steal() vx_ireuse() vx_iget() DESCRIPTION: Several thread are waiting to get inodes from VxFS. The current number of inodes reached max number of inodes (vxfs_ninode) that can be created in memory. So no new allocations can be possible, which results in thread wait. RESOLUTION: Code is modified so that in such situation, threads return ENOINODE instead of retrying to get inodes. * 2487976 (Tracking ID: 2483514) SYMPTOM: System panic in following stack: pvthread+19E500 STACK: [0001BF00]abend_trap+000000 () [000A29E4]tod_lock_write+000084 (??) [000A34A0]tstop+0000C0 (??) [04C2F04C].vx_untimeout+000078 () [04C29078].vx_timed_sleep+000090 () [04D2A72C].vx_open_modes+000574 () [04D2F80C].vx_open1+0001FC () [04D2FE80].vx_open+00007C () [04C39514].vx_open_skey+000044 () [0057D75C]vnop_open+0004BC (??, ??, ??, ??, ??) [00615D24]openpnp+0005E4 (??, ??, ??, ??, ??, ??) [006162E0]openpath+000100 (??, ??, ??, ??, ??, ??, ??) [006167B4]copen+000294 (??, ??, ??, ??, ??) [006156BC]kopen+00001C (??, ??, ??) DESCRIPTION: When a file is opened with O_DELAY flag and some process already has that file opened in conflicting mode with O_NSHARE flag, the first process waits in loop till the file is available instead exiting. This is achieved by associating a timer with the waiting process. The waiting process will wakeup after the timer expires and see if it can open the file, if not it starts a fresh timer and sleeps again. AIX provides tstart/tstop kernel services to implement this timer. Associated with every timer we have a callback function which is executed when the timer gets expired and after that the process sleeping for timer is awaken. The callback routine is to add the expired timer to a global freelist. All the expired timers on this global freelist are freed at the beginning of next timeout operation that happens with next open retry or each fresh open with O_DELAY. The callback function is executed just before the process sleeping for timer is awaken, so there is a possibility of having timer on freelist but the process waiting for that timer is still sleeping. Hence the timer can be freed up while process is still sleeping. During this window if the process is aborted, we try to stop a timer which might be already freed and this race results in panic. RESOLUTION: Avoid freeing up the expired timer on the freelist if the process associated with it is still sleeping. Once the process wakes up, it will set a flag on timer structure to signal that this timer is safe to be freed up. * 2494464 (Tracking ID: 2247387) SYMPTOM: Internal local mount noise.fullfsck.N4 test hit an assert vx_ino_update:2 With stack trace looking as below panic: f:vx_ino_update:2 Stack Trace: IP Function Name 0xe0000000023d5780 ted_call_demon+0xc0 0xe0000000023d6030 ted_assert+0x130 0xe000000000d66f80 vx_ino_update+0x230 0xe000000000d727e0 vx_iupdat_local+0x13b0 0xe000000000d638b0 vx_iupdat+0x230 0xe000000000f20880 vx_tflush_inode+0x210 0xe000000000f1fc80 __vx_fsq_flush___vx_tran.c__4096000_0686__+0xed0 0xe000000000f15160 vx_tranflush+0xe0 0xe000000000d2e600 vx_tranflush_threaded+0xc0 0xe000000000d16000 vx_workitem_process+0x240 0xe000000000d15ca0 vx_worklist_thread+0x7f0 0xe000000001471270 kthread_daemon_startup+0x90 End of Stack Trace DESCRIPTION: INOILPUSH flag is not set when inode is getting updated, which caused above assert. The problem was creation and deletion of clone resets the INOILPUSH flag and function vx_write1_fast() does not set the flag after updating the inode and file. RESOLUTION: Code is modified so that if INOILPUSH flag is not set while function vx_write1_fast(), then the flag is set in the function. * 2508164 (Tracking ID: 2481984) SYMPTOM: Access to the file system got hang. DESCRIPTION: In function 'vx_setqrec', it will call 'vx_dqget'. when 'vx_dqget' return errors, it will try to unlock DQ structure using 'VX_DQ_CLUSTER_UNLOCK'. But, in this situation, DQ structure doesn't hold the lock. hence, this hang happens. RESOLUTION: 'dq_inval' would be set in 'vx_dqget' in case of any error happens in 'vx_dqget'. Skip unlocking DQ structure in the error code path of 'vx_setqrec', if 'dq_inval' is set. * 2508171 (Tracking ID: 2246127) SYMPTOM: Mount command may take more time in case of large IAU file. DESCRIPTION: At time of mount, IAU file is read one block at time. The read block is processed and then next block is read. In case there are huge number of files in filesystem, IAU file for the filesystem becomes large. Reading of such large IAU file, one block at time is taking time to complete mount command. RESOLUTION: Code is changed to read IAU file using multiple threads in parallel, also now complete extent is read and then it is processed. * 2521672 (Tracking ID: 2515380) SYMPTOM: The ff(1M) command hangs and exits if the program exceeds the memory limit and the following error is displayed: # ff -F vxfs UX:vxfs ff: ERROR: V-3-24347: program limit of 30701385 exceeded for directory data block list UX:vxfs ff: ERROR: V-3-20177: DESCRIPTION: The ff(1M) command is used to perform a directory lookup and it lists all the files on Veritas File System (VxFS). All the directory blocks are traversed to save the block addresses for a directory in a function. The function keeps track of the buffer in which the directory blocks are read and the extent up to which the directory blocks are read. This function is called with an offset and it returns the offset up to which the directory blocks have been read. The offset passed to this function has to be within the extent. But the logical offset is being passed to this function which can be greater than the extent size. Hence, the returned offset gets wrapped to 0. The caller assumes that nothing has been read and a loop occurs. RESOLUTION: The code is modified to remove the call to the function which maintains the buffer offsets for the reading data. * 2523084 (Tracking ID: 2515101) SYMPTOM: The "svmon -O vxfs=on" option can be used to collect VxFS file system details, with this enabled subsequently executing the "svmon -S" command can generate a system panic in the svm_getvxinode_gnode routine when trying to collect information from the VxFS segment control blocks. 16)> f pvthread+838900 STACK: [F100000090704A38]perfvmmstat:svm_getvxinode_gnode+000038 DESCRIPTION: VxFS will create and delete AIX Virtual Memory Management [VMM] structures called Segment Control Blocks [SCB] via VMM interfaces. VxFS was leaking SCBs via one specific code path. The "svmon -S" command will parse a global list of SCB structures, including any SCB structures leaked by VxFS. If svmon is also collecting information about VxFS file systems the gnode element of the SCB will be dereferenced, for a leaked SCB the gnode will be stale and thus now contain unrelated content, reading and dereferencing this content can generate the panic. RESOLUTION: A very simple and low risk change now prevents Segment Control Blocks from being leaked by VxFS, the SCBs will now be correctly removed by VxFS. * 2529356 (Tracking ID: 2340953) SYMPTOM: During internal stress test, f:vx_iget:1a assert is seen. DESCRIPTION: While renaming certain file, we check if the target directory is in the path of the source file to be renamed. while using function vx_iget() to reach till root inode, one of parent directory incode number was 0 and hence the assert. RESOLUTION: Code is changed so that during renames, parent directory is first assigned correct inode number before using vx_iget() function to retrieve root inode. * 2551576 (Tracking ID: 2526174) SYMPTOM: The vxfs_fcl_seektime function seeks to the first record in the FCL file that has a timestamp greater than or equal to the specified time. FCL vxfs_fcl_seektime()API can incorrectly return EINVAL(no records in FCL file newer than specified time) error even though records after specified time are present in FCL log. DESCRIPTION: Error EINVAL was returned when partial records exist in FCL log, because search for FCL record was not continued till correct last offset. 'last offset' is where search should stop. last offset is determined in binary search when record newer than specified time is found. While doing binary search FCL partial records are skipped. When record newer than specified time is found - length for which partial records were found was not considered in last offset calculation. Therefore last offset calculation was incorrect and hence FCL record search was terminated earlier i.e. before record at specified time is found. RESOLUTION: if partial records are found during binary search, length of partial records is now added to the last offset to get the correct last offset. * 2561355 (Tracking ID: 2561334) SYMPTOM: System log file may contain following error message on multi-threaded environment with Dynamic Storage Tiers(DST). UX:vxfs fsppadm: ERROR: V-3-26626: File Change Log IOTEMP and ACCESSTEMP index creation failure for /vx/fsvm with message Argument list too long DESCRIPTION: In DST, while enforcing policy, SQL queries are generated and written to file .__fsppadm_enforcesql present in lost+found. In multi threaded environment, 16 threads works in parallel on ILIST and geenrate SQL queries and write it to file. This may lead to corruption of file, if multiple threads write to file simultaneously. RESOLUTION: using flockfile() instead of adding new code to take lock on .__fsppadm_enforcesq file descriptor before writing into it. * 2561791 (Tracking ID: 2527765) SYMPTOM: On AIX, a VxFS directory cannot have more than 32K sub-directories. New directory creation in directory already having 32K sub-directories fails. So there is a need to support >32K sub-directories for VxFS on AIX. DESCRIPTION: on AIX there is a Operating System limitation which restricts the number of links of a directory/file to 32K (MAXLINK). Therefore one cannot create sub-directories greater than 32K. RESOLUTION: VxFS uses an internal 32 bit link count and hence VxFS internal structures are capable of handling more than 32k sub-directories. None of the VxFS commands use MAXLINK count hence there are no other dependencies. So a new tunable i.e. 'maxlink_enable' has been added to allow >32k sub-directories and removed limitation of 32K sub-directories. Known Limitations with this fix: a] AIX Base command limitations: Since the value of MAXLINK can go only up to 32k, certain AIX commands that have the dependency on this value would fail to produce correct results. IBM has completed the analysis of such commands and following are expected to produce incorrect results when run with >32K directories. AIX commands /usr/bin/istat, /usr/bin/li, /usr/bin/ls and /usr/bin/find commands will report incorrect nlink count when there are >32k subdirectories. b] VxFS limitations : 1) In case of CFS, ALL nodes of the cluster must run with same maxlink_enable value. 2) once this tunable is set then it cannot be disabled. By default it is disabled. * 2564431 (Tracking ID: 2515459) SYMPTOM: Local mount hangs in vx_bc_binval_cookie like the following stack delay vx_bc_binval_cookie vx_blkinval_cookie vx_freeze_flush_cookie vx_freeze_all vx_freeze vx_set_tunefs1 vx_set_tunefs vx_aioctl_full vx_aioctl_common vx_aioctl vx_ioctl genunix:ioctl unix:syscall_trap32 DESCRIPTION: The hanging process for local mount is waiting for a buffer to be unlocked. But that buffer can only be released if its associated cloned map writes get flushed. But a necessary flush is missed. RESOLUTION: Add code to synchronize cloned map writes so that all the cloned maps will be cleared and the buffers associated with them will be released. * 2567091 (Tracking ID: 2527578) SYMPTOM: System crashes due to NULL pointer deference with the following stack - simple_lock+000014 () vx_bhash_rele@AF161_63+00001C () vx_inode_deinit+0000D4 () vx_idrop+0002A4 () vx_detach_fset+000CC8 () vx_unmount+0001AC () vx_unmount_skey+000034 () vfs_unmount+000098 () kunmount+0000DC () uvmount+000208 () ovlya_addr_sc_flih_main+000130 () DESCRIPTION: The crash happens as a result of accessing an address for which memory hasn't been allocated. This address corresponds to a spinlock and therefore the crash while locking the spinlock. RESOLUTION: Allocate and initialize the spinlock before locking. * 2574396 (Tracking ID: 2433934) SYMPTOM: Performance degradation observed when CFS is used compared to standalone VxFS as back-end NFS data servers. DESCRIPTION: In CFS, if one thread holding read-write lock on inode in exclusive mode, other threads are stuck for the same inode, even if they want to access inode in shared mode, resulting in performance degradation. RESOLUTION: Code is changed to avoid taking read-write lock for inode in exclusive mode, where it is not required. * 2581351 (Tracking ID: 2588593) SYMPTOM: df(1M) shows wrong usage value for volume when large file is deleted. DESCRIPTION: We maintain all freed extent size in the in core global variable and transaction subroutine specific data structures. After deletion of a large file, we were missing to update this in core global variable. df(1M) while reporting usage data, read the freed space information from this global variable which contains stale information. RESOLUTION: Code is modified to account the freed extent data into global vaiable used by df(1M) so that correct usage for volume is reported by df(1M). * 2586283 (Tracking ID: 2603511) SYMPTOM: On Systems with Oracle 11.2.0.3 or higher installed, database operations can fail with the following message is displayed in the system logs: "ODM ERROR V-41-4-1-105-22 Invalid argument" DESCRIPTION: A change in and Oracle API in 11.2.0.3 , Oracle Disk Manager (ODM) is unable to create a file due to a unrecognized f-mode. RESOLUTION: The code is modified to mask and take into consideration only the files which are known to ODM instead of failing the creation of the file. * 2587025 (Tracking ID: 2528819) SYMPTOM: AIX can fail to create new worker threads for VxFS. The following message is seen in the system log- "WARNING: msgcnt 175 mesg 097: V-2-97: vxfs failed to create new thread" DESCRIPTION: AIX is failing the thread creation because it cannot find a free slot in that kproc and returning ENOMEM. RESOLUTION: Limit the maximum number of VxFS worker threads. * 2587139 (Tracking ID: 2511432) SYMPTOM: Poor VxFS performance for application doing writes on a mmaped file which has been written to before being mmaped and therefore has all the associated pages already brought into memory as a result. Large page-ins are observed as soon as the writes begin. Also large page-outs are observed periodically afterward. DESCRIPTION: Buffered writes to a file brings the associated page in memory. Such a page can be written to. However mmaping this file now will mark the page read-only. Any mmaped write writing to same page will encounter now a protection fault thereby leading to page-in. The VxFS background flushing mechanism periodically flushes the entire file which leads to large page-outs. During these page-outs the writes happening else where on the file slowdown owing to unavailability of various inode locks currently held by the flushing thread. RESOLUTION: For preventing large page-ins we have removed the extra overhead of protection fault by marking pages read-write in the mmap range wherever possible. We have introduced a new tunable "vx_ctrl_flush" which can be tuned to control the amount of flushing by background flushing thread and therefore the page- outs. * 2602982 (Tracking ID: 2599590) SYMPTOM: Expansion of a 100% full file system may panic the machine with the following stack trace. bad_kern_reference() $cold_vfault() vm_hndlr() bubbledown() vx_logflush() vx_log_sync1() vx_log_sync() vx_worklist_thread() kthread_daemon_startup() DESCRIPTION: When 100% full file system is expanded intent log of the file system is truncated and blocks freed up are used during the expansion. Due to a bug the block map of the replica intent log inode was not getting updated thereby causing the block maps of the two inodes to differ. This caused some of the in- core structures of the intent log to go NULL. The machine panics while de- referencing one of this structure. RESOLUTION: Updated the block map of the replica intent log inode correctly. 100% full file system now can be expanded only If the last extent in the intent log contains more than 32 blocks, otherwise fsadm will fail. To expand such a file-system, some of the files should be deleted manually and resize be retried. * 2603015 (Tracking ID: 2565400) SYMPTOM: Sequential buffered I/O reads are slow in performance. DESCRIPTION: Read-Aheads are not happening because the file-system's read-ahead size gets incorrectly calculated. RESOLUTION: Fixed the incorrect typecast. * 2603121 (Tracking ID: 2577079) SYMPTOM: Customer suspects high pinned memory usage by VxFS. DESCRIPTION: Customer is suspecting high pinned memory usage by VxFS. But from the dump available vxfs usage doesn't seems to be high. Therefore adding separate counters to help in finding pinned memory usage by VxFS. RESOLUTION: vx_info counters added to track total, inode cache, buffer cache, DNLC pinned memory usage. * 2624650 (Tracking ID: 2624262) SYMPTOM: Panic hit in vx_bc_do_brelse() function while executing dedup functionality with following backtrace. vx_bc_do_brelse() vx_mixread_compare() vx_dedup_extents() enqueue_entity() __alloc_pages_slowpath() __get_free_pages() vx_getpages() vx_do_dedup() vx_aioctl_dedup() vx_aioctl_common() vx_rwunlock() vx_aioctl() vx_ioctl() vfs_ioctl() do_vfs_ioctl() sys_ioctl() DESCRIPTION: While executing function vx_mixread_compare() in dedup codepath, we hit error due to which an allocated data structure remained uninitialised. The panic occurs due to writing to this uninitialised allocated data structure in the function vx_mixread_compare(). RESOLUTION: Code is changed to free the memory allocated to the data structure when we are going out due to error. * 2626960 (Tracking ID: 2626390) SYMPTOM: Freeing a large number of pages at once can induce a small i/o latency. DESCRIPTION: By reading an entire file into memory and then removing the file all the pages will be freed. Pages are freed in chunks. Currently the smallest configurable chunk size for freeing pages is 64MB. RESOLUTION: Enhance the fine tuning of the chunk invalidation size, now allow smaller chunks to be tuned. Example tuning below will set the chunk size for freeing pages to 8Mb, the table shows the tuning values now available. # vxtunefs -D chunk inval_size=6 chunk inval_size: 64 256 128 64 32 16 8 4 unit is Mb Tuning value : 0 1 2 3 4 5 6 7 * 2630984 (Tracking ID: 2622899) SYMPTOM: Number of PDTs not set correctly by default. # /opt/VRTS/bin/vxtunefs -D print Filesystem parameters: drefund enable = 0 drefund supported = 0 number of pdts = 1 default number of pdts = 16 Above vxtunefs output shows by default no. of PDTs should be 16 but it is incorrectly set to 1. DESCRIPTION: VxFS configuration tunable are read from file /etc/vx/vxfssystem. If this file is not present VxFS decides tunable values based on memory and number of cpus. Values read from configuration file are stored in a structure and sent to kernel. Because of a wrong initialization of structure, VxFS thinks that these values have come from configuration file and ends up assigning wrong values to tunable even the configuration file does not exist. RESOLUTION: To initialize the tunable structure properly. So that tunables are set to values based on memory and number of cpus get assigned. * 2631026 (Tracking ID: 2332314) SYMPTOM: Internal noise.fullfsck test with ODM enabled hit an assert fdd_odm_aiodone:3 DESCRIPTION: In case of failed IO in fdd_write_clone_end() function, error was not set on buffer which is causing the assert. RESOLUTION: Code is changed so we set the error on buffer in case of IO failures in fdd_write_clone_end() function. * 2631221 (Tracking ID: 2349744) SYMPTOM: A system can panic when performing buffered i/o using large i/o sizes. The following is an example stacktrace showing where the system panics: abend_trap+000000 () xm_bad_free+000264 () xmfree+0004AC () vx_memfree_cpu+00006C () vx_alloc+000124 () vx_worklist_enqueue+000034 () vx_sched_thread+00024C () vx_sched_thread0+000038 () vx_thread_base+000048 () threadentry+000054 () DESCRIPTION: Prior to the VxFS5.1 release the maximum buffered i/o size was restricted to 128Kb on the AIX platform. To assist buffered i/o performance an enhancement was introduced in release 5.1 to perform larger sized buffered i/o's. With this enhancement VxFS can be asked to perform buffered i/o of sizes greater than 128K by tuning the vxfs tunables "read_pref_io" and "write_pref_io" to values greater than 128K. When performing buffered i/o using i/o sizes of multiple-megabytes the system can panic in the routine xmfree() whilst freeing the memory from a heap different from which it was allocated from. RESOLUTION: Freeing the memory from correct heap. * 2631221 (Tracking ID: 2349744) SYMPTOM: A system can panic when performing buffered i/o using large i/o sizes. The following is an example stacktrace showing where the system panics: abend_trap+000000 () xm_bad_free+000264 () xmfree+0004AC () vx_memfree_cpu+00006C () vx_alloc+000124 () vx_worklist_enqueue+000034 () vx_sched_thread+00024C () vx_sched_thread0+000038 () vx_thread_base+000048 () threadentry+000054 () DESCRIPTION: Prior to the VxFS5.1 release the maximum buffered i/o size was restricted to 128Kb on the AIX platform. To assist buffered i/o performance an enhancement was introduced in release 5.1 to perform larger sized buffered i/o's. With this enhancement VxFS can be asked to perform buffered i/o of sizes greater than 128K by tuning the vxfs tunables "read_pref_io" and "write_pref_io" to values greater than 128K. When performing buffered i/o using i/o sizes of multiple-megabytes the system can panic in the routine xmfree() whilst freeing the memory from a heap different from which it was allocated from. RESOLUTION: Freeing the memory from correct heap. * 2631315 (Tracking ID: 2631276) SYMPTOM: Lookup fails for the file which is in partitioned directory and is being accessed using its vxfs namespace extension name. DESCRIPTION: If file is present in the partitioned directory and is accessed using its vxfs namespace extension name then its name is searched in one of the hidden leaf directory. This leaf directory mostly doesn't contains entry for this file. Due this lookup fails. RESOLUTION: Code has been modified to call partitioned directory related lookup routine at upper level so that lookup doesn't fails even if file is accessed using its extended namespace name. * 2635583 (Tracking ID: 2271797) SYMPTOM: Internal Noise Testing with locally mounted VxFS filesystem hit an assert "f:vx_getblk:1a" DESCRIPTION: The assert is hit due to overlay inode is being marked with the flag regarding bad copy of inode present on disk. RESOLUTION: Code is changed to set the flag regarding bad copy of inode present on disk, only if the inode is not overlay. * 2642027 (Tracking ID: 2350956) SYMPTOM: Internal noise test on locally mounted filesystem exited with error message "bin/testit : Failed to full fsck cleanly, exiting" and in the logs we get the userspace assert "bmaptops.c 369: ASSERT(devid == 0 || (start == VX_HOLE && devid == VX_DEVID_HOLE)) failed". DESCRIPTION: The function bmap_data4_set() gets called while entering bmap allocation information for typed extents of type VX_TYPE_DATA_4 or VX_TYPE_IADDR_4. The assert expects that, either devid should be zero or if extent start is a hole, then devid should be VX_DEVID_HOLE. However, we never have extent descriptors to represent holes in typed extents. The assertion is incorrect. RESOLUTION: The assert corrected to check that extent start is not a hole and either devid is zero, or extent start is VX_OVERLAY with devid being VX_DEVID_HOLE. * 2646557 (Tracking ID: 2529201) SYMPTOM: The fscdsconv limits specified in the cdslimitstab file were wrong. Hence some of the cds related commands were failing. DESCRIPTION: The cdslimitstab file is used by the CDS/PDC commands to gather information about the migration targets. Limits of vx_maxlink for 32k & 64k are wrong. Below are limits mentioned in cdslimittab :: OS_MAX_NLINKS=32767 VXFS_MAX_NLINKS=65565 2^64 -1 evaluates to 65535. If we create 32768 directories in single directory then fscdsconv is throwing error for nlink. RESOLUTION: The limit for VXFS_MAX_NLINKS is set correctly to 65535. * 2654644 (Tracking ID: 2630954) SYMPTOM: During internal CFS stress reconfiguration testing, the fsck(1M) command exits by hitting an assert. DESCRIPTION: When the dexh_getblk() function is executed, if the extent size is greater than 256 Kbytes, the extent is divided into chunks of 256 Kbytes each. When the extent of the hash inodes are read in the dexh_getblk() function, a maximum of 256 Kbytes (MAXBUFSZ) is read at a time. Currently, the chunk size is assigned as 256 Kbytes every time. But there is a bug when the last chunk in the extent is less than 256 Kbytes because of which the length of the buffer is assigned incorrectly and we get an aliased buffer in the buffer cache. Instead, for the last chunk, the remaining size in the extent to be read should be assigned as the chunk size. RESOLUTION: The code is modified so that the buffer size is calculated correctly. * 2669195 (Tracking ID: 2326037) SYMPTOM: Internal Stress Test on cluster file system with clones failed while writing to file with error ENOENT. DESCRIPTION: VxFS file-system trying to write to clone which is in process of removal. As clone removal process works asynchronously, process starts to push changes from inode of primary fset to inode of clone fset. But when actual write happens the inode of clone fset is removed, hence error ENOENT is returned. RESOLUTION: Code is added to re-validate the inode being written. * 2715030 (Tracking ID: 2715028) SYMPTOM: The fsadm(1M) command with the '-d' option may hang when compacting a directory if it is run on the Cluster File System (CFS) secondary node while the find(1) command is running on any other node. DESCRIPTION: During the compacting of a directory, the CFS secondary node has ownership of the inode of the directory. To complete the compacting of directory, the truncation message needs to be processed on the CFS primary node. For this to occur, the CFS primary node needs to have ownership of the inode of the directory. This causes a deadlock. RESOLUTION: The code is modified to force the processing of the truncation message on the CFS secondary node which initiated the compacting of directory. * 2725995 (Tracking ID: 2566875) SYMPTOM: The write(2) operation exceeding the quota limit fails with an EDQUOT error ("Disc quota exceeded") before the user quota limit is reached. DESCRIPTION: When a write request exceeds a quota limit, the EDQUOT error is handled so that Veritas File System (VxFS) can allocate space up to the hard quota limit to proceed with a partial write. However, VxFS does not handle this error and an erroris returned without performing a partial write. RESOLUTION: The code is modified to handle the EDQUOT error from the extent allocation routine. * 2726002 (Tracking ID: 2650330) SYMPTOM: Accessing a file with O_NSHARE mode by multiple process concurrently on Aix could cause file system hang. DESCRIPTION: There are two different hang scenarios. First, a deadlock can be seen between open threads and a freeze operation. For example, a freeze T1 issued by commands, such as umount, fsadm etc stops new threads from getting active level and meanwhile is waiting for an old thread T2 which is holding active level, but expecting ilock. However, T3 thread with the ilock is not able to get the active level because of freeze thread T1. T1: vx_async_waitmsg+00001C vx_msg_broadcast+000118 vx_cwfa_step+0000A0 vx_cwfreeze_common+0000F8 vx_cwfreeze_all+0002E8 vx_freeze+000038 vx_detach_fset+000394 vx_unmount+0001AC vx_unmount_skey+000034 T2: simple_lock+000058 vx_ilock+000020 vx_close1+000720 vx_close+00006C vx_close_skey+00003C vnop_close+000094 vno_close+000050 closef+00005C T3: vx_delay+000010 vx_active_common_flush+000038 vx_open_modes+00058C vx_open1+0001FC vx_open+00007C vx_open_skey+000044 Another RESOLUTION: Give up the ilock prior to attempting active level, and wakeup function regards ilock as simple lock instead of complex lock. * 2726010 (Tracking ID: 2651922) SYMPTOM: On a local VxFS file system, the ls(1M) command with the '-l' option runs slowly and high CPU usage is observed. DESCRIPTION: Currently, Cluster File System (CFS) inodes are not allowed to be reused as local inodes to avoid Global Lock Manager (GLM) deadlo`ck issue when Veritas File System (VxFS) reconfiguration is in process. Hence, if a VxFS local inode is needed, all the inode free lists need to be traversed to find a local inode if the free lists are almost filled up with CFS inodes. RESOLUTION: The code is modified to add a global variable, 'vxi_icache_cfsinodes' to count the CFS inodes in inode cache. The condition is relaxed for converting a cluster inode to a local inode when the number of in-core CFS inodes is greater than the 'vx_clreuse_threshold' threshold and reconfiguration is not in progress. * 2726018 (Tracking ID: 2670022) SYMPTOM: Duplicate file names can be seen in a directory. DESCRIPTION: Veritas File System (VxFS) maintains an internal Directory Name Lookup Cache (DNLC) to improve the performance of directory lookups. A race condition occurs in the DNLC lists manipulation code during lookup/creation of file names that have more than 32 characters (which further affects other file creations). This causes the DNLC to have a stale entry for an existing file in the directory. A lookup of such a file through DNLC does not find the file and allows another duplicate file with the same name in the directory. RESOLUTION: The code is modified to fix the race condition by protecting the DNLC lists through proper locks. * 2726025 (Tracking ID: 2674639) SYMPTOM: The cp(1) command with the '-p' option may fail on a file system whose File Change Log (FCL) feature is enabled. The following error messages are displayed: "cp: setting permissions for 'file_name': Input/output error" "cp: preserving permissions for 'file_name': No data available" The fsetxattr() system call during the failed cp(1) command returns the error value, 61493. DESCRIPTION: During the execution of the cp(1) command with the '-p' option, the attributes of the source file are copied to the newly created file. If FCL is enabled on the file system, then the log of newly created attributes is added/logged in the FCL file. While logging this new entry, if the FCL file does not have free space, its size is extended up to the value of the Veritas File System (VxFS) 'fcl_maxalloc' tunable and then the entry is logged. If the FCL file does not have free space, an error is returned instead of extending the FCL file and retrying. Hence, the cp(1) command with the '-p' option fails with an error message. RESOLUTION: The code is modified to allocate the required space and extend the FCL file. * 2726027 (Tracking ID: 2678375) SYMPTOM: Setting vmmbufs_resv_disable=1 gives an error message when drefund_enable is set. DESCRIPTION: We can enable drefund only when it is supported. The tuning failure is seen because tuning vmmbufs_resv_disable to 1 is not allowed when drefund_supported=1 This is not valid because we can set vmmbufs_resv_disable=1 independent of value of 'drefund_supported'. We should not permit the setting of vmmbufs_resv_disable=1 when drefund is enabled. While setting 'vmmbufs_resv_disable' to 1 we should check value of 'drefund_enable' instead of 'drefund_supported'. Setting 'vmmbufs_resv_disable' to 1 should be allowed when 'drefund_enable' is not set to 1. RESOLUTION: Now value of 'drefund_enable' is checked while tuning 'vmmbuf_resv_disable'. Also vmmbuf_resv_disable is now honoured for for AIX - 6.1 and 7.1 OS versions. * 2726028 (Tracking ID: 2344085) SYMPTOM: System panics with the following stack trace. vx_aiofpcio_common_iodone() vx_aiofpcio_iodone() internal_iodone_offl() iodone_offl() i_softmod() flih_util() This can only happen on machines with AIX POWER7 processors. DESCRIPTION: The panic here is because of VxFS trying to access a vnode field without a storage key during a iodone routine. RESOLUTION: Used the storage key for accessing the vnode field in such a case. * 2726031 (Tracking ID: 2684573) SYMPTOM: The performance of the cfsumount(1M) command for the VRTScavf package is slow when some checkpoints are deleted. DESCRIPTION: When a checkpoint is removed asynchronously, a kernel thread is started to handle the job in the background. If an unmount command is issued before these checkpoint removal jobs are completed, the command waits for the completion of these jobs. A forced unmount can interrupt the process of checkpoint deletion and the remaining work is left to the next mount. RESOLUTION: The code is modified to add a counter in the vxfsstat(1M) command to determine the number of checkpoint removal threads in the kernel. The '-c' option is added to the cfsumount(1M) command to force unmount a mounted file system if the checkpoint jobs are running. * 2726056 (Tracking ID: 2709869) SYMPTOM: The system panics with a redzone violation while releasing inodes File Input/Output (FIO) statistics structure and the following stack trace is displayed: kmem_error() kmem_cache_free_debug() kmem_cache_free() vx_fiostats_alloc() fdd_common1() fdd_odm_open() odm_vx_open() odm_ident_init() odm_identify() odmioctl() fop_ioctl() ioctl() DESCRIPTION: Different types of statistics are maintained when a file is accessed in Quick Input/Output (QIO) and non-QIO mode. Some common statistics are copied when the file access mode is changed from QIO to non-QIO or vice versa. While switching from QIO mode to non-QIO, the QIO statistics structure is freed and FIO statistics structure is allocated to maintain FIO file-level statistics. There is a race between the thread freeing the QIO statistics which also allocates the FIO statistics and the thread updating the QIO statistics when the file is opened in QIO mode. Thus, the FIO statistics gets corrupted as another thread writes to it assuming that the QIO statistics is allocated. RESOLUTION: The code is modified to protect the allocation/releasing of FIO/QIO statistics using the read-write lock/spin lock for file statistics structure. * 2736421 (Tracking ID: 2726255) SYMPTOM: VxFS tunables vmmbufs_resv_disable, chunk_inval_size and sync_time values are not persistent after reboot. DESCRIPTION: VxFS tunables vmmbufs_resv_disable, chunk_inval_size and sync_time should be persistent across reboots. This is achieved by doing an auto-tuning during reboot by reading the old values from a table. This was not being done here, in case of vmmbufs_resv_disable, chunk_inval_size tunables, hence causing non- persistent values. In case, of sync_time tunable value read from the table was overwritten by the default value, which was causing the non-persistent value of sync time after reboot. RESOLUTION: Handle auto tuning of the tunable during reboot. * 2752607 (Tracking ID: 2745357) SYMPTOM: Performance enhancements are made for the read/write operation on Veritas File System (VxFS) structural files. DESCRIPTION: The read/write performance of VxFS structural files is affected when the piggy back data in the vx_populate_bpdata() function is ignored. This occurs if the buffer type is not mentioned properly, consequently requiring another disk I/O to get the same data. RESOLUTION: The code is modified so that the piggy back data is not ignored if it is of type VX_IOT_ATTR in the vx_populate_bpdata() function, thus leading to an improvement in the performance of the read/write to the VxFS structural files. * 2765308 (Tracking ID: 2753944) SYMPTOM: The file creation threads can hang. The following stack trace is displayed: cv_wait+0x38 vx_rwsleep_rec_lock+0xa4 vx_recsmp_rangelock+0x14 vx_irwlock2 vx_irwlock+0x34 vx_fsetialloc+0x98 vx_noinode+0xe4 vx_dircreate_tran+0x7d4 vx_pd_create+0xbb8 vx_create1_pd+0x818 vx_do_create+0x80 vx_create1+0x2f8 vx_create+0x158 fop_create+0x34 lo_create+0x138 fop_create+0x34 vn_createat+0x590 vn_openat+0x138 copen+0x260() DESCRIPTION: The Veritas File Systems (VxFS) uses Inode Allocation Units (IAU) to keep track of the allocated and free inodes. Two counters are maintained per IAU. One for the number of free regular inodes in that IAU and the other for the number of free director inodes. A global in-memory counter is also maintained to keep a track of the total number of free inodes in all the IAUs in the file system. The creation threads refer to this global counter to quickly check the number of free inodes at any given time. Every time an inode is allocated, this global count is decremented. Similarly, it is incremented when an inode is freed. The hang is caused when the global counter unexpectedly becomes negative which confuses the file creation threads. This global counter is calculated by adding per IAU counters during mount time. As the code is multi-threaded, any modification to the global counter must be guarded by a summary lock which is missing in the multi threaded code. Therefore, the calculation goes wrong and the global counter and per IAU counters are out of sync. This results in a negative value and causes the file creation threads to hang. RESOLUTION: The code is modified to update the global inode free count under the protection of the summary lock. * 2768540 (Tracking ID: 2650354) SYMPTOM: Allow 8MB and 4MB values for chunk_flush_size tunable on AIX. DESCRIPTION: Allow two more values of 8MB and 4MB for the tunable chunk_flush_size. Currently the minimum allowable values are 0, 1, 2, 3, 4, 5. Values of 1, 2, 3, 4 or 5 make the chunking be 256MB, 128MB, 64MB, 32MB or 16MB. A value of 0 for the chunk size means chunk flushing is disabled. This patch adds two new values 6 and 7 which makes the chunk flushing size 8MB and 4MB respectively. RESOLUTION: Added two values 6 and 7 which makes the chunk flushing size 8MB and 4MB respectively. chunk_flush_size tunable now can take values 0 to 7. * 2788241 (Tracking ID: 2733968) SYMPTOM: Split second i/o pauses can occur on a system when a thread that is processing an i/o is queued to the same CPU that is currently performing a page flushing or page freeing operation on a file. The DESCRIPTION: This hotfix is called VxFS 5.1SP1RP2P1-HF1a. This is a special hotfix for one target customer. This hotfix, HF1a, is based upon HF1 but has additional changes over HF1. This VxFS hotfix has a pre-requisite on AIX7.1 of ifix IV06140m03.120301.epkg.Z, this ifix is only available from IBM. The AIX ifix provides two new Virtual Memory Management [VMM] tunables, by default their values are set to zero (a value of 0 means that VMM chunking is disabled). Their values are in units of 4Kb pages. After installing this VxFS hotfix the VMM tunable values will need to be set, we recommend they are set to a default value of 4Mb, the selected values will persist across subsequent system reboots using the following commands: # vmo -p -o thrpgio_npages=1024 # vmo -p -o thrpgio_inval=1024 The vmo tunable "thrpgio_npages" determines the VMM chunk size for flushing pages, the vmo tunable "thrpgio_inval" determines the VMM chunk size for freeing pages. In VxFS 5.1SP1RP2P1-HF1a we have introduced new tunables and changed the default values of tunables as follows: 1. The IBM ifix is applicable to AIX 7.1 only, therefore new information will be displayed via 'vxtunefs -D print' dchunk_supported is not a tunable, it's purely for user information only, displayed when executing 'vxtunefs -D print' dchunk_supported=0 will be seen when running AIX 6.1 dchunk_supported=1 will be seen when running AIX 7.1 2. We have introduced a new VxFS tunable called dchunk_enable, changed via "vxtunefs -D" dchunk_enable=0 - enables VXFS chunking - (this will be the default for AIX 6.1) dchunk_enable=1 - enables VMM chunking - (this will be the default for AIX 7.1) dchunk_enable can only be changed on systems where dchunk_supported=1 3. For systems with >16GB of physical memory we will now disable D_REFUND by default. 4. If D_REFUND is not enabled we will default to 1000% increase in vmmbufs (this is also the maximum) This ensures that we will have enough vmmbufs available. 5. By default we will now disable the VxFS internal vmmbufs reservation and unreservation (vmmbufs_resv_disable=1) This saves us some extra CPU cycles between each chunk. This will help reduce the CPU usage regardless of whether VMM chunking or VxFS chunking is in use. 6. The maximum number of VxFS PDT's is 64, however the maximum default value will be 32 PDTs Using manual tuning the number of PDTs can still be tuned to the maximum value of 64 if wished. On AIX 6.1, running VxFS 5.1SP1RP2P1-HF1a we will continue to use the VxFS chunking mechanism. On AIX 7.1 running VxFS 5.1SP1RP2P1-HF1a we will by default now utilize the VMM chunking mechanism. A further four changes have been added for VxFS 5.1SP1RP2P1-HF1b: - the re-auto-tuning of the pre-allocated vmmbufs count (only when drefund is disabled) was being miscalculated, this has been corrected in HF1b. The increase is now a 1000% of the default, as intended. - quick truncation is skipped if the file has many associated pages, this prevents the "ls" command from waiting whilst files are being removed. - the pagecache low memory threshold has been increased to 29/30ths of numclient, this allows the VMM lru [least recently used] page stealing processing to engage before VxFS. - calculate the pinned low memory threshold correctly in the presence of the AIX "soft pinning" capability, this is specific to AIX 7.1. RESOLUTION: IBM have added an ability within Virtual Memory Management [VMM] for VxFS to flush and free pages in chunks, where the chunk size is also tunable using the AIX "vmo" command. We have therefore disabled by default on AIX7.1 the existing capability within VxFS to flush and invalidate pages in chunks. We will instead utilise the new VMM chunking capability provided through the ifix available on AIX7.1. The i/o pauses occur when a thread processing an i/o is scheduled on a CPU that is currently flushing or invalidating pages, therefore the process of "chunking" frees-up the CPU for other threads to proceed unhindered. The VMM chunking implementation is a lower level, fine granular and more efficient "chunking" mechanism than the prior VxFS chunking implementation. * 2798164 (Tracking ID: 2796364) SYMPTOM: Two nodes got panicked while running the Longivity test with the following stacktrace: pvthread+023D00 STACK: WARNING: bad IAR: 00000000, display stack from LR: 0057FC5C [0057FC5C]vnop_open+0004BC (F1000A101166A5F0, 000000000400000A, 0000000000000000, F1000F1E50224F78, F1000A101166D47C [??]) [00618264]openpnp+0005E4 (??, ??, ??, ??, ??, ??) [00618820]openpath+000100 (??, ??, ??, ??, ??, ??, ??) [00618CF4]copen+000294 (??, ??, ??, ??, ??) [00617BFC]kopen+00001C (??, ??, ??) DESCRIPTION: The value of vx_pdt_mask should be 'vx_num_pdts - 1'. When vx_num_pdts is decremented, the value of vx_pdt_mask is not updated properly. This gave rise to the problem where vnode get associated with a wrong vx_pdttab structure. This is the reason behind the panic. RESOLUTION: Update the value of vx_pdt_mask, when vx_num_pdts is updated. Patch ID: 5.1.112.100 * 2413811 (Tracking ID: 1590963) SYMPTOM: Maximum Number of subdirectories is limited to 32767 DESCRIPTION: Currently, there limit on maximum numbers of subdirectories. This limit is fixed to 32767. This value can be increased to 64K on Linux, HP and Solaris platforms. Also there is a need to provide Flexibility to change the value of maximum numbers of subdirectories. RESOLUTION: Code is modified to add vx_maxlink tunable to control maximum number of sub directories * 2508171 (Tracking ID: 2246127) SYMPTOM: Mount command may take more time in case of large IAU file. DESCRIPTION: At time of mount, IAU file is read one block at time. The read block is processed and then next block is read. In case there are huge number of files in filesystem, IAU file for the filesystem becomes large. Reading of such large IAU file, one block at time is taking time to complete mount command. RESOLUTION: Code is changed to read IAU file using multiple threads in parallel, also now complete extent is read and then it is processed. * 2521672 (Tracking ID: 2515380) SYMPTOM: The ff command hangs and later it exits after program exceeds memory limit with following error. # ff -F vxfs /dev/vx/dsk/bernddg/testvol UX:vxfs ff: ERROR: V-3-24347: program limit of 30701385 exceeded for directory data block list UX:vxfs ff: ERROR: V-3-20177: /dev/vx/dsk/bernddg/testvol DESCRIPTION: 'ff' command lists all files on device of vxfs file system. In 'ff' command we do directory lookup. In a function we save the block addresses for a directory. For that we traverse all the directory blocks. Then we have function which keeps track of buffer in which we read directory blocks and the extent up to which we have read directory blocks. This function is called with offset and it return the offset up to which we have read the directory blocks. The offset passed to this function has to be the offset within the extent. But, we were wrongly passing logical offset which can be greater than extent size. As a effect the offset returned gets wrapped to 0. The caller thinks that we have not read anything and hence the loop. RESOLUTION: Remove call to function which maintains buffer offsets for reading data. That call was incorrect and redundant. We actually call that function correctly from one of the functions above. * 2551576 (Tracking ID: 2526174) SYMPTOM: The vxfs_fcl_seektime function seeks to the first record in the FCL file that has a timestamp greater than or equal to the specified time. FCL vxfs_fcl_seektime()API can incorrectly return EINVAL(no records in FCL file newer than specified time) error even though records after specified time are present in FCL log. DESCRIPTION: Error EINVAL was returned when partial records exist in FCL log, because search for FCL record was not continued till correct last offset. 'last offset' is where search should stop. last offset is determined in binary search when record newer than specified time is found. While doing binary search FCL partial records are skipped. When record newer than specified time is found - length for which partial records were found was not considered in last offset calculation. Therefore last offset calculation was incorrect and hence FCL record search was terminated earlier i.e. before record at specified time is found. RESOLUTION: if partial records are found during binary search, length of partial records is now added to the last offset to get the correct last offset. * 2561355 (Tracking ID: 2561334) SYMPTOM: System log file may contain following error message on multi-threaded environment with Dynamic Storage Tiers(DST). UX:vxfs fsppadm: ERROR: V-3-26626: File Change Log IOTEMP and ACCESSTEMP index creation failure for /vx/fsvm with message Argument list too long DESCRIPTION: In DST, while enforcing policy, SQL queries are generated and written to file .__fsppadm_enforcesql present in lost+found. In multi threaded environment, 16 threads works in parallel on ILIST and geenrate SQL queries and write it to file. This may lead to corruption of file, if multiple threads write to file simultaneously. RESOLUTION: using flockfile() instead of adding new code to take lock on .__fsppadm_enforcesq file descriptor before writing into it. * 2561791 (Tracking ID: 2527765) SYMPTOM: On AIX, a VxFS directory cannot have more than 32K sub-directories. New directory creation in directory already having 32K sub-directories fails. So there is a need to support >32K sub-directories for VxFS on AIX. DESCRIPTION: on AIX there is a Operating System limitation which restricts the number of links of a directory/file to 32K (MAXLINK). Therefore one cannot create sub-directories greater than 32K. RESOLUTION: VxFS uses an internal 32 bit link count and hence VxFS internal structures are capable of handling more than 32k sub-directories. None of the VxFS commands use MAXLINK count hence there are no other dependencies. So a new tunable i.e. 'maxlink_enable' has been added to allow >32k sub-directories and removed limitation of 32K sub-directories. Known Limitations with this fix: a] AIX Base command limitations: Since the value of MAXLINK can go only up to 32k, certain AIX commands that have the dependency on this value would fail to produce correct results. IBM has completed the analysis of such commands and following are expected to produce incorrect results when run with >32K directories. AIX commands /usr/bin/istat, /usr/bin/li, /usr/bin/ls and /usr/bin/find commands will report incorrect nlink count when there are >32k subdirectories. b] VxFS limitations : 1) In case of CFS, ALL nodes of the cluster must run with same maxlink_enable value. 2) once this tunable is set then it cannot be disabled. By default it is disabled. * 2564431 (Tracking ID: 2515459) SYMPTOM: Local mount hangs in vx_bc_binval_cookie like the following stack delay vx_bc_binval_cookie vx_blkinval_cookie vx_freeze_flush_cookie vx_freeze_all vx_freeze vx_set_tunefs1 vx_set_tunefs vx_aioctl_full vx_aioctl_common vx_aioctl vx_ioctl genunix:ioctl unix:syscall_trap32 DESCRIPTION: The hanging process for local mount is waiting for a buffer to be unlocked. But that buffer can only be released if its associated cloned map writes get flushed. But a necessary flush is missed. RESOLUTION: Add code to synchronize cloned map writes so that all the cloned maps will be cleared and the buffers associated with them will be released. * 2567091 (Tracking ID: 2527578) SYMPTOM: System crashes due to NULL pointer deference with the following stack - simple_lock+000014 () vx_bhash_rele@AF161_63+00001C () vx_inode_deinit+0000D4 () vx_idrop+0002A4 () vx_detach_fset+000CC8 () vx_unmount+0001AC () vx_unmount_skey+000034 () vfs_unmount+000098 () kunmount+0000DC () uvmount+000208 () ovlya_addr_sc_flih_main+000130 () DESCRIPTION: The crash happens as a result of accessing an address for which memory hasn't been allocated. This address corresponds to a spinlock and therefore the crash while locking the spinlock. RESOLUTION: Allocate and initialize the spinlock before locking. * 2574396 (Tracking ID: 2433934) SYMPTOM: Performance degradation observed when CFS is used compared to standalone VxFS as back-end NFS data servers. DESCRIPTION: In CFS, if one thread holding read-write lock on inode in exclusive mode, other threads are stuck for the same inode, even if they want to access inode in shared mode, resulting in performance degradation. RESOLUTION: Code is changed to avoid taking read-write lock for inode in exclusive mode, where it is not required. * 2581351 (Tracking ID: 2588593) SYMPTOM: df(1M) shows wrong usage value for volume when large file is deleted. DESCRIPTION: We maintain all freed extent size in the in core global variable and transaction subroutine specific data structures. After deletion of a large file, we were missing to update this in core global variable. df(1M) while reporting usage data, read the freed space information from this global variable which contains stale information. RESOLUTION: Code is modified to account the freed extent data into global vaiable used by df(1M) so that correct usage for volume is reported by df(1M). * 2586283 (Tracking ID: 2603511) SYMPTOM: On Systems with Oracle 11.2.0.3 or higher installed, database operations can fail with the following message is displayed in the system logs: "ODM ERROR V-41-4-1-105-22 Invalid argument" DESCRIPTION: A change in and Oracle API in 11.2.0.3 , Oracle Disk Manager (ODM) is unable to create a file due to a unrecognized f-mode. RESOLUTION: The code is modified to mask and take into consideration only the files which are known to ODM instead of failing the creation of the file. * 2587025 (Tracking ID: 2528819) SYMPTOM: AIX can fail to create new worker threads for VxFS. The following message is seen in the system log- "WARNING: msgcnt 175 mesg 097: V-2-97: vxfs failed to create new thread" DESCRIPTION: AIX is failing the thread creation because it cannot find a free slot in that kproc and returning ENOMEM. RESOLUTION: Limit the maximum number of VxFS worker threads. * 2587139 (Tracking ID: 2511432) SYMPTOM: Poor VxFS performance for application doing writes on a mmaped file which has been written to before being mmaped and therefore has all the associated pages already brought into memory as a result. Large page-ins are observed as soon as the writes begin. Also large page-outs are observed periodically afterward. DESCRIPTION: Buffered writes to a file brings the associated page in memory. Such a page can be written to. However mmaping this file now will mark the page read-only. Any mmaped write writing to same page will encounter now a protection fault thereby leading to page-in. The VxFS background flushing mechanism periodically flushes the entire file which leads to large page-outs. During these page-outs the writes happening else where on the file slowdown owing to unavailability of various inode locks currently held by the flushing thread. RESOLUTION: For preventing large page-ins we have removed the extra overhead of protection fault by marking pages read-write in the mmap range wherever possible. We have introduced a new tunable "vx_ctrl_flush" which can be tuned to control the amount of flushing by background flushing thread and therefore the page-outs. * 2602982 (Tracking ID: 2599590) SYMPTOM: Expansion of a 100% full file system may panic the machine with the following stack trace. bad_kern_reference() $cold_vfault() vm_hndlr() bubbledown() vx_logflush() vx_log_sync1() vx_log_sync() vx_worklist_thread() kthread_daemon_startup() DESCRIPTION: When 100% full file system is expanded intent log of the file system is truncated and blocks freed up are used during the expansion. Due to a bug the block map of the replica intent log inode was not getting updated thereby causing the block maps of the two inodes to differ. This caused some of the in- core structures of the intent log to go NULL. The machine panics while de- referencing one of this structure. RESOLUTION: Updated the block map of the replica intent log inode correctly. 100% full file system now can be expanded only If the last extent in the intent log contains more than 32 blocks, otherwise fsadm will fail. To expand such a file-system, some of the files should be deleted manually and resize be retried. * 2603015 (Tracking ID: 2565400) SYMPTOM: Sequential buffered I/O reads are slow in performance. DESCRIPTION: Read-Aheads are not happening because the file-system's read-ahead size gets incorrectly calculated. RESOLUTION: Fixed the incorrect typecast. * 2603121 (Tracking ID: 2577079) SYMPTOM: Customer suspects high pinned memory usage by VxFS. DESCRIPTION: Customer is suspecting high pinned memory usage by VxFS. But from the dump available vxfs usage doesn't seems to be high. Therefore adding separate counters to help in finding pinned memory usage by VxFS. RESOLUTION: vx_info counters added to track total, inode cache, buffer cache, DNLC pinned memory usage. * 2624650 (Tracking ID: 2624262) SYMPTOM: Panic hit in vx_bc_do_brelse() function while executing dedup functionality with following backtrace. vx_bc_do_brelse() vx_mixread_compare() vx_dedup_extents() enqueue_entity() __alloc_pages_slowpath() __get_free_pages() vx_getpages() vx_do_dedup() vx_aioctl_dedup() vx_aioctl_common() vx_rwunlock() vx_aioctl() vx_ioctl() vfs_ioctl() do_vfs_ioctl() sys_ioctl() DESCRIPTION: While executing function vx_mixread_compare() in dedup codepath, we hit error due to which an allocated data structure remained uninitialised. The panic occurs due to writing to this uninitialised allocated data structure in the function vx_mixread_compare(). RESOLUTION: Code is changed to free the memory allocated to the data structure when we are going out due to error. * 2626960 (Tracking ID: 2626390) SYMPTOM: Freeing a large number of pages at once can induce a small i/o latency. DESCRIPTION: By reading an entire file into memory and then removing the file all the pages will be freed. Pages are freed in chunks. Currently the smallest configurable chunk size for freeing pages is 64MB. RESOLUTION: Enhance the fine tuning of the chunk invalidation size, now allow smaller chunks to be tuned. Example tuning below will set the chunk size for freeing pages to 8Mb, the table shows the tuning values now available. # vxtunefs -D chunk inval_size=6 chunk inval_size: 64 256 128 64 32 16 8 4 unit is Mb Tuning value : 0 1 2 3 4 5 6 7 * 2630984 (Tracking ID: 2622899) SYMPTOM: Number of PDTs not set correctly by default. # /opt/VRTS/bin/vxtunefs -D print Filesystem parameters: drefund enable = 0 drefund supported = 0 number of pdts = 1 default number of pdts = 16 Above vxtunefs output shows by default no. of PDTs should be 16 but it is incorrectly set to 1. DESCRIPTION: VxFS configuration tunable are read from file /etc/vx/vxfssystem. If this file is not present VxFS decides tunable values based on memory and number of cpus. Values read from configuration file are stored in a structure and sent to kernel. Because of a wrong initialization of structure, VxFS thinks that these values have come from configuration file and ends up assigning wrong values to tunable even the configuration file does not exist. RESOLUTION: To initialize the tunable structure properly. So that tunables are set to values based on memory and number of cpus get assigned. * 2631026 (Tracking ID: 2332314) SYMPTOM: Internal noise.fullfsck test with ODM enabled hit an assert fdd_odm_aiodone:3 DESCRIPTION: In case of failed IO in fdd_write_clone_end() function, error was not set on buffer which is causing the assert. RESOLUTION: Code is changed so we set the error on buffer in case of IO failures in fdd_write_clone_end() function. * 2631221 (Tracking ID: 2349744) SYMPTOM: A system can panic when performing buffered i/o using large i/o sizes. The following is an example stacktrace showing where the system panics: abend_trap+000000 () xm_bad_free+000264 () xmfree+0004AC () vx_memfree_cpu+00006C () vx_alloc+000124 () vx_worklist_enqueue+000034 () vx_sched_thread+00024C () vx_sched_thread0+000038 () vx_thread_base+000048 () threadentry+000054 () DESCRIPTION: Prior to the VxFS5.1 release the maximum buffered i/o size was restricted to 128Kb on the AIX platform. To assist buffered i/o performance an enhancement was introduced in release 5.1 to perform larger sized buffered i/o's. With this enhancement VxFS can be asked to perform buffered i/o of sizes greater than 128K by tuning the vxfs tunables "read_pref_io" and "write_pref_io" to values greater than 128K. When performing buffered i/o using i/o sizes of multiple-megabytes the system can panic in the routine xmfree() whilst freeing the memory from a heap different from which it was allocated from. RESOLUTION: Freeing the memory from correct heap. * 2631221 (Tracking ID: 2349744) SYMPTOM: A system can panic when performing buffered i/o using large i/o sizes. The following is an example stacktrace showing where the system panics: abend_trap+000000 () xm_bad_free+000264 () xmfree+0004AC () vx_memfree_cpu+00006C () vx_alloc+000124 () vx_worklist_enqueue+000034 () vx_sched_thread+00024C () vx_sched_thread0+000038 () vx_thread_base+000048 () threadentry+000054 () DESCRIPTION: Prior to the VxFS5.1 release the maximum buffered i/o size was restricted to 128Kb on the AIX platform. To assist buffered i/o performance an enhancement was introduced in release 5.1 to perform larger sized buffered i/o's. With this enhancement VxFS can be asked to perform buffered i/o of sizes greater than 128K by tuning the vxfs tunables "read_pref_io" and "write_pref_io" to values greater than 128K. When performing buffered i/o using i/o sizes of multiple-megabytes the system can panic in the routine xmfree() whilst freeing the memory from a heap different from which it was allocated from. RESOLUTION: Freeing the memory from correct heap. * 2631315 (Tracking ID: 2631276) SYMPTOM: Lookup fails for the file which is in partitioned directory and is being accessed using its vxfs namespace extension name. DESCRIPTION: If file is present in the partitioned directory and is accessed using its vxfs namespace extension name then its name is searched in one of the hidden leaf directory. This leaf directory mostly doesn't contains entry for this file. Due this lookup fails. RESOLUTION: Code has been modified to call partitioned directory related lookup routine at upper level so that lookup doesn't fails even if file is accessed using its extended namespace name. * 2635583 (Tracking ID: 2271797) SYMPTOM: Internal Noise Testing with locally mounted VxFS filesystem hit an assert "f:vx_getblk:1a" DESCRIPTION: The assert is hit due to overlay inode is being marked with the flag regarding bad copy of inode present on disk. RESOLUTION: Code is changed to set the flag regarding bad copy of inode present on disk, only if the inode is not overlay. * 2642027 (Tracking ID: 2350956) SYMPTOM: Internal noise test on locally mounted filesystem exited with error message "bin/testit : Failed to full fsck cleanly, exiting" and in the logs we get the userspace assert "bmaptops.c 369: ASSERT(devid == 0 || (start == VX_HOLE && devid == VX_DEVID_HOLE)) failed". DESCRIPTION: The function bmap_data4_set() gets called while entering bmap allocation information for typed extents of type VX_TYPE_DATA_4 or VX_TYPE_IADDR_4. The assert expects that, either devid should be zero or if extent start is a hole, then devid should be VX_DEVID_HOLE. However, we never have extent descriptors to represent holes in typed extents. The assertion is incorrect. RESOLUTION: The assert corrected to check that extent start is not a hole and either devid is zero, or extent start is VX_OVERLAY with devid being VX_DEVID_HOLE. * 2646557 (Tracking ID: 2529201) SYMPTOM: The fscdsconv limits specified in the cdslimitstab file were wrong. Hence some of the cds related commands were failing. DESCRIPTION: The cdslimitstab file is used by the CDS/PDC commands to gather information about the migration targets. Limits of vx_maxlink for 32k & 64k are wrong. Below are limits mentioned in cdslimittab :: OS_MAX_NLINKS=32767 VXFS_MAX_NLINKS=65565 2^64 -1 evaluates to 65535. If we create 32768 directories in single directory then fscdsconv is throwing error for nlink. RESOLUTION: The limit for VXFS_MAX_NLINKS is set correctly to 65535. * 2654644 (Tracking ID: 2630954) SYMPTOM: During internal CFS stress reconfiguration testing, the fsck(1M) command exits by hitting an assert. DESCRIPTION: When the dexh_getblk() function is executed, if the extent size is greater than 256 Kbytes, the extent is divided into chunks of 256 Kbytes each. When the extent of the hash inodes are read in the dexh_getblk() function, a maximum of 256 Kbytes (MAXBUFSZ) is read at a time. Currently, the chunk size is assigned as 256 Kbytes every time. But there is a bug when the last chunk in the extent is less than 256 Kbytes because of which the length of the buffer is assigned incorrectly and we get an aliased buffer in the buffer cache. Instead, for the last chunk, the remaining size in the extent to be read should be assigned as the chunk size. RESOLUTION: The code is modified so that the buffer size is calculated correctly. * 2669195 (Tracking ID: 2326037) SYMPTOM: Internal Stress Test on cluster file system with clones failed while writing to file with error ENOENT. DESCRIPTION: VxFS file-system trying to write to clone which is in process of removal. As clone removal process works asynchronously, process starts to push changes from inode of primary fset to inode of clone fset. But when actual write happens the inode of clone fset is removed, hence error ENOENT is returned. RESOLUTION: Code is added to re-validate the inode being written. Patch ID: 5.1.112.0 * 2169326 (Tracking ID: 2169324) SYMPTOM: On LM , When clone is mounted for a file system and some quota is assigned to clone. And if quota exceeds then clone is removed and if files from clones are being accessed then assert may hit in function vx_idelxwri_off() through vx_trunc_tran() DESCRIPTION: During clone removable, we go through the all inodes of the clone(s) being removed and hit the assert because there is difference between on-disk and in-core sizes for the file , which is being modified by the application. RESOLUTION: While truncating files, if VX_IEPTTRUNC op is set, set the in-core file size to on_disk file size. * 2243061 (Tracking ID: 1296491) SYMPTOM: Performing a nested mount on a CFS file system triggers a data page fault if a forced unmount is also taking place on the CFS file system. The panic stack trace involves the following kernel routines: vx_glm_range_unlock vx_mount domount mount syscall DESCRIPTION: When the underlying cluster mounted file system is in the process of unmounting, the nested mount dereferences a NULL vfs structure pointer, thereby causing a system panic. RESOLUTION: The code has been modified to prevent the underlying cluster file system from a forced unmount when a nested mount above the file system, is in progress. The ENXIO error will be returned to the forced unmount attempt. * 2243063 (Tracking ID: 1949445) SYMPTOM: Hang when file creates were being performed on large directory. stack of hung thread is similar to below: vxglm:vxg_grant_sleep+226 vxglm:vxg_cmn_lock+563 vxglm:vxg_api_lock+412 vxfs:vx_glm_lock+29 vxfs:vx_get_ownership+70 vxfs:vx_exh_coverblk+89 vxfs:vx_exh_split+142 vxfs:vx_dexh_setup+1874 vxfs:vx_dexh_create+385 vxfs:vx_dexh_init+832 vxfs:vx_do_create+713 DESCRIPTION: For large directories, Large Directory Hash(LDH) is enabled to improve lookup on such large directories. Hang was due to taking ownership of LDH inode twice in same thread context i.e. while building hash for directory. RESOLUTION: Avoid taking ownership again if we already have the ownership of the LDH inode. * 2247299 (Tracking ID: 2161379) SYMPTOM: In a CFS enviroment various filesytems operations hang with the following stack trace T1: vx_event_wait+0x40 vx_async_waitmsg+0xc vx_msg_send+0x19c vx_iread_msg+0x27c vx_rwlock_getdata+0x2e4 vx_glm_cbfunc+0x14c vx_glmlist_thread+0x204 T2: vx_ilock+0xc vx_assume_iowner+0x100 vx_hlock_getdata+0x3c vx_glm_cbfunc+0x104 vx_glmlist_thread+0x204 DESCRIPTION: Due to improper handling of the ENOTOWNER error in the iread receive function. We continously retry the operation while holding an Inode Lock blocking all other threads and causing a deadlock RESOLUTION: The code is modified to release the inode lock on ENOTOWNER error and acquire it again, thus resolving the deadlock There are totally 4 vx_msg_get_owner() caller with ilocked=1: vx_rwlock_getdata() : Need Fix vx_glock_getdata() : Need Fix vx_cfs_doextop_iau(): Not using the owner for message loop, no need to fix. vx_iupdat_msg() : Already has 'unlock/delay/lock' on ENOTOWNER condition! * 2249658 (Tracking ID: 2220300) SYMPTOM: 'vx_sched' hogs CPU resources. DESCRIPTION: vx_sched process calls vx_iflush_list() to perform the background flushing processing. vx_iflush_list() calls vx_logwrite_flush() if the file has had logged-writes performed upon it. vx_logwrite_flush() performs a old trick that is ineffective when flushing in chunks. The trick is to flush the file asynchronously, then flush the file again synchronously. This therefore flushes the entire file twice, this is double the work when chunk flushing. RESOLUTION: vx_logwrite_flush() has been changed to flush the file once rather than twice. So Removed asynchronous flush in vx_logwrite_flush(). * 2255786 (Tracking ID: 2253617) SYMPTOM: Fullfsck fails to run cleanly using "fsck -n". DESCRIPTION: In case of duplicate file name entries in one directory, fsck compares the directory entry with the previous entries. If the filename already exists further action is taken according to the user input [Yes/No]. As we are using strncmp, it will compare first n characters, if it matches it will return success and will consider it as a duplicate file name entry and fails to run cleanly using "fsck -n" RESOLUTION: Checking the filename size and changing the length in strncmp to name_len + 1 solves the issue. * 2257904 (Tracking ID: 2251223) SYMPTOM: The 'df -h' command can take 10 seconds to run to completion and yet still report an inaccurate free block count, shortly after removing a large number of files. DESCRIPTION: When removing files, some file data blocks are released and counted in the total free block count instantly. However blocks may not always be freed immediately as VxFS can sometimes delay the releasing of blocks. Therefore the displayed free block count, at any one time, is the summation of the free blocks and the 'delayed' free blocks. Once a file 'remove transaction' is done, its delayed free blocks will be eliminated and the free block count increased accordingly. However, some functions which process transactions, for example a metadata update, can also alter the free block count, but ignore the current delayed free blocks. As a result, if file 'remove transactions' have not finished updating their free blocks and their delayed free blocks information, the free space count can occasionally show greater than the real disk space. Therefore to obtain an up-to-date and valid free block count for a file system a delay and retry loop was delaying 1 second before each retry and looping 10 times before giving up. Thus the 'df -h' command can sometimes take 10 seconds, but even if the file system waits for 10 seconds there is no guarantee that the output displayed will be accurate or valid. RESOLUTION: The delayed free block count is recalculated accurately when transactions are created and when metadata is flushed to disk. * 2275543 (Tracking ID: 1475345) SYMPTOM: write() system call hangs for over 10 seconds DESCRIPTION: While performing a transactions in case of logged write we used to asynchronously flush one buffer at a time belonging to the transaction space. Such Asynchronous flushing was causing intermediate delays in write operation because of reduced transaction space. RESOLUTION: Flush all the dirty buffers on the file in one attempt through synchronous flush, which will free up a large amount of transaction space. This will reduce the delay during write system call. * 2280386 (Tracking ID: 2061177) SYMPTOM: 'fsadm -de' command erroring with 'bad file number' on filesystem(s) on 5.0MP3RP1. DESCRIPTION: <1>first, Our kernel fs doesn't have any problem. There is not corrupt layout in their system. The metasave got from the customer is the proof (we can't reproduce this problem and there is not corrupted inode in that metasave). <2>second, As you know, fsadm is a application which has 2 parts: the application part and the kernel part. The application part read layout from raw disk to make strategy and the kernel part is to implement. So, For a buffer write fs, there should be a problem that can't be avoided that is the sync problem. In our customer's system, when they do fsadm -de, they also so huge of write operation (they also have many check points and As you know more check points more copy on write which means checkpoint will multi the write operation. That why more checkpoints more problem). RESOLUTION: our solution is to add sync operation in fsadm before it read layout from raw disk to avoid kernel and application un-sync. * 2280552 (Tracking ID: 2246579) SYMPTOM: Filesystem corruptions and system panic when attempting to extend a 100%-full disk layout version 5(DLV5) VxFS filesystem using fsadm(1M). DESCRIPTION: The behavior is caused by filesystem metadata that is relocated to the intent log area inadvertently being destroyed when the intent log is cleared during the resize operation. RESOLUTION: Refresh the incore intent log extent map by reading the bmap of the intent log inode before clearing it. * 2296277 (Tracking ID: 2296107) SYMPTOM: The fsppadm command (fsppadm query -a mountpoint ) displays ""Operation not applicable" " while querying the mount point. DESCRIPTION: During fsppadm query process, fsppadm will try to open every file's named data stream "" in the filesystem. but vxfs inernal file FCL: "changelog" doesn't support this operation. "ENOSYS" is returned in this case. fsppadm will translate "ENOSYS" into "Operation not applicable", and print the bogus error message. RESOLUTION: Fix fsppadm's get_file_tags() to ignore the "ENOSYS" error. * 2311490 (Tracking ID: 2074806) SYMPTOM: a dmapi program using dm_punch_hole may result in corrupted data DESCRIPTION: When the dm_punch_hole call is made on a file with allocated extents is used immediatly after a previous write then data can be written through stale pages. This causes data to be written to the wrong location RESOLUTION: dm_punch_hole will now invalidate all the pages within the hole its creating. * 2320044 (Tracking ID: 2419989) SYMPTOM: ncheck(1M) command with '-i' option does not limit the output to the specified inodes. DESCRIPTION: Currently, ncheck(1M) command with '-i' option currently shows free space information and other inodes that are not in the list provides by '-i' option. RESOLUTION: ncheck(1M) command is modified to print only those inodes that are specified by '-i' option. * 2320049 (Tracking ID: 2419991) SYMPTOM: There is no way to specify an inode that is unique to the file system since we reuse inode numbers in multiple filesets. We therefore would need to be able to specify a list of filesets similar to the '-i' option for inodes, or add a new '-o' option where you can specify fileset+inode pairs. DESCRIPTION: When ncheck command is called with '-i' option in conjunction with -oblock/device/sector option, it displays inodes having same inode number from all filesets. We don't have any command line option that helps us to specify a unique inode and fileset combination. RESOLUTION: Code is modified to add '-f' option in ncheck command using which one could specify the fset number on which one wants to filter the results. Further, if this option is used with '-i' option, we could uniquely specify the inode-fileset pair/s that that we want to display. * 2329887 (Tracking ID: 2253938) SYMPTOM: In a Cluster File System (CFS) environment , the file read performances gradually degrade up to 10% of the original read performance and the fsadm(1M) -F vxfs -D -E shows a large number (> 70%) of free blocks in extents smaller than 64k. For example, % Free blocks in extents smaller than 64 blks: 73.04 % Free blocks in extents smaller than 8 blks: 5.33 DESCRIPTION: In a CFS environment, the disk space is divided into Allocation Units (AUs).The delegation for these AUs is cached locally on the nodes. When an extending write operation is performed on a file, the file system tries to allocate the requested block from an AU whose delegation is locally cached, rather than finding the largest free extent available that matches the requested size in the other AUs. This leads to a fragmentation of the free space, thus leading to badly fragmented files. RESOLUTION: The code is modified such that the time for which the delegation of the AU is cached can be reduced using a tuneable, thus allowing allocations from other AUs with larger size free extents. Also, the fsadm(1M) command is enhanced to de-fragment free space using the -C option. * 2329893 (Tracking ID: 2316094) SYMPTOM: vxfsstat incorrectly reports "vxi_bcache_maxkbyte" greater than "vx_bc_bufhwm" after reinitialization of buffer cache globals. reinitialization can happen in case of dynamic reconfig operations. vxfsstat's "vxi_bcache_maxkbyte" counter shows maximum memory available for buffer cache buffers allocation. Maximum memory available for buffer allocation depends on total memory available for Buffer cache(buffers + buffer headers) i.e. "vx_bc_bufhwm" global. Therefore vxi_bcache_maxkbyte should never greater than vx_bc_bufhwm. DESCRIPTION: "vxi_bcache_maxkbyte" is per-CPU counter i.e. part of global per-CPU 'vx_info' counter structure. vxfsstat does sum of all per-cpu counters and reports result of sum. During re-intitialation of buffer cache, this counter was not set to zero properly before new value is assigned to it. Therefore total sum of this per-CPU counter can be more than 'vx_bc_bufhwm'. RESOLUTION: During buffer cache re-initialization, "vxi_bcache_maxkbyte" is now correctly set to zero such that final sum of this per-CPU counter is correct. * 2340741 (Tracking ID: 2282201) SYMPTOM: On a VxFS filesystem, vxdump(1m) operation running in parallel with other filesystem Operations like create, delete etc. can fail with signal SIGSEGV generating a core file. DESCRIPTION: vxdump caches the inodes to be dumped in a bit map before starting the dump of a directory, however this value can change if there are creates and deletes happening in the background leading to inconsistent bit map eventually generating a core file. RESOLUTION: The code is updated to refresh the inode bit map before actually starting the dump operation thus avoiding the core file generation. * 2340799 (Tracking ID: 2059611) SYMPTOM: system panics because NULL tranp in vx_unlockmap(). DESCRIPTION: vx_unlockmap is to unlock a map structure of file system. If the map is being handled, we incremented the hold count. vx_unlockmap() attempts to check whether this is an empty mlink doubly linked list while we have an async vx_mapiodone routine which can change the link at unpredictable timing even though the hold count is zero. RESOLUTION: evaluation order is changed inside vx_unlockmap(), such that further evaluation can be skipped over when map hold count is zero. * 2340817 (Tracking ID: 2192895) SYMPTOM: System panics when performing fcl commands at unix:panicsys unix:vpanic_common unix:panic genunix:vmem_xalloc genunix:vmem_alloc unix:segkmem_xalloc unix:segkmem_alloc_vn genunix:vmem_xalloc genunix:vmem_alloc genunix:kmem_alloc vxfs:vx_getacl vxfs:vx_getsecattr genunix:fop_getsecattr genunix:cacl genunix:acl unix:syscall_trap32 DESCRIPTION: The acl count in inode can be corrupted due to race condition. For example, setacl can change the acl count when getacl is processing the same inode, which could cause a invalid use of acl count. RESOLUTION: Code is modified to add the protection for the vulnerable acl count to avoid corruption. * 2340825 (Tracking ID: 2290800) SYMPTOM: When using fsdb to look at the map of ILIST file ("mapall" command), fsdb can wrongly report a large hole at the end of ILIST file. DESCRIPTION: while reading bmap of ILIST file, if hole at the end of indirect extents is found, fsdb may incorrectly end up marking the hole as the last extent in the bmap, causing the mapall command to show a large hole till the end of file. RESOLUTION: Code has been modified to read ILIST file's bmap correctly when holes at the end of indirect extents found, instead of marking that hole as the last extent of file. * 2340831 (Tracking ID: 2272072) SYMPTOM: GAB panics the box because VCS engine "had" did not respond, the lbolt wraps around. DESCRIPTION: The lbolt wraps around after 498 days machine uptime. In VxFS, we flush VxFS meta data buffers based on their age. The age calculation happens taking lbolt in account. Due to lbolt wrapping the buffers were not flushed. So, a lot of metadata IO's stopped and hence, the panic. RESOLUTION: In the function for handling flushing of dirty buffers, also handle the condition if lbolt has wrapped. If it has then assign current lbolt time to the last update time of dirtylist. * 2340834 (Tracking ID: 2302426) SYMPTOM: System panics when multiple 'vxassist mirror' commands are running concurrently with following stack strace: 0) panic+0x410 1) unaligned_hndlr+0x190 2) bubbleup+0x880 ( ) +------------- TRAP #1 ---------------------------- | Unaligned Reference Fault in KERNEL mode | IIP=0xe000000000b03ce0:0 | IFA=0xe0000005aa53c114 <--- | p struct save_state 0x2c561031.0x9fffffff5ffc7400 +------------- TRAP #1 ---------------------------- LVL FUNC ( IN0, IN1, IN2, IN3, IN4, IN5, IN6, IN7 ) 3) vx_copy_getemap_structs+0x70 4) vx_send_getemapmsg+0x240 5) vx_cfs_getemap+0x240 6) vx_get_freeexts_ioctl+0x990 7) vxportal_ioctl+0x4d0 8) spec_ioctl+0x100 9) vno_ioctl+0x390 10) ioctl+0x3c0 11) syscall+0x5a0 DESCRIPTION: Panic is caused because of de-referencing an unaligned address in CFS message structure. RESOLUTION: Used bcopy to ensure proper alignment of the addresses. * 2340839 (Tracking ID: 2316793) SYMPTOM: Shortly after removing files in a file system commands like 'df', which use 'statfs()', can take 10 seconds to complete. DESCRIPTION: To obtain an up-to-date and valid free block count in a file system a delay and retry loop was delaying 1 second before each retry and looping 10 times before giving up. This unnecessarily excessive retying could cause a 10 second delay per file system when executing the df command. RESOLUTION: The original 10 retries with a 1 second delay each, have been reduced to 1 retry after a 20 millisecond delay, when waiting for an updated free block count. * 2341007 (Tracking ID: 2300682) SYMPTOM: When a file is newly created, issuing "fsppadm query -a /mount_point" could show incorrect IOTemp information. DESCRIPTION: fsppadm query outputs incorrect data when the file re-uses the inode number which belonged to a removed file, but the database still contains this obsolete record for the removed one. fsppadm utility takes use of a database to save inodes' historical data. It compares the nearest and the farthest records for an inode to compute IOTemp in a time window. And it picks the generation of inode in the farthest record to check the inode existence. However, if the farthest is missing, zero as the generation is used mistakenly. RESOLUTION: If the nearest record for a given inode exists in database, we extract the generation entry instead of that from the farthest one. * 2360817 (Tracking ID: 2332460) SYMPTOM: Executing the VxFS 'vxedquota -p user1 user2' command to copy quota information of one user to other users takes a long time to run to completion. DESCRIPTION: VxFS maintains quota information in two files - external quota files and internal quota files. Whilst copying quota information of one user to another, the required quota information is read from both the external and internal files. However the quota information should only need to be read from the external file in a situation where the read from the internal file has failed. Reading from both files is therefore causing an unnecessary delay in the command execution time. RESOLUTION: The unnecessary duplication of reading both the external and internal Quota files to retrieve the same information has been removed. * 2360819 (Tracking ID: 2337470) SYMPTOM: Cluster File System can unexpectedly and prematurely report a 'file system out of inodes' error when attempting to create a new file. The error message reported will be similar to the following: vxfs: msgcnt 1 mesg 011: V-2-11: vx_noinode - /dev/vx/dsk/dg/vol file system out of inodes DESCRIPTION: When allocating new inodes in a cluster file system, vxfs will search for an available free inode in the 'Inode-Allocation-Units' [IAUs] that are currently delegated to the local node. If none are available, it will then search the IAUs that are not currently delegated to any node, or revoke an IAU delegated to another node. It is also possible for gaps or HOLEs to be created in the IAU structures as a side effect of the CFS delegation processing. However when searching for an available free inode vxfs simply ignores any HOLEs it may find, if the maximum size of the metadata structures has been reached (2^31) new IAUs cannot be created, thus one of the HOLEs should then be populated and used for new inode allocation. The problem occurred as HOLEs were being ignored, consequently vxfs can prematurely report the "file system out of inodes" error message even though there is plenty of free space in the vxfs file system to create new inodes. RESOLUTION: New inodes will now be allocated from the gaps, or HOLEs, in the IAU structures (created as a side effect of the CFS delegation processing). The HOLEs will be populated rather than returning a 'file system out of inodes' error. * 2360821 (Tracking ID: 1956458) SYMPTOM: When attempting to check information of checkpoints by fsckptadm -C blockinfo , the command failed with error 6 (ENXIO), the file system is disabled and some errors come out in message file: vxfs: msgcnt 4 mesg 012: V-2-12: vx_iget - /dev/vx/dsk/sfsdg/three file system invalid inode number 4495 vxfs: msgcnt 5 mesg 096: V-2-96: vx_setfsflags - /dev/vx/dsk/sfsdg/three file system fullfsck flag set - vx_cfs_iread DESCRIPTION: VxFS takes use of ilist files in primary fileset and checkpoints to accommodate inode information. A hole in a ilist file indicates that inodes in the hole don't exist and are not allocated yet in the corresponding fileset or checkpoint. fsckptadm will check every inode in the primary fileset and the downstream checkpoints. If the inode falls into a hole in a prior checkpoint, i.e. the associated file was not generated at the time of the checkpoint creation, fsckptadm exits with error. RESOLUTION: Skip inodes in the downstream checkpoints, if these inodes are located in a hole. * 2368738 (Tracking ID: 2368737) SYMPTOM: If a file which has shared extents has corrupt indirect blocks, then in certain cases the reference count tracking system can try to interpret this block and panic the system. Since this is a asynchronous background operation, this processing will retry repeatedly on every file system mount and hence can result in panic every time the file system is mounted. DESCRIPTION: Reference count tracking system for shared extents updates reference count in a lazy fashion. So in certain cases it asynchronously has to access shared indirect blocks belonging to a file to account for reference count updates. But due if this indirect block has been corrupted badly "a priori", then this tracking mechanism can panic the system repeatedly on every mount. RESOLUTION: The reference count tracking system validates the read indirect extent from the disk and in case it is not found valid sets VX_FULLFSCK flag in the superblock marking it for full fsck and disables the file system on the current node. * 2373565 (Tracking ID: 2283315) SYMPTOM: System may panic when "fsadm -e" is run on a file system containing file level snapshots. The panic stack looks like: crash_kexec() __die at() do_page_fault() error_exit() [exception RIP: vx_bmap_lookup+36] vx_bmap_lookup() vx_bmap() vx_reorg_emap() vx_extmap_reorg() vx_reorg() vx_aioctl_full() vx_aioctl_common() vx_aioctl() vx_ioctl() do_ioctl() vfs_ioctl() sys_ioctl() tracesys DESCRIPTION: The panic happened because of a NULL inode pointer passed to vx_bmap_lookup() function. During reorganizing extents of a file, block map (bmap) lookup operation is done on a file to get the information about the extents of the file. If this bmap lookup finds a hole at an offset in a file containing shared extents, a local variable is not updated that makes the inode pointer NULL during the next bmap lookup operation. RESOLUTION: Initialized the local variable such that inode pointer passed to vx_bmap_lookup() will be non NULL. * 2386483 (Tracking ID: 2374887) SYMPTOM: Access to a file system can hang when creating a named attribute due to a read/write lock being held exclusively and indefinitely causing a thread to loop in vx_tran_nattr_dircreate() A typical stacktrace of a looping thread: vx_itryhold_locked vx_iget vx_attr_iget vx_attr_kgeti vx_attr_getnonimmed vx_acl_inherit vx_aclop_creat vx_attr_creatop vx_new_attr vx_attr_inheritbuf vx_attr_inherit vx_tran_nattr_dircreate vx_nattr_copen vx_nattr_open vx_setea vx_linux_setxattr vfs_setxattr link_path_walk sys_setxattr system_call DESCRIPTION: The initial creation of a named attribute for a regular file or directory will result in the automatic creation of a 'named attribute directory'. Creations are initially attempted in a single transaction. Should the single transaction fail due to a read/write lock being held then a retry should split the task into multiple transactions. An incorrect reset of a tracking structure meant that all retries were performed using a single transaction creating an endless retry loop. RESOLUTION: The tracking structure is no longer reset within the retry loop. * 2402643 (Tracking ID: 2399178) SYMPTOM: Full fsck does large directory index validation during pass2c. However, if the number of large directories are more then this pass takes a lot of time. There is huge scope to improve the full fsck performance during this pass. DESCRIPTION: Pass2c consists of following basic operations:- [1] Read the entries in the large directory [2] Cross check hash values of those entries with the hash directory inode contents residing on the attribute ilist. This means this is another heavy IO intensive pass. RESOLUTION: 1.Use directory block read-ahead during Step [1]. 2.Wherever possible, access the file contents extent-wise rather than in fs block size (while reading entries in the directory) or in hash block size (8k, during dexh_getblk) Using above mentioned enhancements, the buffer cache can be utilized in better way. * 2405590 (Tracking ID: 2397976) SYMPTOM: A system can panic when performing buffered i/o using large i/o sizes. The following is an example stacktrace showing where the system panics: d_map_list_tce+000510 efc_mapdma_iocb+000498 efc_start+000518 efc_output+0005A4 efsc_start+0006A8 efsc_strategy+000D98 std_devstrat+000364 devstrat+000050 scsidisk_start+000D7C scsidisk_iodone+00068C internal_iodone_offl+000174 iodone_offl+000080 i_softmod+000274 flih_util+00024C DESCRIPTION: Prior to the VxFS5.1 release the maximum buffered i/o size was restricted to 128Kb on the AIX platform. To assist buffered i/o performance an enhancement was introduced in release 5.1 to perform larger sized buffered i/o's. With this enhancement VxFS can be asked to perform buffered i/o of sizes greater than 128K by tuning the vxfs tunables "read_pref_io" and "write_pref_io" to values greater than 128K. When performing buffered i/o using i/o sizes of multiple-megabytes the system can panic in the routine d_map_list_tce() whilst performing DMA mappings. RESOLUTION: A 128Kb buffered i/o size limitation is being reintroduced to avoid the possibility of system panic. Regardless of the "read_pref_io" or "write_pref_io" tunable settings the maximum i/o size for buffered i/o will be restricted to 128Kb. * 2412029 (Tracking ID: 2384831) SYMPTOM: System panics with the following stack trace. This happens in some cases when names streams are used in VxFS. machine_kexec() crash_kexec() __die do_page_fault() error_exit() [exception RIP: iput+75] vx_softcnt_flush() vx_ireuse_clean() vx_ilist_chunkclean() vx_inode_free_list() vx_ifree_scan_list() vx_workitem_process() vx_worklist_process() vx_worklist_thread() vx_kthread_init() kernel_thread() DESCRIPTION: VxFS internally creates a directory to keep the named streams pertaining to a file. In some scenarios, an error code path is missing to release the hold on that directory. Due to this unmount of the file system will not clean the inode belonging to that directory. Later when VxFS reuses such a inode panic is seen. RESOLUTION: Release the hold on the named streams directory in case of an error. * 2412177 (Tracking ID: 2371710) SYMPTOM: User quota file corruption occurs when DELICACHE feature is enabled, the current usage of inodes of a user becomes negative after frequent file creations and deletions. Checking quota info using command "vxquota -vu username", the number of files is "-1" like: # vxquota -vu testuser2 Disk quotas for testuser2 (uid 500): Filesystem usage quota limit timeleft files quota limit timeleft /vol01 1127809 8239104 8239104 -1 0 0 DESCRIPTION: This issue is introduced by the inode DELICACHE feature in 5.1SP1, it is a performance enhancement to optimize the updates done to inode map during file creations and deletions. The feature is enabled by default, and can be changed by vxtunefs. When DELICACHE is enabled and quota is set for vxfs, there will be an extra quota update for the inodes on inactive list during removing process. Since these inodes' quota has been updated already before put on delicache list, the current number of user files gets decremented twice eventually. RESOLUTION: Add a flag to identify the inodes moved to inactive list from delicache list, so that the flag can be used to prevent updating the quota again during removing process. * 2412179 (Tracking ID: 2387609) SYMPTOM: Quota usage gets set to ZERO when umount/mount the file system though files owned by users exist. This issue may occur after some file creations and deletions. Checking the quota usage using "vxrepquota" command and the output would be like following: # vxrepquota -uv /vx/sofs1/ /dev/vx/dsk/sfsdg/sofs1 (/vx/sofs1): Block limits File limits User used soft hard timeleft used soft hard timeleft testuser1 -- 0 3670016 4194304 0 0 0 testuser2 -- 0 3670016 4194304 0 0 0 testuser3 -- 0 3670016 4194304 0 0 0 Additionally the quota usage may not be updated after inode/block usage reaches ZERO. DESCRIPTION: The issue occurs when VxFS merges external per node quota files with internal quota file. The block offset within external quota file could be calculated wrongly in some scenario. When any hole found in per node quota file, the file offset such that it points the next non-HOLE offset will be modified, but we miss to change the block offset accordingly which points to the next available quota record in a block. VxFS updates per node quota records only when global internal quota file shows either of some bytes or inode usage, otherwise it doesn't copy the usage from global quota file to per node quota file. But for the case where quota usage in external quota files has gone down to zero and both bytes and inode usage in global file becomes zero, per node quota records would be not updated and left with incorrect usage. It should also check bytes or inodes usage in per node quota record. It should skip coping records only when bytes and inodes usage in both global quota file and per node quota file is zero. RESOLUTION: Corrected the way to calculate the block offset when any hole is found in per node quota file. Added code to also check blocks or inodes usage in per node quota record while updating user quota usage. * 2412181 (Tracking ID: 2372093) SYMPTOM: New fsadm command options, to defragment a given percentage of the available freespace in a file system, have been introduced as part of an initiative to help improve Cluster File System [CFS] performance - the new additional command usage is as follows: fsadm -C -U We have since found that this new freespace defragmentation operation can sometimes hang (whilst it also continues to consume some cpu) in specific circumstances when executed on a Cluster mounted File System [CFS] DESCRIPTION: The hang can occur when file system metadata is being relocated. In our example case the hang occurs whilst relocating inodes whose corresponding files are being actively updated via a different node (from which the fsadm command is being executed) in the cluster. During the relocation an error code path is taken due to an unexpected mismatch between temporary replica metadata, the code path then results in a deadlock, or hang. RESOLUTION: As there is no overriding need to relocate structural metadata for the purposes of defragmenting the available freespace, we have chosen to simply leave all structural metadata where it is when performing this operation thus avoiding its relocation. The changes required for this solution are therefore very low risk. * 2412187 (Tracking ID: 2401196) SYMPTOM: System panic in vx_ireuse_clean during Dynamic Reconfiguration on VRTSvxfs 5.1SP1 onwards. DESCRIPTION: while Dynamic Reconfiguration(DR) we move inodes to temporary DR inode lists so that DR can go ahead and deinit the various inode lists. In this case we were not taking care of the newly added delicache list(added in 5.1SP1) and were wrongly and unknowingly moving the inode from delicache list to the freelist. That way we end up having the the inode belonging to a local FS with VX_IREMOVE set on the freelist, which breaks our assumption that such inode should belong to CFS and causes panic because of dereferencing NULL CFS inode related fields of the local inode. RESOLUTION: Added case to take care of inodes on delicache list during DR so that they remain on their own list and don't get moved to freelist. * 2418819 (Tracking ID: 2283893) SYMPTOM: In a Cluster File System (CFS) environment , the file read performances gradually degrade up to 10% of the original read performance and the fsadm(1M) -F vxfs -D -E shows a large number (> 70%) of free blocks in extents smaller than 64k. For example, % Free blocks in extents smaller than 64 blks: 73.04 % Free blocks in extents smaller than 8 blks: 5.33 DESCRIPTION: In a CFS environment, the disk space is divided into Allocation Units (AUs).The delegation for these AUs is cached locally on the nodes. When an extending write operation is performed on a file, the file system tries to allocate the requested block from an AU whose delegation is locally cached, rather than finding the largest free extent available that matches the requested size in the other AUs. This leads to a fragmentation of the free space, thus leading to badly fragmented files. RESOLUTION: The code is modified such that the time for which the delegation of the AU is cached can be reduced using a tuneable, thus allowing allocations from other AUs with larger size free extents. Also, the fsadm(1M) command is enhanced to de-fragment free space using the -C option. * 2420060 (Tracking ID: 2403126) SYMPTOM: Hang is seen in the cluster when one of the nodes in the cluster leaves or rebooted. One of the nodes in the cluster will contain the following stack trace. e_sleep_thread() vx_event_wait() vx_async_waitmsg() vx_msg_send() vx_send_rbdele_resp() vx_recv_rbdele+00029C () vx_recvdele+000100 () vx_msg_recvreq+000158 () vx_msg_process_thread+0001AC () vx_thread_base+00002C () threadentry+000014 (??, ??, ??, ??) DESCRIPTION: Whenever a node in the cluster leaves, reconfiguration happens and all the resources that are held by the leaving nodes are consolidated. This is done on one node of the cluster called primary node. Each node sends a message to the primary node about the resources it is currently holding. During this reconfiguration, in a corner case, VxFS is incorrectly calculating the message length which is larger than what GAB(Veritas Group Membership and Atomic Broadcast) layer can handle. As a result the message is getting lost. The sender thinks that the message is sent and waits for acknowledgement. The message is actually dropped at sender and never sent. The master node which is waiting for this message will wait forever and the reconfiguration never completes leading to hang. RESOLUTION: The message length calculation is done properly now and GAB can handle the messages. * 2425429 (Tracking ID: 2422574) SYMPTOM: On CFS, after turning the quota on, when any node is rebooted and rejoins the cluster, it fails to mount the filesystem. DESCRIPTION: At the time of mounting the filesystem after rebooting the node, mntlock was already set, which didn't allow the remount of filesystem, if quota is on. RESOLUTION: Code is changed so that the mntlock flag is masked in quota operation as it's already set on the mount. * 2426039 (Tracking ID: 2412604) SYMPTOM: Once the time limit expires after exceeding the soft-limit of user quota size on VxFS filesystem, writes are still permissible over that soft-limit. DESCRIPTION: After exceeding the soft-limit, in the initial setup of the soft-limit the timer didn't use to start. RESOLUTION: Start the timer during the initial setting of quota limits if current usage has already crossed the soft quota limits. * 2427269 (Tracking ID: 2399228) SYMPTOM: Occasionally Oracle Archive logs can be created smaller than they should be, in the reported case the resultant Oracle Archive logs were incorrectly sized as 512 bytes. DESCRIPTION: The fcntl [file control] command F_FREESP [Free storage space] can be utilised to change the size of a regular file. If the file size is reduced we call it a "truncate", and space allocated in the truncated area will be returned to the file system freespace pool. If the file size is increased using F_FREESP we call it a "truncate-up", although the file size changes no space is allocated in the extended area of the file. Oracle archive logs utilize the F_FREESP fcntl command to perform a truncate-up of a new file before a smaller write of 512 bytes [at the the start of the file] is then performed. A timing window was found with F_FREESP which meant that 'truncate-up' file size was lost, or rather overwritten, by the subsequent write of the data, thus causing the file to appear with a size of just 512 bytes. RESOLUTION: A timing window has been closed whereby the flush of the allocating [512byte] write was triggered after the new F_FREESP file size has been updated in the inode. * 2427281 (Tracking ID: 2413172) SYMPTOM: vxfs_fcl_seektime() API seeks to the first record in the File change log(FCL) file after specified time. This API can incorrectly return EINVAL(FCL record not found)error while reading first block of FCL file. DESCRIPTION: To seek to the first record after the given time, first a binary search is performed to get the largest block offset where fcl record time is less than the given time. Then a linear search from this offset is performed to find the first record which has time value greater than specified time. FCL records are read in buffers. There can be scenarios where FCL records read in one buffer are less than buffer size, e.g. reading first block of FCL file. In such scenarios, buffer read can continue even when all data in current buffer has been read. This is due to wrong check which decides if all records in one buffer has been read. Thus reading buffer beyond boundary was causing search to terminate without finding record for given time and hence EINVAL error was returned. Actually, VxFS should detect that it is partially filled buffer and the search should continue reading the next buffer. RESOLUTION: Check which decides if all records in buffer have been read is corrected such that buffer is read within its boundaries. * 2478237 (Tracking ID: 2384861) SYMPTOM: The following asserts are seen during internal stress and regression runs f:vx_do_filesnap:1b f:vx_inactive:2a f:xted_check_rwdata:31 f:vx_do_unshare:1 DESCRIPTION: These asserts validate some assumption in various function also there were some miscellaneous issues which were seen during internal testing. RESOLUTION: The code has been modified to fix the internal reported issues which other miscellaneous changes. * 2478325 (Tracking ID: 2251015) SYMPTOM: Command fsck(1M) will take longer time to complete. DESCRIPTION: Command fsck(1M), in extreme case like 2TB file system with a 1KB block size, 130+ checkpoints, and 100-250 million inodes per file set takes 15+ hours to complete intent log replay because it has to read a few GB of IAU headers and summaries one synchronous block at a time. RESOLUTION: Changing the fsck code, to do read ahead on the IAU file reduces the fsck log-replay time. * 2480949 (Tracking ID: 2480935) SYMPTOM: System log file may contain following error message on multi-threaded environment with Dynamic Storage Tiers(DST). UX:vxfs fsppadm: ERROR: V-3-26626: File Change Log IOTEMP and ACCESSTEMP index creation failure for /vx/fsvm with message Argument list too long DESCRIPTION: In DST, while enforcing policy, SQL queries are generated and written to file .__fsppadm_enforcesql present in lost+found. In multi threaded environment, 16 threads works in parallel on ILIST and geenrate SQL queries and write it to file. This may lead to corruption of file, if multiple threads write to file simultaneously. RESOLUTION: Mutex is used to serialize writing of threads on SQL file. * 2482337 (Tracking ID: 2431674) SYMPTOM: panic in vx_common_msgprint() via vx_inactive() DESCRIPTION: The problem is that the call VX_CMN_ERR( ) , uses a "llx" format character which vx_common_msgprint() doesn't understand. It gives up trying to process that format, but continues on without consuming the corresponding parameter. Everything else in the parameter list is effectively shifted by 8 bytes, and when we get to processing the string argument, it's game over. RESOLUTION: Changed the format to "llu", which vx_common_msgprint() understands. * 2486597 (Tracking ID: 2486589) SYMPTOM: Multiple threads may wait on a mutex owned by a thread that is in function vx_ireuse_steal() with following stack trace on machine with severe inode pressure. vx_ireuse_steal() vx_ireuse() vx_iget() DESCRIPTION: Several thread are waiting to get inodes from VxFS. The current number of inodes reached max number of inodes (vxfs_ninode) that can be created in memory. So no new allocations can be possible, which results in thread wait. RESOLUTION: Code is modified so that in such situation, threads return ENOINODE instead of retrying to get inodes. * 2487976 (Tracking ID: 2483514) SYMPTOM: System panic in following stack: pvthread+19E500 STACK: [0001BF00]abend_trap+000000 () [000A29E4]tod_lock_write+000084 (??) [000A34A0]tstop+0000C0 (??) [04C2F04C].vx_untimeout+000078 () [04C29078].vx_timed_sleep+000090 () [04D2A72C].vx_open_modes+000574 () [04D2F80C].vx_open1+0001FC () [04D2FE80].vx_open+00007C () [04C39514].vx_open_skey+000044 () [0057D75C]vnop_open+0004BC (??, ??, ??, ??, ??) [00615D24]openpnp+0005E4 (??, ??, ??, ??, ??, ??) [006162E0]openpath+000100 (??, ??, ??, ??, ??, ??, ??) [006167B4]copen+000294 (??, ??, ??, ??, ??) [006156BC]kopen+00001C (??, ??, ??) DESCRIPTION: When a file is opened with O_DELAY flag and some process already has that file opened in conflicting mode with O_NSHARE flag, the first process waits in loop till the file is available instead exiting. This is achieved by associating a timer with the waiting process. The waiting process will wakeup after the timer expires and see if it can open the file, if not it starts a fresh timer and sleeps again. AIX provides tstart/tstop kernel services to implement this timer. Associated with every timer we have a callback function which is executed when the timer gets expired and after that the process sleeping for timer is awaken. The callback routine is to add the expired timer to a global freelist. All the expired timers on this global freelist are freed at the beginning of next timeout operation that happens with next open retry or each fresh open with O_DELAY. The callback function is executed just before the process sleeping for timer is awaken, so there is a possibility of having timer on freelist but the process waiting for that timer is still sleeping. Hence the timer can be freed up while process is still sleeping. During this window if the process is aborted, we try to stop a timer which might be already freed and this race results in panic. RESOLUTION: Avoid freeing up the expired timer on the freelist if the process associated with it is still sleeping. Once the process wakes up, it will set a flag on timer structure to signal that this timer is safe to be freed up. * 2494464 (Tracking ID: 2247387) SYMPTOM: Internal local mount noise.fullfsck.N4 test hit an assert vx_ino_update:2 With stack trace looking as below panic: f:vx_ino_update:2 Stack Trace: IP Function Name 0xe0000000023d5780 ted_call_demon+0xc0 0xe0000000023d6030 ted_assert+0x130 0xe000000000d66f80 vx_ino_update+0x230 0xe000000000d727e0 vx_iupdat_local+0x13b0 0xe000000000d638b0 vx_iupdat+0x230 0xe000000000f20880 vx_tflush_inode+0x210 0xe000000000f1fc80 __vx_fsq_flush___vx_tran.c__4096000_0686__+0xed0 0xe000000000f15160 vx_tranflush+0xe0 0xe000000000d2e600 vx_tranflush_threaded+0xc0 0xe000000000d16000 vx_workitem_process+0x240 0xe000000000d15ca0 vx_worklist_thread+0x7f0 0xe000000001471270 kthread_daemon_startup+0x90 End of Stack Trace DESCRIPTION: INOILPUSH flag is not set when inode is getting updated, which caused above assert. The problem was creation and deletion of clone resets the INOILPUSH flag and function vx_write1_fast() does not set the flag after updating the inode and file. RESOLUTION: Code is modified so that if INOILPUSH flag is not set while function vx_write1_fast(), then the flag is set in the function. * 2508164 (Tracking ID: 2481984) SYMPTOM: Access to the file system got hang. DESCRIPTION: In function 'vx_setqrec', it will call 'vx_dqget'. when 'vx_dqget' return errors, it will try to unlock DQ structure using 'VX_DQ_CLUSTER_UNLOCK'. But, in this situation, DQ structure doesn't hold the lock. hence, this hang happens. RESOLUTION: 'dq_inval' would be set in 'vx_dqget' in case of any error happens in 'vx_dqget'. Skip unlocking DQ structure in the error code path of 'vx_setqrec', if 'dq_inval' is set. * 2523084 (Tracking ID: 2515101) SYMPTOM: The "svmon -O vxfs=on" option can be used to collect VxFS file system details, with this enabled subsequently executing the "svmon -S" command can generate a system panic in the svm_getvxinode_gnode routine when trying to collect information from the VxFS segment control blocks. 16)> f pvthread+838900 STACK: [F100000090704A38]perfvmmstat:svm_getvxinode_gnode+000038 DESCRIPTION: VxFS will create and delete AIX Virtual Memory Management [VMM] structures called Segment Control Blocks [SCB] via VMM interfaces. VxFS was leaking SCBs via one specific code path. The "svmon -S" command will parse a global list of SCB structures, including any SCB structures leaked by VxFS. If svmon is also collecting information about VxFS file systems the gnode element of the SCB will be dereferenced, for a leaked SCB the gnode will be stale and thus now contain unrelated content, reading and dereferencing this content can generate the panic. RESOLUTION: A very simple and low risk change now prevents Segment Control Blocks from being leaked by VxFS, the SCBs will now be correctly removed by VxFS. * 2529356 (Tracking ID: 2340953) SYMPTOM: During internal stress test, f:vx_iget:1a assert is seen. DESCRIPTION: While renaming certain file, we check if the target directory is in the path of the source file to be renamed. while using function vx_iget() to reach till root inode, one of parent directory incode number was 0 and hence the assert. RESOLUTION: Code is changed so that during renames, parent directory is first assigned correct inode number before using vx_iget() function to retrieve root inode. INSTALLING THE PATCH -------------------- If the currently installed VRTSvxfs is below 5.1.112.0, you must upgrade VRTSvxfs to 5.1.112.0 level before installing this patch. AIX maintenance levels and APARs can be downloaded from the IBM Web site: _http://techsupport.services.ibm.com Install the VRTSvxfs.bff patch if VRTSvxfs is already installed at fileset level 5.1.112.0 A system reboot is required after installing this patch. To apply the patch, first unmount all VxFS file systems, then enter these commands: # mount | grep vxfs # cd <patch_location> # installp -aXd VRTSvxfs.bff VRTSvxfs # reboot REMOVING THE PATCH ------------------ If you need to remove the patch, first unmount all VxFS file systems, then enter these commands: # mount | grep vxfs # installp -r VRTSvxfs # reboot SPECIAL INSTRUCTIONS -------------------- NONE OTHERS ------ NONE