README VERSION : 1 README Creation Date : 2011-09-21 Patch-ID : 5.1.132.000 Patch Name : VRTSvxfs 5.1SP1RP2 BASE PACKAGE NAME : VRTSvxfs BASE PACKAGE VERSION : 5.1.000.000 Obsolete Patches : NONE Superseded Patches : 5.1.101.000/5.1.120.000/5.1.101.100 Required Patches : 5.1.100.000 Incompatible Patches : NONE Supported PADV : rhel5_x86_64 , rhel6_x86_64 , sles10_x86_64 , sles11_x86_64 (P-Platform , A-Architecture , D-Distribution , V-Version) Patch Category : CORE , CORRUPTION , HANG , MEMORYLEAK , PANIC , PERFORMANCE Reboot Required : NO FIXED INCIDENTS: ---------------- Patch Id::5.1.132.000 * Incident no::2169326 Tracking ID ::2169324 Symptom::On LM , When clone is mounted for a file system and some quota is assigned to clone. And if quota exceeds then clone is removed and if files from clones are being accessed then assert may hit in function vx_idelxwri_off() through vx_trunc_tran() Description::During clone removable, we go through the all inodes of the clone(s) being removed and hit the assert because there is difference between on-disk and in-core sizes for the file ,which is being modified by the application. Resolution::While truncating files, if VX_IEPTTRUNC op is set, set the in-core file size to on_disk file size. * Incident no::2243061 Tracking ID ::1296491 Symptom::Performing a nested mount on a CFS file system triggers a data page fault if a forced unmount is also taking place on the CFS file system. The panic stack trace involves the following kernel routines: vx_glm_range_unlock vx_mount domount mount syscall Description::When the underlying cluster mounted file system is in the process of unmounting, the nested mount dereferences a NULL vfs structure pointer, thereby causing a system panic. Resolution::The code has been modified to prevent the underlying cluster file system from a forced unmount when a nested mount above the file system, is in progress. The ENXIO error will be returned to the forced unmount attempt. * Incident no::2243063 Tracking ID ::1949445 Symptom::Hang when file creates were being performed on large directory. stack of hung thread is similar to below: vxglm:vxg_grant_sleep+226 vxglm:vxg_cmn_lock+563 vxglm:vxg_api_lock+412 vxfs:vx_glm_lock+29 vxfs:vx_get_ownership+70 vxfs:vx_exh_coverblk+89 vxfs:vx_exh_split+142 vxfs:vx_dexh_setup+1874 vxfs:vx_dexh_create+385 vxfs:vx_dexh_init+832 vxfs:vx_do_create+713 Description::For large directories, Large Directory Hash(LDH) is enabled to improve lookup on such large directories. Hang was due to taking ownership of LDH inode twice in same thread context i.e. while building hash for directory. Resolution::Avoid taking ownership again if we already have the ownership of the LDH inode. * Incident no::2243064 Tracking ID ::2111921 Symptom::On linux platforms readv()/writev() performance with DIO/CIO can be more than 2x slower on vxfs than on raw volumes. Description::In the current implementation of DIO If there are multiple iovecs passed during IO [eg. readv()/writev()] , we do a DIO for each iovec in a loop. We cannot do coalescing of the given iovecs directly, because there is no guarantee that the user addresses are contiguous. Resolution::We introduced a concept of Parrallel-DIO with this we can submit iovecs in the same extent together. By doing this and the use of the io_submit interface directly we are able to make use of linux's scatter gather enahancements. This change brings the readv() writev() performance in vxfs to the same as raw. * Incident no::2247299 Tracking ID ::2161379 Symptom::In a CFS enviroment various filesytems operations hang with the following stack trace T1: vx_event_wait+0x40 vx_async_waitmsg+0xc vx_msg_send+0x19c vx_iread_msg+0x27c vx_rwlock_getdata+0x2e4 vx_glm_cbfunc+0x14c vx_glmlist_thread+0x204 T2: vx_ilock+0xc vx_assume_iowner+0x100 vx_hlock_getdata+0x3c vx_glm_cbfunc+0x104 vx_glmlist_thread+0x204 Description::Due to improper handling of the ENOTOWNER error in the iread receive function. We continously retry the operation while holding an Inode Lock blocking all other threads and causing a deadlock Resolution::The code is modified to release the inode lock on ENOTOWNER error and acquire it again, thus resolving the deadlock There are totally 4 vx_msg_get_owner() caller with ilocked=1: vx_rwlock_getdata() : Need Fix vx_glock_getdata() : Need Fix vx_cfs_doextop_iau(): Not using the owner for message loop, no need to fix. vx_iupdat_msg() : Already has 'unlock/delay/lock' on ENOTOWNER condition! * Incident no::2249658 Tracking ID ::2220300 Symptom::'vx_sched' hogs CPU resources. Description::vx_sched process calls vx_iflush_list() to perform the background flushing processing. vx_iflush_list() calls vx_logwrite_flush() if the file has had logged-writes performed upon it. vx_logwrite_flush() performs a old trick that is ineffective when flushing in chunks. The trick is to flush the file asynchronously, then flush the file again synchronously. This therefore flushes the entire file twice, this is double the work when chunk flushing. Resolution::vx_logwrite_flush() has been changed to flush the file once rather than twice. So Removed asynchronous flush in vx_logwrite_flush(). * Incident no::2255786 Tracking ID ::2253617 Symptom::Fullfsck fails to run cleanly using "fsck -n". Description::In case of duplicate file name entries in one directory, fsck compares the directory entry with the previous entries. If the filename already exists further action is taken according to the user input [Yes/No]. As we are using strncmp, it will compare first n characters, if it matches it will return success and will consider it as a duplicate file name entry and fails to run cleanly using "fsck -n" Resolution::Checking the filename size and changing the length in strncmp to name_len + 1 solves the issue. * Incident no::2257904 Tracking ID ::2251223 Symptom::The 'df -h' command can take 10 seconds to run to completion and yet still report an inaccurate free block count, shortly after removing a large number of files. Description::When removing files, some file data blocks are released and counted in the total free block count instantly. However blocks may not always be freed immediately as VxFS can sometimes delay the releasing of blocks. Therefore the displayed free block count, at any one time, is the summation of the free blocks and the 'delayed' free blocks. Once a file 'remove transaction' is done, its delayed free blocks will be eliminated and the free block count increased accordingly. However, some functions which process transactions, for example a metadata update, can also alter the free block count, but ignore the current delayed free blocks. As a result, if file 'remove transactions' have not finished updating their free blocks and their delayed free blocks information, the free space count can occasionally show greater than the real disk space. Therefore to obtain an up-to-date and valid free block count for a file system a delay and retry loop was delaying 1 second before each retry and looping 10 times before giving up. Thus the 'df -h' command can sometimes take 10 seconds, but even if the file system waits for 10 seconds there is no guarantee that the output displayed will be accurate or valid. Resolution::The delayed free block count is recalculated accurately when transactions are created and when metadata is flushed to disk. * Incident no::2275543 Tracking ID ::1475345 Symptom::write() system call hangs for over 10 seconds Description::While performing a transactions in case of logged write we used to asynchronously flush one buffer at a time belonging to the transaction space. Such Asynchronous flushing was causing intermediate delays in write operation because of reduced transaction space. Resolution::Flush all the dirty buffers on the file in one attempt through synchronous flush, which will free up a large amount of transaction space. This will reduce the delay during write system call. * Incident no::2280386 Tracking ID ::2061177 Symptom::'fsadm -de' command erroring with 'bad file number' on filesystem(s) on 5.0MP3RP1. Description::<1>first, Our kernel fs doesn't have any problem. There is not corrupt layout in their system. The metasave got from the customer is the proof (we can't reproduce this problem and there is not corrupted inode in that metasave). <2>second, As you know, fsadm is a application which has 2 parts: the application part and the kernel part. The application part read layout from raw disk to make strategy and the kernel part is to implement. So, For a buffer write fs, there should be a problem that can't be avoided that is the sync problem. In our customer's system, when they do fsadm -de, they also so huge of write operation (they also have many check points and As you know more check points more copy on write which means checkpoint will multi the write operation. That why more checkpoints more problem). Resolution::our solution is to add sync operation in fsadm before it read layout from raw disk to avoid kernel and application un-sync. * Incident no::2280552 Tracking ID ::2246579 Symptom::Filesystem corruptions and system panic when attempting to extend a 100%-full disk layout version 5(DLV5) VxFS filesystem using fsadm(1M). Description::The behavior is caused by filesystem metadata that is relocated to the intent log area inadvertently being destroyed when the intent log is cleared during the resize operation. Resolution::Refresh the incore intent log extent map by reading the bmap of the intent log inode before clearing it. * Incident no::2296277 Tracking ID ::2296107 Symptom::The fsppadm command (fsppadm query -a mountpoint ) displays ""Operation not applicable" " while querying the mount point. Description::During fsppadm query process, fsppadm will try to open every file's named data stream "" in the filesystem. but vxfs inernal file FCL: "changelog" doesn't support this operation. "ENOSYS" is returned in this case. fsppadm will translate "ENOSYS" into "Operation not applicable", and print the bogus error message. Resolution::Fix fsppadm's get_file_tags() to ignore the "ENOSYS" error. * Incident no::2311490 Tracking ID ::2074806 Symptom::a dmapi program using dm_punch_hole may result in corrupted data Description::When the dm_punch_hole call is made on a file with allocated extents is used immediatly after a previous write then data can be written through stale pages. This causes data to be written to the wrong location Resolution::dm_punch_hole will now invalidate all the pages within the hole its creating. * Incident no::2320044 Tracking ID ::2419989 Symptom::ncheck(1M) command with '-i' option does not limit the output to the specified inodes. Description::Currently, ncheck(1M) command with '-i' option currently shows free space information and other inodes that are not in the list provides by '-i' option. Resolution::ncheck(1M) command is modified to print only those inodes that are specified by '-i' option. * Incident no::2320049 Tracking ID ::2419991 Symptom::There is no way to specify an inode that is unique to the file system since we reuse inode numbers in multiple filesets. We therefore would need to be able to specify a list of filesets similar to the '-i' option for inodes, or add a new '-o' option where you can specify fileset+inode pairs. Description::When ncheck command is called with '-i' option in conjunction with -oblock/device/sector option, it displays inodes having same inode number from all filesets. We don't have any command line option that helps us to specify a unique inode and fileset combination. Resolution::Code is modified to add '-f' option in ncheck command using which one could specify the fset number on which one wants to filter the results. Further, if this option is used with '-i' option, we could uniquely specify the inode-fileset pair/s that that we want to display. * Incident no::2329887 Tracking ID ::2253938 Symptom::In a Cluster File System (CFS) environment , the file read performances gradually degrade up to 10% of the original read performance and the fsadm(1M) -F vxfs -D -E shows a large number (> 70%) of free blocks in extents smaller than 64k. For example, % Free blocks in extents smaller than 64 blks: 73.04 % Free blocks in extents smaller than 8 blks: 5.33 Description::In a CFS environment, the disk space is divided into Allocation Units (AUs).The delegation for these AUs is cached locally on the nodes. When an extending write operation is performed on a file, the file system tries to allocate the requested block from an AU whose delegation is locally cached, rather than finding the largest free extent available that matches the requested size in the other AUs. This leads to a fragmentation of the free space, thus leading to badly fragmented files. Resolution::The code is modified such that the time for which the delegation of the AU is cached can be reduced using a tuneable, thus allowing allocations from other AUs with larger size free extents. Also, the fsadm(1M) command is enhanced to de-fragment free space using the -C option. * Incident no::2329893 Tracking ID ::2316094 Symptom::vxfsstat incorrectly reports "vxi_bcache_maxkbyte" greater than "vx_bc_bufhwm" after reinitialization of buffer cache globals. reinitialization can happen in case of dynamic reconfig operations. vxfsstat's "vxi_bcache_maxkbyte" counter shows maximum memory available for buffer cache buffers allocation. Maximum memory available for buffer allocation depends on total memory available for Buffer cache(buffers + buffer headers) i.e. "vx_bc_bufhwm" global. Therefore vxi_bcache_maxkbyte should never greater than vx_bc_bufhwm. Description::"vxi_bcache_maxkbyte" is per-CPU counter i.e. part of global per-CPU 'vx_info' counter structure. vxfsstat does sum of all per-cpu counters and reports result of sum. During re-intitialation of buffer cache, this counter was not set to zero properly before new value is assigned to it. Therefore total sum of this per-CPU counter can be more than 'vx_bc_bufhwm'. Resolution::During buffer cache re-initialization, "vxi_bcache_maxkbyte" is now correctly set to zero such that final sum of this per-CPU counter is correct. * Incident no::2338010 Tracking ID ::2337737 Symptom::For Linux version 2.6.27 and onwards, a write() may never complete due to forever looping of the write kernel thread, thus consuming most of the CPU leading to system hang like situation. A kernel stack of such looping write thread looks like - <> bad_to_user vx_uiomove vx_write_default vx_write1 vx_rwsleep_unlock vx_do_putpage vx_write_common_slow handle_mm_fault d_instantiate do_page_fault vx_write_common vx_prefault_uio_readable vx_write vfs_write sys_write system_call_fastpath <> Description::Some pages created during a write() may be partially initialized and are therefore destroyed. However due to a bug, the variable representing number of bytes copied is not updated correctly to reflect this destroying of page. Thus subsequent page-fault for the destroyed page occur at incorrect offset leading to indefinite looping of write kernel thread. Resolution::Update correctly the number of bytes copied after destroying partially initialized pages. * Incident no::2340741 Tracking ID ::2282201 Symptom::On a VxFS filesystem, vxdump(1m) operation running in parallel with other filesystem Operations like create, delete etc. can fail with signal SIGSEGV generating a core file. Description::vxdump caches the inodes to be dumped in a bit map before starting the dump of a directory, however this value can change if there are creates and deletes happening in the background leading to inconsistent bit map eventually generating a core file. Resolution::The code is updated to refresh the inode bit map before actually starting the dump operation thus avoiding the core file generation. * Incident no::2340799 Tracking ID ::2059611 Symptom::system panics because NULL tranp in vx_unlockmap(). Description::vx_unlockmap is to unlock a map structure of file system. If the map is being handled, we incremented the hold count. vx_unlockmap() attempts to check whether this is an empty mlink doubly linked list while we have an async vx_mapiodone routine which can change the link at unpredictable timing even though the hold count is zero. Resolution::evaluation order is changed inside vx_unlockmap(), such that further evaluation can be skipped over when map hold count is zero. * Incident no::2340817 Tracking ID ::2192895 Symptom::System panics when performing fcl commands at unix:panicsys unix:vpanic_common unix:panic genunix:vmem_xalloc genunix:vmem_alloc unix:segkmem_xalloc unix:segkmem_alloc_vn genunix:vmem_xalloc genunix:vmem_alloc genunix:kmem_alloc vxfs:vx_getacl vxfs:vx_getsecattr genunix:fop_getsecattr genunix:cacl genunix:acl unix:syscall_trap32 Description::The acl count in inode can be corrupted due to race condition. For example, setacl can change the acl count when getacl is processing the same inode, which could cause a invalid use of acl count. Resolution::Code is modified to add the protection for the vulnerable acl count to avoid corruption. * Incident no::2340825 Tracking ID ::2290800 Symptom::When using fsdb to look at the map of ILIST file ("mapall" command), fsdb can wrongly report a large hole at the end of ILIST file. Description::while reading bmap of ILIST file, if hole at the end of indirect extents is found, fsdb may incorrectly end up marking the hole as the last extent in the bmap, causing the mapall command to show a large hole till the end of file. Resolution::Code has been modified to read ILIST file's bmap correctly when holes at the end of indirect extents found, instead of marking that hole as the last extent of file. * Incident no::2340831 Tracking ID ::2272072 Symptom::GAB panics the box because VCS engine "had" did not respond, the lbolt wraps around. Description::The lbolt wraps around after 498 days machine uptime. In VxFS, we flush VxFS meta data buffers based on their age. The age calculation happens taking lbolt in account. Due to lbolt wrapping the buffers were not flushed. So, a lot of metadata IO's stopped and hence, the panic. Resolution::In the function for handling flushing of dirty buffers, also handle the condition if lbolt has wrapped. If it has then assign current lbolt time to the last update time of dirtylist. * Incident no::2340834 Tracking ID ::2302426 Symptom::System panics when multiple 'vxassist mirror' commands are running concurrently with following stack strace: 0) panic+0x410 1) unaligned_hndlr+0x190 2) bubbleup+0x880 ( ) +------------- TRAP #1 ---------------------------- | Unaligned Reference Fault in KERNEL mode | IIP=0xe000000000b03ce0:0 | IFA=0xe0000005aa53c114 <--- | p struct save_state 0x2c561031.0x9fffffff5ffc7400 +------------- TRAP #1 ---------------------------- LVL FUNC ( IN0, IN1, IN2, IN3, IN4, IN5, IN6, IN7 ) 3) vx_copy_getemap_structs+0x70 4) vx_send_getemapmsg+0x240 5) vx_cfs_getemap+0x240 6) vx_get_freeexts_ioctl+0x990 7) vxportal_ioctl+0x4d0 8) spec_ioctl+0x100 9) vno_ioctl+0x390 10) ioctl+0x3c0 11) syscall+0x5a0 Description::Panic is caused because of de-referencing an unaligned address in CFS message structure. Resolution::Used bcopy to ensure proper alignment of the addresses. * Incident no::2340839 Tracking ID ::2316793 Symptom::Shortly after removing files in a file system commands like 'df', which use 'statfs()', can take 10 seconds to complete. Description::To obtain an up-to-date and valid free block count in a file system a delay and retry loop was delaying 1 second before each retry and looping 10 times before giving up. This unnecessarily excessive retying could cause a 10 second delay per file system when executing the df command. Resolution::The original 10 retries with a 1 second delay each, have been reduced to 1 retry after a 20 millisecond delay, when waiting for an updated free block count. * Incident no::2341007 Tracking ID ::2300682 Symptom::When a file is newly created, issuing "fsppadm query -a /mount_point" could show incorrect IOTemp information. Description::fsppadm query outputs incorrect data when the file re-uses the inode number which belonged to a removed file, but the database still contains this obsolete record for the removed one. fsppadm utility takes use of a database to save inodes' historical data. It compares the nearest and the farthest records for an inode to compute IOTemp in a time window. And it picks the generation of inode in the farthest record to check the inode existence. However, if the farthest is missing, zero as the generation is used mistakenly. Resolution::If the nearest record for a given inode exists in database, we extract the generation entry instead of that from the farthest one. * Incident no::2360817 Tracking ID ::2332460 Symptom::Executing the VxFS 'vxedquota -p user1 user2' command to copy quota information of one user to other users takes a long time to run to completion. Description::VxFS maintains quota information in two files - external quota files and internal quota files. Whilst copying quota information of one user to another, the required quota information is read from both the external and internal files. However the quota information should only need to be read from the external file in a situation where the read from the internal file has failed. Reading from both files is therefore causing an unnecessary delay in the command execution time. Resolution::The unnecessary duplication of reading both the external and internal Quota files to retrieve the same information has been removed. * Incident no::2360819 Tracking ID ::2337470 Symptom::Cluster File System can unexpectedly and prematurely report a 'file system out of inodes' error when attempting to create a new file. The error message reported will be similar to the following: vxfs: msgcnt 1 mesg 011: V-2-11: vx_noinode - /dev/vx/dsk/dg/vol file system out of inodes Description::When allocating new inodes in a cluster file system, vxfs will search for an available free inode in the 'Inode-Allocation-Units' [IAUs] that are currently delegated to the local node. If none are available, it will then search the IAUs that are not currently delegated to any node, or revoke an IAU delegated to another node. It is also possible for gaps or HOLEs to be created in the IAU structures as a side effect of the CFS delegation processing. However when searching for an available free inode vxfs simply ignores any HOLEs it may find, if the maximum size of the metadata structures has been reached (2^31) new IAUs cannot be created, thus one of the HOLEs should then be populated and used for new inode allocation. The problem occurred as HOLEs were being ignored, consequently vxfs can prematurely report the "file system out of inodes" error message even though there is plenty of free space in the vxfs file system to create new inodes. Resolution::New inodes will now be allocated from the gaps, or HOLEs, in the IAU structures (created as a side effect of the CFS delegation processing). The HOLEs will be populated rather than returning a 'file system out of inodes' error. * Incident no::2360821 Tracking ID ::1956458 Symptom::When attempting to check information of checkpoints by fsckptadm -C blockinfo , the command failed with error 6 (ENXIO), the file system is disabled and some errors come out in message file: vxfs: msgcnt 4 mesg 012: V-2-12: vx_iget - /dev/vx/dsk/sfsdg/three file system invalid inode number 4495 vxfs: msgcnt 5 mesg 096: V-2-96: vx_setfsflags - /dev/vx/dsk/sfsdg/three file system fullfsck flag set - vx_cfs_iread Description::VxFS takes use of ilist files in primary fileset and checkpoints to accommodate inode information. A hole in a ilist file indicates that inodes in the hole don't exist and are not allocated yet in the corresponding fileset or checkpoint. fsckptadm will check every inode in the primary fileset and the downstream checkpoints. If the inode falls into a hole in a prior checkpoint, i.e. the associated file was not generated at the time of the checkpoint creation, fsckptadm exits with error. Resolution::Skip inodes in the downstream checkpoints, if these inodes are located in a hole. * Incident no::2368738 Tracking ID ::2368737 Symptom::If a file which has shared extents has corrupt indirect blocks, then in certain cases the reference count tracking system can try to interpret this block and panic the system. Since this is a asynchronous background operation, this processing will retry repeatedly on every file system mount and hence can result in panic every time the file system is mounted. Description::Reference count tracking system for shared extents updates reference count in a lazy fashion. So in certain cases it asynchronously has to access shared indirect blocks belonging to a file to account for reference count updates. But due if this indirect block has been corrupted badly "a priori", then this tracking mechanism can panic the system repeatedly on every mount. Resolution::The reference count tracking system validates the read indirect extent from the disk and in case it is not found valid sets VX_FULLFSCK flag in the superblock marking it for full fsck and disables the file system on the current node. * Incident no::2373565 Tracking ID ::2283315 Symptom::System may panic when "fsadm -e" is run on a file system containing file level snapshots. The panic stack looks like: crash_kexec() __die at() do_page_fault() error_exit() [exception RIP: vx_bmap_lookup+36] vx_bmap_lookup() vx_bmap() vx_reorg_emap() vx_extmap_reorg() vx_reorg() vx_aioctl_full() vx_aioctl_common() vx_aioctl() vx_ioctl() do_ioctl() vfs_ioctl() sys_ioctl() tracesys Description::The panic happened because of a NULL inode pointer passed to vx_bmap_lookup() function. During reorganizing extents of a file, block map (bmap) lookup operation is done on a file to get the information about the extents of the file. If this bmap lookup finds a hole at an offset in a file containing shared extents, a local variable is not updated that makes the inode pointer NULL during the next bmap lookup operation. Resolution::Initialized the local variable such that inode pointer passed to vx_bmap_lookup() will be non NULL. * Incident no::2386483 Tracking ID ::2374887 Symptom::Access to a file system can hang when creating a named attribute due to a read/write lock being held exclusively and indefinitely causing a thread to loop in vx_tran_nattr_dircreate() A typical stacktrace of a looping thread: vx_itryhold_locked vx_iget vx_attr_iget vx_attr_kgeti vx_attr_getnonimmed vx_acl_inherit vx_aclop_creat vx_attr_creatop vx_new_attr vx_attr_inheritbuf vx_attr_inherit vx_tran_nattr_dircreate vx_nattr_copen vx_nattr_open vx_setea vx_linux_setxattr vfs_setxattr link_path_walk sys_setxattr system_call Description::The initial creation of a named attribute for a regular file or directory will result in the automatic creation of a 'named attribute directory'. Creations are initially attempted in a single transaction. Should the single transaction fail due to a read/write lock being held then a retry should split the task into multiple transactions. An incorrect reset of a tracking structure meant that all retries were performed using a single transaction creating an endless retry loop. Resolution::The tracking structure is no longer reset within the retry loop. * Incident no::2402643 Tracking ID ::2399178 Symptom::Full fsck does large directory index validation during pass2c. However, if the number of large directories are more then this pass takes a lot of time. There is huge scope to improve the full fsck performance during this pass. Description::Pass2c consists of following basic operations:- [1] Read the entries in the large directory [2] Cross check hash values of those entries with the hash directory inode contents residing on the attribute ilist. This means this is another heavy IO intensive pass. Resolution::1.Use directory block read-ahead during Step [1]. 2.Wherever possible, access the file contents extent-wise rather than in fs block size (while reading entries in the directory) or in hash block size (8k, during dexh_getblk) Using above mentioned enhancements, the buffer cache can be utilized in better way. * Incident no::2412029 Tracking ID ::2384831 Symptom::System panics with the following stack trace. This happens in some cases when names streams are used in VxFS. machine_kexec() crash_kexec() __die do_page_fault() error_exit() [exception RIP: iput+75] vx_softcnt_flush() vx_ireuse_clean() vx_ilist_chunkclean() vx_inode_free_list() vx_ifree_scan_list() vx_workitem_process() vx_worklist_process() vx_worklist_thread() vx_kthread_init() kernel_thread() Description::VxFS internally creates a directory to keep the named streams pertaining to a file. In some scenarios, an error code path is missing to release the hold on that directory. Due to this unmount of the file system will not clean the inode belonging to that directory. Later when VxFS reuses such a inode panic is seen. Resolution::Release the hold on the named streams directory in case of an error. * Incident no::2412169 Tracking ID ::2371903 Symptom::There is an extra empty line in "/proc/devices" when activating file system checkpoints, like # tail -6 /proc/devices 199 VxVM 201 VxDMP 252 vxclonefs-0 <<< There is a space here. 253 device-mapper 254 mdp Description::vxfs improperly adds a new line character i.e. "\n", when composing device name of clone. As a result, when the function "register_blkdev" is called with the device name to register block device driver, an additional blank line is showed in "/proc/devices". Resolution::Remove the new line when creating the clone device. * Incident no::2412177 Tracking ID ::2371710 Symptom::User quota file corruption occurs when DELICACHE feature is enabled, the current usage of inodes of a user becomes negative after frequent file creations and deletions. Checking quota info using command "vxquota -vu username", the number of files is "-1" like: # vxquota -vu testuser2 Disk quotas for testuser2 (uid 500): Filesystem usage quota limit timeleft files quota limit timeleft /vol01 1127809 8239104 8239104 -1 0 0 Description::This issue is introduced by the inode DELICACHE feature in 5.1SP1, it is a performance enhancement to optimize the updates done to inode map during file creations and deletions. The feature is enabled by default, and can be changed by vxtunefs. When DELICACHE is enabled and quota is set for vxfs, there will be an extra quota update for the inodes on inactive list during removing process. Since these inodes' quota has been updated already before put on delicache list, the current number of user files gets decremented twice eventually. Resolution::Add a flag to identify the inodes moved to inactive list from delicache list, so that the flag can be used to prevent updating the quota again during removing process. * Incident no::2412179 Tracking ID ::2387609 Symptom::Quota usage gets set to ZERO when umount/mount the file system though files owned by users exist. This issue may occur after some file creations and deletions. Checking the quota usage using "vxrepquota" command and the output would be like following: # vxrepquota -uv /vx/sofs1/ /dev/vx/dsk/sfsdg/sofs1 (/vx/sofs1): Block limits File limits User used soft hard timeleft used soft hard timeleft testuser1 -- 0 3670016 4194304 0 0 0 testuser2 -- 0 3670016 4194304 0 0 0 testuser3 -- 0 3670016 4194304 0 0 0 Additionally the quota usage may not be updated after inode/block usage reaches ZERO. Description::The issue occurs when VxFS merges external per node quota files with internal quota file. The block offset within external quota file could be calculated wrongly in some scenario. When any hole found in per node quota file, the file offset such that it points the next non-HOLE offset will be modified, but we miss to change the block offset accordingly which points to the next available quota record in a block. VxFS updates per node quota records only when global internal quota file shows either of some bytes or inode usage, otherwise it doesn't copy the usage from global quota file to per node quota file. But for the case where quota usage in external quota files has gone down to zero and both bytes and inode usage in global file becomes zero, per node quota records would be not updated and left with incorrect usage. It should also check bytes or inodes usage in per node quota record. It should skip coping records only when bytes and inodes usage in both global quota file and per node quota file is zero. Resolution::Corrected the way to calculate the block offset when any hole is found in per node quota file. Added code to also check blocks or inodes usage in per node quota record while updating user quota usage. * Incident no::2412181 Tracking ID ::2372093 Symptom::New fsadm command options, to defragment a given percentage of the available freespace in a file system, have been introduced as part of an initiative to help improve Cluster File System [CFS] performance - the new additional command usage is as follows: fsadm -C -U We have since found that this new freespace defragmentation operation can sometimes hang (whilst it also continues to consume some cpu) in specific circumstances when executed on a Cluster mounted File System [CFS] Description::The hang can occur when file system metadata is being relocated. In our example case the hang occurs whilst relocating inodes whose corresponding files are being actively updated via a different node (from which the fsadm command is being executed) in the cluster. During the relocation an error code path is taken due to an unexpected mismatch between temporary replica metadata, the code path then results in a deadlock, or hang. Resolution::As there is no overriding need to relocate structural metadata for the purposes of defragmenting the available freespace, we have chosen to simply leave all structural metadata where it is when performing this operation thus avoiding its relocation. The changes required for this solution are therefore very low risk. * Incident no::2418819 Tracking ID ::2283893 Symptom::In a Cluster File System (CFS) environment , the file read performances gradually degrade up to 10% of the original read performance and the fsadm(1M) -F vxfs -D -E shows a large number (> 70%) of free blocks in extents smaller than 64k. For example, % Free blocks in extents smaller than 64 blks: 73.04 % Free blocks in extents smaller than 8 blks: 5.33 Description::In a CFS environment, the disk space is divided into Allocation Units (AUs).The delegation for these AUs is cached locally on the nodes. When an extending write operation is performed on a file, the file system tries to allocate the requested block from an AU whose delegation is locally cached, rather than finding the largest free extent available that matches the requested size in the other AUs. This leads to a fragmentation of the free space, thus leading to badly fragmented files. Resolution::The code is modified such that the time for which the delegation of the AU is cached can be reduced using a tuneable, thus allowing allocations from other AUs with larger size free extents. Also, the fsadm(1M) command is enhanced to de-fragment free space using the -C option. * Incident no::2420060 Tracking ID ::2403126 Symptom::Hang is seen in the cluster when one of the nodes in the cluster leaves or rebooted. One of the nodes in the cluster will contain the following stack trace. e_sleep_thread() vx_event_wait() vx_async_waitmsg() vx_msg_send() vx_send_rbdele_resp() vx_recv_rbdele+00029C () vx_recvdele+000100 () vx_msg_recvreq+000158 () vx_msg_process_thread+0001AC () vx_thread_base+00002C () threadentry+000014 (??, ??, ??, ??) Description::Whenever a node in the cluster leaves, reconfiguration happens and all the resources that are held by the leaving nodes are consolidated. This is done on one node of the cluster called primary node. Each node sends a message to the primary node about the resources it is currently holding. During this reconfiguration, in a corner case, VxFS is incorrectly calculating the message length which is larger than what GAB(Veritas Group Membership and Atomic Broadcast) layer can handle. As a result the message is getting lost. The sender thinks that the message is sent and waits for acknowledgement. The message is actually dropped at sender and never sent. The master node which is waiting for this message will wait forever and the reconfiguration never completes leading to hang. Resolution::The message length calculation is done properly now and GAB can handle the messages. * Incident no::2425429 Tracking ID ::2422574 Symptom::On CFS, after turning the quota on, when any node is rebooted and rejoins the cluster, it fails to mount the filesystem. Description::At the time of mounting the filesystem after rebooting the node, mntlock was already set, which didn't allow the remount of filesystem, if quota is on. Resolution::Code is changed so that the mntlock flag is masked in quota operation as it's already set on the mount. * Incident no::2425439 Tracking ID ::2242630 Symptom::Earlier distributions of Linux had a maximum size of memory that could be allocated via vmalloc(). This throttled the maximum size of VxFS's hash tables, and so limited the size of the inode and buffer caches. RHEL5/6 and SLES10/11 do not have this limit. Description::Limitations in the Linux kernel used to limit the inode and buffer cache for VxFS. Resolution::Code is changed to accommodate the change of limits in Linux kernel, hence modified limits on inode and buffer cache for VxFS. * Incident no::2426039 Tracking ID ::2412604 Symptom::Once the time limit expires after exceeding the soft-limit of user quota size on VxFS filesystem, writes are still permissible over that soft-limit. Description::After exceeding the soft-limit, in the initial setup of the soft-limit the timer didn't use to start. Resolution::Start the timer during the initial setting of quota limits if current usage has already crossed the soft quota limits. * Incident no::2427269 Tracking ID ::2399228 Symptom::Occasionally Oracle Archive logs can be created smaller than they should be, in the reported case the resultant Oracle Archive logs were incorrectly sized as 512 bytes. Description::The fcntl [file control] command F_FREESP [Free storage space] can be utilised to change the size of a regular file. If the file size is reduced we call it a "truncate", and space allocated in the truncated area will be returned to the file system freespace pool. If the file size is increased using F_FREESP we call it a "truncate-up", although the file size changes no space is allocated in the extended area of the file. Oracle archive logs utilize the F_FREESP fcntl command to perform a truncate-up of a new file before a smaller write of 512 bytes [at the the start of the file] is then performed. A timing window was found with F_FREESP which meant that 'truncate-up' file size was lost, or rather overwritten, by the subsequent write of the data, thus causing the file to appear with a size of just 512 bytes. Resolution::A timing window has been closed whereby the flush of the allocating [512byte] write was triggered after the new F_FREESP file size has been updated in the inode. * Incident no::2427281 Tracking ID ::2413172 Symptom::vxfs_fcl_seektime() API seeks to the first record in the File change log(FCL) file after specified time. This API can incorrectly return EINVAL(FCL record not found)error while reading first block of FCL file. Description::To seek to the first record after the given time, first a binary search is performed to get the largest block offset where fcl record time is less than the given time. Then a linear search from this offset is performed to find the first record which has time value greater than specified time. FCL records are read in buffers. There can be scenarios where FCL records read in one buffer are less than buffer size, e.g. reading first block of FCL file. In such scenarios, buffer read can continue even when all data in current buffer has been read. This is due to wrong check which decides if all records in one buffer has been read. Thus reading buffer beyond boundary was causing search to terminate without finding record for given time and hence EINVAL error was returned. Actually, VxFS should detect that it is partially filled buffer and the search should continue reading the next buffer. Resolution::Check which decides if all records in buffer have been read is corrected such that buffer is read within its boundaries. * Incident no::2430679 Tracking ID ::1892045 Symptom::A multitude of slab-1024 memory is consumed, which can be checked from /proc/slabinfo. For example, # cat /proc/slabinfo | grep "size-1024 " size-1024 51676 51700 1024 4 1 : tunables 54 27 8 : slabdata 12925 12925 0 where 51700 slabs are allocated with 1024 bytes unit size. Description::VxFS creates some fake inodes for various background and bookkeeping tasks for which the kernel wants VxFS to have an inode that doesn't strictly have to be a real file. The fake inode number is computed from NR_CPUS. If the kernel has a large NR_CPUS, the slab-1024 will become significant accordingly. Resolution::Allocate the fake inodes based on the kernel's routine num_possible_cpus() that would help to reduce the slab-1024 number to thousands. * Incident no::2478237 Tracking ID ::2384861 Symptom::The following asserts are seen during internal stress and regression runs f:vx_do_filesnap:1b f:vx_inactive:2a f:xted_check_rwdata:31 f:vx_do_unshare:1 Description::These asserts validate some assumption in various function also there were some miscellaneous issues which were seen during internal testing. Resolution::The code has been modified to fix the internal reported issues which other miscellaneous changes. * Incident no::2478325 Tracking ID ::2251015 Symptom::Command fsck(1M) will take longer time to complete. Description::Command fsck(1M), in extreme case like 2TB file system with a 1KB block size, 130+ checkpoints, and 100-250 million inodes per file set takes 15+ hours to complete intent log replay because it has to read a few GB of IAU headers and summaries one synchronous block at a time. Resolution::Changing the fsck code, to do read ahead on the IAU file reduces the fsck log-replay time. * Incident no::2480949 Tracking ID ::2480935 Symptom::System log file may contain following error message on multi-threaded environment with Dynamic Storage Tiers(DST). UX:vxfs fsppadm: ERROR: V-3-26626: File Change Log IOTEMP and ACCESSTEMP index creation failure for /vx/fsvm with message Argument list too long Description::In DST, while enforcing policy, SQL queries are generated and written to file .__fsppadm_enforcesql present in lost+found. In multi threaded environment, 16 threads works in parallel on ILIST and geenrate SQL queries and write it to file. This may lead to corruption of file, if multiple threads write to file simultaneously. Resolution::Mutex is used to serialize writing of threads on SQL file. * Incident no::2482337 Tracking ID ::2431674 Symptom::panic in vx_common_msgprint() via vx_inactive() Description::The problem is that the call VX_CMN_ERR( ) , uses a "llx" format character which vx_common_msgprint() doesn't understand. It gives up trying to process that format, but continues on without consuming the corresponding parameter. Everything else in the parameter list is effectively shifted by 8 bytes, and when we get to processing the string argument, it's game over. Resolution::Changed the format to "llu", which vx_common_msgprint() understands. * Incident no::2482344 Tracking ID ::2424240 Symptom::In the case of deduplication, when FS block size is bigger and there is a partial block match, we end up sharing the block anyway, resulting in corruption of files. Description::This is because to handle slivers we are rounding up the matched length to block boundary. Resolution::Code is changed so to round-down the length that we intend to dedup to align with fs block size. * Incident no::2484815 Tracking ID ::2440584 Symptom::System panic when force unmounting the file system. The backtrace looks like machine_kexec crash_kexec __die do_page_fault error_exit vx_sync vx_sync_fs sync_filesystems do_sync sys_sync system_call or machine_kexec crash_kexec __die do_page_fault error_exit sync_filesystems do_sync sys_sync system_call Description::When we do force unmount, VxFS will free some memory structures which can still be referenced by the code path of sync. Thus, there is a window that allows the race between sync command and force unmount as shown in the first backtrace. For the second panic, prior to the completion of force unmount, sync_filesystems could invoke sync_fs (one member of super_operations) callback of VxFS which however could be already set NULL by force unmount. Resolution::vx_sync function will not do real sync if detecting a force unmount is in process. Add dummy functions instead of NULL pointers to vxfs super_operations during force unmount. * Incident no::2486597 Tracking ID ::2486589 Symptom::Multiple threads may wait on a mutex owned by a thread that is in function vx_ireuse_steal() with following stack trace on machine with severe inode pressure. vx_ireuse_steal() vx_ireuse() vx_iget() Description::Several thread are waiting to get inodes from VxFS. The current number of inodes reached max number of inodes (vxfs_ninode) that can be created in memory. So no new allocations can be possible, which results in thread wait. Resolution::Code is modified so that in such situation, threads return ENOINODE instead of retrying to get inodes. * Incident no::2494464 Tracking ID ::2247387 Symptom::Internal local mount noise.fullfsck.N4 test hit an assert vx_ino_update:2 With stack trace looking as below panic: f:vx_ino_update:2 Stack Trace: IP Function Name 0xe0000000023d5780 ted_call_demon+0xc0 0xe0000000023d6030 ted_assert+0x130 0xe000000000d66f80 vx_ino_update+0x230 0xe000000000d727e0 vx_iupdat_local+0x13b0 0xe000000000d638b0 vx_iupdat+0x230 0xe000000000f20880 vx_tflush_inode+0x210 0xe000000000f1fc80 __vx_fsq_flush___vx_tran.c__4096000_0686__+0xed0 0xe000000000f15160 vx_tranflush+0xe0 0xe000000000d2e600 vx_tranflush_threaded+0xc0 0xe000000000d16000 vx_workitem_process+0x240 0xe000000000d15ca0 vx_worklist_thread+0x7f0 0xe000000001471270 kthread_daemon_startup+0x90 End of Stack Trace Description::INOILPUSH flag is not set when inode is getting updated, which caused above assert. The problem was creation and deletion of clone resets the INOILPUSH flag and function vx_write1_fast() does not set the flag after updating the inode and file. Resolution::Code is modified so that if INOILPUSH flag is not set while function vx_write1_fast(), then the flag is set in the function. * Incident no::2508164 Tracking ID ::2481984 Symptom::Access to the file system got hang. Description::In function 'vx_setqrec', it will call 'vx_dqget'. when 'vx_dqget' return errors, it will try to unlock DQ structure using 'VX_DQ_CLUSTER_UNLOCK'. But, in this situation, DQ structure doesn't hold the lock. hence, this hang happens. Resolution::'dq_inval' would be set in 'vx_dqget' in case of any error happens in 'vx_dqget'. Skip unlocking DQ structure in the error code path of 'vx_setqrec', if 'dq_inval' is set. * Incident no::2521514 Tracking ID ::2177591 Symptom::System panic in vx_softcnt_flush() with stack as below: #4 [ffff8104981a7d08] generic_drop_inode() #5 [ffff8104981a7d28] vx_softcnt_flush() #6 [ffff8104981a7d58] vx_ireuse_clean() #7 [ffff8104981a7d88] vx_ilist_chunkclean() #8 [ffff8104981a7df8] vx_inode_free_list() #9 [ffff8104981a7e38] vx_ifree_scan_list() #10 [ffff8104981a7e48] vx_workitem_process() #11 [ffff8104981a7e58] vx_worklist_process() #12 [ffff8104981a7ed8] vx_worklist_thread() #13 [ffff8104981a7ee8] vx_kthread_init() #14 [ffff8104981a7f48] kernel_thread() Description::panic occured in iput()(vx_softcnt_flush()) which was called to drop the softcount hold(i_count) held by VxFS. Panic thread is cleaning an inode on freelist. Inode being cleaned belongs to FS which is already unmounted. FS superblock structure is freed after vxfs unmount returns irrespective of unmount is successfull or failed as linux always expects umount to succeed. Unmount of this FS was not clean i.e. detach of this FS was failing with EBUSY error. Detach of FS can fail with EBUSY error if there are any busy inodes i.e. having pending operations. During an unmount operation, all the inodes belonging to that file system are cleaned. But as unmount of FS was not successful, inode processing of FS didn't completed. When background thread picked inodes of such unmounted FS, panic happened while accessing superblock structure which was freed. Resolution::Check for busy inodes while unmounting comes before processing all inodes in inode cache. So this can leave inodes on freelist without their cleanup during unmount. Now after some failed attempt to detach fset, calling detach fset with force option. This can be more aggressive in tearing down fileset and therefore should help to clear fest's inodes. * Incident no::2529356 Tracking ID ::2340953 Symptom::During internal stress test, f:vx_iget:1a assert is seen. Description::While renaming certain file, we check if the target directory is in the path of the source file to be renamed. while using function vx_iget() to reach till root inode, one of parent directory incode number was 0 and hence the assert. Resolution::Code is changed so that during renames, parent directory is first assigned correct inode number before using vx_iget() function to retrieve root inode.