* * * READ ME * * * * * * Veritas Volume Manager 7.4 * * * * * * Patch 1600 * * * Patch Date: 2019-07-09 This document provides the following information: * PATCH NAME * OPERATING SYSTEMS SUPPORTED BY THE PATCH * PACKAGES AFFECTED BY THE PATCH * BASE PRODUCT VERSIONS FOR THE PATCH * SUMMARY OF INCIDENTS FIXED BY THE PATCH * DETAILS OF INCIDENTS FIXED BY THE PATCH * INSTALLATION PRE-REQUISITES * INSTALLING THE PATCH * REMOVING THE PATCH PATCH NAME ---------- Veritas Volume Manager 7.4 Patch 1600 OPERATING SYSTEMS SUPPORTED BY THE PATCH ---------------------------------------- RHEL7 x86-64 PACKAGES AFFECTED BY THE PATCH ------------------------------ VRTSvxvm BASE PRODUCT VERSIONS FOR THE PATCH ----------------------------------- * InfoScale Enterprise 7.4 * InfoScale Foundation 7.4 * InfoScale Storage 7.4 SUMMARY OF INCIDENTS FIXED BY THE PATCH --------------------------------------- Patch ID: 7.4.0.1600 * 3974224 (3973364) I/O hang may occur when VVR Replication is enabled in synchronous mode. * 3974225 (3968279) Vxconfigd dumping core for NVME disk setup. * 3974226 (3948140) System panic can occur if size of RTPG (Report Target Port Groups) data returned by underlying array is greater than 255. * 3974227 (3934700) VxVM is not able to recognize AIX LVM disk with 4k sector. * 3974228 (3969860) Event source daemon (vxesd) takes a lot of time to start when lot of LUNS (around 1700) are attached to the system. * 3974230 (3915523) Local disk from other node belonging to private DG(diskgroup) is exported to the node when a private DG is imported on current node. * 3974231 (3955979) I/O gets hang in case of synchronous Replication. * 3975899 (3931048) VxVM (Veritas Volume Manager) creates particular log files with write permission to all users. * 3978013 (3907800) VxVM package installation will fail on SLES12 SP2. * 3979382 (3950373) Allow a slave node to be configured as logowner in CVR configurations * 3979385 (3953711) Panic observed while switching the logowner to slave while IO's are in progress * 3980907 (3978343) Negative IO count while switching logowner Patch ID: 7.4.0.1500 * 3970687 (3971046) Replication does not switch between synchronous and asynchronous mode automatically based on the network conditions. Patch ID: 7.4.0.1400 * 3949320 (3947022) VVR:vxconfigd hang during during /scripts/configuratio/assoc_datavol.tc#6 * 3958838 (3953681) Data corruption issue is seen when more than one plex of volume is detached. * 3958884 (3954787) Data corruption may occur in GCO along with FSS environment on RHEL 7.5 Operating system. * 3958887 (3953711) Panic observed while switching the logowner to slave while IO's are in progress * 3958976 (3955101) Panic observed in GCO environment (cluster to cluster replication) during replication. * 3959204 (3949954) Dumpstack messages are printed when vxio module is loaded for the first time when called blk_register_queue. * 3959433 (3956134) System panic might occur when IO is in progress in VVR (veritas volume replicator) environment. * 3967098 (3966239) IO hang observed while copying data in cloud environments. * 3967099 (3965715) vxconfigd may core dump when VIOM(Veritas InfoScale Operation Manager) is enabled. Patch ID: 7.4.0.1200 * 3949322 (3944259) The vradmin verifydata and vradmin ibc commands fail on private diskgroups with Lost connection error. * 3950578 (3953241) Messages in syslog are seen with message string "0000" for VxVM module. * 3950760 (3946217) In a scenario where encryption over wire is enabled and secondary logging is disabled, vxconfigd hangs and replication does not progress. * 3950799 (3950384) In a scenario where volume encryption at rest is enabled, data corruption may occur if the file system size exceeds 1TB. * 3951488 (3950759) The application I/Os hang if the volume-level I/O shipping is enabled and the volume layout is mirror-concat or mirror-stripe. DETAILS OF INCIDENTS FIXED BY THE PATCH --------------------------------------- This patch fixes the following incidents: Patch ID: 7.4.0.1600 * 3974224 (Tracking ID: 3973364) SYMPTOM: In case of VVR (Veritas Volume Replicator) synchronous mode of replication with TCP protocol, if there are any network issues I/O's may hang for upto 15-20 mins. DESCRIPTION: In VVR synchronous replication mode, if a node on primary site is unable to receive ACK (acknowledgement) message sent from the secondary within the TCP timeout period, then IO may get hung till the TCP layer detects a timeout, which is ~ 15-20 minutes. This issue may frequently happen in a lossy network where the ACKs could not be delivered to primary due to some network issues. RESOLUTION: A hidden tunable 'vol_vvr_tcp_keepalive' is added to allow users to enable TCP 'keepalive' for VVR data ports if the TCP timeout happens frequently. * 3974225 (Tracking ID: 3968279) SYMPTOM: Vxconfigd dumps core with SEGFAULT/SIGABRT on boot for NVME setup. DESCRIPTION: For NVME setup, vxconfigd dumps core while doing device discovery as the data structure is accessed by multiple threads and can hit a race condition. For sector size other than 512, the partition size mismatch is seen because we are doing comparison with partition size from devintf_getpart() and it is in sector size of the disk. This can lead to call of NVME device discovery. RESOLUTION: Added mutex lock while accessing the data structure so as to prevent core. Made calculations in terms of sector size of the disk to prevent the partition size mismatch. * 3974226 (Tracking ID: 3948140) SYMPTOM: System may panic if RTPG data returned by the array is greater than 255 with below stack: dmp_alua_get_owner_state() dmp_alua_get_path_state() dmp_get_path_state() dmp_check_path_state() dmp_restore_callback() dmp_process_scsireq() dmp_daemons_loop() DESCRIPTION: The size of the buffer given to RTPG SCSI command is currently 255 bytes. But the size of data returned by underlying array for RTPG can be greater than 255 bytes. As a result incomplete data is retrieved (only the first 255 bytes) and when trying to read the RTPG data, it causes invalid access of memory resulting in error while claiming the devices. This invalid access of memory may lead to system panic. RESOLUTION: The RTPG buffer size has been increased to 1024 bytes for handling this. * 3974227 (Tracking ID: 3934700) SYMPTOM: VxVM is not able to recognize AIX LVM disk with 4k sector. DESCRIPTION: With 4k sector disk, the AIX LVM ID locates at different new offset(4096 rather than 3584 bytes) in the disk header. VxVM tries to read the LVM ID at the original offset, so it is not able to recognize it. RESOLUTION: Code has been changed to compatible with the new offset of 4k sector disk. * 3974228 (Tracking ID: 3969860) SYMPTOM: Event source daemon (vxesd) takes a lot of time to start when lot of LUNS (around 1700) are attached to the system. DESCRIPTION: Event source daemon creates a configuration file ddlconfig.info with the help of HBA API libraries. The configuration file is created by child process while the parent process is waiting for child to create the configuration file. If the number of LUNS are large then time taken for creation of configuration is also more. Thus the parent process keeps on waiting for the child process to complete the configuration and exit. RESOLUTION: Changes have been done to create the ddlconfig.info file in the background and let the parent exit immediately. * 3974230 (Tracking ID: 3915523) SYMPTOM: Local disk from other node belonging to private DG is exported to the node when a private DG is imported on current node. DESCRIPTION: When we try to import a DG, all the disks belonging to the DG are automatically exported to the current node so as to make sure that the DG gets imported. This is done to have same behaviour as SAN with local disks as well. Since we are exporting all disks in the DG, then it happens that disks which belong to same DG name but different private DG on other node get exported to current node as well. This leads to wrong disk getting selected while DG gets imported. RESOLUTION: Instead of DG name, DGID (diskgroup ID) is used to decide whether disk needs to be exported or not. * 3974231 (Tracking ID: 3955979) SYMPTOM: In case of Synchronous mode of replication with TCP , if there are any network related issues, I/O's get hang for upto 15-30 mins. DESCRIPTION: When synchronous replication is used , and if because of some network issues secondary is not being able to send the network ack's to the primary, I/O gets hang on primary waiting for these network ack's. In case of TCP mode we depend on TCP for timeout to happen and then I/O's get drained out, since in this case there is no handling from VVR side, I/O's get hang until TCP triggers its timeout which in normal case happens within 15-30 mins. RESOLUTION: Code changes are done to allow user to set the time for tcp within which timeout should get triggered. * 3975899 (Tracking ID: 3931048) SYMPTOM: Few VxVM log files listed below are created with write permission to all users which might lead to security issues. /etc/vx/log/vxloggerd.log /var/adm/vx/logger.txt /var/adm/vx/kmsg.log DESCRIPTION: The log files are created with write permissions to all users, which is a security hole. The files are created with default rw-rw-rw- (666) permission because the umask is set to 0 while creating these files. RESOLUTION: Changed umask to 022 while creating these files and fixed an incorrect open system call. Log files will now have rw-r--r--(644) permissions. * 3978013 (Tracking ID: 3907800) SYMPTOM: VxVM package installation will fail on SLES12 SP2. DESCRIPTION: Since SLES12 SP2 has lot of kernel changes, package installation fails. Added code changes to provide SLES12 SP2 platform support for VxVM. RESOLUTION: Added code changes to provide SLES12 SP2 platform support. * 3979382 (Tracking ID: 3950373) SYMPTOM: In CVR environment, master node acts as default logowner and slave node cannot be assigned logowner role. DESCRIPTION: In CVR environment, master node acts as default logowner. In case of heavy I/O load, master node acts as becomes bottleneck, impacting overall cluster and replication performance. In case of scaling CVR configurations with multiple nodes and multiple RVGs, master node acts as logowner for all RVGS and further degrades the performance and put restriction in scaling. In FSS-CVR environment, logowner node, having remote connectivity to SRL/data volume, adds I/O shipping overhead. RESOLUTION: Changes are done to enable configuring any slave node in cluster as logowner per RVG in CVR environment. * 3979385 (Tracking ID: 3953711) SYMPTOM: System might encounter panic while switching the logowner to slave while IO's are in progress with the following stack: vol_rv_service_message_free() vol_rv_replica_reconfigure() sched_clock_cpu() vol_rv_error_handle() vol_rv_errorhandler_callback() vol_klog_start() voliod_iohandle() voliod_loop() voliod_kiohandle() kthread() insert_kthread_work() ret_from_fork_nospec_begin() insert_kthread_work() vol_rv_service_message_free() DESCRIPTION: While processing a transaction, we leave the IO count on the RV object to let the transaction to proceed. In such case we set the RV object in SIO to NULL. But while freeing the message, the object is de-referenced without taking into consideration it can be NULL. This can lead to a panic because of NULL pointer de-reference in code. RESOLUTION: Code changes have been made to handle NULL value of the RV object. * 3980907 (Tracking ID: 3978343) SYMPTOM: Log owner change in CVR could lead to a hang or panic due to access to freed memory DESCRIPTION: In case of logowner change one of the error case was not handled properly. IO initiated from slave node needs to be retried from slave node only. In this case it was getting retried from master node causing inconsistency. RESOLUTION: IO from slave is retried from slave node in case of error. Patch ID: 7.4.0.1500 * 3970687 (Tracking ID: 3971046) SYMPTOM: Replication does not switch between synchronous and asynchronous mode automatically based on the network conditions. DESCRIPTION: Network conditions may impact the replication performance. However, the current VVR replication does not switch between synchronous and asynchronous mode automatically based on the network conditions. RESOLUTION: This patch provides the adaptive synchronous mode for VVR, which is an enhancement to the existing synchronous override mode. In the adaptive synchronous mode, the replication mode switches from synchronous to asynchronous based on the cross-site network latency. Thus replication happens in the synchronous mode when the network conditions are good, and it automatically switches to the asynchronous mode when there is an increase in the cross-site network latency. You can also set alerts that notify you when the system undergoes network deterioration. For more details, see https://www.veritas.com/bin/support/docRepoServlet?bookId=136858821-137189101-1&requestType=pdf Patch ID: 7.4.0.1400 * 3949320 (Tracking ID: 3947022) SYMPTOM: vxconfigd hang. DESCRIPTION: There was a window between NIOs being added to rp_port->pt_waitq and rlink being disconnected where NIOs were left in pt_waitq and hence their parent (ack sio) were not done. The ack sio had IO count which led to the vxconfigd hang. RESOLUTION: Don't add NIO to rp_port->pt_waitq if rp_port->pt_closing is set. Instead call done on the NIO with error ENC_CLOSING.Before deleting the port, call done on the NIOs in pt_waitq with error ENC_CLOSING. * 3958838 (Tracking ID: 3953681) SYMPTOM: Data corruption issue is seen when more than one plex of volume is detached. DESCRIPTION: When a plex of volume gets detached, DETACH map gets enabled in the DCO (Data Change Object). The incoming IO's are tracked in DRL (Dirty Region Log) and then asynchronously copied to DETACH map for tracking. If one more plex gets detached then it might happen that some of the new incoming regions are missed in the DETACH map of the previously detached plex. This leads to corruption when the disk comes back and plex resync happens using corrupted DETACH map. RESOLUTION: Code changes are done to correctly track the IO's in the DETACH map of previously detached plex and avoid corruption. * 3958884 (Tracking ID: 3954787) SYMPTOM: On a RHEL 7.5 FSS environment with GCO configured having NVMe devices and Infiniband network, data corruption might occur when sending the IO from Master to slave node. DESCRIPTION: In the recent RHEL 7.5 release, linux stopped allowing IO on the underlying NVMe device which has gaps in between BIO vectors. In case of VVR, the SRL header of 3 blocks is added to the BIO . When the BIO is sent through LLT to the other node because of LLT limitation of 32 fragments can lead to unalignment of BIO vectors. When this unaligned BIO is sent to the underlying NVMe device, the last 3 blocks of the BIO are skipped and not written to the disk on the slave node. This leads to incomplete data written on the slave node which leads to data corruption. RESOLUTION: Code changes have been done to handle this case and send the BIO aligned to the underlying NVMe device. * 3958887 (Tracking ID: 3953711) SYMPTOM: System might encounter panic while switching the logowner to slave while IO's are in progress with the following stack: vol_rv_service_message_free() vol_rv_replica_reconfigure() sched_clock_cpu() vol_rv_error_handle() vol_rv_errorhandler_callback() vol_klog_start() voliod_iohandle() voliod_loop() voliod_kiohandle() kthread() insert_kthread_work() ret_from_fork_nospec_begin() insert_kthread_work() vol_rv_service_message_free() DESCRIPTION: While processing a transaction, we leave the IO count on the RV object to let the transaction to proceed. In such case we set the RV object in SIO to NULL. But while freeing the message, the object is de-referenced without taking into consideration it can be NULL. This can lead to a panic because of NULL pointer de-reference in code. RESOLUTION: Code changes have been made to handle NULL value of the RV object. * 3958976 (Tracking ID: 3955101) SYMPTOM: Server might panic in a GCO environment with the following stack: nmcom_server_main_tcp() ttwu_do_wakeup() ttwu_do_activate.constprop.90() try_to_wake_up() update_curr() update_curr() account_entity_dequeue() __schedule() nmcom_server_proc_tcp() kthread() kthread_create_on_node() ret_from_fork() kthread_create_on_node() DESCRIPTION: There are recent changes done in the code to handle Dynamic port changes i.e deletion and addition of ports can now happen dynamically. It might happen that while accessing the port, it was deleted in the background by other thread. This would lead to a panic in the code since the port to be accessed has been already deleted. RESOLUTION: Code changes have been done to take care of this situation and check if the port is available before accessing it. * 3959204 (Tracking ID: 3949954) SYMPTOM: Dumpstack messages are printed when vxio module is loaded for the first time when called blk_register_queue. DESCRIPTION: In RHEL 7.5 a new check was added in kernel code in blk_register_queue where if QUEUE_FLAG_REGISTERED was already set on the queue a dumpstack warning message was printed. In vxvm the flag was already set as the flag got copied from the device queue which was earlier registered by the OS. RESOLUTION: Changes are done in VxVM code to avoid copying of QUEUE_FLAG_REGISTERED fix the dumpstack warnings. * 3959433 (Tracking ID: 3956134) SYMPTOM: System panic might occur when IO is in progress in VVR (veritas volume replicator) environment with below stack: page_fault() voliomem_grab_special() volrv_seclog_wsio_start() voliod_iohandle() voliod_loop() kthread() ret_from_fork() DESCRIPTION: In a memory crunch scenario in some cases the memory reservation for SIO (staged IO) in VVR configuration might fail. If this situation occurs then SIO is tried at a later time when the memory becomes available again but while doing some of the fields of SIO are passed NULL values which leads to panic in the VVR code. RESOLUTION: Code changes have been done to pass proper values to IO when it is retired in VVR environment. * 3967098 (Tracking ID: 3966239) SYMPTOM: IO hang observed while copying data to VxVM (Veritas Volume Manager) volumes in cloud environments. DESCRIPTION: In an cloud environment having a mirrored volume with DCO (Data Change Object) attached to the volume, IO's issued on the volume have to process DRL (Dirty region log) which is used for faster recovery on reboot. When IO processing is happening on DRL, there is an condition in code which can lead to IO's not being driven through DRL resulting in a hang situation. Further IO's keep on queuing waiting for the IO's on the DRL to complete leading to a hang situation. RESOLUTION: Code changes have been done to resolve the condition which leads to IO's not being driven through DRL. * 3967099 (Tracking ID: 3965715) SYMPTOM: vxconfigd may core dump when VIOM(Veritas InfoScale Operation Manager) is enabled. Following is stack trace: #0 0x00007ffff7309d22 in ____strtoll_l_internal () from /lib64/libc.so.6 #1 0x000000000059c61f in ddl_set_vom_discovered_attr () #2 0x00000000005a4230 in ddl_find_devices_in_system () #3 0x0000000000535231 in find_devices_in_system () #4 0x0000000000535530 in mode_set () #5 0x0000000000477a73 in setup_mode () #6 0x0000000000479485 in main () DESCRIPTION: vxconfigd daemon reads Json data generated by VIOM for dynamically updating some of the VxVM disk(LUN) attributes. While accessing this data it wrongly parsed LUN size attribute as string which returns NULL instead of returning LUN size. Accessing this NULL value leads to vxconfigd daemon to core dump and generates segmentation fault. RESOLUTION: Appropriate changes are done to handle the LUN size attribute correctly. Patch ID: 7.4.0.1200 * 3949322 (Tracking ID: 3944259) SYMPTOM: The vradmin verifydata and vradmin ibc commands fail on private diskgroups with Lost connection error. DESCRIPTION: This issue occurs because of a deadlock between the IBC mechanism and the ongoing I/Os on the secondary RVG. IBC mechanism expects I/O transfer to secondary in a sequential order, however to improve performance I/Os are now written in parallel. The mismatch in IBC behavior causes a deadlock and the vradmin verifydata and vradmin ibc fail due to time out error. RESOLUTION: As a part of this fix, IBC behavior is now improved such that it now considers parallel and possible out-of-sequence I/O writes to the secondary. * 3950578 (Tracking ID: 3953241) SYMPTOM: Customer may get generic message or warning in syslog with string as "vxvm:0000: " instead of uniquely numbered message id for VxVM module. DESCRIPTION: Few syslog messages introduced in InfoScale 7.4 release were not given unique message number to identify correct places in the product where they are originated. Instead they are marked with common message identification number "0000". RESOLUTION: This patch fixes syslog messages generated by VxVM module, containing "0000" as the message string and provides them with a unique numbering. * 3950760 (Tracking ID: 3946217) SYMPTOM: In a scenario where encryption over wire is enabled and secondary logging is disabled, vxconfigd hangs and replication does not progress. DESCRIPTION: In a scenario where encryption over wire is enabled and secondary logging is disabled, the application I/Os are encrypted in a sequence, but are not written to the secondary in the same order. The out-of-sequence and in-sequence I/Os are stuck in a loop, waiting for each other to complete. Due to this, I/Os are left incomplete and eventually hang. As a result, the vxconfigd hangs and the replication does not progress. RESOLUTION: As a part of this fix, the I/O encryption and write sequence is improved such that all I/Os are first encrypted and then sequentially written to the secondary. * 3950799 (Tracking ID: 3950384) SYMPTOM: In a scenario where volume data encryption at rest is enabled, data corruption may occur if the file system size exceeds 1TB and the data is located in a file extent which has an extent size bigger than 256KB. DESCRIPTION: In a scenario where data encryption at rest is enabled, data corruption may occur when both the following cases are satisfied: - File system size is over 1TB - The data is located in a file extent which has an extent size bigger than 256KB This issue occurs due to a bug which causes an integer overflow for the offset. RESOLUTION: As a part of this fix, appropriate code changes have been made to improve data encryption behavior such that the data corruption does not occur. * 3951488 (Tracking ID: 3950759) SYMPTOM: The application I/Os hang if the volume-level I/O shipping is enabled and the volume layout is mirror-concat or mirror-stripe. DESCRIPTION: In a scenario where an application I/O is issued over a volume that has volume-level I/O shipping enabled, the I/O is shipped to all target nodes. Typically, on the target nodes, the I/O must be sent only to the local disk. However, in case of mirror-concat or mirror-stripe volumes, I/Os are sent to remote disks as well. This at times leads in to an I/O hang. RESOLUTION: As a part of this fix, I/O once shipped to the target node is restricted to only locally connected disks and remote disks are skipped. INSTALLING THE PATCH -------------------- Run the Installer script to automatically install the patch: ----------------------------------------------------------- Please be noted that the installation of this P-Patch will cause downtime. To install the patch perform the following steps on at least one node in the cluster: 1. Copy the patch vm-rhel7_x86_64-Patch-7.4.0.1600.tar.gz to /tmp 2. Untar vm-rhel7_x86_64-Patch-7.4.0.1600.tar.gz to /tmp/hf # mkdir /tmp/hf # cd /tmp/hf # gunzip /tmp/vm-rhel7_x86_64-Patch-7.4.0.1600.tar.gz # tar xf /tmp/vm-rhel7_x86_64-Patch-7.4.0.1600.tar 3. Install the hotfix(Please be noted that the installation of this P-Patch will cause downtime.) # pwd /tmp/hf # ./installVRTSvxvm740P1600 [ ...] You can also install this patch together with 7.4 base release using Install Bundles 1. Download this patch and extract it to a directory 2. Change to the Veritas InfoScale 7.4 directory and invoke the installer script with -patch_path option where -patch_path should point to the patch directory # ./installer -patch_path [] [ ...] Install the patch manually: -------------------------- 1.Before-the-upgrade :- (a) Stop I/Os to all the VxVM volumes. (b) Umount any filesystems with VxVM volumes. (c) Stop applications using any VxVM volumes. 2.Check whether root support or DMP native support is enabled or not: # vxdmpadm gettune dmp_native_support If the current value is "on", DMP native support is enabled on this machine. If disabled: goto step 4. If enabled: goto step 3. 3.If DMP native support is enabled: a.It is essential to disable DMP native support. Run the following command to disable DMP native support # vxdmpadm settune dmp_native_support=off b.Reboot the system # reboot 4.Select the appropriate RPMs for your system, and upgrade to the new patch. # rpm -Uvh VRTSvxvm-7.4.0.1600-RHEL7.x86_64.rpm 5.Run vxinstall to get VxVM configured # vxinstall 6.If DMP Native Support was enabled before patch upgrade, enable it back. a. Run the following command to enable DMP native support # vxdmpadm settune dmp_native_support=on b. Reboot the system # reboot REMOVING THE PATCH ------------------ # rpm -e VRTSvxvm SPECIAL INSTRUCTIONS -------------------- Added support to assign a slave node as a logowner In a disaster recovery environment, VVR maintains write-order fidelity for the application I/Os received. When replicating in a shared disk group environment, VVR designates one cluster node as a logowner to maintain the order of writes. By default, VVR designates the master node as a logowner. To optimize the master node workload, VVR now enables you to assign any cluster node (slave node) as a logowner. Note: In the following cases, the change in logowner role is not preserved, and the master nodes takes over as a logowner. - Product upgrade - Cluster upgrade or reboot - Logowner Slave node failure OTHERS ------ NONE