owner of vm paused/unpaused operation

Hello, I'm doing some tests related to storage latency or problems manually created to debug and manage reactions of hosts and VMs. What is the subsystem/process/daemon responsible to pause a VM when problems arise on storage for the host where the VM is running? How is determined the timeout to use to put the VM in pause mode? Sometimes I see after clearing the problems that the VM is automatically un-paused, sometimes no: how is this managed? Are there any counters so that if VM has been paused and and problems are not solved in a certain timeframe the unpause can be done only manually by the sysadmin? Thanks in advance, Gianluca

On Tue, Oct 8, 2019 at 4:06 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, I'm doing some tests related to storage latency or problems manually created to debug and manage reactions of hosts and VMs. What is the subsystem/process/daemon responsible to pause a VM when problems arise on storage for the host where the VM is running? How is determined the timeout to use to put the VM in pause mode? Sometimes I see after clearing the problems that the VM is automatically un-paused, sometimes no: how is this managed? Are there any counters so that if VM has been paused and and problems are not solved in a certain timeframe the unpause can be done only manually by the sysadmin?
Thanks in advance, Gianluca
I have noticed that when virtual disk is virtio, the VM is not able to be unpaused in storage unreachable for many seconds, while if I have virtio-scsi and set high virtual disk timeout (like vSphere does on VMs when vmware tools have been installed), then VM is able to be resumed. The udev rule I have put into a CentOS 7 VM inside /etc/udev/rules.d/99-ovirt.rules is this one # Set timeout of virtio-SCSI disks to 180 secons like vSphere vmware tools # ACTION=="add", SUBSYSTEMS=="scsi", ATTRS{vendor}=="QEMU*", ATTRS{model}=="QEMU HARDDISK*", ENV{DEVTYPE}=="disk", RUN+="/bin/sh -c 'echo 180 > /sys$DEVPATH/device/timeout'" What I have not understood is if it is possible to prevent at all vdsm (is it the responsible?) to suddenly put the VM in paused state. Eg for experiment I have iSCSI based storage domains and put this in multipath.conf devices { device { all_devs yes # Set timeout of queuing of 5*28 = 140 seconds # similar to vSphere APD timeout # no_path_retry fail no_path_retry 28 polling_interval 5 } Then I create an iptables rule that for 100 seconds prevents host to reach storage and a dd task that writes on disk inside VM The effect is that vm is paused and after about 100 seconds VM mydbsrv has recovered from paused back to up. 10/9/19 1:59:02 PM VM mydbsrv has been paused due to storage I/O problem. 10/9/19 1:57:32 PM VM mydbsrv has been paused. 10/9/19 1:57:32 PM Any hint on how to prevent action of pausing the VM? Thanks, Gianluca

On 10/8/19 4:06 PM, Gianluca Cecchi wrote: Hi Gianluca
Hello, I'm doing some tests related to storage latency or problems manually created to debug and manage reactions of hosts and VMs. What is the subsystem/process/daemon responsible to pause a VM when problems arise on storage for the host where the VM is running?
It's Vdsm itself.
How is determined the timeout to use to put the VM in pause mode?
The VM is paused immediately as soon as libvirt, through QEMU, reports IOError, to avoid data corruption. Now, when libvirt reports this error depends laregly on the timeout set for the storage configuration, which is done at host level, using system tools (e.g. it is not a Vdsm tunable)
Sometimes I see after clearing the problems that the VM is automatically un-paused, sometimes no: how is this managed?
It depends on the error condition that happens. Vdsm tries to recovery automatically when it is safe to do so. When in doubt, Vdsm always plays it safe wrt user data
Are there any counters so that if VM has been paused and and problems are not solved in a certain timeframe the unpause can be done only manually by the sysadmin?
AFAIR no, because if Vdsm can't be sure, the only real option is to let the sysadmin check and decide. Bests, -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh

On Wed, Oct 9, 2019 at 2:55 PM Francesco Romani <fromani@redhat.com> wrote:
On 10/8/19 4:06 PM, Gianluca Cecchi wrote:
Hi Gianluca
Hello, I'm doing some tests related to storage latency or problems manually created to debug and manage reactions of hosts and VMs. What is the subsystem/process/daemon responsible to pause a VM when problems arise on storage for the host where the VM is running?
It's Vdsm itself.
ok
How is determined the timeout to use to put the VM in pause mode?
The VM is paused immediately as soon as libvirt, through QEMU, reports IOError, to avoid data corruption. Now, when libvirt reports this error
depends laregly on the timeout set for the storage configuration, which is done at host level, using system tools (e.g. it is not a Vdsm tunable)
For test I have set this in multipath.conf of host: devices { device { all_devs yes # Set timeout of queuing of 5*28 = 140 seconds # similar to vSphere APD timeout # no_path_retry fail no_path_retry 28 polling_interval 5 } So it should wait at least 140 seconds before passing error to upper layer correct?
Sometimes I see after clearing the problems that the VM is automatically un-paused, sometimes no: how is this managed?
I noticed that if I set disk as virtio-scsi (it seems virtio has no timeout definable and passes suddenly the error to upper layer) and disk timeout of vm disk (through udev rule) to 180 seconds, I can block access to the storage for example for 100 seconds and the host is able to reinstate paths and then vm is always unpaused. But I would like to prevent VM from pausing at all What else to tweak? Thanks, Gianluca

On 10/10/19 9:07 AM, Gianluca Cecchi wrote:
> How is determined the timeout to use to put the VM in pause mode?
The VM is paused immediately as soon as libvirt, through QEMU, reports IOError, to avoid data corruption. Now, when libvirt reports this error
depends laregly on the timeout set for the storage configuration, which is done at host level, using system tools (e.g. it is not a Vdsm tunable)
For test I have set this in multipath.conf of host:
devices { device { all_devs yes # Set timeout of queuing of 5*28 = 140 seconds # similar to vSphere APD timeout # no_path_retry fail no_path_retry 28 polling_interval 5 }
So it should wait at least 140 seconds before passing error to upper layer correct?
AFAICT yes
> Sometimes I see after clearing the problems that the VM is > automatically un-paused, sometimes no: how is this managed?
I noticed that if I set disk as virtio-scsi (it seems virtio has no timeout definable and passes suddenly the error to upper layer) and disk timeout of vm disk (through udev rule) to 180 seconds, I can block access to the storage for example for 100 seconds and the host is able to reinstate paths and then vm is always unpaused. But I would like to prevent VM from pausing at all What else to tweak?
The only way Vdsm will not pause the VM is if libvirt+qemu never reports any ioerror, which is something I'm not sure is possible and that I'd never recommend anyway. Vdsm always tries hard to be super-careful with respect possible data corruption. Bests, -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh

On Thu, Oct 10, 2019 at 9:56 AM Francesco Romani <fromani@redhat.com> wrote:
The only way Vdsm will not pause the VM is if libvirt+qemu never reports any ioerror, which is something I'm not sure is possible and that I'd never recommend anyway.
Vdsm always tries hard to be super-careful with respect possible data corruption.
OK.
In case of storage not accessible for a bunch of seconds is more a matter of I/O blocked than data corruption. If no other host powers on the VM I think there is no risk of data corruption itself, or at least no more than when you have a physical server and for some reason the I/O operations to its physical disks (local or on a SAN) are blocked for some tens of seconds. The host could ever do a poweroff of the VM itself, instead of leaving control to the underlying libvirt+qemu I see that by default the qemu-kvm process in my oVirt 4.3.6 is spawned for every disk with the options: ...,werror=stop,rerror=stop,... Only for the ide channel of the CD device I have: ...,werror=report,rerror=report,readonly=on and the manual page for qemu-kvm tells: werror=action,rerror=action Specify which action to take on write and read errors. Valid actions are: "ignore" (ignore the error and try to continue), "stop" (pause QEMU), "report" (report the error to the guest), "enospc" (pause QEMU only if the host disk is full; report the error to the guest otherwise). The default setting is werror=enospc and rerror=report. So I think that if I want in any way to modify behavior I have to change the options so that I keep "report" for both write and read errors on virtual disks. I'm only experimenting to see possible different options to manage "temporary" problems at storage level, that often resolve without manual actions in tens of seconds, sometimes due to uncorrect operations at levels managed by other teams (network, storage, ecc). In these circumstances experience told me it is better to "do nothing and wait", instead of trying to taking any action that anyway will fail until the "external" problem has been solved (automatically, thanks to logic outside oVirt control, or manually). It would be nice to "mimic" the behavior of vSphere in this sense and I'm investigating possible actions to reach it... Hope I clarified a bit the origin of my actions... Thanks, Gianluca

On 10/10/19 10:44 AM, Gianluca Cecchi wrote:
On Thu, Oct 10, 2019 at 9:56 AM Francesco Romani <fromani@redhat.com <mailto:fromani@redhat.com>> wrote:
The only way Vdsm will not pause the VM is if libvirt+qemu never reports any ioerror, which is something I'm not sure is possible and that I'd never recommend anyway.
Vdsm always tries hard to be super-careful with respect possible data corruption.
OK. In case of storage not accessible for a bunch of seconds is more a matter of I/O blocked than data corruption.
True, but we can know only ex-poste that the storage was just temporarily unavailable, don't we?
If no other host powers on the VM I think there is no risk of data corruption itself, or at least no more than when you have a physical server and for some reason the I/O operations to its physical disks (local or on a SAN) are blocked for some tens of seconds.
IMO, a storage unresponsive for tens of seconds is something which should be uncommon and very alarming in every circumstances, especially for physical servers. What i'm trying to say is that yes, there probabily are ways to sidestep this behaviour, but I think this is the wrong direction and adds fragility rather than convenience to the system.
The host could ever do a poweroff of the VM itself, instead of leaving control to the underlying libvirt+qemu
I see that by default the qemu-kvm process in my oVirt 4.3.6 is spawned for every disk with the options: ...,werror=stop,rerror=stop,...
Only for the ide channel of the CD device I have: ...,werror=report,rerror=report,readonly=on
and the manual page for qemu-kvm tells:
werror=action,rerror=action Specify which action to take on write and read errors. Valid actions are: "ignore" (ignore the error and try to continue), "stop" (pause QEMU), "report" (report the error to the guest), "enospc" (pause QEMU only if the host disk is full; report the error to the guest otherwise). The default setting is werror=enospc and rerror=report. So I think that if I want in any way to modify behavior I have to change the options so that I keep "report" for both write and read errors on virtual disks.
Yep. I don't remember what Engine allows. Worst case you can use an hook, but once again this is making things a bit more fragile.
I'm only experimenting to see possible different options to manage "temporary" problems at storage level, that often resolve without manual actions in tens of seconds, sometimes due to uncorrect operations at levels managed by other teams (network, storage, ecc).
I think the best option is improve the current behaviour: learn why Vdsm fails to unpause the VM and improve here. -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh

On Thu, Oct 10, 2019 at 1:10 PM Francesco Romani <fromani@redhat.com> wrote:
On 10/10/19 10:44 AM, Gianluca Cecchi wrote:
On Thu, Oct 10, 2019 at 9:56 AM Francesco Romani <fromani@redhat.com> wrote:
The only way Vdsm will not pause the VM is if libvirt+qemu never reports any ioerror, which is something I'm not sure is possible and that I'd never recommend anyway.
Vdsm always tries hard to be super-careful with respect possible data corruption.
OK.
In case of storage not accessible for a bunch of seconds is more a matter of I/O blocked than data corruption.
True, but we can know only ex-poste that the storage was just temporarily unavailable, don't we?
yes but I would like to have an option to say: don't do anything for X seconds, both at host level and guest level. X could be 5 seconds, or 10 seconds or 20 seconds.... according to several needs.
If no other host powers on the VM I think there is no risk of data corruption itself, or at least no more than when you have a physical server and for some reason the I/O operations to its physical disks (local or on a SAN) are blocked for some tens of seconds.
IMO, a storage unresponsive for tens of seconds is something which should be uncommon and very alarming in every circumstances, especially for physical servers.
What i'm trying to say is that yes, there probabily are ways to sidestep this behaviour, but I think this is the wrong direction and adds fragility rather than convenience to the system.
In general I agree with you on this
So I think that if I want in any way to modify behavior I have to change the options so that I keep "report" for both write and read errors on virtual disks.
Yep. I don't remember what Engine allows. Worst case you can use an hook, but once again this is making things a bit more fragile.
I'm only experimenting to see possible different options to manage "temporary" problems at storage level, that often resolve without manual actions in tens of seconds, sometimes due to uncorrect operations at levels managed by other teams (network, storage, ecc).
I think the best option is improve the current behaviour: learn why Vdsm fails to unpause the VM and improve here.
yes, I'm just experimenting on possible options and their pros & cons I see that on my 4.3.6 environment with plain CentOS 7.7 hosts the qemu-kvm process is spawned with "werror=stop,rerror=stop" for all virtual disks I didn't find any related option in VM edit page In my Fedora 30 when I start a VM (with virt-manager or "virsh start") I see that the options are not present in command line and based on qemu-kvm manual page: " The default setting is werror=enospc and rerror=report " In the mean time I created a wrapper script for qemu-kvm that changes command line 1) from werror=stop to werror=report and from rerror=stop to rerror=report This seems worse, in the sense that the VM is not paused at all, as expected, but strange behavior inside it From host point of view: [root@ov300 ~]# virsh -r list Id Name State ---------------------------------------------------- 7 mydbsrv running I suddenly get in VM /var/log/messsages something like Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] Sense Key : Aborted Command [current] Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] Add. Sense: I/O process terminated Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] CDB: Write(10) 2a 00 03 07 a8 78 00 00 08 00 Oct 10 12:42:55 mydbsrv kernel: blk_update_request: I/O error, dev sdc, sector 50833528 Oct 10 12:42:55 mydbsrv kernel: EXT4-fs warning (device dm-3): ext4_end_bio:322: I/O error -5 writing to inode 1573304 (offset 0 size 0 starting block 6353935) Oct 10 12:42:55 mydbsrv kernel: Buffer I/O error on device dm-3, logical block 6353935 Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] Sense Key : Aborted Command [current] Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] Add. Sense: I/O process terminated Oct 10 12:42:55 mydbsrv kernel: sd 2:0:0:1: [sdc] CDB: Write(10) 2a 00 03 07 a8 98 00 00 08 00 Oct 10 12:42:55 mydbsrv kernel: blk_update_request: I/O error, dev sdc, sector 50833560 Oct 10 12:42:55 mydbsrv kernel: EXT4-fs warning (device dm-3): ext4_end_bio:322: I/O error -5 writing to inode 1573308 (offset 0 size 0 starting block 6353939) ... and only shell builtin commands apparently working inside VM making necessary anyway a power off (from engine) and power on [root@mydbsrv ~]# uptime -bash: uptime: command not found [root@mydbsrv ~]# df -h -bash: df: command not found [root@mydbsrv ~]# id -bash: id: command not found [root@mydbsrv ~]# ll -bash: ls: command not found [root@mydbsrv ~]# [root@mydbsrv ~]# jobs [root@mydbsrv ~]# ps -bash: ps: command not found [root@mydbsrv ~]# sync -bash: sync: command not found [root@mydbsrv ~]# pwd /root [root@mydbsrv ~]# ls -bash: ls: command not found [root@mydbsrv ~]# /bin/ls -bash: /bin/ls: Input/output error [root@mydbsrv ~]# type mount -bash: type: mount: not found [root@mydbsrv ~]# /bin/mount -o remount,rw /myfs -bash: /bin/mount: Input/output error [root@mydbsrv ~]# tail /var/log/messages -bash: tail: command not found [root@mydbsrv ~]# cat /var/log/messages -bash: cat: command not found [root@mydbsrv ~]# echo some_word some_word [root@mydbsrv ~]# even after storage accessible again the wrong behavior continues. 2) from werror=stop to werror=ignore and from rerror=stop to rerror=ignore This is the more non-intrusive approach in respect of VM in my opinion, letting it discover itself there is a problem. In this case I get inside /var/log/messages of VM Oct 10 12:54:00 mydbsrv chronyd[4133]: Selected source 131.175.12.3 Oct 10 12:54:00 mydbsrv chronyd[4133]: System clock wrong by 1.065994 seconds, adjustment started Oct 10 12:54:00 mydbsrv systemd: Time has been changed Oct 10 12:54:00 mydbsrv chronyd[4133]: System clock was stepped by 1.065994 seconds Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 289, block bitmap and bg descriptor inconsistent: 30748 vs 28302 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 290, block bitmap and bg descriptor inconsistent: 31999 vs 32768 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 291, block bitmap and bg descriptor inconsistent: 28880 vs 32768 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 292, block bitmap and bg descriptor inconsistent: 32478 vs 32768 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 293, block bitmap and bg descriptor inconsistent: 32698 vs 32768 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 294, block bitmap and bg descriptor inconsistent: 32151 vs 32768 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 295, block bitmap and bg descriptor inconsistent: 31925 vs 32768 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 296, block bitmap and bg descriptor inconsistent: 32468 vs 32768 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 0, block bitmap and bg descriptor inconsistent: 32768 vs 1496 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 1, block bitmap and bg descriptor inconsistent: 32768 vs 279 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_mb_generate_buddy:757: group 112, block bitmap and bg descriptor inconsistent: 32767 vs 23733 free clusters Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_m000_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_m002_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:54:23 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_m005_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:54:27 mydbsrv kernel: JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 0). There's a risk of filesystem corruption in case of system crash. Oct 10 12:54:28 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_cjq0_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:54:31 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_pmon_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:54:49 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_reco_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:54:49 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_tmon_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:54:56 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_dia0_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:54:58 mydbsrv kernel: EXT4-fs error: 49 callbacks suppressed Oct 10 12:54:58 mydbsrv kernel: EXT4-fs error (device dm-3): ext4_mb_generate_buddy:757: group 192, block bitmap and bg descriptor inconsistent: 32379 vs 24527 free clusters Oct 10 12:54:58 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_tt00_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:55:07 mydbsrv chronyd[4133]: Selected source 131.175.12.6 Oct 10 12:55:21 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_gen1_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:55:22 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_mman_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:55:22 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_psp0_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:55:22 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_dbrm_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:55:22 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_pxmn_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:55:22 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_smco_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:55:22 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_mmnl_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:55:26 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_lgwr_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:55:26 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_gen0_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 Oct 10 12:55:26 mydbsrv kernel: EXT4-fs error (device dm-2): ext4_find_dest_de:1829: inode #928680: block 3679007: comm ora_mmon_test: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 and then when the 100 seconds of artificial storage access inhibition ends, the VM is able again to be accessible. [root@mydbsrv ~]# id uid=0(root) gid=0(root) groups=0(root) [root@mydbsrv ~]# uptime 12:56:36 up 3 min, 1 user, load average: 0.28, 0.38, 0.17 [root@mydbsrv ~]# [root@mydbsrv log]# time dd if=/dev/zero bs=1024k count=10240 of=/myfs/testfile 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 60.785 s, 177 MB/s real 1m1.771s user 0m0.016s sys 0m10.459s [root@mydbsrv log]# Obviously, as mentioned in messages, this behavior could potentially lead to fs/journal corruption... Gianluca
participants (2)
-
Francesco Romani
-
Gianluca Cecchi