[ovirt-users] Some VMs in status "not responding" in oVirt interface

Tue Sep 15 16:37:30 UTC 2015

Hello Daniel,

this is exactly what I experienced in the past weeks. 
I switched the NIC and the HDD from e1000 and IDE to VirtIO NIC and VirtIO disk for all Windows Server 2012 R2 VMs, they are running for 2 days now without problems.
Additionally 2 of my CentOS VMs stopped responding today, and this was a bit scary: the VM itself was running and I could connect to the console (I intentionally logged in as root yesterday and didn't log out, so I could have a look today what was the problem) . Network was down (pinging anything was unsuccessful), and every action concerning reading from or writing to the harddisk immediately hung. So I reset the VM (oVirt wasn't able to shut it down, so I had to kill the process) and had a look at /var/log/messages after booting up.
The last entry was this night at 03:01, cron.daily. Nothing else until I rebooted this morning at 08:30.

The VM is and always was configured with both VirtIO NIC and HDD.

So what could be the reason that the VM couldn't access the harddisk anymore?

Best regards, Christian 

-----Ursprüngliche Nachricht-----
Von: Daniel Helgenberger [mailto:daniel.helgenberger at m-box.de] 
Gesendet: Dienstag, 15. September 2015 16:15
An: Christian Hailer <christian at hailer.eu>; users at ovirt.org; ydary at redhat.com
Betreff: Re: [ovirt-users] Some VMs in status "not responding" in oVirt interface

Hello,

I do not want to hijack the thread but maybe my issue is related?

It might have started with ovirt 3.5.3; but I cannot tell for sure.

For me, one vm (foreman) is affected; the second time in 14 days. I can confirm this as I also loose any network connection to the VM and the ability to connect a console.
Also, the only thing witch 'fixes' the issue is right now 'kill -9 <pid of qemu-kvm process>'

As far as I can tell the VM became unresponsive at around Sep 15 12:30:01; engine logged this at 12:34. Nothing obvious in VDSM logs (see attached).

Below the engine.log part.

Versions:
ovirt-engine-3.5.4.2-1.el7.centos.noarch

vdsm-4.16.26-0.el7.centos
libvirt-1.2.8-16.el7_1.3

engine.log (1200 - 1300:
2015-09-15 12:03:47,949 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-56) [264d502a] HA reservation status for cluster Default is OK
2015-09-15 12:08:02,708 INFO  [org.ovirt.engine.core.bll.OvfDataUpdater] (DefaultQuartzScheduler_Worker-89) [2e7bf56e] Attempting to update VMs/Templates Ovf.
2015-09-15 12:08:02,709 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-89)
[5e9f4ba6] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected :  ID:
00000002-0002-0002-0002-000000000088 Type: l
2015-09-15 12:08:02,780 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-89)
[5e9f4ba6] Lock freed to object EngineLock [exclusiveLocks= key: 00000002-0002-0002-0002-000000000088 value: OVF_UPDATE
2015-09-15 12:08:47,997 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-21) [3fc854a2] HA reservation status for cluster Default is OK
2015-09-15 12:13:06,998 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand] (org.ovirt.thread.pool-8-thread-48)
[50221cdc] START, GetFileStatsVDSCommand( storagePoolId = 00000002-0002-0002-0002-000000000088, ignoreFailoverLimit = false), log id: 1503968
2015-09-15 12:13:07,137 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand] (org.ovirt.thread.pool-8-thread-48)
[50221cdc] FINISH, GetFileStatsVDSCommand, return: {pfSense-2.0-RELEASE-i386.iso={status=0, ctime=1432286887.0, size=115709952},
Fedora-15-i686-Live8
2015-09-15 12:13:07,178 INFO  [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] (org.ovirt.thread.pool-8-thread-48) [50221cdc] Finished automatic refresh process for ISO file type with success, for storage domain id 84dcb2fc-fb63-442f-aa77-3e84dc7d5a72.
2015-09-15 12:13:48,043 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-87) [4fa1bb16] HA reservation status for cluster Default is OK
2015-09-15 12:18:48,088 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-44) [6345e698] HA reservation status for cluster Default is OK
2015-09-15 12:23:48,137 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-13) HA reservation status for cluster Default is OK
2015-09-15 12:28:48,183 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-76) [154c91d5] HA reservation status for cluster Default is OK
2015-09-15 12:33:48,229 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-36) [27c73ac6] HA reservation status for cluster Default is OK
2015-09-15 12:34:49,432 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-41) [5f2a4b68] VM foreman 8b57ff1d-2800-48ad-b267-fd8e9e2f6fb2 moved from Up --> NotResponding
2015-09-15 12:34:49,578 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41)
[5f2a4b68] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM foreman is not responding.
2015-09-15 12:38:48,273 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-10) [7a800766] HA reservation status for cluster Default is OK
2015-09-15 12:43:48,320 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-42) [440f1c40] HA reservation status for cluster Default is OK
2015-09-15 12:48:48,366 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-70) HA reservation status for cluster Default is OK
2015-09-15 12:53:48,412 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-12) [50221cdc] HA reservation status for cluster Default is OK
2015-09-15 12:58:48,459 INFO  [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] (DefaultQuartzScheduler_Worker-3) HA reservation status for cluster Default is OK

On 29.08.2015 22:48, Christian Hailer wrote:
> Hello,
> 
> last Wednesday I wanted to update my oVirt 3.5 hypervisor. It is a 
> single Centos
> 7 server, so I started by suspending the VMs in order to set the oVirt 
> engine host to maintenance mode. During the process of suspending the 
> VMs the server crashed, kernel panic…
> 
> After restarting the server I installed the updates via yum an 
> restarted the server again. Afterwards, all the VMs could be started 
> again. Some hours later my monitoring system registered some 
> unresponsive hosts, I had a look in the oVirt interface, 3 of the VMs 
> were in the state “not responding”, marked by a question mark.
> 
> I tried to shut down the VMs, but oVirt wasn’t able to do so. I tried 
> to reset the status in the database with the sql statement
> 
> update vm_dynamic set status = 0 where vm_guid = (select vm_guid from 
> vm_static where vm_name = 'MYVMNAME');
> 
> but that didn’t help, either. Only rebooting the whole hypervisor 
> helped… afterwards everything worked again. But only for a few hours, 
> then one of the VMs entered the “not responding” state again… again only a reboot helped.
> Yesterday it happened again:
> 
> 2015-08-28 17:44:22,664 INFO
> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
> (DefaultQuartzScheduler_Worker-60) [4ef90b12] VM DC
> 0f3d1f06-e516-48ce-aa6f-7273c33d3491 moved from Up --> NotResponding
> 
> 2015-08-28 17:44:22,692 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-60) [4ef90b12] Correlation ID: null, Call Stack: 
> null, Custom Event ID: -1, Message: VM DC is not responding.
> 
> Does anybody know what I can do? Where should I have a look? Hints are 
> greatly appreciated!
> 
> Thanks,
> 
> Christian
> 

--
Daniel Helgenberger
m box bewegtbild GmbH

P: +49/30/2408781-22
F: +49/30/2408781-10

ACKERSTR. 19
D-10115 BERLIN

www.m-box.de  www.monkeymen.tv

Geschäftsführer: Martin Retschitzegger / Michaela Göllner
Handeslregister: Amtsgericht Charlottenburg / HRB 112767