[ovirt-users] Some VMs in status "not responding" in oVirt interface

Daniel Helgenberger daniel.helgenberger at m-box.de
Mon Sep 21 09:09:24 UTC 2015



On 17.09.2015 07:39, Christian Hailer wrote:
> Hi,
> 
> just to get it straight: most of my VMs had one or more existing snapshots. Do you think this is a problem currently? If I understand it correctly the BZ of Markus concerns only a short period of time while removing a snapshot, but my VMs stopped responding in the middle of the night without any interaction...
> I deleted all the snapshots, just in case :) my system is running fine for nearly three days now, I'm not quite sure but I think it helped that I changed the HDD and NIC of the Windows 2012 VMs to VirtIO devices...

In my case these these Linux guests. So far I only had one VM with a live snap; the one showing the issue. This points into the direction
qemu-rhev; might be independent from guest os.
For testing I created a new guest just for this purpose; it has a live snap shot. Maybe you do the same?

To dig deeper into the issue: my storage is NFS3 - backed.
> 
> Best regards, Christian
> 
> -----Ursprüngliche Nachricht-----
> Von: Daniel Helgenberger [mailto:daniel.helgenberger at m-box.de] 
> Gesendet: Dienstag, 15. September 2015 22:24
> An: Markus Stockhausen <stockhausen at collogia.de>; Christian Hailer <christian at hailer.eu>
> Cc: ydary at redhat.com; users at ovirt.org
> Betreff: Re: AW: [ovirt-users] Some VMs in status "not responding" in oVirt interface
> 
> 
> 
> On 15.09.2015 21:31, Markus Stockhausen wrote:
>> Hi Christian,
>>
>> I think of a package similar like this:
>>
>> qemu-debuginfo.x86_64       2:2.1.3-10.fc21
>>
>> That allows gdb to show information about backtrace symbols. See 
>> comment 12 of https://bugzilla.redhat.com/show_bug.cgi?id=1262251
>> Makes error search much simpler - especially if qemu hangs.
> 
> Markus, thanks for the BZ. I think I do see the same issue. Actually my VM is currently the only with a live snapshot and (puppetmaster) does a lot of I/O.
> 
> Christian, maybe this BZ1262251 also applicable?
> 
> I'll go ahead and delete the live snapshot. If I see this issue again I will submit the trace to your BZ.
> 
> 
>>
>> Markus
>>
>> **********************************
>>
>> Von: Christian Hailer [christian at hailer.eu]
>>
>> Gesendet: Dienstag, 15. September 2015 21:24
>>
>> An: Markus Stockhausen; 'Daniel Helgenberger'
>>
>> Cc: ydary at redhat.com; users at ovirt.org
>>
>> Betreff: AW: [ovirt-users] Some VMs in status "not responding" in 
>> oVirt interface
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi Markus,
>>  
>> gdb is available on CentOS 7, but what do you mean by qemu-debug? I Installed qemu-kvm-tools, maybe this is the pendant for CentOS?
>>  
>> qemu-kvm-tools.x86_64 : KVM debugging and diagnostics tools
>> qemu-kvm-tools-ev.x86_64 : KVM debugging and diagnostics tools
>> qemu-kvm-tools-rhev.x86_64 : KVM debugging and diagnostics tools
>>  
>> Regards, Christian
>>  
>>
>>
>>
>>
>> Von: Markus Stockhausen [mailto:stockhausen at collogia.de]
>>
>>
>> Gesendet: Dienstag, 15. September 2015 20:40
>>
>> An: Daniel Helgenberger <daniel.helgenberger at m-box.de>
>>
>> Cc: Christian Hailer <christian at hailer.eu>; ydary at redhat.com; 
>> users at ovirt.org
>>
>> Betreff: Re: [ovirt-users] Some VMs in status "not responding" in 
>> oVirt interface
>>
>>
>>  
>> Do you have a chance to install qemu-debug? If yes I would try a backtrace.
>> gdb -p <qemu-pid>
>>
>> # bt
>> Markus
>>
>>
>> Am 15.09.2015 4:15 nachm. schrieb Daniel Helgenberger <daniel.helgenberger at m-box.de>:
>>
>>
>>
>>
>>
>> Hello,
>>
>>
>>
>> I do not want to hijack the thread but maybe my issue is related?
>>
>>
>>
>> It might have started with ovirt 3.5.3; but I cannot tell for sure.
>>
>>
>>
>> For me, one vm (foreman) is affected; the second time in 14 days. I 
>> can confirm this as I also loose any network connection to the VM and
>>
>> the ability to connect a console.
>>
>> Also, the only thing witch 'fixes' the issue is right now 'kill -9 <pid of qemu-kvm process>'
>>
>>
>>
>> As far as I can tell the VM became unresponsive at around Sep 15 
>> 12:30:01; engine logged this at 12:34. Nothing obvious in VDSM logs 
>> (see
>>
>> attached).
>>
>>
>>
>> Below the engine.log part.
>>
>>
>>
>> Versions:
>>
>> ovirt-engine-3.5.4.2-1.el7.centos.noarch
>>
>>
>>
>> vdsm-4.16.26-0.el7.centos
>>
>> libvirt-1.2.8-16.el7_1.3
>>
>>
>>
>> engine.log (1200 - 1300:
>>
>> 2015-09-15 12:03:47,949 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-56) [264d502a] HA
>>
>> reservation status for cluster Default is OK
>>
>> 2015-09-15 12:08:02,708 INFO  
>> [org.ovirt.engine.core.bll.OvfDataUpdater] 
>> (DefaultQuartzScheduler_Worker-89) [2e7bf56e] Attempting to update
>>
>> VMs/Templates Ovf.
>>
>> 2015-09-15 12:08:02,709 INFO  
>> [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] 
>> (DefaultQuartzScheduler_Worker-89)
>>
>> [5e9f4ba6] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected :  ID:
>>
>> 00000002-0002-0002-0002-000000000088 Type: l
>>
>> 2015-09-15 12:08:02,780 INFO  
>> [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] 
>> (DefaultQuartzScheduler_Worker-89)
>>
>> [5e9f4ba6] Lock freed to object EngineLock [exclusiveLocks= key: 
>> 00000002-0002-0002-0002-000000000088 value: OVF_UPDATE
>>
>> 2015-09-15 12:08:47,997 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-21) [3fc854a2] HA
>>
>> reservation status for cluster Default is OK
>>
>> 2015-09-15 12:13:06,998 INFO  
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand] 
>> (org.ovirt.thread.pool-8-thread-48)
>>
>> [50221cdc] START, GetFileStatsVDSCommand( storagePoolId = 
>> 00000002-0002-0002-0002-000000000088, ignoreFailoverLimit = false), 
>> log id: 1503968
>>
>> 2015-09-15 12:13:07,137 INFO  
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand] 
>> (org.ovirt.thread.pool-8-thread-48)
>>
>> [50221cdc] FINISH, GetFileStatsVDSCommand, return: 
>> {pfSense-2.0-RELEASE-i386.iso={status=0, ctime=1432286887.0, 
>> size=115709952},
>>
>> Fedora-15-i686-Live8
>>
>> 2015-09-15 12:13:07,178 INFO  
>> [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] 
>> (org.ovirt.thread.pool-8-thread-48) [50221cdc] Finished
>>
>> automatic refresh process for ISO file type with success, for storage domain id 84dcb2fc-fb63-442f-aa77-3e84dc7d5a72.
>>
>> 2015-09-15 12:13:48,043 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-87) [4fa1bb16] HA
>>
>> reservation status for cluster Default is OK
>>
>> 2015-09-15 12:18:48,088 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-44) [6345e698] HA
>>
>> reservation status for cluster Default is OK
>>
>> 2015-09-15 12:23:48,137 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-13) HA reservation
>>
>> status for cluster Default is OK
>>
>> 2015-09-15 12:28:48,183 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-76) [154c91d5] HA
>>
>> reservation status for cluster Default is OK
>>
>> 2015-09-15 12:33:48,229 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-36) [27c73ac6] HA
>>
>> reservation status for cluster Default is OK
>>
>> 2015-09-15 12:34:49,432 INFO  
>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] 
>> (DefaultQuartzScheduler_Worker-41) [5f2a4b68] VM
>>
>> foreman 8b57ff1d-2800-48ad-b267-fd8e9e2f6fb2 moved from Up --> 
>> NotResponding
>>
>> 2015-09-15 12:34:49,578 WARN  
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
>> (DefaultQuartzScheduler_Worker-41)
>>
>> [5f2a4b68] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM foreman is not responding.
>>
>> 2015-09-15 12:38:48,273 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-10) [7a800766] HA
>>
>> reservation status for cluster Default is OK
>>
>> 2015-09-15 12:43:48,320 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-42) [440f1c40] HA
>>
>> reservation status for cluster Default is OK
>>
>> 2015-09-15 12:48:48,366 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-70) HA reservation
>>
>> status for cluster Default is OK
>>
>> 2015-09-15 12:53:48,412 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-12) [50221cdc] HA
>>
>> reservation status for cluster Default is OK
>>
>> 2015-09-15 12:58:48,459 INFO  
>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
>> (DefaultQuartzScheduler_Worker-3) HA reservation
>>
>> status for cluster Default is OK
>>
>>
>>
>>
>>
>>
>>
>> On 29.08.2015 22:48, Christian Hailer wrote:
>>
>>> Hello,
>>
>>>
>>
>>> last Wednesday I wanted to update my oVirt 3.5 hypervisor. It is a 
>>> single Centos
>>
>>
>>> 7 server, so I started by suspending the VMs in order to set the 
>>> oVirt engine
>>
>>> host to maintenance mode. During the process of suspending the VMs 
>>> the server
>>
>>> crashed, kernel panic…
>>
>>>
>>
>>> After restarting the server I installed the updates via yum an 
>>> restarted the
>>
>>> server again. Afterwards, all the VMs could be started again. Some 
>>> hours later
>>
>>> my monitoring system registered some unresponsive hosts, I had a look 
>>> in the
>>
>>> oVirt interface, 3 of the VMs were in the state “not responding”, 
>>> marked by a
>>
>>> question mark.
>>
>>>>
>>> I tried to shut down the VMs, but oVirt wasn’t able to do so. I tried 
>>> to reset
>>
>>> the status in the database with the sql statement
>>
>>>
>>
>>> update vm_dynamic set status = 0 where vm_guid = (select vm_guid from 
>>> vm_static
>>
>>
>>> where vm_name = 'MYVMNAME');
>>
>>>
>>
>>> but that didn’t help, either. Only rebooting the whole hypervisor 
>>> helped…
>>
>>> afterwards everything worked again. But only for a few hours, then 
>>> one of the
>>
>>> VMs entered the “not responding” state again… again only a reboot helped. 
>>
>>> Yesterday it happened again:
>>
>>>
>>
>>> 2015-08-28 17:44:22,664 INFO
>>
>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>
>>> (DefaultQuartzScheduler_Worker-60) [4ef90b12] VM DC
>>
>>> 0f3d1f06-e516-48ce-aa6f-7273c33d3491 moved from Up --> NotResponding
>>
>>>
>>
>>> 2015-08-28 17:44:22,692 WARN
>>
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector
>>> ]
>>
>>> (DefaultQuartzScheduler_Worker-60) [4ef90b12] Correlation ID: null, Call Stack:
>>
>>
>>> null, Custom Event ID: -1, Message: VM DC is not responding.
>>
>>>
>>
>>> Does anybody know what I can do? Where should I have a look? Hints 
>>> are greatly
>>
>>> appreciated!
>>
>>>
>>
>>> Thanks,
>>
>>>
>>
>>> Christian
>>
>>>
>>
>>
>>
> 
> --
> Daniel Helgenberger
> m box bewegtbild GmbH
> 
> P: +49/30/2408781-22
> F: +49/30/2408781-10
> 
> ACKERSTR. 19
> D-10115 BERLIN
> 
> 
> www.m-box.de  www.monkeymen.tv
> 
> Geschäftsführer: Martin Retschitzegger / Michaela Göllner
> Handeslregister: Amtsgericht Charlottenburg / HRB 112767
> 

-- 
Daniel Helgenberger
m box bewegtbild GmbH

P: +49/30/2408781-22
F: +49/30/2408781-10

ACKERSTR. 19
D-10115 BERLIN


www.m-box.de  www.monkeymen.tv

Geschäftsführer: Martin Retschitzegger / Michaela Göllner
Handeslregister: Amtsgericht Charlottenburg / HRB 112767



More information about the Users mailing list