[ovirt-users] Some VMs in status "not responding" in oVirt interface

Tue Sep 29 17:12:12 UTC 2015

Hi, it stopped for me as well. Very strange...

Regards, Christian

Von: Daniel Helgenberger <daniel.helgenberger at m-box.de>
Gesendet: 29.09.2015 19:07
An: stockhausen at collogia.de; christian at hailer.eu
Cc: ydary at redhat.com; users at ovirt.org
Betreff: Re: AW: AW: [ovirt-users] Some VMs in status "not responding"  
in oVirt interface

> Hello Christian,
>
> just a quick round up:
>
> Did you still see the issue? It stopped for me after removing live  
> snap shots.
>
>
> On 17.09.2015 07:39, Christian Hailer wrote:
>> Hi,
>>
>> just to get it straight: most of my VMs had one or more existing  
>> snapshots. Do you think this is a problem currently? If I  
>> understand it correctly the BZ of Markus concerns only a short  
>> period of time while removing a snapshot, but my VMs stopped  
>> responding in the middle of the night without any interaction...
>> I deleted all the snapshots, just in case :) my system is running  
>> fine for nearly three days now, I'm not quite sure but I think it  
>> helped that I changed the HDD and NIC of the Windows 2012 VMs to  
>> VirtIO devices...
>>
>> Best regards, Christian
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Daniel Helgenberger [mailto:daniel.helgenberger at m-box.de]
>> Gesendet: Dienstag, 15. September 2015 22:24
>> An: Markus Stockhausen <stockhausen at collogia.de>; Christian Hailer  
>> <christian at hailer.eu>
>> Cc: ydary at redhat.com; users at ovirt.org
>> Betreff: Re: AW: [ovirt-users] Some VMs in status "not responding"  
>> in oVirt interface
>>
>>
>>
>> On 15.09.2015 21:31, Markus Stockhausen wrote:
>>> Hi Christian,
>>>
>>> I think of a package similar like this:
>>>
>>> qemu-debuginfo.x86_64       2:2.1.3-10.fc21
>>>
>>> That allows gdb to show information about backtrace symbols. See
>>> comment 12 of https://bugzilla.redhat.com/show_bug.cgi?id=1262251
>>> Makes error search much simpler - especially if qemu hangs.
>>
>> Markus, thanks for the BZ. I think I do see the same issue.  
>> Actually my VM is currently the only with a live snapshot and  
>> (puppetmaster) does a lot of I/O.
>>
>> Christian, maybe this BZ1262251 also applicable?
>>
>> I'll go ahead and delete the live snapshot. If I see this issue  
>> again I will submit the trace to your BZ.
>>
>>
>>>
>>> Markus
>>>
>>> **********************************
>>>
>>> Von: Christian Hailer [christian at hailer.eu]
>>>
>>> Gesendet: Dienstag, 15. September 2015 21:24
>>>
>>> An: Markus Stockhausen; 'Daniel Helgenberger'
>>>
>>> Cc: ydary at redhat.com; users at ovirt.org
>>>
>>> Betreff: AW: [ovirt-users] Some VMs in status "not responding" in
>>> oVirt interface
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Hi Markus,
>>>
>>> gdb is available on CentOS 7, but what do you mean by qemu-debug?  
>>> I Installed qemu-kvm-tools, maybe this is the pendant for CentOS?
>>>
>>> qemu-kvm-tools.x86_64 : KVM debugging and diagnostics tools
>>> qemu-kvm-tools-ev.x86_64 : KVM debugging and diagnostics tools
>>> qemu-kvm-tools-rhev.x86_64 : KVM debugging and diagnostics tools
>>>
>>> Regards, Christian
>>>
>>>
>>>
>>>
>>>
>>> Von: Markus Stockhausen [mailto:stockhausen at collogia.de]
>>>
>>>
>>> Gesendet: Dienstag, 15. September 2015 20:40
>>>
>>> An: Daniel Helgenberger <daniel.helgenberger at m-box.de>
>>>
>>> Cc: Christian Hailer <christian at hailer.eu>; ydary at redhat.com;
>>> users at ovirt.org
>>>
>>> Betreff: Re: [ovirt-users] Some VMs in status "not responding" in
>>> oVirt interface
>>>
>>>
>>>
>>> Do you have a chance to install qemu-debug? If yes I would try a backtrace.
>>> gdb -p <qemu-pid>
>>>
>>> # bt
>>> Markus
>>>
>>>
>>> Am 15.09.2015 4:15 nachm. schrieb Daniel Helgenberger  
>>> <daniel.helgenberger at m-box.de>:
>>>
>>>
>>>
>>>
>>>
>>> Hello,
>>>
>>>
>>>
>>> I do not want to hijack the thread but maybe my issue is related?
>>>
>>>
>>>
>>> It might have started with ovirt 3.5.3; but I cannot tell for sure.
>>>
>>>
>>>
>>> For me, one vm (foreman) is affected; the second time in 14 days. I
>>> can confirm this as I also loose any network connection to the VM and
>>>
>>> the ability to connect a console.
>>>
>>> Also, the only thing witch 'fixes' the issue is right now 'kill -9  
>>> <pid of qemu-kvm process>'
>>>
>>>
>>>
>>> As far as I can tell the VM became unresponsive at around Sep 15
>>> 12:30:01; engine logged this at 12:34. Nothing obvious in VDSM logs
>>> (see
>>>
>>> attached).
>>>
>>>
>>>
>>> Below the engine.log part.
>>>
>>>
>>>
>>> Versions:
>>>
>>> ovirt-engine-3.5.4.2-1.el7.centos.noarch
>>>
>>>
>>>
>>> vdsm-4.16.26-0.el7.centos
>>>
>>> libvirt-1.2.8-16.el7_1.3
>>>
>>>
>>>
>>> engine.log (1200 - 1300:
>>>
>>> 2015-09-15 12:03:47,949 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-56) [264d502a] HA
>>>
>>> reservation status for cluster Default is OK
>>>
>>> 2015-09-15 12:08:02,708 INFO
>>> [org.ovirt.engine.core.bll.OvfDataUpdater]
>>> (DefaultQuartzScheduler_Worker-89) [2e7bf56e] Attempting to update
>>>
>>> VMs/Templates Ovf.
>>>
>>> 2015-09-15 12:08:02,709 INFO
>>> [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand]
>>> (DefaultQuartzScheduler_Worker-89)
>>>
>>> [5e9f4ba6] Running command: ProcessOvfUpdateForStoragePoolCommand  
>>> internal: true. Entities affected :  ID:
>>>
>>> 00000002-0002-0002-0002-000000000088 Type: l
>>>
>>> 2015-09-15 12:08:02,780 INFO
>>> [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand]
>>> (DefaultQuartzScheduler_Worker-89)
>>>
>>> [5e9f4ba6] Lock freed to object EngineLock [exclusiveLocks= key:
>>> 00000002-0002-0002-0002-000000000088 value: OVF_UPDATE
>>>
>>> 2015-09-15 12:08:47,997 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-21) [3fc854a2] HA
>>>
>>> reservation status for cluster Default is OK
>>>
>>> 2015-09-15 12:13:06,998 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand]
>>> (org.ovirt.thread.pool-8-thread-48)
>>>
>>> [50221cdc] START, GetFileStatsVDSCommand( storagePoolId =
>>> 00000002-0002-0002-0002-000000000088, ignoreFailoverLimit = false),
>>> log id: 1503968
>>>
>>> 2015-09-15 12:13:07,137 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand]
>>> (org.ovirt.thread.pool-8-thread-48)
>>>
>>> [50221cdc] FINISH, GetFileStatsVDSCommand, return:
>>> {pfSense-2.0-RELEASE-i386.iso={status=0, ctime=1432286887.0,
>>> size=115709952},
>>>
>>> Fedora-15-i686-Live8
>>>
>>> 2015-09-15 12:13:07,178 INFO
>>> [org.ovirt.engine.core.bll.IsoDomainListSyncronizer]
>>> (org.ovirt.thread.pool-8-thread-48) [50221cdc] Finished
>>>
>>> automatic refresh process for ISO file type with success, for  
>>> storage domain id 84dcb2fc-fb63-442f-aa77-3e84dc7d5a72.
>>>
>>> 2015-09-15 12:13:48,043 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-87) [4fa1bb16] HA
>>>
>>> reservation status for cluster Default is OK
>>>
>>> 2015-09-15 12:18:48,088 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-44) [6345e698] HA
>>>
>>> reservation status for cluster Default is OK
>>>
>>> 2015-09-15 12:23:48,137 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-13) HA reservation
>>>
>>> status for cluster Default is OK
>>>
>>> 2015-09-15 12:28:48,183 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-76) [154c91d5] HA
>>>
>>> reservation status for cluster Default is OK
>>>
>>> 2015-09-15 12:33:48,229 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-36) [27c73ac6] HA
>>>
>>> reservation status for cluster Default is OK
>>>
>>> 2015-09-15 12:34:49,432 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>> (DefaultQuartzScheduler_Worker-41) [5f2a4b68] VM
>>>
>>> foreman 8b57ff1d-2800-48ad-b267-fd8e9e2f6fb2 moved from Up -->
>>> NotResponding
>>>
>>> 2015-09-15 12:34:49,578 WARN
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_Worker-41)
>>>
>>> [5f2a4b68] Correlation ID: null, Call Stack: null, Custom Event  
>>> ID: -1, Message: VM foreman is not responding.
>>>
>>> 2015-09-15 12:38:48,273 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-10) [7a800766] HA
>>>
>>> reservation status for cluster Default is OK
>>>
>>> 2015-09-15 12:43:48,320 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-42) [440f1c40] HA
>>>
>>> reservation status for cluster Default is OK
>>>
>>> 2015-09-15 12:48:48,366 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-70) HA reservation
>>>
>>> status for cluster Default is OK
>>>
>>> 2015-09-15 12:53:48,412 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-12) [50221cdc] HA
>>>
>>> reservation status for cluster Default is OK
>>>
>>> 2015-09-15 12:58:48,459 INFO
>>> [org.ovirt.engine.core.bll.scheduling.HaReservationHandling]
>>> (DefaultQuartzScheduler_Worker-3) HA reservation
>>>
>>> status for cluster Default is OK
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 29.08.2015 22:48, Christian Hailer wrote:
>>>
>>>> Hello,
>>>
>>>>
>>>
>>>> last Wednesday I wanted to update my oVirt 3.5 hypervisor. It is a
>>>> single Centos
>>>
>>>
>>>> 7 server, so I started by suspending the VMs in order to set the
>>>> oVirt engine
>>>
>>>> host to maintenance mode. During the process of suspending the VMs
>>>> the server
>>>
>>>> crashed, kernel panic…
>>>
>>>>
>>>
>>>> After restarting the server I installed the updates via yum an
>>>> restarted the
>>>
>>>> server again. Afterwards, all the VMs could be started again. Some
>>>> hours later
>>>
>>>> my monitoring system registered some unresponsive hosts, I had a look
>>>> in the
>>>
>>>> oVirt interface, 3 of the VMs were in the state “not responding”,
>>>> marked by a
>>>
>>>> question mark.
>>>
>>>>>
>>>> I tried to shut down the VMs, but oVirt wasn’t able to do so. I tried
>>>> to reset
>>>
>>>> the status in the database with the sql statement
>>>
>>>>
>>>
>>>> update vm_dynamic set status = 0 where vm_guid = (select vm_guid from
>>>> vm_static
>>>
>>>
>>>> where vm_name = 'MYVMNAME');
>>>
>>>>
>>>
>>>> but that didn’t help, either. Only rebooting the whole hypervisor
>>>> helped…
>>>
>>>> afterwards everything worked again. But only for a few hours, then
>>>> one of the
>>>
>>>> VMs entered the “not responding” state again… again only a reboot helped.
>>>
>>>> Yesterday it happened again:
>>>
>>>>
>>>
>>>> 2015-08-28 17:44:22,664 INFO
>>>
>>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>>
>>>> (DefaultQuartzScheduler_Worker-60) [4ef90b12] VM DC
>>>
>>>> 0f3d1f06-e516-48ce-aa6f-7273c33d3491 moved from Up --> NotResponding
>>>
>>>>
>>>
>>>> 2015-08-28 17:44:22,692 WARN
>>>
>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector
>>>> ]
>>>
>>>> (DefaultQuartzScheduler_Worker-60) [4ef90b12] Correlation ID:  
>>>> null, Call Stack:
>>>
>>>
>>>> null, Custom Event ID: -1, Message: VM DC is not responding.
>>>
>>>>
>>>
>>>> Does anybody know what I can do? Where should I have a look? Hints
>>>> are greatly
>>>
>>>> appreciated!
>>>
>>>>
>>>
>>>> Thanks,
>>>
>>>>
>>>
>>>> Christian
>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> Daniel Helgenberger
>> m box bewegtbild GmbH
>>
>> P: +49/30/2408781-22
>> F: +49/30/2408781-10
>>
>> ACKERSTR. 19
>> D-10115 BERLIN
>>
>>
>> www.m-box.de  www.monkeymen.tv
>>
>> Geschäftsführer: Martin Retschitzegger / Michaela Göllner
>> Handeslregister: Amtsgericht Charlottenburg / HRB 112767
>>
>
> --
> Daniel Helgenberger
> m box bewegtbild GmbH
>
> P: +49/30/2408781-22
> F: +49/30/2408781-10
>
> ACKERSTR. 19
> D-10115 BERLIN
>
>
> www.m-box.de  www.monkeymen.tv
>
> Geschäftsführer: Martin Retschitzegger / Michaela Göllner
> Handeslregister: Amtsgericht Charlottenburg / HRB 112767
>