Re: [ovirt-users] Some VMs in status "not responding" in oVirt interface

17 Sep 2015

      Hi,

just to get it straight: most of my VMs had one or more existing snapshots. Do you think this is a problem currently? If I understand it correctly the BZ of Markus concerns only a short period of time while removing a snapshot, but my VMs stopped responding in the middle of the night without any interaction...
I deleted all the snapshots, just in case :) my system is running fine for nearly three days now, I'm not quite sure but I think it helped that I changed the HDD and NIC of the Windows 2012 VMs to VirtIO devices...

Best regards, Christian

-----Ursprüngliche Nachricht-----
Von: Daniel Helgenberger [mailto:daniel.helgenberger@m-box.de] 
Gesendet: Dienstag, 15. September 2015 22:24
An: Markus Stockhausen <stockhausen@collogia.de>; Christian Hailer <christian@hailer.eu>
Cc: ydary@redhat.com; users@ovirt.org
Betreff: Re: AW: [ovirt-users] Some VMs in status "not responding" in oVirt interface

On 15.09.2015 21:31, Markus Stockhausen wrote:
...
Hi Christian,
I think of a package similar like this:
qemu-debuginfo.x86_64       2:2.1.3-10.fc21
That allows gdb to show information about backtrace symbols. See 
comment 12 of https://bugzilla.redhat.com/show_bug.cgi?id=1262251
Makes error search much simpler - especially if qemu hangs.
Markus, thanks for the BZ. I think I do see the same issue. Actually my VM is currently the only with a live snapshot and (puppetmaster) does a lot of I/O.

Christian, maybe this BZ1262251 also applicable?

I'll go ahead and delete the live snapshot. If I see this issue again I will submit the trace to your BZ.
...
Markus
**********************************
Von: Christian Hailer [christian@hailer.eu]
Gesendet: Dienstag, 15. September 2015 21:24
An: Markus Stockhausen; 'Daniel Helgenberger'
Cc: ydary@redhat.com; users@ovirt.org
Betreff: AW: [ovirt-users] Some VMs in status "not responding" in 
oVirt interface
Hi Markus,
gdb is available on CentOS 7, but what do you mean by qemu-debug? I Installed qemu-kvm-tools, maybe this is the pendant for CentOS?
qemu-kvm-tools.x86_64 : KVM debugging and diagnostics tools
qemu-kvm-tools-ev.x86_64 : KVM debugging and diagnostics tools
qemu-kvm-tools-rhev.x86_64 : KVM debugging and diagnostics tools
Regards, Christian
Von: Markus Stockhausen [mailto:stockhausen@collogia.de]
Gesendet: Dienstag, 15. September 2015 20:40
An: Daniel Helgenberger <daniel.helgenberger@m-box.de>
Cc: Christian Hailer <christian@hailer.eu>; ydary@redhat.com; 
users@ovirt.org
Betreff: Re: [ovirt-users] Some VMs in status "not responding" in 
oVirt interface
Do you have a chance to install qemu-debug? If yes I would try a backtrace.
gdb -p <qemu-pid>
# bt
Markus
Am 15.09.2015 4:15 nachm. schrieb Daniel Helgenberger <daniel.helgenberger@m-box.de>:
Hello,
I do not want to hijack the thread but maybe my issue is related?
It might have started with ovirt 3.5.3; but I cannot tell for sure.
For me, one vm (foreman) is affected; the second time in 14 days. I 
can confirm this as I also loose any network connection to the VM and
the ability to connect a console.
Also, the only thing witch 'fixes' the issue is right now 'kill -9 <pid of qemu-kvm process>'
As far as I can tell the VM became unresponsive at around Sep 15 
12:30:01; engine logged this at 12:34. Nothing obvious in VDSM logs 
(see
attached).
Below the engine.log part.
Versions:
ovirt-engine-3.5.4.2-1.el7.centos.noarch
vdsm-4.16.26-0.el7.centos
libvirt-1.2.8-16.el7_1.3
engine.log (1200 - 1300:
2015-09-15 12:03:47,949 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-56) [264d502a] HA
reservation status for cluster Default is OK
2015-09-15 12:08:02,708 INFO  
[org.ovirt.engine.core.bll.OvfDataUpdater] 
(DefaultQuartzScheduler_Worker-89) [2e7bf56e] Attempting to update
VMs/Templates Ovf.
2015-09-15 12:08:02,709 INFO  
[org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] 
(DefaultQuartzScheduler_Worker-89)
[5e9f4ba6] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected :  ID:
00000002-0002-0002-0002-000000000088 Type: l
2015-09-15 12:08:02,780 INFO  
[org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] 
(DefaultQuartzScheduler_Worker-89)
[5e9f4ba6] Lock freed to object EngineLock [exclusiveLocks= key: 
00000002-0002-0002-0002-000000000088 value: OVF_UPDATE
2015-09-15 12:08:47,997 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-21) [3fc854a2] HA
reservation status for cluster Default is OK
2015-09-15 12:13:06,998 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand] 
(org.ovirt.thread.pool-8-thread-48)
[50221cdc] START, GetFileStatsVDSCommand( storagePoolId = 
00000002-0002-0002-0002-000000000088, ignoreFailoverLimit = false), 
log id: 1503968
2015-09-15 12:13:07,137 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand] 
(org.ovirt.thread.pool-8-thread-48)
[50221cdc] FINISH, GetFileStatsVDSCommand, return: 
{pfSense-2.0-RELEASE-i386.iso={status=0, ctime=1432286887.0, 
size=115709952},
Fedora-15-i686-Live8
2015-09-15 12:13:07,178 INFO  
[org.ovirt.engine.core.bll.IsoDomainListSyncronizer] 
(org.ovirt.thread.pool-8-thread-48) [50221cdc] Finished
automatic refresh process for ISO file type with success, for storage domain id 84dcb2fc-fb63-442f-aa77-3e84dc7d5a72.
2015-09-15 12:13:48,043 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-87) [4fa1bb16] HA
reservation status for cluster Default is OK
2015-09-15 12:18:48,088 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-44) [6345e698] HA
reservation status for cluster Default is OK
2015-09-15 12:23:48,137 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-13) HA reservation
status for cluster Default is OK
2015-09-15 12:28:48,183 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-76) [154c91d5] HA
reservation status for cluster Default is OK
2015-09-15 12:33:48,229 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-36) [27c73ac6] HA
reservation status for cluster Default is OK
2015-09-15 12:34:49,432 INFO  
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] 
(DefaultQuartzScheduler_Worker-41) [5f2a4b68] VM
foreman 8b57ff1d-2800-48ad-b267-fd8e9e2f6fb2 moved from Up --> 
NotResponding
2015-09-15 12:34:49,578 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(DefaultQuartzScheduler_Worker-41)
[5f2a4b68] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM foreman is not responding.
2015-09-15 12:38:48,273 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-10) [7a800766] HA
reservation status for cluster Default is OK
2015-09-15 12:43:48,320 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-42) [440f1c40] HA
reservation status for cluster Default is OK
2015-09-15 12:48:48,366 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-70) HA reservation
status for cluster Default is OK
2015-09-15 12:53:48,412 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-12) [50221cdc] HA
reservation status for cluster Default is OK
2015-09-15 12:58:48,459 INFO  
[org.ovirt.engine.core.bll.scheduling.HaReservationHandling] 
(DefaultQuartzScheduler_Worker-3) HA reservation
status for cluster Default is OK
On 29.08.2015 22:48, Christian Hailer wrote:
...
Hello,
...
...
last Wednesday I wanted to update my oVirt 3.5 hypervisor. It is a 
single Centos
...
7 server, so I started by suspending the VMs in order to set the 
oVirt engine
...
host to maintenance mode. During the process of suspending the VMs 
the server
...
crashed, kernel panic…
...
...
After restarting the server I installed the updates via yum an 
restarted the
...
server again. Afterwards, all the VMs could be started again. Some 
hours later
...
my monitoring system registered some unresponsive hosts, I had a look 
in the
...
oVirt interface, 3 of the VMs were in the state “not responding”, 
marked by a
...
question mark.
...
...
I tried to shut down the VMs, but oVirt wasn’t able to do so. I tried 
to reset
...
the status in the database with the sql statement
...
...
update vm_dynamic set status = 0 where vm_guid = (select vm_guid from 
vm_static
...
where vm_name = 'MYVMNAME');
...
...
but that didn’t help, either. Only rebooting the whole hypervisor 
helped…
...
afterwards everything worked again. But only for a few hours, then 
one of the
...
VMs entered the “not responding” state again… again only a reboot helped.
...
Yesterday it happened again:
...
...
2015-08-28 17:44:22,664 INFO
...
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
...
(DefaultQuartzScheduler_Worker-60) [4ef90b12] VM DC
...
0f3d1f06-e516-48ce-aa6f-7273c33d3491 moved from Up --> NotResponding
...
...
2015-08-28 17:44:22,692 WARN
...
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector
]
...
(DefaultQuartzScheduler_Worker-60) [4ef90b12] Correlation ID: null, Call Stack:
...
null, Custom Event ID: -1, Message: VM DC is not responding.
...
...
Does anybody know what I can do? Where should I have a look? Hints 
are greatly
...
appreciated!
...
...
Thanks,
...
...
Christian
...
--
Daniel Helgenberger
m box bewegtbild GmbH

P: +49/30/2408781-22
F: +49/30/2408781-10

ACKERSTR. 19
D-10115 BERLIN

www.m-box.de  www.monkeymen.tv

Geschäftsführer: Martin Retschitzegger / Michaela Göllner
Handeslregister: Amtsgericht Charlottenburg / HRB 112767