Re: [Users] Gluster VM stuck in "waiting for launch" state

30 Oct 2013

      On Wed, Oct 30, 2013 at 08:41:43PM +0100, Alessandro Bianchi wrote:
...
Il 30/10/2013 18:04, Dan Kenigsberg ha scritto:
...
On Wed, Oct 30, 2013 at 02:40:02PM +0100, Alessandro Bianchi wrote:
...
Il 30/10/2013 13:58, Dan Kenigsberg ha scritto:
On Wed, Oct 30, 2013 at 11:34:21AM +0100, Alessandro Bianchi wrote:
Hi everyone
I've set up a gluster storage with two replicated bricks
DC is up and I created a VM to test gluster storage
If I start the VM WITHOUT any disk attached (only one virtual DVD) it
    starts fine.
If I attach a gluster domain disk thin provisioning 30 Gb the Vm stucks in
    "waiting for launch" state
I see no special activity on the gluster servers (they serve several other
    shares with no troubles at all and even the ISO domain  is a NFS on
    locally mounted gluster and works fine)
I've double checked all the pre requisites and they look fine (F 19 -
    gluster setup insecure  in both glusterd.vol and volume options -
    uid/gid/insecure )
Am I doing something wrong?
I'm even unable to stop the VM from the engine GUI
Any advise?
Which version of ovirt are you using? Hopefully ovirt-3.3.0.1.
 For how long is the VM stuck in its "wait for launch" state?
 What does `virsh -r list` has to say while startup stalls?
 Would you provide more content of your vdsm.log and possibly
 libvirtd.log so we can understand what blocks the VM start-up? Please
 use attachement of pastebin, as your mail agents wreaks havoc to the log
 lines.
Thank you for your answer.
Here are the "facts"
In the GUI I see
"waiting for launch 3 h"
virsh -r list
    Id    Nome                           Stato
   ----------------------------------------------------
    3     CentOS_30                      terminato
vdsClient -s 0 list table
   200dfb05-461e-49d9-95a2-c0a7c7ced669      0  CentOS_30
   WaitForLaunch
Packages:
ovirt-engine-userportal-3.3.0.1-1.fc19.noarch
   ovirt-log-collector-3.3.1-1.fc19.noarch
   ovirt-engine-restapi-3.3.0.1-1.fc19.noarch
   ovirt-engine-setup-3.3.0.1-1.fc19.noarch
   ovirt-engine-backend-3.3.0.1-1.fc19.noarch
   ovirt-host-deploy-java-1.1.1-1.fc19.noarch
   ovirt-release-fedora-8-1.noarch
   ovirt-engine-setup-plugin-allinone-3.3.0.1-1.fc19.noarch
   ovirt-engine-webadmin-portal-3.3.0.1-1.fc19.noarch
   ovirt-engine-sdk-python-3.3.0.7-1.fc19.noarch
   ovirt-iso-uploader-3.3.1-1.fc19.noarch
   ovirt-engine-websocket-proxy-3.3.0.1-1.fc19.noarch
   ovirt-engine-dbscripts-3.3.0.1-1.fc19.noarch
   ovirt-host-deploy-offline-1.1.1-1.fc19.noarch
   ovirt-engine-cli-3.3.0.5-1.fc19.noarch
   ovirt-engine-tools-3.3.0.1-1.fc19.noarch
   ovirt-engine-lib-3.3.0.1-1.fc19.noarch
   ovirt-image-uploader-3.3.1-1.fc19.noarch
   ovirt-engine-3.3.0.1-1.fc19.noarch
   ovirt-host-deploy-1.1.1-1.fc19.noarch
I attach the full vdsm log
Look around 30-10 10:30 to see all what happens
Despite the "terminated" label in output from virsh I still see the VM
   "waiting for launch" in the GUI, so I suspect the answer to "how long" may
   be "forever"
Since this is a test VM I can do whatever test you may need to track the
   problem included destroy and rebuild
It would be great to have gluster support stable in ovirt!
Thank you for your efforts
The log has an ominous failed attempt to start the VM, followed by an
immediate vdsm crash. Is it reproducible?
We have plenty of issues lurking here:
1. Why has libvirt failed to create the VM? For this, please find clues
   in the complete non-line-broken CentOS_30.log and libvirtd.log.
attached to this messages
2. Why was vdsm killed? Does /var/log/message has a clue from systemd?
result of cat /var/log/messages | grep vdsm attached
I do not see an explicit attempt to take vdsmd down. Do you see any other
incriminating message correlated with

Oct 30 08:51:15 hypervisor respawn: slave '/usr/share/vdsm/vdsm' died, respawning slave
...
...
3. We may have a nasty race: if Vdsm crashes just before it has
   registered that the VM is down.
Actually, this is not the issue: vdsm tries (and fails, due to qemu/libvirt
bug) to destroy the VM.
...
...
4. We used to force Vdsm to run with LC_ALL=C. It seems that the grand
   service rewrite by Zhou (http://gerrit.ovirt.org/15578) has changed
   that. This may have adverse effects, since AFAIR we sometimes parse
   application output, and assume that it's in C. Having a non-English
   log file is problematic on it's own for support personal, used to
   grep for keywords. ybronhei, was it intensional? Can it be reverted
   or at least scrutinized?
currentely it still says "waiting for launch 9h"
I don't abort it so if you need any other info I can have them
libvirtd fails to connect to qemu's monitor. This smells like a qemu bug that
is beyond my over-the-mailing-list debugging abilities :-(
You may want to strace or gdb the running qemu process, or to try to re-attach
to it by restarting libvirtd.

2013-10-30 07:51:10.045+0000: 8304: debug : qemuProcessStart:3804 : Waiting for monitor to show up
2013-10-30 07:51:10.045+0000: 8304: debug : qemuProcessWaitForMonitor:1707 : Connect monitor to 0x7fc1640eab80 'CentOS_30'
2013-10-30 07:51:10.246+0000: 8304: debug : qemuMonitorOpenInternal:751 : QEMU_MONITOR_NEW: mon=0x7fc1640ef6b0 refs=2 fd=27
2013-10-30 07:51:10.246+0000: 8304: debug : qemuMonitorSetCapabilities:1145 : mon=0x7fc1640ef6b0
2013-10-30 07:51:10.246+0000: 8304: debug : qemuMonitorSend:887 : QEMU_MONITOR_SEND_MSG: mon=0x7fc1640ef6b0 msg={"execute":"qmp_capabilities","id":"libvirt-1"}

 fd=-1
2013-10-30 07:51:13.097+0000: 8296: error : qemuMonitorIORead:505 : Unable to read from monitor: Connessione interrotta dal corrispondente
2013-10-30 07:51:13.097+0000: 8296: debug : qemuMonitorIO:638 : Error on monitor Unable to read from monitor: Connessione interrotta dal corrispondente
2013-10-30 07:51:13.097+0000: 8296: debug : qemuMonitorIO:672 : Triggering error callback
2013-10-30 07:51:13.097+0000: 8296: debug : qemuProcessHandleMonitorError:351 : Received error on 0x7fc1640eab80 'CentOS_30'
2013-10-30 07:51:13.097+0000: 8304: debug : qemuMonitorSend:899 : Send command resulted in error Unable to read from monitor: Connessione interrotta dal corrispondente
2013-10-30 07:51:13.097+0000: 8304: debug : qemuProcessStop:3992 : Shutting down VM 'CentOS_30' pid=7655 flags=0
2013-10-30 07:51:13.097+0000: 8296: debug : qemuMonitorIO:638 : Error on monitor Unable to read from monitor: Connessione interrotta dal corrispondente
2013-10-30 07:51:13.097+0000: 8296: debug : qemuMonitorIO:661 : Triggering EOF callback