<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix"><br>
since the VM run_on_vds was empty, the "confirm host..." didn't
clear its status because its not selected from the DB as one of
the host VMs. <br>
I'll try to dig in to see at what point this value was cleared -
probably around the failed migration.<br>
<br>
<br>
On 10/10/2012 07:25 PM, Alan Johnson wrote:<br>
</div>
<blockquote
cite="mid:CAAhwQij0zfoxwO0hzywG_bWTNs+VbCw7KTQyrk+Vbqy2UNNRmQ@mail.gmail.com"
type="cite">On Wed, Oct 10, 2012 at 4:35 AM, Juan Hernandez <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:jhernand@redhat.com" target="_blank">jhernand@redhat.com</a>></span>
wrote:<br>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb">
<div class="h5">On 10/09/2012 11:36 PM, Itamar Heim wrote:<br>
> well, hacking the db will work, but reproducing this
and logs for us to<br>
> fix the actually bug would also help.<br>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>Unfortunately, it will would be very difficult to reproduce
since I have replaced the boot drive that I believe was
causing the failures and I have no idea what state the host
was in when it went down (details below). Plus, our testing
folks will become very angry if I keep crashing their VMs. =)
Still, I can probably come up with some spare hardware
eventually and try to reproduce if needed, but let's see what
we see with the logs, etc., first, yeah?</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb">
<div class="h5">
<br>
</div>
</div>
We have seen this before, and we thought it was fixed.<br>
</blockquote>
<div><br>
</div>
<div>Was that very recently? Perhaps I was not running the
fixed version. I might not be still. I'm running on CentOS
6.3 using this repo: <a moz-do-not-send="true"
href="http://www.dreyou.org/ovirt/ovirt-dre.repo">http://www.dreyou.org/ovirt/ovirt-dre.repo</a></div>
<div><br>
</div>
<div>I used these <a moz-do-not-send="true"
href="http://middleswarth.net/content/installing-ovirt-31-and-glusterfs-using-either-nfs-or-posix-native-file-system-node-install">instructions
to setup the nodes</a> and these <a moz-do-not-send="true"
href="http://middleswarth.net/content/installing-ovirt-31-and-glusterfs-using-either-nfs-or-posix-native-file-system-engine">instructions
to setup the engine</a>. (I have not configured any
glusterfs volumes, but I will likely play with that soon.)
When setting up the engine, I had to fix a bunch of broken
symlinks to jar files that were included with the various
packages. Both the broken symlinks and jar files there there,
but many of the symlinks to the jars where broken. I'll be
reporting that to the package maintainers soon, but mention it
here just in case it turns out to be relevant.</div>
<div><br>
</div>
<div>Here are my ovirt package versions:</div>
<div>
<div><font face="courier new, monospace">[root@admin ~]# rpm
-qa | fgrep -i ovirt</font></div>
<div><font face="courier new, monospace">ovirt-log-collector-3.1.0-16.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-image-uploader-3.1.0-16.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-userportal-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-setup-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-restapi-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-config-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-notification-service-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-backend-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-sdk-3.1.0.5-1.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-iso-uploader-3.1.0-16.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-jbossas711-1-0.x86_64</font></div>
<div><font face="courier new, monospace">ovirt-engine-webadmin-portal-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-dbscripts-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-genericapi-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-tools-common-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new, monospace">ovirt-engine-3.1.0-3.19.el6.noarch</font></div>
</div>
<div><font face="courier new, monospace"><br>
</font></div>
<div>
<div><font face="courier new, monospace">[root@cloudhost03 ~]#
rpm -qa | fgrep -i vdsm</font></div>
<div><font face="courier new, monospace">vdsm-xmlrpc-4.10.0-0.42.13.el6.noarch</font></div>
<div><font face="courier new, monospace">vdsm-gluster-4.10.0-0.42.13.el6.noarch</font></div>
<div><font face="courier new, monospace">vdsm-python-4.10.0-0.42.13.el6.x86_64</font></div>
<div><font face="courier new, monospace">vdsm-4.10.0-0.42.13.el6.x86_64</font></div>
<div><font face="courier new, monospace">vdsm-cli-4.10.0-0.42.13.el6.noarch</font></div>
</div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Alan, can you describe exactly the sequence of events that
leaded to<br>
this problem? When you say that the host died while going to
maintenance</blockquote>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
what do you mean exactly? It crashed, rebooted, hung, was
fenced? </blockquote>
<div><br>
</div>
<div>
<div>I'll do my best. =) First, some back ground. I have 3
hosts in oVirt currently: cloudhost0{2,3,4}. (1 out there
but not in oVirt just yet, so of no consequence. It is
currently running the VM hosting oVirt engine, but I doubt
that is relevant either.) I originally build these using
USB sticks as the boot drives hoping to leave the drive bays
dedicated to VM storage. I bought what seemed to be the
best ones for the jobs, but the root file system kept
started going into read-only due to device errors recently.
It only happened once before I installed oVirt a few weeks
ago, but I brough to new hosts on line with oVirt, so maybe
I was just lucky with the first 2 hosts.</div>
<div><br>
</div>
<div>After converting to oVirt, the USB drives starting going
read-only more often and I think they might not have been
fast enough to keep up with the write requests from a host
OS running a full load of VMs (~10), causing performance
problems with some of our VMs. So, I started replacing the
USB drives putting one host into maintenance at a time.</div>
</div>
<div><br>
</div>
<div>I started with cloudhost04 not because it had gone
read-only, but in hopes of improving performance. While it
was migrating VMs for maintenance mode, the engine marked it
in an unknown state. vdsmd had crashed and would not start
because it could not write logs or pid files. The / mount had
gone ro. </div>
<div><br>
</div>
<div>I'm not 100% sure now, but I think a couple of VMs were
successfully migrated, but several were not. libvirtd was up
and the VMs were none the wiser, but I could not figure out
the virsh password to tell it to migrate VMs despite finding a
thread in this list with some pointers. So, we logged into
each VM and shut them down manually. Once virsh listed no
running VMs, I shut down clh04, Confirm 'Host has been
Rebooted', in the GUI, and the engine let me start the VMs on
another host. (I.e., they went from status "unknown" to
status "down".)</div>
<div><br>
</div>
<div>I yanked the USB drive, removed the host from in the GUI,
and then rebuilt the node on a SATA SSD using the instructions
linked above. All has been fine with that node since. I
follow the same procedure on cloudhost03 with no trouble at
all. It went into maintenance mode without a hitch.</div>
<div><br>
</div>
<div>Now, for the case in question. I put cloudhost02 into
maintenance mode and went to lunch. After lunch and some
other distractions, I found it successfully migrated all but 2
of the VMs to other nodes before becoming unresponsive.
Unlike cloudhost04, which I could ssh into and manipulate, 02
had become completely unresponsive Even the console was just
a black screen with not response to mouse or keyboard. The
rest is covered previously in this thread. In short, no
amount of node rebooting, confirming host reboots, engine
restarting, or removing and adding hosts back in convinced the
engine to take the VMs out of unknown status. Also, the host
column was blank for the entire time they were in unknown
status, as far as I know.</div>
<div><br>
</div>
<div>(In case anyone is curious, I have had no confirmation on
performance improvements since converting all nodes to SSD
boot drives, but I have had no complaints since either.)</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
It would be very helpful if you have the engine logs at the
time the host<br>
went down.<br>
</blockquote>
<div><br>
</div>
<div>Attached. Also, I have the USB stick from cloudhost02, so
if there are logs on there that might help, just point me to
them. The migration started around 2012-10-09 12:36:28. I
started fighting with it again after 2PM.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Once you have those logs in a safe place, the only way to get
the VM out<br>
of that status is to update the database manually. First make
completely<br>
sure that the VM is not running in any host, then do the
following:<br>
<br>
# psql -U engine<br>
psql (9.1.6)<br>
Type "help" for help.<br>
<br>
engine=> update vm_dynamic set status = 0 where vm_guid =
(select<br>
vm_gui from vm_static where vm_name = 'myvm');<br>
UPDATE 1<br>
<br>
(Assuming that the name of your VM is "myvm").<br>
</blockquote>
<div><br>
</div>
<div>That did the trick! Thank you so much, Juan! I had to fix
a typo by changing 'vm_gui' to 'vm_guid' after 'select', but
the error from the first try made that clear. Regardless, it
was all I needed. Here is the corrected syntax all on one
line for others that may be looking for the same fix:</div>
<div><br>
</div>
<div>engine=> update vm_dynamic set status = 0 where vm_guid
= (select vm_guid from vm_static where vm_name = 'myvm');</div>
<div><br>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a></pre>
</blockquote>
<br>
</body>
</html>