This is a multi-part message in MIME format.
--------------080207050007090000030901
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
since the VM run_on_vds was empty, the "confirm host..." didn't clear
its status because its not selected from the DB as one of the host VMs.
I'll try to dig in to see at what point this value was cleared -
probably around the failed migration.
On 10/10/2012 07:25 PM, Alan Johnson wrote:
On Wed, Oct 10, 2012 at 4:35 AM, Juan Hernandez
<jhernand(a)redhat.com
<mailto:jhernand@redhat.com>> wrote:
On 10/09/2012 11:36 PM, Itamar Heim wrote:
> well, hacking the db will work, but reproducing this and logs
for us to
> fix the actually bug would also help.
Unfortunately, it will would be very difficult to reproduce since I
have replaced the boot drive that I believe was causing the failures
and I have no idea what state the host was in when it went down
(details below). Plus, our testing folks will become very angry if I
keep crashing their VMs. =) Still, I can probably come up with some
spare hardware eventually and try to reproduce if needed, but let's
see what we see with the logs, etc., first, yeah?
We have seen this before, and we thought it was fixed.
Was that very recently? Perhaps I was not running the fixed version.
I might not be still. I'm running on CentOS 6.3 using this repo:
http://www.dreyou.org/ovirt/ovirt-dre.repo
I used these instructions to setup the nodes
<
http://middleswarth.net/content/installing-ovirt-31-and-glusterfs-using-e...
and
these instructions to setup the engine
<
http://middleswarth.net/content/installing-ovirt-31-and-glusterfs-using-e...;.
(I have not configured any glusterfs volumes, but I will likely play
with that soon.) When setting up the engine, I had to fix a bunch of
broken symlinks to jar files that were included with the various
packages. Both the broken symlinks and jar files there there, but
many of the symlinks to the jars where broken. I'll be reporting that
to the package maintainers soon, but mention it here just in case it
turns out to be relevant.
Here are my ovirt package versions:
[root@admin ~]# rpm -qa | fgrep -i ovirt
ovirt-log-collector-3.1.0-16.el6.noarch
ovirt-image-uploader-3.1.0-16.el6.noarch
ovirt-engine-userportal-3.1.0-3.19.el6.noarch
ovirt-engine-setup-3.1.0-3.19.el6.noarch
ovirt-engine-restapi-3.1.0-3.19.el6.noarch
ovirt-engine-config-3.1.0-3.19.el6.noarch
ovirt-engine-notification-service-3.1.0-3.19.el6.noarch
ovirt-engine-backend-3.1.0-3.19.el6.noarch
ovirt-engine-sdk-3.1.0.5-1.el6.noarch
ovirt-iso-uploader-3.1.0-16.el6.noarch
ovirt-engine-jbossas711-1-0.x86_64
ovirt-engine-webadmin-portal-3.1.0-3.19.el6.noarch
ovirt-engine-dbscripts-3.1.0-3.19.el6.noarch
ovirt-engine-genericapi-3.1.0-3.19.el6.noarch
ovirt-engine-tools-common-3.1.0-3.19.el6.noarch
ovirt-engine-3.1.0-3.19.el6.noarch
[root@cloudhost03 ~]# rpm -qa | fgrep -i vdsm
vdsm-xmlrpc-4.10.0-0.42.13.el6.noarch
vdsm-gluster-4.10.0-0.42.13.el6.noarch
vdsm-python-4.10.0-0.42.13.el6.x86_64
vdsm-4.10.0-0.42.13.el6.x86_64
vdsm-cli-4.10.0-0.42.13.el6.noarch
Alan, can you describe exactly the sequence of events that leaded to
this problem? When you say that the host died while going to
maintenance
what do you mean exactly? It crashed, rebooted, hung, was fenced?
I'll do my best. =) First, some back ground. I have 3 hosts in oVirt
currently: cloudhost0{2,3,4}. (1 out there but not in oVirt just yet,
so of no consequence. It is currently running the VM hosting oVirt
engine, but I doubt that is relevant either.) I originally build
these using USB sticks as the boot drives hoping to leave the drive
bays dedicated to VM storage. I bought what seemed to be the best
ones for the jobs, but the root file system kept started going into
read-only due to device errors recently. It only happened once before
I installed oVirt a few weeks ago, but I brough to new hosts on line
with oVirt, so maybe I was just lucky with the first 2 hosts.
After converting to oVirt, the USB drives starting going read-only
more often and I think they might not have been fast enough to keep up
with the write requests from a host OS running a full load of VMs
(~10), causing performance problems with some of our VMs. So, I
started replacing the USB drives putting one host into maintenance at
a time.
I started with cloudhost04 not because it had gone read-only, but in
hopes of improving performance. While it was migrating VMs for
maintenance mode, the engine marked it in an unknown state. vdsmd had
crashed and would not start because it could not write logs or pid
files. The / mount had gone ro.
I'm not 100% sure now, but I think a couple of VMs were
successfully migrated, but several were not. libvirtd was up and the
VMs were none the wiser, but I could not figure out the virsh password
to tell it to migrate VMs despite finding a thread in this list with
some pointers. So, we logged into each VM and shut them down
manually. Once virsh listed no running VMs, I shut down clh04,
Confirm 'Host has been Rebooted', in the GUI, and the engine let me
start the VMs on another host. (I.e., they went from status "unknown"
to status "down".)
I yanked the USB drive, removed the host from in the GUI, and
then rebuilt the node on a SATA SSD using the instructions linked
above. All has been fine with that node since. I follow the same
procedure on cloudhost03 with no trouble at all. It went into
maintenance mode without a hitch.
Now, for the case in question. I put cloudhost02 into maintenance
mode and went to lunch. After lunch and some other distractions, I
found it successfully migrated all but 2 of the VMs to other nodes
before becoming unresponsive. Unlike cloudhost04, which I could ssh
into and manipulate, 02 had become completely unresponsive Even the
console was just a black screen with not response to mouse or
keyboard. The rest is covered previously in this thread. In short,
no amount of node rebooting, confirming host reboots, engine
restarting, or removing and adding hosts back in convinced the engine
to take the VMs out of unknown status. Also, the host column was
blank for the entire time they were in unknown status, as far as I know.
(In case anyone is curious, I have had no confirmation on performance
improvements since converting all nodes to SSD boot drives, but I have
had no complaints since either.)
It would be very helpful if you have the engine logs at the time
the host
went down.
Attached. Also, I have the USB stick from cloudhost02, so if there
are logs on there that might help, just point me to them. The
migration started around 2012-10-09 12:36:28. I started fighting with
it again after 2PM.
Once you have those logs in a safe place, the only way to get the
VM out
of that status is to update the database manually. First make
completely
sure that the VM is not running in any host, then do the following:
# psql -U engine
psql (9.1.6)
Type "help" for help.
engine=> update vm_dynamic set status = 0 where vm_guid = (select
vm_gui from vm_static where vm_name = 'myvm');
UPDATE 1
(Assuming that the name of your VM is "myvm").
That did the trick! Thank you so much, Juan! I had to fix a typo by
changing 'vm_gui' to 'vm_guid' after 'select', but the error from
the
first try made that clear. Regardless, it was all I needed. Here is
the corrected syntax all on one line for others that may be looking
for the same fix:
engine=> update vm_dynamic set status = 0 where vm_guid =
(select vm_guid from vm_static where vm_name = 'myvm');
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
--------------080207050007090000030901
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix"><br>
since the VM run_on_vds was empty, the "confirm host..." didn't
clear its status because its not selected from the DB as one of
the host VMs. <br>
I'll try to dig in to see at what point this value was cleared -
probably around the failed migration.<br>
<br>
<br>
On 10/10/2012 07:25 PM, Alan Johnson wrote:<br>
</div>
<blockquote
cite="mid:CAAhwQij0zfoxwO0hzywG_bWTNs+VbCw7KTQyrk+Vbqy2UNNRmQ@mail.gmail.com"
type="cite">On Wed, Oct 10, 2012 at 4:35 AM, Juan Hernandez <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:jhernand@redhat.com"
target="_blank">jhernand(a)redhat.com</a>&gt;</span>
wrote:<br>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb">
<div class="h5">On 10/09/2012 11:36 PM, Itamar Heim
wrote:<br>
> well, hacking the db will work, but reproducing this
and logs for us to<br>
> fix the actually bug would also help.<br>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>Unfortunately, it will would be very difficult to reproduce
since I have replaced the boot drive that I believe was
causing the failures and I have no idea what state the host
was in when it went down (details below). Plus, our testing
folks will become very angry if I keep crashing their VMs. =)
Still, I can probably come up with some spare hardware
eventually and try to reproduce if needed, but let's see what
we see with the logs, etc., first, yeah?</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb">
<div class="h5">
<br>
</div>
</div>
We have seen this before, and we thought it was fixed.<br>
</blockquote>
<div><br>
</div>
<div>Was that very recently? Perhaps I was not running the
fixed version. I might not be still. I'm running on
CentOS
6.3 using this repo: <a moz-do-not-send="true"
href="http://www.dreyou.org/ovirt/ovirt-dre.repo">http://www...
<div><br>
</div>
<div>I used these <a moz-do-not-send="true"
href="http://middleswarth.net/content/installing-ovirt-31-and-gluste...
to setup the nodes</a> and these <a
moz-do-not-send="true"
href="http://middleswarth.net/content/installing-ovirt-31-and-gluste...
to setup the engine</a>. (I have not configured any
glusterfs volumes, but I will likely play with that soon.)
When setting up the engine, I had to fix a bunch of broken
symlinks to jar files that were included with the various
packages. Both the broken symlinks and jar files there there,
but many of the symlinks to the jars where broken. I'll be
reporting that to the package maintainers soon, but mention it
here just in case it turns out to be relevant.</div>
<div><br>
</div>
<div>Here are my ovirt package versions:</div>
<div>
<div><font face="courier new, monospace">[root@admin ~]#
rpm
-qa | fgrep -i ovirt</font></div>
<div><font face="courier new,
monospace">ovirt-log-collector-3.1.0-16.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-image-uploader-3.1.0-16.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-userportal-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-setup-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-restapi-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-config-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-notification-service-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-backend-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-sdk-3.1.0.5-1.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-iso-uploader-3.1.0-16.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-jbossas711-1-0.x86_64</font></div>
<div><font face="courier new,
monospace">ovirt-engine-webadmin-portal-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-dbscripts-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-genericapi-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-tools-common-3.1.0-3.19.el6.noarch</font></div>
<div><font face="courier new,
monospace">ovirt-engine-3.1.0-3.19.el6.noarch</font></div>
</div>
<div><font face="courier new, monospace"><br>
</font></div>
<div>
<div><font face="courier new, monospace">[root@cloudhost03
~]#
rpm -qa | fgrep -i vdsm</font></div>
<div><font face="courier new,
monospace">vdsm-xmlrpc-4.10.0-0.42.13.el6.noarch</font></div>
<div><font face="courier new,
monospace">vdsm-gluster-4.10.0-0.42.13.el6.noarch</font></div>
<div><font face="courier new,
monospace">vdsm-python-4.10.0-0.42.13.el6.x86_64</font></div>
<div><font face="courier new,
monospace">vdsm-4.10.0-0.42.13.el6.x86_64</font></div>
<div><font face="courier new,
monospace">vdsm-cli-4.10.0-0.42.13.el6.noarch</font></div>
</div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Alan, can you describe exactly the sequence of events that
leaded to<br>
this problem? When you say that the host died while going to
maintenance</blockquote>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
what do you mean exactly? It crashed, rebooted, hung, was
fenced? </blockquote>
<div><br>
</div>
<div>
<div>I'll do my best. =) First, some back ground.
I have 3
hosts in oVirt currently: cloudhost0{2,3,4}. (1 out there
but not in oVirt just yet, so of no consequence. It is
currently running the VM hosting oVirt engine, but I doubt
that is relevant either.) I originally build these using
USB sticks as the boot drives hoping to leave the drive bays
dedicated to VM storage. I bought what seemed to be the
best ones for the jobs, but the root file system kept
started going into read-only due to device errors recently.
It only happened once before I installed oVirt a few weeks
ago, but I brough to new hosts on line with oVirt, so maybe
I was just lucky with the first 2 hosts.</div>
<div><br>
</div>
<div>After converting to oVirt, the USB drives starting
going
read-only more often and I think they might not have been
fast enough to keep up with the write requests from a host
OS running a full load of VMs (~10), causing performance
problems with some of our VMs. So, I started replacing the
USB drives putting one host into maintenance at a time.</div>
</div>
<div><br>
</div>
<div>I started with cloudhost04 not because it had gone
read-only, but in hopes of improving performance. While it
was migrating VMs for maintenance mode, the engine marked it
in an unknown state. vdsmd had crashed and would not start
because it could not write logs or pid files. The / mount had
gone ro. </div>
<div><br>
</div>
<div>I'm not 100% sure now, but I think a couple of VMs were
successfully migrated, but several were not.
libvirtd was up
and the VMs were none the wiser, but I could not figure out
the virsh password to tell it to migrate VMs despite finding a
thread in this list with some pointers. So, we logged into
each VM and shut them down manually. Once virsh listed no
running VMs, I shut down clh04, Confirm 'Host has been
Rebooted', in the GUI, and the engine let me start the VMs on
another host. (I.e., they went from status "unknown" to
status "down".)</div>
<div><br>
</div>
<div>I yanked the USB drive, removed the host from in the GUI,
and then rebuilt the node on a SATA SSD using the instructions
linked above. All has been fine with that node since. I
follow the same procedure on cloudhost03 with no trouble at
all. It went into maintenance mode without a hitch.</div>
<div><br>
</div>
<div>Now, for the case in question. I put cloudhost02 into
maintenance mode and went to lunch. After lunch and some
other distractions, I found it successfully migrated all but 2
of the VMs to other nodes before becoming unresponsive.
Unlike cloudhost04, which I could ssh into and manipulate, 02
had become completely unresponsive Even the console
was just
a black screen with not response to mouse or keyboard. The
rest is covered previously in this thread. In short, no
amount of node rebooting, confirming host reboots, engine
restarting, or removing and adding hosts back in convinced the
engine to take the VMs out of unknown status. Also, the host
column was blank for the entire time they were in unknown
status, as far as I know.</div>
<div><br>
</div>
<div>(In case anyone is curious, I have had no confirmation on
performance improvements since converting all nodes to SSD
boot drives, but I have had no complaints since either.)</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
It would be very helpful if you have the engine logs at the
time the host<br>
went down.<br>
</blockquote>
<div><br>
</div>
<div>Attached. Also, I have the USB stick from cloudhost02, so
if there are logs on there that might help, just point me to
them. The migration started around 2012-10-09 12:36:28. I
started fighting with it again after 2PM.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Once you have those logs in a safe place, the only way to get
the VM out<br>
of that status is to update the database manually. First make
completely<br>
sure that the VM is not running in any host, then do the
following:<br>
<br>
# psql -U engine<br>
psql (9.1.6)<br>
Type "help" for help.<br>
<br>
engine=> update vm_dynamic set status = 0 where vm_guid =
(select<br>
vm_gui from vm_static where vm_name = 'myvm');<br>
UPDATE 1<br>
<br>
(Assuming that the name of your VM is "myvm").<br>
</blockquote>
<div><br>
</div>
<div>That did the trick! Thank you so much, Juan! I had
to fix
a typo by changing 'vm_gui' to 'vm_guid' after 'select',
but
the error from the first try made that clear. Regardless, it
was all I needed. Here is the corrected syntax all on one
line for others that may be looking for the same fix:</div>
<div><br>
</div>
<div>engine=> update vm_dynamic set status = 0 where vm_guid
= (select vm_guid from vm_static where vm_name =
'myvm');</div>
<div><br>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated"
href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class="moz-txt-link-freetext"
href="http://lists.ovirt.org/mailman/listinfo/users">http://...
</blockquote>
<br>
</body>
</html>
--------------080207050007090000030901--