On Wed, Oct 10, 2012 at 4:35 AM, Juan Hernandez <jhernand(a)redhat.com> wrote:
On 10/09/2012 11:36 PM, Itamar Heim wrote:
> well, hacking the db will work, but reproducing this and logs for us to
> fix the actually bug would also help.
Unfortunately, it will would be very difficult to reproduce since I have
replaced the boot drive that I believe was causing the failures and I have
no idea what state the host was in when it went down (details below).
Plus, our testing folks will become very angry if I keep crashing their
VMs. =) Still, I can probably come up with some spare hardware eventually
and try to reproduce if needed, but let's see what we see with the logs,
etc., first, yeah?
We have seen this before, and we thought it was fixed.
Was that very recently? Perhaps I was not running the fixed version. I
might not be still. I'm running on CentOS 6.3 using this repo:
http://www.dreyou.org/ovirt/ovirt-dre.repo
I used these instructions to setup the
nodes<http://middleswarth.net/content/installing-ovirt-31-and-glusterf...
and
these instructions to setup the
engine<http://middleswarth.net/content/installing-ovirt-31-and-gluster...;.
(I have not configured any glusterfs volumes, but I will likely play with
that soon.) When setting up the engine, I had to fix a bunch of broken
symlinks to jar files that were included with the various packages. Both
the broken symlinks and jar files there there, but many of the symlinks to
the jars where broken. I'll be reporting that to the package maintainers
soon, but mention it here just in case it turns out to be relevant.
Here are my ovirt package versions:
[root@admin ~]# rpm -qa | fgrep -i ovirt
ovirt-log-collector-3.1.0-16.el6.noarch
ovirt-image-uploader-3.1.0-16.el6.noarch
ovirt-engine-userportal-3.1.0-3.19.el6.noarch
ovirt-engine-setup-3.1.0-3.19.el6.noarch
ovirt-engine-restapi-3.1.0-3.19.el6.noarch
ovirt-engine-config-3.1.0-3.19.el6.noarch
ovirt-engine-notification-service-3.1.0-3.19.el6.noarch
ovirt-engine-backend-3.1.0-3.19.el6.noarch
ovirt-engine-sdk-3.1.0.5-1.el6.noarch
ovirt-iso-uploader-3.1.0-16.el6.noarch
ovirt-engine-jbossas711-1-0.x86_64
ovirt-engine-webadmin-portal-3.1.0-3.19.el6.noarch
ovirt-engine-dbscripts-3.1.0-3.19.el6.noarch
ovirt-engine-genericapi-3.1.0-3.19.el6.noarch
ovirt-engine-tools-common-3.1.0-3.19.el6.noarch
ovirt-engine-3.1.0-3.19.el6.noarch
[root@cloudhost03 ~]# rpm -qa | fgrep -i vdsm
vdsm-xmlrpc-4.10.0-0.42.13.el6.noarch
vdsm-gluster-4.10.0-0.42.13.el6.noarch
vdsm-python-4.10.0-0.42.13.el6.x86_64
vdsm-4.10.0-0.42.13.el6.x86_64
vdsm-cli-4.10.0-0.42.13.el6.noarch
Alan, can you describe exactly the sequence of events that leaded to
this problem? When you say that the host died while going to maintenance
what do you mean exactly? It crashed, rebooted, hung, was fenced?
I'll do my best. =) First, some back ground. I have 3 hosts in oVirt
currently: cloudhost0{2,3,4}. (1 out there but not in oVirt just yet, so
of no consequence. It is currently running the VM hosting oVirt engine,
but I doubt that is relevant either.) I originally build these using USB
sticks as the boot drives hoping to leave the drive bays dedicated to VM
storage. I bought what seemed to be the best ones for the jobs, but the
root file system kept started going into read-only due to device errors
recently. It only happened once before I installed oVirt a few weeks ago,
but I brough to new hosts on line with oVirt, so maybe I was just lucky
with the first 2 hosts.
After converting to oVirt, the USB drives starting going read-only more
often and I think they might not have been fast enough to keep up with the
write requests from a host OS running a full load of VMs (~10), causing
performance problems with some of our VMs. So, I started replacing the USB
drives putting one host into maintenance at a time.
I started with cloudhost04 not because it had gone read-only, but in hopes
of improving performance. While it was migrating VMs for maintenance mode,
the engine marked it in an unknown state. vdsmd had crashed and would not
start because it could not write logs or pid files. The / mount had gone
ro.
I'm not 100% sure now, but I think a couple of VMs were
successfully migrated, but several were not. libvirtd was up and the VMs
were none the wiser, but I could not figure out the virsh password to tell
it to migrate VMs despite finding a thread in this list with some pointers.
So, we logged into each VM and shut them down manually. Once virsh listed
no running VMs, I shut down clh04, Confirm 'Host has been Rebooted', in the
GUI, and the engine let me start the VMs on another host. (I.e., they went
from status "unknown" to status "down".)
I yanked the USB drive, removed the host from in the GUI, and then rebuilt
the node on a SATA SSD using the instructions linked above. All has been
fine with that node since. I follow the same procedure on cloudhost03 with
no trouble at all. It went into maintenance mode without a hitch.
Now, for the case in question. I put cloudhost02 into maintenance mode and
went to lunch. After lunch and some other distractions, I found it
successfully migrated all but 2 of the VMs to other nodes before becoming
unresponsive. Unlike cloudhost04, which I could ssh into and manipulate,
02 had become completely unresponsive Even the console was just a black
screen with not response to mouse or keyboard. The rest is covered
previously in this thread. In short, no amount of node rebooting,
confirming host reboots, engine restarting, or removing and adding hosts
back in convinced the engine to take the VMs out of unknown status. Also,
the host column was blank for the entire time they were in unknown status,
as far as I know.
(In case anyone is curious, I have had no confirmation on performance
improvements since converting all nodes to SSD boot drives, but I have had
no complaints since either.)
It would be very helpful if you have the engine logs at the time the
host
went down.
Attached. Also, I have the USB stick from cloudhost02, so if there are
logs on there that might help, just point me to them. The migration
started around 2012-10-09 12:36:28. I started fighting with it again after
2PM.
Once you have those logs in a safe place, the only way to get the VM out
of that status is to update the database manually. First make completely
sure that the VM is not running in any host, then do the following:
# psql -U engine
psql (9.1.6)
Type "help" for help.
engine=> update vm_dynamic set status = 0 where vm_guid = (select
vm_gui from vm_static where vm_name = 'myvm');
UPDATE 1
(Assuming that the name of your VM is "myvm").
That did the trick! Thank you so much, Juan! I had to fix a typo by
changing 'vm_gui' to 'vm_guid' after 'select', but the error from
the first
try made that clear. Regardless, it was all I needed. Here is the
corrected syntax all on one line for others that may be looking for the
same fix:
engine=> update vm_dynamic set status = 0 where vm_guid = (select vm_guid
from vm_static where vm_name = 'myvm');