[Users] how to convince oVirt a VM is down

Roy Golan rgolan at redhat.com
Thu Oct 11 07:20:40 UTC 2012


since the VM run_on_vds was empty, the "confirm host..." didn't clear 
its status because its not selected from the DB as one of the host VMs.
I'll try to dig in to see at what point this value was cleared - 
probably around the failed migration.


On 10/10/2012 07:25 PM, Alan Johnson wrote:
> On Wed, Oct 10, 2012 at 4:35 AM, Juan Hernandez <jhernand at redhat.com 
> <mailto:jhernand at redhat.com>> wrote:
>
>     On 10/09/2012 11:36 PM, Itamar Heim wrote:
>     > well, hacking the db will work, but reproducing this and logs
>     for us to
>     > fix the actually bug would also help.
>
>
> Unfortunately, it will would be very difficult to reproduce since I 
> have replaced the boot drive that I believe was causing the failures 
> and I have no idea what state the host was in when it went down 
> (details below).  Plus, our testing folks will become very angry if I 
> keep crashing their VMs. =)  Still, I can probably come up with some 
> spare hardware eventually and try to reproduce if needed, but let's 
> see what we see with the logs, etc., first, yeah?
>
>
>     We have seen this before, and we thought it was fixed.
>
>
> Was that very recently?  Perhaps I was not running the fixed version. 
>  I might not be still.  I'm running on CentOS 6.3 using this repo: 
> http://www.dreyou.org/ovirt/ovirt-dre.repo
>
> I used these instructions to setup the nodes 
> <http://middleswarth.net/content/installing-ovirt-31-and-glusterfs-using-either-nfs-or-posix-native-file-system-node-install> and 
> these instructions to setup the engine 
> <http://middleswarth.net/content/installing-ovirt-31-and-glusterfs-using-either-nfs-or-posix-native-file-system-engine>. 
>  (I have not configured any glusterfs volumes, but I will likely play 
> with that soon.)  When setting up the engine, I had to fix a bunch of 
> broken symlinks to jar files that were included with the various 
> packages.  Both the broken symlinks and jar files there there, but 
> many of the symlinks to the jars where broken.  I'll be reporting that 
> to the package maintainers soon, but mention it here just in case it 
> turns out to be relevant.
>
> Here are my ovirt package versions:
> [root at admin ~]# rpm -qa | fgrep -i ovirt
> ovirt-log-collector-3.1.0-16.el6.noarch
> ovirt-image-uploader-3.1.0-16.el6.noarch
> ovirt-engine-userportal-3.1.0-3.19.el6.noarch
> ovirt-engine-setup-3.1.0-3.19.el6.noarch
> ovirt-engine-restapi-3.1.0-3.19.el6.noarch
> ovirt-engine-config-3.1.0-3.19.el6.noarch
> ovirt-engine-notification-service-3.1.0-3.19.el6.noarch
> ovirt-engine-backend-3.1.0-3.19.el6.noarch
> ovirt-engine-sdk-3.1.0.5-1.el6.noarch
> ovirt-iso-uploader-3.1.0-16.el6.noarch
> ovirt-engine-jbossas711-1-0.x86_64
> ovirt-engine-webadmin-portal-3.1.0-3.19.el6.noarch
> ovirt-engine-dbscripts-3.1.0-3.19.el6.noarch
> ovirt-engine-genericapi-3.1.0-3.19.el6.noarch
> ovirt-engine-tools-common-3.1.0-3.19.el6.noarch
> ovirt-engine-3.1.0-3.19.el6.noarch
>
> [root at cloudhost03 ~]# rpm -qa | fgrep -i vdsm
> vdsm-xmlrpc-4.10.0-0.42.13.el6.noarch
> vdsm-gluster-4.10.0-0.42.13.el6.noarch
> vdsm-python-4.10.0-0.42.13.el6.x86_64
> vdsm-4.10.0-0.42.13.el6.x86_64
> vdsm-cli-4.10.0-0.42.13.el6.noarch
>
>
>     Alan, can you describe exactly the sequence of events that leaded to
>     this problem? When you say that the host died while going to
>     maintenance
>
>     what do you mean exactly? It crashed, rebooted, hung, was fenced? 
>
>
> I'll do my best. =)  First, some back ground.  I have 3 hosts in oVirt 
> currently: cloudhost0{2,3,4}.  (1 out there but not in oVirt just yet, 
> so of no consequence.  It is currently running the VM hosting oVirt 
> engine, but I doubt that is relevant either.)  I originally build 
> these using USB sticks as the boot drives hoping to leave the drive 
> bays dedicated to VM storage.  I bought what seemed to be the best 
> ones for the jobs, but the root file system kept started going into 
> read-only due to device errors recently.  It only happened once before 
> I installed oVirt a few weeks ago, but I brough to new hosts on line 
> with oVirt, so maybe I was just lucky with the first 2 hosts.
>
> After converting to oVirt, the USB drives starting going read-only 
> more often and I think they might not have been fast enough to keep up 
> with the write requests from a host OS running a full load of VMs 
> (~10), causing performance problems with some of our VMs.  So, I 
> started replacing the USB drives putting one host into maintenance at 
> a time.
>
> I started with cloudhost04 not because it had gone read-only, but in 
> hopes of improving performance.  While it was migrating VMs for 
> maintenance mode, the engine marked it in an unknown state.  vdsmd had 
>  crashed and would not start because it could not write logs or pid 
> files.  The / mount had gone ro.
>
> I'm not 100% sure now, but I think a couple of VMs were 
> successfully migrated, but several were not.  libvirtd was up and the 
> VMs were none the wiser, but I could not figure out the virsh password 
> to tell it to migrate VMs despite finding a thread in this list with 
> some pointers.  So, we logged into each VM and shut them down 
> manually.  Once virsh listed no running VMs, I shut down clh04, 
> Confirm 'Host has been Rebooted', in the GUI, and the engine let me 
> start the VMs on another host.  (I.e., they went from status "unknown" 
> to status "down".)
>
> I yanked the USB drive, removed the host from in the GUI, and 
> then rebuilt the node on a SATA SSD using the instructions linked 
> above.  All has been fine with that node since.  I follow the same 
> procedure on cloudhost03 with no trouble at all.  It went into 
> maintenance mode without a hitch.
>
> Now, for the case in question.  I put cloudhost02 into maintenance 
> mode and went to lunch.  After lunch and some other distractions, I 
> found it successfully migrated all but 2 of the VMs to other nodes 
> before becoming unresponsive.  Unlike cloudhost04, which I could ssh 
> into and manipulate, 02 had become completely unresponsive   Even the 
> console was just a black screen with not response to mouse or 
> keyboard.  The rest is covered previously in this thread.  In short, 
> no amount of node rebooting, confirming host reboots, engine 
> restarting, or removing and adding hosts back in convinced the engine 
> to take the VMs out of unknown status.  Also, the host column was 
> blank for the entire time they were in unknown status, as far as I know.
>
> (In case anyone is curious, I have had no confirmation on performance 
> improvements since converting all nodes to SSD boot drives, but I have 
> had no complaints since either.)
>
>     It would be very helpful if you have the engine logs at the time
>     the host
>     went down.
>
>
> Attached.  Also, I have the USB stick from cloudhost02, so if there 
> are logs on there that might help, just point me to them.  The 
> migration started around 2012-10-09 12:36:28.  I started fighting with 
> it again after 2PM.
>
>
>     Once you have those logs in a safe place, the only way to get the
>     VM out
>     of that status is to update the database manually. First make
>     completely
>     sure that the VM is not running in any host, then do the following:
>
>       # psql -U engine
>       psql (9.1.6)
>       Type "help" for help.
>
>       engine=> update vm_dynamic set status = 0 where vm_guid = (select
>     vm_gui from vm_static where vm_name = 'myvm');
>       UPDATE 1
>
>     (Assuming that the name of your VM is "myvm").
>
>
> That did the trick!  Thank you so much, Juan!  I had to fix a typo by 
> changing 'vm_gui' to 'vm_guid' after 'select', but the error from the 
> first try made that clear.  Regardless, it was all I needed.   Here is 
> the corrected syntax all on one line for others that may be looking 
> for the same fix:
>
> engine=> update vm_dynamic set status = 0 where vm_guid = 
> (select vm_guid from vm_static where vm_name = 'myvm');
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20121011/3de2e237/attachment-0001.html>


More information about the Users mailing list