[Users] Vm's being paused
Michal Skrivanek
michal.skrivanek at redhat.com
Wed Jan 29 08:26:45 UTC 2014
On Jan 28, 2014, at 19:18 , Dafna Ron <dron at redhat.com> wrote:
> yes - engine lost communication with vdsm and it has no way of knowing if the host is down or if there was a network issue so a network issue would cause the same errors that I see in the logs.
>
> The error you put on the iso is the reason the vm's have failed migration - if a vm is run with a cd and the cd is gone than the vm will not be able to be migrated.
which, as I learned last week, is not entirely correct. Pure libvirt VM seems to work fine…so it must be somewhere something in oVirt:(
looking into it
but just for future reference we want it to work:)
>
> after the engine restart, do you still see a problem with the size or did the report of size changed?
>
> Dafna
>
> On 01/28/2014 01:02 PM, Neil wrote:
>> Hi Dafna,
>>
>> Thanks for coming back to me. I'll try answer your queries one by one.
>>
>> On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron at redhat.com> wrote:
>>> you had a problem with your storage on the 14th of Jan and one of the hosts
>>> rebooted (if you have the vdsm log from that day than I can see what
>>> happened on vdsm side)
>>> in engine, I could see a problem with the export domain and this should not
>>> have cause a reboot.
>> 1.) I don't unfortunately have logs going back that far. Looking at
>> all 3 hosts uptime, the one with the least uptime is 21 days, the
>> others are all over 40 days, so there definitely wasn't a host that
>> rebooted on the 14th of Jan, would a network issue or Firewall issue
>> also cause the error you've seen to look as if a host rebooted? There
>> was a bonding mode change on the 14th of January, so perhaps this
>> caused the issue?
>>
>>
>>> Can you tell me if you had a problem with the data
>>> domain as well or was it just the export domain? were you having any vm's
>>> exported/imported at that time?
>>> In any case - this is a bug.
>> 2.) I think this was the same day that the bonding mode was changed on
>> the host while the host was live (by mistake), and had SPM running on
>> it. I haven't done any importing or exporting for a few years on this
>> oVirt setup.
>>
>>
>>> As for the vm's - if the vm's are no longer in migrating state than please
>>> restart ovirt-engine service (looks like a cache issue)
>> 3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
>>
>>
>>> if they are in migrating state - there should have been a timeout a long
>>> time ago.
>>> can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on both
>>> all hosts?
>> 4.) Ran on all hosts...
>>
>> node01.blabla.com
>> 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up
>> 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up
>> ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up
>> 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up
>> 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
>>
>> Id Name State
>> ----------------------------------------------------
>> 1 adam running
>> 2 reports running
>> 4 tux running
>> 6 Moodle running
>> 7 babbage running
>>
>> node02.blabla.com
>> dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up
>> 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up
>> ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up
>> 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
>>
>> Id Name State
>> ----------------------------------------------------
>> 9 proxy02 running
>> 10 spam running
>> 12 software running
>> 13 zimbra running
>>
>> node03.blabla.com
>> e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up
>> 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
>>
>> Id Name State
>> ----------------------------------------------------
>> 13 oliver running
>> 14 dhcp running
>>
>>
>>> Last thing is that your ISO domain seems to be having issues as well.
>>> This should not effect the host status but if any of the vm's were booted
>>> from an iso or have an iso attached in the boot sequence this will explain
>>> the migration issue.
>> There was an ISO domain issue a while back, but this was corrected
>> about 2 weeks ago after iptables re-enabled itself on boot after
>> running updates, I've checked now and the ISO domain appears to be
>> fine and I can see all the images stored within.
>>
>> I've stumbled across what appears to be another error and all three
>> hosts are showing this over and over in /var/log/messages, and I'm not
>> sure if it's related? ...
>>
>> Jan 28 14:58:59 node01 vdsm vm.Vm ERROR
>> vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed:
>> <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most
>> recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351,
>> in collect#012 statsFunction()#012 File
>> "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue =
>> self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py",
>> line 509, in _highWrite#012 if not vmDrive.blockDev or
>> vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no
>> attribute 'format'
>>
>> I've attached the full vdsm log from node02 to this reply.
>>
>> Please shout if you need anything else.
>>
>> Thank you.
>>
>> Regards.
>>
>> Neil Wilson.
>>
>>> On 01/28/2014 09:28 AM, Neil wrote:
>>>> Hi guys,
>>>>
>>>> Sorry for the very late reply, I've been out of the office doing
>>>> installations.
>>>> Unfortunately due to the time delay, my oldest logs are only as far
>>>> back as the attached.
>>>>
>>>> I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm
>>>> not sure what info is required, so the full log is attached.
>>>>
>>>> Please shout if you need any info or further details.
>>>>
>>>> Thank you very much.
>>>>
>>>> Regards.
>>>>
>>>> Neil Wilson.
>>>>
>>>>
>>>> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin at redhat.com>
>>>> wrote:
>>>>> Could you please attach the engine.log from the same time?
>>>>>
>>>>> thanks!
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Neil" <nwilson123 at gmail.com>
>>>>>> To: dron at redhat.com
>>>>>> Cc: "users" <users at ovirt.org>
>>>>>> Sent: Wednesday, January 22, 2014 1:14:25 PM
>>>>>> Subject: Re: [Users] Vm's being paused
>>>>>>
>>>>>> Hi Dafna,
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> The vdsm logs are quite large, so I've only attached the logs for the
>>>>>> pause of the VM called Babbage on the 19th of Jan.
>>>>>>
>>>>>> As for snapshots, Babbage has one from June 2013 and Reports has two
>>>>>> from June and Oct 2013.
>>>>>>
>>>>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's
>>>>>> have thin provisioned disks.
>>>>>>
>>>>>> Please shout if you'd like any further info or logs.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>> Neil Wilson.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron at redhat.com> wrote:
>>>>>>> Hi Neil,
>>>>>>>
>>>>>>> Can you please attach the vdsm logs?
>>>>>>> also, as for the vm's, do they have any snapshots?
>>>>>>> from your suggestion to allocate more luns, are you using iscsi or FC?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Dafna
>>>>>>>
>>>>>>>
>>>>>>> On 01/22/2014 08:45 AM, Neil wrote:
>>>>>>>> Thanks for the replies guys,
>>>>>>>>
>>>>>>>> Looking at my two VM's that have paused so far through the oVirt GUI
>>>>>>>> the following sizes show under Disks.
>>>>>>>>
>>>>>>>> VM Reports:
>>>>>>>> Virtual Size 35GB, Actual Size 41GB
>>>>>>>> Looking on the Centos OS side, Disk size is 33G and used is 12G with
>>>>>>>> 19G available (40%) usage.
>>>>>>>>
>>>>>>>> VM Babbage:
>>>>>>>> Virtual Size is 40GB, Actual Size 53GB
>>>>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so
>>>>>>>> under 50% usage.
>>>>>>>>
>>>>>>>>
>>>>>>>> Do you see any issues with the above stats?
>>>>>>>>
>>>>>>>> Then my main Datacenter storage is as follows...
>>>>>>>>
>>>>>>>> Size: 6887 GB
>>>>>>>> Available: 1948 GB
>>>>>>>> Used: 4939 GB
>>>>>>>> Allocated: 1196 GB
>>>>>>>> Over Allocation: 61%
>>>>>>>>
>>>>>>>> Could there be a problem here? I can allocate additional LUNS if you
>>>>>>>> feel the space isn't correctly allocated.
>>>>>>>>
>>>>>>>> Apologies for going on about this, but I'm really concerned that
>>>>>>>> something isn't right and I might have a serious problem if an
>>>>>>>> important machine locks up.
>>>>>>>>
>>>>>>>> Thank you and much appreciated.
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>>
>>>>>>>> Neil Wilson.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron at redhat.com> wrote:
>>>>>>>>> the storage space is configured in percentages and not physical size.
>>>>>>>>> so if 20G is less than 10% (default config) of your storage it will
>>>>>>>>> pause
>>>>>>>>> the vms regardless of how much GB you still have.
>>>>>>>>> this is configurable though so you can change it to less than 10% if
>>>>>>>>> you
>>>>>>>>> like.
>>>>>>>>>
>>>>>>>>> to answer the second question, vm's will not pause on ENOSpace error
>>>>>>>>> if
>>>>>>>>> they
>>>>>>>>> run out of space internally but only if the external storage cannot
>>>>>>>>> be
>>>>>>>>> consumed. so only if you run out of space in the storage and and not
>>>>>>>>> if
>>>>>>>>> vm
>>>>>>>>> runs out of space in its on fs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 01/21/2014 09:51 AM, Neil wrote:
>>>>>>>>>> Hi Dan,
>>>>>>>>>>
>>>>>>>>>> Sorry, attached is engine.log I've taken out the two sections where
>>>>>>>>>> each of the VM's were paused.
>>>>>>>>>>
>>>>>>>>>> Does the error "VM babbage has paused due to no Storage space error"
>>>>>>>>>> mean the main storage domain has run out of storage, or that the VM
>>>>>>>>>> has run out?
>>>>>>>>>>
>>>>>>>>>> Both VM's appear to have been running on node01 when they were
>>>>>>>>>> paused.
>>>>>>>>>> My vdsm versions are all...
>>>>>>>>>>
>>>>>>>>>> vdsm-cli-4.13.0-11.el6.noarch
>>>>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64
>>>>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch
>>>>>>>>>> vdsm-4.13.0-11.el6.x86_64
>>>>>>>>>> vdsm-python-4.13.0-11.el6.x86_64
>>>>>>>>>>
>>>>>>>>>> I currently have a 61% over allocation ratio on my primary storage
>>>>>>>>>> domain, with 1948GB available.
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>> Regards.
>>>>>>>>>>
>>>>>>>>>> Neil Wilson.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123 at gmail.com> wrote:
>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>
>>>>>>>>>>> Sorry for only coming back to you now.
>>>>>>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out of
>>>>>>>>>>> disk space there is about 20Gigs free, and the usage barely grows
>>>>>>>>>>> as
>>>>>>>>>>> the VM only shares printers. The other VM that paused is also on
>>>>>>>>>>> thin
>>>>>>>>>>> provisioned disks and also has plenty space, this guest is running
>>>>>>>>>>> Centos 6.3 64bit and only runs basic reporting.
>>>>>>>>>>>
>>>>>>>>>>> After the 2003 guest was rebooted, the network card showed up as
>>>>>>>>>>> unplugged in ovirt, and we had to remove it, and re-add it again in
>>>>>>>>>>> order to correct the issue. The Centos VM did not have the same
>>>>>>>>>>> issue.
>>>>>>>>>>>
>>>>>>>>>>> I'm concerned that this might happen to a VM that's quite critical,
>>>>>>>>>>> any thoughts or ideas?
>>>>>>>>>>>
>>>>>>>>>>> The only recent changes have been updating from Dreyou 3.2 to the
>>>>>>>>>>> official Centos repo and updating to 3.3.1-2. Prior to updating I
>>>>>>>>>>> haven't had this issue.
>>>>>>>>>>>
>>>>>>>>>>> Any assistance is greatly appreciated.
>>>>>>>>>>>
>>>>>>>>>>> Thank you.
>>>>>>>>>>>
>>>>>>>>>>> Regards.
>>>>>>>>>>>
>>>>>>>>>>> Neil Wilson.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Do you have the VMs on thin provisioned storage or sparse disks?
>>>>>>>>>>>>
>>>>>>>>>>>> Pausing happens when the VM has an IO error or runs out of space
>>>>>>>>>>>> on
>>>>>>>>>>>> the
>>>>>>>>>>>> storage domain, and it is done intentionally, so that the VM will
>>>>>>>>>>>> not
>>>>>>>>>>>> experience a disk corruption. If you have thin provisioned disks,
>>>>>>>>>>>> and
>>>>>>>>>>>> the VM
>>>>>>>>>>>> writes to it's disks faster than the disks can grow, this is
>>>>>>>>>>>> exactly
>>>>>>>>>>>> what
>>>>>>>>>>>> you will see
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123 at gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've had two different Vm's randomly pause this past week and
>>>>>>>>>>>>> inside
>>>>>>>>>>>>> ovirt
>>>>>>>>>>>>> the error received is something like 'vm ran out of storage and
>>>>>>>>>>>>> was
>>>>>>>>>>>>> paused'.
>>>>>>>>>>>>> Resuming the vm's didn't work and I had to force them off and
>>>>>>>>>>>>> then on
>>>>>>>>>>>>> which
>>>>>>>>>>>>> resolved the issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Has anyone had this issue before?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I realise this is very vague so if you could please let me know
>>>>>>>>>>>>> which
>>>>>>>>>>>>> logs
>>>>>>>>>>>>> to send in.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Neil Wilson
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Dafna Ron
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Dafna Ron
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at ovirt.org
>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>
>>>
>>> --
>>> Dafna Ron
>
>
> --
> Dafna Ron
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
More information about the Users
mailing list