[Users] Vm's being paused
Dafna Ron
dron at redhat.com
Wed Jan 29 11:14:05 UTC 2014
The reason I asked about the size if because this was the original issue
no? vm's pausing on lack of space?
You're having a problem with your data domains.
Can you check the rout from the hosts to the storage? I think that you
have some disconnection to the storage from the hosts
since it's random and not from all the vm's I would suggest that its a
routing problem?
Thanks,
Dafna
On 01/29/2014 08:00 AM, Neil wrote:
> Sorry, more on this issue, I see my logs are rapidly filling up my
> disk space on node02 with this error in /var/log/messages...
>
> Jan 29 09:56:53 node02 vdsm vm.Vm ERROR
> vmId=`dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b`::Stats function failed:
> <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most
> recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351,
> in collect#012 statsFunction()#012 File
> "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue =
> self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py",
> line 513, in _highWrite#012 self._vm._dom.blockInfo(vmDrive.path,
> 0)#012 File "/usr/share/vdsm/vm.py", line 835, in f#012 ret =
> attr(*args, **kwargs)#012 File
> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line
> 76, in wrapper#012 ret = f(*args, **kwargs)#012 File
> "/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in
> blockInfo#012 if ret is None: raise libvirtError
> ('virDomainGetBlockInfo() failed', dom=self)#012libvirtError: invalid
> argument: invalid path
> /rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/fac8a3bb-e414-43c0-affc-6e2628757a28/6c3e5ae8-23fc-4196-ba42-778bdc0fbad8
> not assigned to domain
> Jan 29 09:56:53 node02 vdsm vm.Vm ERROR
> vmId=`ac2a3f99-a6db-4cae-955d-efdfb901abb7`::Stats function failed:
> <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most
> recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351,
> in collect#012 statsFunction()#012 File
> "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue =
> self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py",
> line 509, in _highWrite#012 if not vmDrive.blockDev or
> vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no
> attribute 'format'
>
> Not sure if this is related at all though?
>
> Thanks.
>
> Regards.
>
> Neil Wilson.
>
> On Wed, Jan 29, 2014 at 9:02 AM, Neil <nwilson123 at gmail.com> wrote:
>> Hi Dafna,
>>
>> Thanks for clarifying that, I found the migration issue and this was
>> resolved once I sorted out the ISO domain problem.
>>
>> I'm sorry I don't understand your last question?
>> "> after the engine restart, do you still see a problem with the size
>> or did the report of size changed?"
>>
>> The migration issue was resolved, it's now just trying to track down
>> why the two VM's paused on their own, one on the 8th of Jan(I think)
>> and one on the 19th of Jan.
>>
>> Thank you.
>>
>>
>> Regards.
>>
>> Neil Wilson.
>>
>>
>> On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <dron at redhat.com> wrote:
>>> yes - engine lost communication with vdsm and it has no way of knowing if
>>> the host is down or if there was a network issue so a network issue would
>>> cause the same errors that I see in the logs.
>>>
>>> The error you put on the iso is the reason the vm's have failed migration -
>>> if a vm is run with a cd and the cd is gone than the vm will not be able to
>>> be migrated.
>>>
>>> after the engine restart, do you still see a problem with the size or did
>>> the report of size changed?
>>>
>>> Dafna
>>>
>>>
>>> On 01/28/2014 01:02 PM, Neil wrote:
>>>> Hi Dafna,
>>>>
>>>> Thanks for coming back to me. I'll try answer your queries one by one.
>>>>
>>>> On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron at redhat.com> wrote:
>>>>> you had a problem with your storage on the 14th of Jan and one of the
>>>>> hosts
>>>>> rebooted (if you have the vdsm log from that day than I can see what
>>>>> happened on vdsm side)
>>>>> in engine, I could see a problem with the export domain and this should
>>>>> not
>>>>> have cause a reboot.
>>>> 1.) I don't unfortunately have logs going back that far. Looking at
>>>> all 3 hosts uptime, the one with the least uptime is 21 days, the
>>>> others are all over 40 days, so there definitely wasn't a host that
>>>> rebooted on the 14th of Jan, would a network issue or Firewall issue
>>>> also cause the error you've seen to look as if a host rebooted? There
>>>> was a bonding mode change on the 14th of January, so perhaps this
>>>> caused the issue?
>>>>
>>>>
>>>>> Can you tell me if you had a problem with the data
>>>>> domain as well or was it just the export domain? were you having any vm's
>>>>> exported/imported at that time?
>>>>> In any case - this is a bug.
>>>> 2.) I think this was the same day that the bonding mode was changed on
>>>> the host while the host was live (by mistake), and had SPM running on
>>>> it. I haven't done any importing or exporting for a few years on this
>>>> oVirt setup.
>>>>
>>>>
>>>>> As for the vm's - if the vm's are no longer in migrating state than
>>>>> please
>>>>> restart ovirt-engine service (looks like a cache issue)
>>>> 3.) Restarted ovirt-engine, logging now appears to be normal without any
>>>> errors.
>>>>
>>>>
>>>>> if they are in migrating state - there should have been a timeout a long
>>>>> time ago.
>>>>> can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on
>>>>> both
>>>>> all hosts?
>>>> 4.) Ran on all hosts...
>>>>
>>>> node01.blabla.com
>>>> 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up
>>>> 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up
>>>> ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up
>>>> 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up
>>>> 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
>>>>
>>>> Id Name State
>>>> ----------------------------------------------------
>>>> 1 adam running
>>>> 2 reports running
>>>> 4 tux running
>>>> 6 Moodle running
>>>> 7 babbage running
>>>>
>>>> node02.blabla.com
>>>> dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up
>>>> 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up
>>>> ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up
>>>> 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
>>>>
>>>> Id Name State
>>>> ----------------------------------------------------
>>>> 9 proxy02 running
>>>> 10 spam running
>>>> 12 software running
>>>> 13 zimbra running
>>>>
>>>> node03.blabla.com
>>>> e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up
>>>> 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
>>>>
>>>> Id Name State
>>>> ----------------------------------------------------
>>>> 13 oliver running
>>>> 14 dhcp running
>>>>
>>>>
>>>>> Last thing is that your ISO domain seems to be having issues as well.
>>>>> This should not effect the host status but if any of the vm's were booted
>>>>> from an iso or have an iso attached in the boot sequence this will
>>>>> explain
>>>>> the migration issue.
>>>> There was an ISO domain issue a while back, but this was corrected
>>>> about 2 weeks ago after iptables re-enabled itself on boot after
>>>> running updates, I've checked now and the ISO domain appears to be
>>>> fine and I can see all the images stored within.
>>>>
>>>> I've stumbled across what appears to be another error and all three
>>>> hosts are showing this over and over in /var/log/messages, and I'm not
>>>> sure if it's related? ...
>>>>
>>>> Jan 28 14:58:59 node01 vdsm vm.Vm ERROR
>>>> vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed:
>>>> <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most
>>>> recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351,
>>>> in collect#012 statsFunction()#012 File
>>>> "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue =
>>>> self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py",
>>>> line 509, in _highWrite#012 if not vmDrive.blockDev or
>>>> vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no
>>>> attribute 'format'
>>>>
>>>> I've attached the full vdsm log from node02 to this reply.
>>>>
>>>> Please shout if you need anything else.
>>>>
>>>> Thank you.
>>>>
>>>> Regards.
>>>>
>>>> Neil Wilson.
>>>>
>>>>> On 01/28/2014 09:28 AM, Neil wrote:
>>>>>> Hi guys,
>>>>>>
>>>>>> Sorry for the very late reply, I've been out of the office doing
>>>>>> installations.
>>>>>> Unfortunately due to the time delay, my oldest logs are only as far
>>>>>> back as the attached.
>>>>>>
>>>>>> I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm
>>>>>> not sure what info is required, so the full log is attached.
>>>>>>
>>>>>> Please shout if you need any info or further details.
>>>>>>
>>>>>> Thank you very much.
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>> Neil Wilson.
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin at redhat.com>
>>>>>> wrote:
>>>>>>> Could you please attach the engine.log from the same time?
>>>>>>>
>>>>>>> thanks!
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Neil" <nwilson123 at gmail.com>
>>>>>>>> To: dron at redhat.com
>>>>>>>> Cc: "users" <users at ovirt.org>
>>>>>>>> Sent: Wednesday, January 22, 2014 1:14:25 PM
>>>>>>>> Subject: Re: [Users] Vm's being paused
>>>>>>>>
>>>>>>>> Hi Dafna,
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> The vdsm logs are quite large, so I've only attached the logs for the
>>>>>>>> pause of the VM called Babbage on the 19th of Jan.
>>>>>>>>
>>>>>>>> As for snapshots, Babbage has one from June 2013 and Reports has two
>>>>>>>> from June and Oct 2013.
>>>>>>>>
>>>>>>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's
>>>>>>>> have thin provisioned disks.
>>>>>>>>
>>>>>>>> Please shout if you'd like any further info or logs.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>>
>>>>>>>> Neil Wilson.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron at redhat.com> wrote:
>>>>>>>>> Hi Neil,
>>>>>>>>>
>>>>>>>>> Can you please attach the vdsm logs?
>>>>>>>>> also, as for the vm's, do they have any snapshots?
>>>>>>>>> from your suggestion to allocate more luns, are you using iscsi or
>>>>>>>>> FC?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Dafna
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 01/22/2014 08:45 AM, Neil wrote:
>>>>>>>>>> Thanks for the replies guys,
>>>>>>>>>>
>>>>>>>>>> Looking at my two VM's that have paused so far through the oVirt GUI
>>>>>>>>>> the following sizes show under Disks.
>>>>>>>>>>
>>>>>>>>>> VM Reports:
>>>>>>>>>> Virtual Size 35GB, Actual Size 41GB
>>>>>>>>>> Looking on the Centos OS side, Disk size is 33G and used is 12G with
>>>>>>>>>> 19G available (40%) usage.
>>>>>>>>>>
>>>>>>>>>> VM Babbage:
>>>>>>>>>> Virtual Size is 40GB, Actual Size 53GB
>>>>>>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G,
>>>>>>>>>> so
>>>>>>>>>> under 50% usage.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Do you see any issues with the above stats?
>>>>>>>>>>
>>>>>>>>>> Then my main Datacenter storage is as follows...
>>>>>>>>>>
>>>>>>>>>> Size: 6887 GB
>>>>>>>>>> Available: 1948 GB
>>>>>>>>>> Used: 4939 GB
>>>>>>>>>> Allocated: 1196 GB
>>>>>>>>>> Over Allocation: 61%
>>>>>>>>>>
>>>>>>>>>> Could there be a problem here? I can allocate additional LUNS if you
>>>>>>>>>> feel the space isn't correctly allocated.
>>>>>>>>>>
>>>>>>>>>> Apologies for going on about this, but I'm really concerned that
>>>>>>>>>> something isn't right and I might have a serious problem if an
>>>>>>>>>> important machine locks up.
>>>>>>>>>>
>>>>>>>>>> Thank you and much appreciated.
>>>>>>>>>>
>>>>>>>>>> Regards.
>>>>>>>>>>
>>>>>>>>>> Neil Wilson.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron at redhat.com> wrote:
>>>>>>>>>>> the storage space is configured in percentages and not physical
>>>>>>>>>>> size.
>>>>>>>>>>> so if 20G is less than 10% (default config) of your storage it will
>>>>>>>>>>> pause
>>>>>>>>>>> the vms regardless of how much GB you still have.
>>>>>>>>>>> this is configurable though so you can change it to less than 10%
>>>>>>>>>>> if
>>>>>>>>>>> you
>>>>>>>>>>> like.
>>>>>>>>>>>
>>>>>>>>>>> to answer the second question, vm's will not pause on ENOSpace
>>>>>>>>>>> error
>>>>>>>>>>> if
>>>>>>>>>>> they
>>>>>>>>>>> run out of space internally but only if the external storage cannot
>>>>>>>>>>> be
>>>>>>>>>>> consumed. so only if you run out of space in the storage and and
>>>>>>>>>>> not
>>>>>>>>>>> if
>>>>>>>>>>> vm
>>>>>>>>>>> runs out of space in its on fs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 01/21/2014 09:51 AM, Neil wrote:
>>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry, attached is engine.log I've taken out the two sections
>>>>>>>>>>>> where
>>>>>>>>>>>> each of the VM's were paused.
>>>>>>>>>>>>
>>>>>>>>>>>> Does the error "VM babbage has paused due to no Storage space
>>>>>>>>>>>> error"
>>>>>>>>>>>> mean the main storage domain has run out of storage, or that the
>>>>>>>>>>>> VM
>>>>>>>>>>>> has run out?
>>>>>>>>>>>>
>>>>>>>>>>>> Both VM's appear to have been running on node01 when they were
>>>>>>>>>>>> paused.
>>>>>>>>>>>> My vdsm versions are all...
>>>>>>>>>>>>
>>>>>>>>>>>> vdsm-cli-4.13.0-11.el6.noarch
>>>>>>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64
>>>>>>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch
>>>>>>>>>>>> vdsm-4.13.0-11.el6.x86_64
>>>>>>>>>>>> vdsm-python-4.13.0-11.el6.x86_64
>>>>>>>>>>>>
>>>>>>>>>>>> I currently have a 61% over allocation ratio on my primary storage
>>>>>>>>>>>> domain, with 1948GB available.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards.
>>>>>>>>>>>>
>>>>>>>>>>>> Neil Wilson.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123 at gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for only coming back to you now.
>>>>>>>>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out
>>>>>>>>>>>>> of
>>>>>>>>>>>>> disk space there is about 20Gigs free, and the usage barely grows
>>>>>>>>>>>>> as
>>>>>>>>>>>>> the VM only shares printers. The other VM that paused is also on
>>>>>>>>>>>>> thin
>>>>>>>>>>>>> provisioned disks and also has plenty space, this guest is
>>>>>>>>>>>>> running
>>>>>>>>>>>>> Centos 6.3 64bit and only runs basic reporting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> After the 2003 guest was rebooted, the network card showed up as
>>>>>>>>>>>>> unplugged in ovirt, and we had to remove it, and re-add it again
>>>>>>>>>>>>> in
>>>>>>>>>>>>> order to correct the issue. The Centos VM did not have the same
>>>>>>>>>>>>> issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm concerned that this might happen to a VM that's quite
>>>>>>>>>>>>> critical,
>>>>>>>>>>>>> any thoughts or ideas?
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only recent changes have been updating from Dreyou 3.2 to the
>>>>>>>>>>>>> official Centos repo and updating to 3.3.1-2. Prior to updating I
>>>>>>>>>>>>> haven't had this issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any assistance is greatly appreciated.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Neil Wilson.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny at gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Do you have the VMs on thin provisioned storage or sparse disks?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Pausing happens when the VM has an IO error or runs out of space
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> storage domain, and it is done intentionally, so that the VM
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> experience a disk corruption. If you have thin provisioned
>>>>>>>>>>>>>> disks,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> the VM
>>>>>>>>>>>>>> writes to it's disks faster than the disks can grow, this is
>>>>>>>>>>>>>> exactly
>>>>>>>>>>>>>> what
>>>>>>>>>>>>>> you will see
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123 at gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've had two different Vm's randomly pause this past week and
>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>> ovirt
>>>>>>>>>>>>>>> the error received is something like 'vm ran out of storage and
>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>> paused'.
>>>>>>>>>>>>>>> Resuming the vm's didn't work and I had to force them off and
>>>>>>>>>>>>>>> then on
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> resolved the issue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Has anyone had this issue before?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I realise this is very vague so if you could please let me know
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> logs
>>>>>>>>>>>>>>> to send in.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Neil Wilson
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Dafna Ron
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Dafna Ron
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at ovirt.org
>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>
>>>>> --
>>>>> Dafna Ron
>>>
>>>
>>> --
>>> Dafna Ron
--
Dafna Ron
More information about the Users
mailing list