The reason I asked about the size if because this was the original issue
no? vm's pausing on lack of space?
You're having a problem with your data domains.
Can you check the rout from the hosts to the storage? I think that you
have some disconnection to the storage from the hosts
since it's random and not from all the vm's I would suggest that its a
routing problem?
Thanks,
Dafna
On 01/29/2014 08:00 AM, Neil wrote:
Sorry, more on this issue, I see my logs are rapidly filling up my
disk space on node02 with this error in /var/log/messages...
Jan 29 09:56:53 node02 vdsm vm.Vm ERROR
vmId=`dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b`::Stats function failed:
<AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most
recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351,
in collect#012 statsFunction()#012 File
"/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue =
self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py",
line 513, in _highWrite#012 self._vm._dom.blockInfo(vmDrive.path,
0)#012 File "/usr/share/vdsm/vm.py", line 835, in f#012 ret =
attr(*args, **kwargs)#012 File
"/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line
76, in wrapper#012 ret = f(*args, **kwargs)#012 File
"/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in
blockInfo#012 if ret is None: raise libvirtError
('virDomainGetBlockInfo() failed', dom=self)#012libvirtError: invalid
argument: invalid path
/rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/fac8a3bb-e414-43c0-affc-6e2628757a28/6c3e5ae8-23fc-4196-ba42-778bdc0fbad8
not assigned to domain
Jan 29 09:56:53 node02 vdsm vm.Vm ERROR
vmId=`ac2a3f99-a6db-4cae-955d-efdfb901abb7`::Stats function failed:
<AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most
recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351,
in collect#012 statsFunction()#012 File
"/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue =
self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py",
line 509, in _highWrite#012 if not vmDrive.blockDev or
vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no
attribute 'format'
Not sure if this is related at all though?
Thanks.
Regards.
Neil Wilson.
On Wed, Jan 29, 2014 at 9:02 AM, Neil <nwilson123(a)gmail.com> wrote:
> Hi Dafna,
>
> Thanks for clarifying that, I found the migration issue and this was
> resolved once I sorted out the ISO domain problem.
>
> I'm sorry I don't understand your last question?
> "> after the engine restart, do you still see a problem with the size
> or did the report of size changed?"
>
> The migration issue was resolved, it's now just trying to track down
> why the two VM's paused on their own, one on the 8th of Jan(I think)
> and one on the 19th of Jan.
>
> Thank you.
>
>
> Regards.
>
> Neil Wilson.
>
>
> On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <dron(a)redhat.com> wrote:
>> yes - engine lost communication with vdsm and it has no way of knowing if
>> the host is down or if there was a network issue so a network issue would
>> cause the same errors that I see in the logs.
>>
>> The error you put on the iso is the reason the vm's have failed migration -
>> if a vm is run with a cd and the cd is gone than the vm will not be able to
>> be migrated.
>>
>> after the engine restart, do you still see a problem with the size or did
>> the report of size changed?
>>
>> Dafna
>>
>>
>> On 01/28/2014 01:02 PM, Neil wrote:
>>> Hi Dafna,
>>>
>>> Thanks for coming back to me. I'll try answer your queries one by one.
>>>
>>> On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron(a)redhat.com> wrote:
>>>> you had a problem with your storage on the 14th of Jan and one of the
>>>> hosts
>>>> rebooted (if you have the vdsm log from that day than I can see what
>>>> happened on vdsm side)
>>>> in engine, I could see a problem with the export domain and this should
>>>> not
>>>> have cause a reboot.
>>> 1.) I don't unfortunately have logs going back that far. Looking at
>>> all 3 hosts uptime, the one with the least uptime is 21 days, the
>>> others are all over 40 days, so there definitely wasn't a host that
>>> rebooted on the 14th of Jan, would a network issue or Firewall issue
>>> also cause the error you've seen to look as if a host rebooted? There
>>> was a bonding mode change on the 14th of January, so perhaps this
>>> caused the issue?
>>>
>>>
>>>> Can you tell me if you had a problem with the data
>>>> domain as well or was it just the export domain? were you having any
vm's
>>>> exported/imported at that time?
>>>> In any case - this is a bug.
>>> 2.) I think this was the same day that the bonding mode was changed on
>>> the host while the host was live (by mistake), and had SPM running on
>>> it. I haven't done any importing or exporting for a few years on this
>>> oVirt setup.
>>>
>>>
>>>> As for the vm's - if the vm's are no longer in migrating state
than
>>>> please
>>>> restart ovirt-engine service (looks like a cache issue)
>>> 3.) Restarted ovirt-engine, logging now appears to be normal without any
>>> errors.
>>>
>>>
>>>> if they are in migrating state - there should have been a timeout a long
>>>> time ago.
>>>> can you please run 'vdsClient -s 0 list table' and 'virsh -r
list' on
>>>> both
>>>> all hosts?
>>> 4.) Ran on all hosts...
>>>
>>>
node01.blabla.com
>>> 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up
>>> 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up
>>> ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up
>>> 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up
>>> 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
>>>
>>> Id Name State
>>> ----------------------------------------------------
>>> 1 adam running
>>> 2 reports running
>>> 4 tux running
>>> 6 Moodle running
>>> 7 babbage running
>>>
>>>
node02.blabla.com
>>> dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up
>>> 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up
>>> ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up
>>> 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
>>>
>>> Id Name State
>>> ----------------------------------------------------
>>> 9 proxy02 running
>>> 10 spam running
>>> 12 software running
>>> 13 zimbra running
>>>
>>>
node03.blabla.com
>>> e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up
>>> 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
>>>
>>> Id Name State
>>> ----------------------------------------------------
>>> 13 oliver running
>>> 14 dhcp running
>>>
>>>
>>>> Last thing is that your ISO domain seems to be having issues as well.
>>>> This should not effect the host status but if any of the vm's were
booted
>>>> from an iso or have an iso attached in the boot sequence this will
>>>> explain
>>>> the migration issue.
>>> There was an ISO domain issue a while back, but this was corrected
>>> about 2 weeks ago after iptables re-enabled itself on boot after
>>> running updates, I've checked now and the ISO domain appears to be
>>> fine and I can see all the images stored within.
>>>
>>> I've stumbled across what appears to be another error and all three
>>> hosts are showing this over and over in /var/log/messages, and I'm not
>>> sure if it's related? ...
>>>
>>> Jan 28 14:58:59 node01 vdsm vm.Vm ERROR
>>> vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed:
>>> <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most
>>> recent call last):#012 File "/usr/share/vdsm/sampling.py", line
351,
>>> in collect#012 statsFunction()#012 File
>>> "/usr/share/vdsm/sampling.py", line 226, in __call__#012
retValue =
>>> self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py",
>>> line 509, in _highWrite#012 if not vmDrive.blockDev or
>>> vmDrive.format != 'cow':#012AttributeError: 'Drive' object
has no
>>> attribute 'format'
>>>
>>> I've attached the full vdsm log from node02 to this reply.
>>>
>>> Please shout if you need anything else.
>>>
>>> Thank you.
>>>
>>> Regards.
>>>
>>> Neil Wilson.
>>>
>>>> On 01/28/2014 09:28 AM, Neil wrote:
>>>>> Hi guys,
>>>>>
>>>>> Sorry for the very late reply, I've been out of the office doing
>>>>> installations.
>>>>> Unfortunately due to the time delay, my oldest logs are only as far
>>>>> back as the attached.
>>>>>
>>>>> I've only grep'd for Thread-286029 in the vdsm log. The
engine.log I'm
>>>>> not sure what info is required, so the full log is attached.
>>>>>
>>>>> Please shout if you need any info or further details.
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Regards.
>>>>>
>>>>> Neil Wilson.
>>>>>
>>>>>
>>>>> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine
<mbourvin(a)redhat.com>
>>>>> wrote:
>>>>>> Could you please attach the engine.log from the same time?
>>>>>>
>>>>>> thanks!
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Neil" <nwilson123(a)gmail.com>
>>>>>>> To: dron(a)redhat.com
>>>>>>> Cc: "users" <users(a)ovirt.org>
>>>>>>> Sent: Wednesday, January 22, 2014 1:14:25 PM
>>>>>>> Subject: Re: [Users] Vm's being paused
>>>>>>>
>>>>>>> Hi Dafna,
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> The vdsm logs are quite large, so I've only attached the
logs for the
>>>>>>> pause of the VM called Babbage on the 19th of Jan.
>>>>>>>
>>>>>>> As for snapshots, Babbage has one from June 2013 and Reports
has two
>>>>>>> from June and Oct 2013.
>>>>>>>
>>>>>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts,
9 of the 11 VM's
>>>>>>> have thin provisioned disks.
>>>>>>>
>>>>>>> Please shout if you'd like any further info or logs.
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>> Neil Wilson.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron
<dron(a)redhat.com> wrote:
>>>>>>>> Hi Neil,
>>>>>>>>
>>>>>>>> Can you please attach the vdsm logs?
>>>>>>>> also, as for the vm's, do they have any snapshots?
>>>>>>>> from your suggestion to allocate more luns, are you using
iscsi or
>>>>>>>> FC?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Dafna
>>>>>>>>
>>>>>>>>
>>>>>>>> On 01/22/2014 08:45 AM, Neil wrote:
>>>>>>>>> Thanks for the replies guys,
>>>>>>>>>
>>>>>>>>> Looking at my two VM's that have paused so far
through the oVirt GUI
>>>>>>>>> the following sizes show under Disks.
>>>>>>>>>
>>>>>>>>> VM Reports:
>>>>>>>>> Virtual Size 35GB, Actual Size 41GB
>>>>>>>>> Looking on the Centos OS side, Disk size is 33G and
used is 12G with
>>>>>>>>> 19G available (40%) usage.
>>>>>>>>>
>>>>>>>>> VM Babbage:
>>>>>>>>> Virtual Size is 40GB, Actual Size 53GB
>>>>>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and
used is 16.3G,
>>>>>>>>> so
>>>>>>>>> under 50% usage.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you see any issues with the above stats?
>>>>>>>>>
>>>>>>>>> Then my main Datacenter storage is as follows...
>>>>>>>>>
>>>>>>>>> Size: 6887 GB
>>>>>>>>> Available: 1948 GB
>>>>>>>>> Used: 4939 GB
>>>>>>>>> Allocated: 1196 GB
>>>>>>>>> Over Allocation: 61%
>>>>>>>>>
>>>>>>>>> Could there be a problem here? I can allocate
additional LUNS if you
>>>>>>>>> feel the space isn't correctly allocated.
>>>>>>>>>
>>>>>>>>> Apologies for going on about this, but I'm really
concerned that
>>>>>>>>> something isn't right and I might have a serious
problem if an
>>>>>>>>> important machine locks up.
>>>>>>>>>
>>>>>>>>> Thank you and much appreciated.
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>>
>>>>>>>>> Neil Wilson.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron
<dron(a)redhat.com> wrote:
>>>>>>>>>> the storage space is configured in percentages
and not physical
>>>>>>>>>> size.
>>>>>>>>>> so if 20G is less than 10% (default config) of
your storage it will
>>>>>>>>>> pause
>>>>>>>>>> the vms regardless of how much GB you still
have.
>>>>>>>>>> this is configurable though so you can change it
to less than 10%
>>>>>>>>>> if
>>>>>>>>>> you
>>>>>>>>>> like.
>>>>>>>>>>
>>>>>>>>>> to answer the second question, vm's will not
pause on ENOSpace
>>>>>>>>>> error
>>>>>>>>>> if
>>>>>>>>>> they
>>>>>>>>>> run out of space internally but only if the
external storage cannot
>>>>>>>>>> be
>>>>>>>>>> consumed. so only if you run out of space in the
storage and and
>>>>>>>>>> not
>>>>>>>>>> if
>>>>>>>>>> vm
>>>>>>>>>> runs out of space in its on fs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 01/21/2014 09:51 AM, Neil wrote:
>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>
>>>>>>>>>>> Sorry, attached is engine.log I've taken
out the two sections
>>>>>>>>>>> where
>>>>>>>>>>> each of the VM's were paused.
>>>>>>>>>>>
>>>>>>>>>>> Does the error "VM babbage has paused
due to no Storage space
>>>>>>>>>>> error"
>>>>>>>>>>> mean the main storage domain has run out of
storage, or that the
>>>>>>>>>>> VM
>>>>>>>>>>> has run out?
>>>>>>>>>>>
>>>>>>>>>>> Both VM's appear to have been running on
node01 when they were
>>>>>>>>>>> paused.
>>>>>>>>>>> My vdsm versions are all...
>>>>>>>>>>>
>>>>>>>>>>> vdsm-cli-4.13.0-11.el6.noarch
>>>>>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64
>>>>>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch
>>>>>>>>>>> vdsm-4.13.0-11.el6.x86_64
>>>>>>>>>>> vdsm-python-4.13.0-11.el6.x86_64
>>>>>>>>>>>
>>>>>>>>>>> I currently have a 61% over allocation ratio
on my primary storage
>>>>>>>>>>> domain, with 1948GB available.
>>>>>>>>>>>
>>>>>>>>>>> Thank you.
>>>>>>>>>>>
>>>>>>>>>>> Regards.
>>>>>>>>>>>
>>>>>>>>>>> Neil Wilson.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil
<nwilson123(a)gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry for only coming back to you now.
>>>>>>>>>>>> The VM's are thin provisioned. The
Server 2003 VM hasn't run out
>>>>>>>>>>>> of
>>>>>>>>>>>> disk space there is about 20Gigs free,
and the usage barely grows
>>>>>>>>>>>> as
>>>>>>>>>>>> the VM only shares printers. The other VM
that paused is also on
>>>>>>>>>>>> thin
>>>>>>>>>>>> provisioned disks and also has plenty
space, this guest is
>>>>>>>>>>>> running
>>>>>>>>>>>> Centos 6.3 64bit and only runs basic
reporting.
>>>>>>>>>>>>
>>>>>>>>>>>> After the 2003 guest was rebooted, the
network card showed up as
>>>>>>>>>>>> unplugged in ovirt, and we had to remove
it, and re-add it again
>>>>>>>>>>>> in
>>>>>>>>>>>> order to correct the issue. The Centos VM
did not have the same
>>>>>>>>>>>> issue.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm concerned that this might happen
to a VM that's quite
>>>>>>>>>>>> critical,
>>>>>>>>>>>> any thoughts or ideas?
>>>>>>>>>>>>
>>>>>>>>>>>> The only recent changes have been
updating from Dreyou 3.2 to the
>>>>>>>>>>>> official Centos repo and updating to
3.3.1-2. Prior to updating I
>>>>>>>>>>>> haven't had this issue.
>>>>>>>>>>>>
>>>>>>>>>>>> Any assistance is greatly appreciated.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards.
>>>>>>>>>>>>
>>>>>>>>>>>> Neil Wilson.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan
Yasny <dyasny(a)gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Do you have the VMs on thin
provisioned storage or sparse disks?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Pausing happens when the VM has an IO
error or runs out of space
>>>>>>>>>>>>> on
>>>>>>>>>>>>> the
>>>>>>>>>>>>> storage domain, and it is done
intentionally, so that the VM
>>>>>>>>>>>>> will
>>>>>>>>>>>>> not
>>>>>>>>>>>>> experience a disk corruption. If you
have thin provisioned
>>>>>>>>>>>>> disks,
>>>>>>>>>>>>> and
>>>>>>>>>>>>> the VM
>>>>>>>>>>>>> writes to it's disks faster than
the disks can grow, this is
>>>>>>>>>>>>> exactly
>>>>>>>>>>>>> what
>>>>>>>>>>>>> you will see
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM,
Neil <nwilson123(a)gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've had two different
Vm's randomly pause this past week and
>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>> ovirt
>>>>>>>>>>>>>> the error received is something
like 'vm ran out of storage and
>>>>>>>>>>>>>> was
>>>>>>>>>>>>>> paused'.
>>>>>>>>>>>>>> Resuming the vm's didn't
work and I had to force them off and
>>>>>>>>>>>>>> then on
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>> resolved the issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Has anyone had this issue
before?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I realise this is very vague so
if you could please let me know
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>> logs
>>>>>>>>>>>>>> to send in.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Neil Wilson
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>>> Users(a)ovirt.org
>>>>>>>>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>> Users(a)ovirt.org
>>>>>>>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Dafna Ron
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Dafna Ron
>>>>>>> _______________________________________________
>>>>>>> Users mailing list
>>>>>>> Users(a)ovirt.org
>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>
>>>> --
>>>> Dafna Ron
>>
>>
>> --
>> Dafna Ron