yes - engine lost communication with vdsm and it has no way of
knowing if the host is down or if there was a network issue so a network issue would cause
the same errors that I see in the logs.
The error you put on the iso is the reason the vm's have failed migration - if a vm
is run with a cd and the cd is gone than the vm will not be able to be migrated.
which, as I learned last week, is not entirely correct. Pure libvirt VM seems to work
fineā¦so it must be somewhere something in oVirt:(
looking into it
but just for future reference we want it to work:)
after the engine restart, do you still see a problem with the size or did the report of
size changed?
Dafna
On 01/28/2014 01:02 PM, Neil wrote:
> Hi Dafna,
>
> Thanks for coming back to me. I'll try answer your queries one by one.
>
> On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron(a)redhat.com> wrote:
>> you had a problem with your storage on the 14th of Jan and one of the hosts
>> rebooted (if you have the vdsm log from that day than I can see what
>> happened on vdsm side)
>> in engine, I could see a problem with the export domain and this should not
>> have cause a reboot.
> 1.) I don't unfortunately have logs going back that far. Looking at
> all 3 hosts uptime, the one with the least uptime is 21 days, the
> others are all over 40 days, so there definitely wasn't a host that
> rebooted on the 14th of Jan, would a network issue or Firewall issue
> also cause the error you've seen to look as if a host rebooted? There
> was a bonding mode change on the 14th of January, so perhaps this
> caused the issue?
>
>
>> Can you tell me if you had a problem with the data
>> domain as well or was it just the export domain? were you having any vm's
>> exported/imported at that time?
>> In any case - this is a bug.
> 2.) I think this was the same day that the bonding mode was changed on
> the host while the host was live (by mistake), and had SPM running on
> it. I haven't done any importing or exporting for a few years on this
> oVirt setup.
>
>
>> As for the vm's - if the vm's are no longer in migrating state than
please
>> restart ovirt-engine service (looks like a cache issue)
> 3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
>
>
>> if they are in migrating state - there should have been a timeout a long
>> time ago.
>> can you please run 'vdsClient -s 0 list table' and 'virsh -r
list' on both
>> all hosts?
> 4.) Ran on all hosts...
>
>
node01.blabla.com
> 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up
> 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up
> ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up
> 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up
> 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
>
> Id Name State
> ----------------------------------------------------
> 1 adam running
> 2 reports running
> 4 tux running
> 6 Moodle running
> 7 babbage running
>
>
node02.blabla.com
> dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up
> 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up
> ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up
> 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
>
> Id Name State
> ----------------------------------------------------
> 9 proxy02 running
> 10 spam running
> 12 software running
> 13 zimbra running
>
>
node03.blabla.com
> e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up
> 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
>
> Id Name State
> ----------------------------------------------------
> 13 oliver running
> 14 dhcp running
>
>
>> Last thing is that your ISO domain seems to be having issues as well.
>> This should not effect the host status but if any of the vm's were booted
>> from an iso or have an iso attached in the boot sequence this will explain
>> the migration issue.
> There was an ISO domain issue a while back, but this was corrected
> about 2 weeks ago after iptables re-enabled itself on boot after
> running updates, I've checked now and the ISO domain appears to be
> fine and I can see all the images stored within.
>
> I've stumbled across what appears to be another error and all three
> hosts are showing this over and over in /var/log/messages, and I'm not
> sure if it's related? ...
>
> Jan 28 14:58:59 node01 vdsm vm.Vm ERROR
> vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed:
> <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most
> recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351,
> in collect#012 statsFunction()#012 File
> "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue =
> self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py",
> line 509, in _highWrite#012 if not vmDrive.blockDev or
> vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no
> attribute 'format'
>
> I've attached the full vdsm log from node02 to this reply.
>
> Please shout if you need anything else.
>
> Thank you.
>
> Regards.
>
> Neil Wilson.
>
>> On 01/28/2014 09:28 AM, Neil wrote:
>>> Hi guys,
>>>
>>> Sorry for the very late reply, I've been out of the office doing
>>> installations.
>>> Unfortunately due to the time delay, my oldest logs are only as far
>>> back as the attached.
>>>
>>> I've only grep'd for Thread-286029 in the vdsm log. The engine.log
I'm
>>> not sure what info is required, so the full log is attached.
>>>
>>> Please shout if you need any info or further details.
>>>
>>> Thank you very much.
>>>
>>> Regards.
>>>
>>> Neil Wilson.
>>>
>>>
>>> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine
<mbourvin(a)redhat.com>
>>> wrote:
>>>> Could you please attach the engine.log from the same time?
>>>>
>>>> thanks!
>>>>
>>>> ----- Original Message -----
>>>>> From: "Neil" <nwilson123(a)gmail.com>
>>>>> To: dron(a)redhat.com
>>>>> Cc: "users" <users(a)ovirt.org>
>>>>> Sent: Wednesday, January 22, 2014 1:14:25 PM
>>>>> Subject: Re: [Users] Vm's being paused
>>>>>
>>>>> Hi Dafna,
>>>>>
>>>>> Thanks.
>>>>>
>>>>> The vdsm logs are quite large, so I've only attached the logs for
the
>>>>> pause of the VM called Babbage on the 19th of Jan.
>>>>>
>>>>> As for snapshots, Babbage has one from June 2013 and Reports has two
>>>>> from June and Oct 2013.
>>>>>
>>>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of
the 11 VM's
>>>>> have thin provisioned disks.
>>>>>
>>>>> Please shout if you'd like any further info or logs.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Regards.
>>>>>
>>>>> Neil Wilson.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron(a)redhat.com>
wrote:
>>>>>> Hi Neil,
>>>>>>
>>>>>> Can you please attach the vdsm logs?
>>>>>> also, as for the vm's, do they have any snapshots?
>>>>>> from your suggestion to allocate more luns, are you using iscsi
or FC?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Dafna
>>>>>>
>>>>>>
>>>>>> On 01/22/2014 08:45 AM, Neil wrote:
>>>>>>> Thanks for the replies guys,
>>>>>>>
>>>>>>> Looking at my two VM's that have paused so far through
the oVirt GUI
>>>>>>> the following sizes show under Disks.
>>>>>>>
>>>>>>> VM Reports:
>>>>>>> Virtual Size 35GB, Actual Size 41GB
>>>>>>> Looking on the Centos OS side, Disk size is 33G and used is
12G with
>>>>>>> 19G available (40%) usage.
>>>>>>>
>>>>>>> VM Babbage:
>>>>>>> Virtual Size is 40GB, Actual Size 53GB
>>>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is
16.3G, so
>>>>>>> under 50% usage.
>>>>>>>
>>>>>>>
>>>>>>> Do you see any issues with the above stats?
>>>>>>>
>>>>>>> Then my main Datacenter storage is as follows...
>>>>>>>
>>>>>>> Size: 6887 GB
>>>>>>> Available: 1948 GB
>>>>>>> Used: 4939 GB
>>>>>>> Allocated: 1196 GB
>>>>>>> Over Allocation: 61%
>>>>>>>
>>>>>>> Could there be a problem here? I can allocate additional LUNS
if you
>>>>>>> feel the space isn't correctly allocated.
>>>>>>>
>>>>>>> Apologies for going on about this, but I'm really
concerned that
>>>>>>> something isn't right and I might have a serious problem
if an
>>>>>>> important machine locks up.
>>>>>>>
>>>>>>> Thank you and much appreciated.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>> Neil Wilson.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron
<dron(a)redhat.com> wrote:
>>>>>>>> the storage space is configured in percentages and not
physical size.
>>>>>>>> so if 20G is less than 10% (default config) of your
storage it will
>>>>>>>> pause
>>>>>>>> the vms regardless of how much GB you still have.
>>>>>>>> this is configurable though so you can change it to less
than 10% if
>>>>>>>> you
>>>>>>>> like.
>>>>>>>>
>>>>>>>> to answer the second question, vm's will not pause on
ENOSpace error
>>>>>>>> if
>>>>>>>> they
>>>>>>>> run out of space internally but only if the external
storage cannot
>>>>>>>> be
>>>>>>>> consumed. so only if you run out of space in the storage
and and not
>>>>>>>> if
>>>>>>>> vm
>>>>>>>> runs out of space in its on fs.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 01/21/2014 09:51 AM, Neil wrote:
>>>>>>>>> Hi Dan,
>>>>>>>>>
>>>>>>>>> Sorry, attached is engine.log I've taken out the
two sections where
>>>>>>>>> each of the VM's were paused.
>>>>>>>>>
>>>>>>>>> Does the error "VM babbage has paused due to no
Storage space error"
>>>>>>>>> mean the main storage domain has run out of storage,
or that the VM
>>>>>>>>> has run out?
>>>>>>>>>
>>>>>>>>> Both VM's appear to have been running on node01
when they were
>>>>>>>>> paused.
>>>>>>>>> My vdsm versions are all...
>>>>>>>>>
>>>>>>>>> vdsm-cli-4.13.0-11.el6.noarch
>>>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64
>>>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch
>>>>>>>>> vdsm-4.13.0-11.el6.x86_64
>>>>>>>>> vdsm-python-4.13.0-11.el6.x86_64
>>>>>>>>>
>>>>>>>>> I currently have a 61% over allocation ratio on my
primary storage
>>>>>>>>> domain, with 1948GB available.
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>>
>>>>>>>>> Neil Wilson.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil
<nwilson123(a)gmail.com> wrote:
>>>>>>>>>> Hi Dan,
>>>>>>>>>>
>>>>>>>>>> Sorry for only coming back to you now.
>>>>>>>>>> The VM's are thin provisioned. The Server
2003 VM hasn't run out of
>>>>>>>>>> disk space there is about 20Gigs free, and the
usage barely grows
>>>>>>>>>> as
>>>>>>>>>> the VM only shares printers. The other VM that
paused is also on
>>>>>>>>>> thin
>>>>>>>>>> provisioned disks and also has plenty space, this
guest is running
>>>>>>>>>> Centos 6.3 64bit and only runs basic reporting.
>>>>>>>>>>
>>>>>>>>>> After the 2003 guest was rebooted, the network
card showed up as
>>>>>>>>>> unplugged in ovirt, and we had to remove it, and
re-add it again in
>>>>>>>>>> order to correct the issue. The Centos VM did not
have the same
>>>>>>>>>> issue.
>>>>>>>>>>
>>>>>>>>>> I'm concerned that this might happen to a VM
that's quite critical,
>>>>>>>>>> any thoughts or ideas?
>>>>>>>>>>
>>>>>>>>>> The only recent changes have been updating from
Dreyou 3.2 to the
>>>>>>>>>> official Centos repo and updating to 3.3.1-2.
Prior to updating I
>>>>>>>>>> haven't had this issue.
>>>>>>>>>>
>>>>>>>>>> Any assistance is greatly appreciated.
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>> Regards.
>>>>>>>>>>
>>>>>>>>>> Neil Wilson.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny
<dyasny(a)gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> Do you have the VMs on thin provisioned
storage or sparse disks?
>>>>>>>>>>>
>>>>>>>>>>> Pausing happens when the VM has an IO error
or runs out of space
>>>>>>>>>>> on
>>>>>>>>>>> the
>>>>>>>>>>> storage domain, and it is done intentionally,
so that the VM will
>>>>>>>>>>> not
>>>>>>>>>>> experience a disk corruption. If you have
thin provisioned disks,
>>>>>>>>>>> and
>>>>>>>>>>> the VM
>>>>>>>>>>> writes to it's disks faster than the
disks can grow, this is
>>>>>>>>>>> exactly
>>>>>>>>>>> what
>>>>>>>>>>> you will see
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil
<nwilson123(a)gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>
>>>>>>>>>>>> I've had two different Vm's
randomly pause this past week and
>>>>>>>>>>>> inside
>>>>>>>>>>>> ovirt
>>>>>>>>>>>> the error received is something like
'vm ran out of storage and
>>>>>>>>>>>> was
>>>>>>>>>>>> paused'.
>>>>>>>>>>>> Resuming the vm's didn't work and
I had to force them off and
>>>>>>>>>>>> then on
>>>>>>>>>>>> which
>>>>>>>>>>>> resolved the issue.
>>>>>>>>>>>>
>>>>>>>>>>>> Has anyone had this issue before?
>>>>>>>>>>>>
>>>>>>>>>>>> I realise this is very vague so if you
could please let me know
>>>>>>>>>>>> which
>>>>>>>>>>>> logs
>>>>>>>>>>>> to send in.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you
>>>>>>>>>>>>
>>>>>>>>>>>> Regards.
>>>>>>>>>>>>
>>>>>>>>>>>> Neil Wilson
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>> Users(a)ovirt.org
>>>>>>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>>
>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>> Users mailing list
>>>>>>>>>>> Users(a)ovirt.org
>>>>>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Dafna Ron
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dafna Ron
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users(a)ovirt.org
>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>
>>
>> --
>> Dafna Ron
--
Dafna Ron
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users