[Users] Vm's being paused

Neil nwilson123 at gmail.com
Wed Jan 29 07:02:24 UTC 2014


Hi Dafna,

Thanks for clarifying that, I found the migration issue and this was
resolved once I sorted out the ISO domain problem.

I'm sorry I don't understand your last question?
"> after the engine restart, do you still see a problem with the size
or did the report of size changed?"

The migration issue was resolved, it's now just trying to track down
why the two VM's paused on their own, one on the 8th of Jan(I think)
and one on the 19th of Jan.

Thank you.


Regards.

Neil Wilson.


On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <dron at redhat.com> wrote:
> yes - engine lost communication with vdsm and it has no way of knowing if
> the host is down or if there was a network issue so a network issue would
> cause the same errors that I see in the logs.
>
> The error you put on the iso is the reason the vm's have failed migration -
> if a vm is run with a cd and the cd is gone than the vm will not be able to
> be migrated.
>
> after the engine restart, do you still see a problem with the size or did
> the report of size changed?
>
> Dafna
>
>
> On 01/28/2014 01:02 PM, Neil wrote:
>>
>> Hi Dafna,
>>
>> Thanks for coming back to me. I'll try answer your queries one by one.
>>
>> On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron at redhat.com> wrote:
>>>
>>> you had a problem with your storage on the 14th of Jan and one of the
>>> hosts
>>> rebooted (if you have the vdsm log from that day than I can see what
>>> happened on vdsm side)
>>> in engine, I could see a problem with the export domain and this should
>>> not
>>> have cause a reboot.
>>
>> 1.) I don't unfortunately have logs going back that far. Looking at
>> all 3 hosts uptime, the one with the least uptime is 21 days, the
>> others are all over 40 days, so there definitely wasn't a host that
>> rebooted on the 14th of Jan, would a network issue or Firewall issue
>> also cause the error you've seen to look as if a host rebooted? There
>> was a bonding mode change on the 14th of January, so perhaps this
>> caused the issue?
>>
>>
>>> Can you tell me if you had a problem with the data
>>> domain as well or was it just the export domain? were you having any vm's
>>> exported/imported at that time?
>>> In any case - this is a bug.
>>
>> 2.) I think this was the same day that the bonding mode was changed on
>> the host while the host was live (by mistake), and had SPM running on
>> it. I haven't done any importing or exporting for a few years on this
>> oVirt setup.
>>
>>
>>> As for the vm's - if the vm's are no longer in migrating state than
>>> please
>>> restart ovirt-engine service (looks like a cache issue)
>>
>> 3.) Restarted ovirt-engine, logging now appears to be normal without any
>> errors.
>>
>>
>>> if they are in migrating state - there should have been a timeout a long
>>> time ago.
>>> can you please run 'vdsClient -s 0 list table' and 'virsh -r list'  on
>>> both
>>> all hosts?
>>
>> 4.) Ran on all hosts...
>>
>> node01.blabla.com
>> 63da7faa-f92a-4652-90f2-b6660a4fb7b3  11232  adam                 Up
>> 502170aa-0fc6-4287-bb08-5844be6e0352  13986  babbage              Up
>> ff9036fb-1499-45e4-8cde-e350eee3c489  26733  reports              Up
>> 2736197b-6dc3-4155-9a29-9306ca64881d  13804  tux                  Up
>> 0a3af7b2-ea94-42f3-baeb-78b950af4402  25257  Moodle               Up
>>
>>   Id    Name                           State
>> ----------------------------------------------------
>>   1     adam                           running
>>   2     reports                        running
>>   4     tux                            running
>>   6     Moodle                         running
>>   7     babbage                        running
>>
>> node02.blabla.com
>> dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b   2879  spam                 Up
>> 23b9212c-1e25-4003-aa18-b1e819bf6bb1  32454  proxy02              Up
>> ac2a3f99-a6db-4cae-955d-efdfb901abb7   5605  software             Up
>> 179c293b-e6a3-4ec6-a54c-2f92f875bc5e   8870  zimbra               Up
>>
>>   Id    Name                           State
>> ----------------------------------------------------
>>   9     proxy02                        running
>>   10    spam                           running
>>   12    software                       running
>>   13    zimbra                         running
>>
>> node03.blabla.com
>> e42b7ccc-ce04-4308-aeb2-2291399dd3ef  25809  dhcp                 Up
>> 16d3f077-b74c-4055-97d0-423da78d8a0c  23939  oliver               Up
>>
>>   Id    Name                           State
>> ----------------------------------------------------
>>   13    oliver                         running
>>   14    dhcp                           running
>>
>>
>>> Last thing is that your ISO domain seems to be having issues as well.
>>> This should not effect the host status but if any of the vm's were booted
>>> from an iso or have an iso attached in the boot sequence this will
>>> explain
>>> the migration issue.
>>
>> There was an ISO domain issue a while back, but this was corrected
>> about 2 weeks ago after iptables re-enabled itself on boot after
>> running updates, I've checked now and the ISO domain appears to be
>> fine and I can see all the images stored within.
>>
>> I've stumbled across what appears to be another error and all three
>> hosts are showing this over and over in /var/log/messages, and I'm not
>> sure if it's related? ...
>>
>> Jan 28 14:58:59 node01 vdsm vm.Vm ERROR
>> vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed:
>> <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most
>> recent call last):#012  File "/usr/share/vdsm/sampling.py", line 351,
>> in collect#012    statsFunction()#012  File
>> "/usr/share/vdsm/sampling.py", line 226, in __call__#012    retValue =
>> self._function(*args, **kwargs)#012  File "/usr/share/vdsm/vm.py",
>> line 509, in _highWrite#012    if not vmDrive.blockDev or
>> vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no
>> attribute 'format'
>>
>> I've attached the full vdsm log from node02 to this reply.
>>
>> Please shout if you need anything else.
>>
>> Thank you.
>>
>> Regards.
>>
>> Neil Wilson.
>>
>>> On 01/28/2014 09:28 AM, Neil wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> Sorry for the very late reply, I've been out of the office doing
>>>> installations.
>>>> Unfortunately due to the time delay, my oldest logs are only as far
>>>> back as the attached.
>>>>
>>>> I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm
>>>> not sure what info is required, so the full log is attached.
>>>>
>>>> Please shout if you need any info or further details.
>>>>
>>>> Thank you very much.
>>>>
>>>> Regards.
>>>>
>>>> Neil Wilson.
>>>>
>>>>
>>>> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin at redhat.com>
>>>> wrote:
>>>>>
>>>>> Could you please attach the engine.log from the same time?
>>>>>
>>>>> thanks!
>>>>>
>>>>> ----- Original Message -----
>>>>>>
>>>>>> From: "Neil" <nwilson123 at gmail.com>
>>>>>> To: dron at redhat.com
>>>>>> Cc: "users" <users at ovirt.org>
>>>>>> Sent: Wednesday, January 22, 2014 1:14:25 PM
>>>>>> Subject: Re: [Users] Vm's being paused
>>>>>>
>>>>>> Hi Dafna,
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> The vdsm logs are quite large, so I've only attached the logs for the
>>>>>> pause of the VM called Babbage on the 19th of Jan.
>>>>>>
>>>>>> As for snapshots, Babbage has one from June 2013 and Reports has two
>>>>>> from June and Oct 2013.
>>>>>>
>>>>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's
>>>>>> have thin provisioned disks.
>>>>>>
>>>>>> Please shout if you'd like any further info or logs.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>> Neil Wilson.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron at redhat.com> wrote:
>>>>>>>
>>>>>>> Hi Neil,
>>>>>>>
>>>>>>> Can you please attach the vdsm logs?
>>>>>>> also, as for the vm's, do they have any snapshots?
>>>>>>> from your suggestion to allocate more luns, are you using iscsi or
>>>>>>> FC?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Dafna
>>>>>>>
>>>>>>>
>>>>>>> On 01/22/2014 08:45 AM, Neil wrote:
>>>>>>>>
>>>>>>>> Thanks for the replies guys,
>>>>>>>>
>>>>>>>> Looking at my two VM's that have paused so far through the oVirt GUI
>>>>>>>> the following sizes show under Disks.
>>>>>>>>
>>>>>>>> VM Reports:
>>>>>>>> Virtual Size 35GB,  Actual Size 41GB
>>>>>>>> Looking on the Centos OS side, Disk size is 33G and used is 12G with
>>>>>>>> 19G available (40%) usage.
>>>>>>>>
>>>>>>>> VM Babbage:
>>>>>>>> Virtual Size is 40GB, Actual Size 53GB
>>>>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G,
>>>>>>>> so
>>>>>>>> under 50% usage.
>>>>>>>>
>>>>>>>>
>>>>>>>> Do you see any issues with the above stats?
>>>>>>>>
>>>>>>>> Then my main Datacenter storage is as follows...
>>>>>>>>
>>>>>>>> Size: 6887 GB
>>>>>>>> Available: 1948 GB
>>>>>>>> Used: 4939 GB
>>>>>>>> Allocated: 1196 GB
>>>>>>>> Over Allocation: 61%
>>>>>>>>
>>>>>>>> Could there be a problem here? I can allocate additional LUNS if you
>>>>>>>> feel the space isn't correctly allocated.
>>>>>>>>
>>>>>>>> Apologies for going on about this, but I'm really concerned that
>>>>>>>> something isn't right and I might have a serious problem if an
>>>>>>>> important machine locks up.
>>>>>>>>
>>>>>>>> Thank you and much appreciated.
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>>
>>>>>>>> Neil Wilson.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> the storage space is configured in percentages and not physical
>>>>>>>>> size.
>>>>>>>>> so if 20G is less than 10% (default config) of your storage it will
>>>>>>>>> pause
>>>>>>>>> the vms regardless of how much GB you still have.
>>>>>>>>> this is configurable though so you can change it to less than 10%
>>>>>>>>> if
>>>>>>>>> you
>>>>>>>>> like.
>>>>>>>>>
>>>>>>>>> to answer the second question, vm's will not pause on ENOSpace
>>>>>>>>> error
>>>>>>>>> if
>>>>>>>>> they
>>>>>>>>> run out of space internally but only if the external storage cannot
>>>>>>>>> be
>>>>>>>>> consumed. so only if you run out of space in the storage and and
>>>>>>>>> not
>>>>>>>>> if
>>>>>>>>> vm
>>>>>>>>> runs out of space in its on fs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 01/21/2014 09:51 AM, Neil wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Dan,
>>>>>>>>>>
>>>>>>>>>> Sorry, attached is engine.log I've taken out the two sections
>>>>>>>>>> where
>>>>>>>>>> each of the VM's were paused.
>>>>>>>>>>
>>>>>>>>>> Does the error "VM babbage has paused due to no Storage space
>>>>>>>>>> error"
>>>>>>>>>> mean the main storage domain has run out of storage, or that the
>>>>>>>>>> VM
>>>>>>>>>> has run out?
>>>>>>>>>>
>>>>>>>>>> Both VM's appear to have been running on node01 when they were
>>>>>>>>>> paused.
>>>>>>>>>> My vdsm versions are all...
>>>>>>>>>>
>>>>>>>>>> vdsm-cli-4.13.0-11.el6.noarch
>>>>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64
>>>>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch
>>>>>>>>>> vdsm-4.13.0-11.el6.x86_64
>>>>>>>>>> vdsm-python-4.13.0-11.el6.x86_64
>>>>>>>>>>
>>>>>>>>>> I currently have a 61% over allocation ratio on my primary storage
>>>>>>>>>> domain, with 1948GB available.
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>> Regards.
>>>>>>>>>>
>>>>>>>>>> Neil Wilson.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123 at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>
>>>>>>>>>>> Sorry for only coming back to you now.
>>>>>>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out
>>>>>>>>>>> of
>>>>>>>>>>> disk space there is about 20Gigs free, and the usage barely grows
>>>>>>>>>>> as
>>>>>>>>>>> the VM only shares printers. The other VM that paused is also on
>>>>>>>>>>> thin
>>>>>>>>>>> provisioned disks and also has plenty space, this guest is
>>>>>>>>>>> running
>>>>>>>>>>> Centos 6.3 64bit and only runs basic reporting.
>>>>>>>>>>>
>>>>>>>>>>> After the 2003 guest was rebooted, the network card showed up as
>>>>>>>>>>> unplugged in ovirt, and we had to remove it, and re-add it again
>>>>>>>>>>> in
>>>>>>>>>>> order to correct the issue. The Centos VM did not have the same
>>>>>>>>>>> issue.
>>>>>>>>>>>
>>>>>>>>>>> I'm concerned that this might happen to a VM that's quite
>>>>>>>>>>> critical,
>>>>>>>>>>> any thoughts or ideas?
>>>>>>>>>>>
>>>>>>>>>>> The only recent changes have been updating from Dreyou 3.2 to the
>>>>>>>>>>> official Centos repo and updating to 3.3.1-2. Prior to updating I
>>>>>>>>>>> haven't had this issue.
>>>>>>>>>>>
>>>>>>>>>>> Any assistance is greatly appreciated.
>>>>>>>>>>>
>>>>>>>>>>> Thank you.
>>>>>>>>>>>
>>>>>>>>>>> Regards.
>>>>>>>>>>>
>>>>>>>>>>> Neil Wilson.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Do you have the VMs on thin provisioned storage or sparse disks?
>>>>>>>>>>>>
>>>>>>>>>>>> Pausing happens when the VM has an IO error or runs out of space
>>>>>>>>>>>> on
>>>>>>>>>>>> the
>>>>>>>>>>>> storage domain, and it is done intentionally, so that the VM
>>>>>>>>>>>> will
>>>>>>>>>>>> not
>>>>>>>>>>>> experience a disk corruption. If you have thin provisioned
>>>>>>>>>>>> disks,
>>>>>>>>>>>> and
>>>>>>>>>>>> the VM
>>>>>>>>>>>> writes to it's disks faster than the disks can grow, this is
>>>>>>>>>>>> exactly
>>>>>>>>>>>> what
>>>>>>>>>>>> you will see
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123 at gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've had two different Vm's randomly pause this past week and
>>>>>>>>>>>>> inside
>>>>>>>>>>>>> ovirt
>>>>>>>>>>>>> the error received is something like 'vm ran out of storage and
>>>>>>>>>>>>> was
>>>>>>>>>>>>> paused'.
>>>>>>>>>>>>> Resuming the vm's didn't work and I had to force them off and
>>>>>>>>>>>>> then on
>>>>>>>>>>>>> which
>>>>>>>>>>>>> resolved the issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Has anyone had this issue before?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I realise this is very vague so if you could please let me know
>>>>>>>>>>>>> which
>>>>>>>>>>>>> logs
>>>>>>>>>>>>> to send in.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Neil Wilson
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Dafna Ron
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Dafna Ron
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at ovirt.org
>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>
>>>
>>> --
>>> Dafna Ron
>
>
>
> --
> Dafna Ron



More information about the Users mailing list