[Users] Vm's being paused

Neil nwilson123 at gmail.com
Tue Jan 28 13:02:27 UTC 2014


Hi Dafna,

Thanks for coming back to me. I'll try answer your queries one by one.

On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron at redhat.com> wrote:
> you had a problem with your storage on the 14th of Jan and one of the hosts
> rebooted (if you have the vdsm log from that day than I can see what
> happened on vdsm side)
> in engine, I could see a problem with the export domain and this should not
> have cause a reboot.

1.) I don't unfortunately have logs going back that far. Looking at
all 3 hosts uptime, the one with the least uptime is 21 days, the
others are all over 40 days, so there definitely wasn't a host that
rebooted on the 14th of Jan, would a network issue or Firewall issue
also cause the error you've seen to look as if a host rebooted? There
was a bonding mode change on the 14th of January, so perhaps this
caused the issue?


> Can you tell me if you had a problem with the data
> domain as well or was it just the export domain? were you having any vm's
> exported/imported at that time?
> In any case - this is a bug.

2.) I think this was the same day that the bonding mode was changed on
the host while the host was live (by mistake), and had SPM running on
it. I haven't done any importing or exporting for a few years on this
oVirt setup.


> As for the vm's - if the vm's are no longer in migrating state than please
> restart ovirt-engine service (looks like a cache issue)

3.) Restarted ovirt-engine, logging now appears to be normal without any errors.


> if they are in migrating state - there should have been a timeout a long
> time ago.
> can you please run 'vdsClient -s 0 list table' and 'virsh -r list'  on both
> all hosts?

4.) Ran on all hosts...

node01.blabla.com
63da7faa-f92a-4652-90f2-b6660a4fb7b3  11232  adam                 Up
502170aa-0fc6-4287-bb08-5844be6e0352  13986  babbage              Up
ff9036fb-1499-45e4-8cde-e350eee3c489  26733  reports              Up
2736197b-6dc3-4155-9a29-9306ca64881d  13804  tux                  Up
0a3af7b2-ea94-42f3-baeb-78b950af4402  25257  Moodle               Up

 Id    Name                           State
----------------------------------------------------
 1     adam                           running
 2     reports                        running
 4     tux                            running
 6     Moodle                         running
 7     babbage                        running

node02.blabla.com
dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b   2879  spam                 Up
23b9212c-1e25-4003-aa18-b1e819bf6bb1  32454  proxy02              Up
ac2a3f99-a6db-4cae-955d-efdfb901abb7   5605  software             Up
179c293b-e6a3-4ec6-a54c-2f92f875bc5e   8870  zimbra               Up

 Id    Name                           State
----------------------------------------------------
 9     proxy02                        running
 10    spam                           running
 12    software                       running
 13    zimbra                         running

node03.blabla.com
e42b7ccc-ce04-4308-aeb2-2291399dd3ef  25809  dhcp                 Up
16d3f077-b74c-4055-97d0-423da78d8a0c  23939  oliver               Up

 Id    Name                           State
----------------------------------------------------
 13    oliver                         running
 14    dhcp                           running


> Last thing is that your ISO domain seems to be having issues as well.
> This should not effect the host status but if any of the vm's were booted
> from an iso or have an iso attached in the boot sequence this will explain
> the migration issue.

There was an ISO domain issue a while back, but this was corrected
about 2 weeks ago after iptables re-enabled itself on boot after
running updates, I've checked now and the ISO domain appears to be
fine and I can see all the images stored within.

I've stumbled across what appears to be another error and all three
hosts are showing this over and over in /var/log/messages, and I'm not
sure if it's related? ...

Jan 28 14:58:59 node01 vdsm vm.Vm ERROR
vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed:
<AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most
recent call last):#012  File "/usr/share/vdsm/sampling.py", line 351,
in collect#012    statsFunction()#012  File
"/usr/share/vdsm/sampling.py", line 226, in __call__#012    retValue =
self._function(*args, **kwargs)#012  File "/usr/share/vdsm/vm.py",
line 509, in _highWrite#012    if not vmDrive.blockDev or
vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no
attribute 'format'

I've attached the full vdsm log from node02 to this reply.

Please shout if you need anything else.

Thank you.

Regards.

Neil Wilson.

> On 01/28/2014 09:28 AM, Neil wrote:
>>
>> Hi guys,
>>
>> Sorry for the very late reply, I've been out of the office doing
>> installations.
>> Unfortunately due to the time delay, my oldest logs are only as far
>> back as the attached.
>>
>> I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm
>> not sure what info is required, so the full log is attached.
>>
>> Please shout if you need any info or further details.
>>
>> Thank you very much.
>>
>> Regards.
>>
>> Neil Wilson.
>>
>>
>> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin at redhat.com>
>> wrote:
>>>
>>> Could you please attach the engine.log from the same time?
>>>
>>> thanks!
>>>
>>> ----- Original Message -----
>>>>
>>>> From: "Neil" <nwilson123 at gmail.com>
>>>> To: dron at redhat.com
>>>> Cc: "users" <users at ovirt.org>
>>>> Sent: Wednesday, January 22, 2014 1:14:25 PM
>>>> Subject: Re: [Users] Vm's being paused
>>>>
>>>> Hi Dafna,
>>>>
>>>> Thanks.
>>>>
>>>> The vdsm logs are quite large, so I've only attached the logs for the
>>>> pause of the VM called Babbage on the 19th of Jan.
>>>>
>>>> As for snapshots, Babbage has one from June 2013 and Reports has two
>>>> from June and Oct 2013.
>>>>
>>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's
>>>> have thin provisioned disks.
>>>>
>>>> Please shout if you'd like any further info or logs.
>>>>
>>>> Thank you.
>>>>
>>>> Regards.
>>>>
>>>> Neil Wilson.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron at redhat.com> wrote:
>>>>>
>>>>> Hi Neil,
>>>>>
>>>>> Can you please attach the vdsm logs?
>>>>> also, as for the vm's, do they have any snapshots?
>>>>> from your suggestion to allocate more luns, are you using iscsi or FC?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dafna
>>>>>
>>>>>
>>>>> On 01/22/2014 08:45 AM, Neil wrote:
>>>>>>
>>>>>> Thanks for the replies guys,
>>>>>>
>>>>>> Looking at my two VM's that have paused so far through the oVirt GUI
>>>>>> the following sizes show under Disks.
>>>>>>
>>>>>> VM Reports:
>>>>>> Virtual Size 35GB,  Actual Size 41GB
>>>>>> Looking on the Centos OS side, Disk size is 33G and used is 12G with
>>>>>> 19G available (40%) usage.
>>>>>>
>>>>>> VM Babbage:
>>>>>> Virtual Size is 40GB, Actual Size 53GB
>>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so
>>>>>> under 50% usage.
>>>>>>
>>>>>>
>>>>>> Do you see any issues with the above stats?
>>>>>>
>>>>>> Then my main Datacenter storage is as follows...
>>>>>>
>>>>>> Size: 6887 GB
>>>>>> Available: 1948 GB
>>>>>> Used: 4939 GB
>>>>>> Allocated: 1196 GB
>>>>>> Over Allocation: 61%
>>>>>>
>>>>>> Could there be a problem here? I can allocate additional LUNS if you
>>>>>> feel the space isn't correctly allocated.
>>>>>>
>>>>>> Apologies for going on about this, but I'm really concerned that
>>>>>> something isn't right and I might have a serious problem if an
>>>>>> important machine locks up.
>>>>>>
>>>>>> Thank you and much appreciated.
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>> Neil Wilson.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron at redhat.com> wrote:
>>>>>>>
>>>>>>> the storage space is configured in percentages and not physical size.
>>>>>>> so if 20G is less than 10% (default config) of your storage it will
>>>>>>> pause
>>>>>>> the vms regardless of how much GB you still have.
>>>>>>> this is configurable though so you can change it to less than 10% if
>>>>>>> you
>>>>>>> like.
>>>>>>>
>>>>>>> to answer the second question, vm's will not pause on ENOSpace error
>>>>>>> if
>>>>>>> they
>>>>>>> run out of space internally but only if the external storage cannot
>>>>>>> be
>>>>>>> consumed. so only if you run out of space in the storage and and not
>>>>>>> if
>>>>>>> vm
>>>>>>> runs out of space in its on fs.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 01/21/2014 09:51 AM, Neil wrote:
>>>>>>>>
>>>>>>>> Hi Dan,
>>>>>>>>
>>>>>>>> Sorry, attached is engine.log I've taken out the two sections where
>>>>>>>> each of the VM's were paused.
>>>>>>>>
>>>>>>>> Does the error "VM babbage has paused due to no Storage space error"
>>>>>>>> mean the main storage domain has run out of storage, or that the VM
>>>>>>>> has run out?
>>>>>>>>
>>>>>>>> Both VM's appear to have been running on node01 when they were
>>>>>>>> paused.
>>>>>>>> My vdsm versions are all...
>>>>>>>>
>>>>>>>> vdsm-cli-4.13.0-11.el6.noarch
>>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64
>>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch
>>>>>>>> vdsm-4.13.0-11.el6.x86_64
>>>>>>>> vdsm-python-4.13.0-11.el6.x86_64
>>>>>>>>
>>>>>>>> I currently have a 61% over allocation ratio on my primary storage
>>>>>>>> domain, with 1948GB available.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>>
>>>>>>>> Neil Wilson.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123 at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Dan,
>>>>>>>>>
>>>>>>>>> Sorry for only coming back to you now.
>>>>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out of
>>>>>>>>> disk space there is about 20Gigs free, and the usage barely grows
>>>>>>>>> as
>>>>>>>>> the VM only shares printers. The other VM that paused is also on
>>>>>>>>> thin
>>>>>>>>> provisioned disks and also has plenty space, this guest is running
>>>>>>>>> Centos 6.3 64bit and only runs basic reporting.
>>>>>>>>>
>>>>>>>>> After the 2003 guest was rebooted, the network card showed up as
>>>>>>>>> unplugged in ovirt, and we had to remove it, and re-add it again in
>>>>>>>>> order to correct the issue. The Centos VM did not have the same
>>>>>>>>> issue.
>>>>>>>>>
>>>>>>>>> I'm concerned that this might happen to a VM that's quite critical,
>>>>>>>>> any thoughts or ideas?
>>>>>>>>>
>>>>>>>>> The only recent changes have been updating from Dreyou 3.2 to the
>>>>>>>>> official Centos repo and updating to 3.3.1-2. Prior to updating I
>>>>>>>>> haven't had this issue.
>>>>>>>>>
>>>>>>>>> Any assistance is greatly appreciated.
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>>
>>>>>>>>> Neil Wilson.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Do you have the VMs on thin provisioned storage or sparse disks?
>>>>>>>>>>
>>>>>>>>>> Pausing happens when the VM has an IO error or runs out of space
>>>>>>>>>> on
>>>>>>>>>> the
>>>>>>>>>> storage domain, and it is done intentionally, so that the VM will
>>>>>>>>>> not
>>>>>>>>>> experience a disk corruption. If you have thin provisioned disks,
>>>>>>>>>> and
>>>>>>>>>> the VM
>>>>>>>>>> writes to it's disks faster than the disks can grow, this is
>>>>>>>>>> exactly
>>>>>>>>>> what
>>>>>>>>>> you will see
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123 at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>>
>>>>>>>>>>> I've had two different Vm's randomly pause this past week and
>>>>>>>>>>> inside
>>>>>>>>>>> ovirt
>>>>>>>>>>> the error received is something like 'vm ran out of storage and
>>>>>>>>>>> was
>>>>>>>>>>> paused'.
>>>>>>>>>>> Resuming the vm's didn't work and I had to force them off and
>>>>>>>>>>> then on
>>>>>>>>>>> which
>>>>>>>>>>> resolved the issue.
>>>>>>>>>>>
>>>>>>>>>>> Has anyone had this issue before?
>>>>>>>>>>>
>>>>>>>>>>> I realise this is very vague so if you could please let me know
>>>>>>>>>>> which
>>>>>>>>>>> logs
>>>>>>>>>>> to send in.
>>>>>>>>>>>
>>>>>>>>>>> Thank you
>>>>>>>>>>>
>>>>>>>>>>> Regards.
>>>>>>>>>>>
>>>>>>>>>>> Neil Wilson
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Users mailing list
>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Users mailing list
>>>>>>>>>> Users at ovirt.org
>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Dafna Ron
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dafna Ron
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>
>
> --
> Dafna Ron
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vdsm.log-node02.tar.bz2
Type: application/x-bzip2
Size: 92728 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140128/8e262243/attachment-0001.bz2>


More information about the Users mailing list