
Hi guys, I've had two different Vm's randomly pause this past week and inside ovirt the error received is something like 'vm ran out of storage and was paused'. Resuming the vm's didn't work and I had to force them off and then on which resolved the issue. Has anyone had this issue before? I realise this is very vague so if you could please let me know which logs to send in. Thank you Regards. Neil Wilson

Do you have the VMs on thin provisioned storage or sparse disks? Pausing happens when the VM has an IO error or runs out of space on the storage domain, and it is done intentionally, so that the VM will not experience a disk corruption. If you have thin provisioned disks, and the VM writes to it's disks faster than the disks can grow, this is exactly what you will see On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote:
Hi guys,
I've had two different Vm's randomly pause this past week and inside ovirt the error received is something like 'vm ran out of storage and was paused'. Resuming the vm's didn't work and I had to force them off and then on which resolved the issue.
Has anyone had this issue before?
I realise this is very vague so if you could please let me know which logs to send in.
Thank you
Regards.
Neil Wilson
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Hi, this shouldn't happen imho. If the VM writes faster than the virtual disk can grow I would expect the hypervisor to slow down the writing instead of throwing errors? Am 19.01.2014 19:20, schrieb Dan Yasny:
If you have thin provisioned disks, and the VM writes to it's disks faster than the disks can grow, this is exactly what you will see
-- Mit freundlichen Grüßen / Regards Sven Kieske Systemadministrator Mittwald CM Service GmbH & Co. KG Königsberger Straße 6 32339 Espelkamp T: +49-5772-293-100 F: +49-5772-293-333 https://www.mittwald.de Geschäftsführer: Robert Meyer St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen

On Jan 20, 2014, at 15:03 , Sven Kieske <S.Kieske@mittwald.de> wrote:
Hi,
this shouldn't happen imho.
If the VM writes faster than the virtual disk can grow I would expect the hypervisor to slow down the writing instead of throwing errors?
we extend before run out of space so it should not really happen in real life, not in reasonable setups. unlike the out of space scenario where there's little we can do, just pause. the threshold for growing the space can be change…or simply use preallocated disks if the performance is so bad and VM demands so high. Thanks michal
Am 19.01.2014 19:20, schrieb Dan Yasny:
If you have thin provisioned disks, and the VM writes to it's disks faster than the disks can grow, this is exactly what you will see
-- Mit freundlichen Grüßen / Regards
Sven Kieske
Systemadministrator Mittwald CM Service GmbH & Co. KG Königsberger Straße 6 32339 Espelkamp T: +49-5772-293-100 F: +49-5772-293-333 https://www.mittwald.de Geschäftsführer: Robert Meyer St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Good to know there's a threshold. how and where can it be configured? And what is the default value? Thank you for your fast reply! :) Am 20.01.2014 15:10, schrieb Michal Skrivanek:
the threshold for growing the space can be change
-- Mit freundlichen Grüßen / Regards Sven Kieske Systemadministrator Mittwald CM Service GmbH & Co. KG Königsberger Straße 6 32339 Espelkamp T: +49-5772-293-100 F: +49-5772-293-333 https://www.mittwald.de Geschäftsführer: Robert Meyer St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen

no... we don't because we don't want to create a performance issue. the vm is writing until it runs out of space. if we can extend the lv we do (vdsm will run lvextend). However, the user should never see this error at all even if we are running on a thin provision disk since this is a low level operation which takes very little time. If we can see the vm's pausing than it's either a UI refresh issue, internal communication issue between the host and storage or an actual bug. Adding Eduardo and Federico. On 01/20/2014 02:03 PM, Sven Kieske wrote:
Hi,
this shouldn't happen imho.
If the VM writes faster than the virtual disk can grow I would expect the hypervisor to slow down the writing instead of throwing errors?
Am 19.01.2014 19:20, schrieb Dan Yasny:
If you have thin provisioned disks, and the VM writes to it's disks faster than the disks can grow, this is exactly what you will see
-- Dafna Ron

Hi Dan, Sorry for only coming back to you now. The VM's are thin provisioned. The Server 2003 VM hasn't run out of disk space there is about 20Gigs free, and the usage barely grows as the VM only shares printers. The other VM that paused is also on thin provisioned disks and also has plenty space, this guest is running Centos 6.3 64bit and only runs basic reporting. After the 2003 guest was rebooted, the network card showed up as unplugged in ovirt, and we had to remove it, and re-add it again in order to correct the issue. The Centos VM did not have the same issue. I'm concerned that this might happen to a VM that's quite critical, any thoughts or ideas? The only recent changes have been updating from Dreyou 3.2 to the official Centos repo and updating to 3.3.1-2. Prior to updating I haven't had this issue. Any assistance is greatly appreciated. Thank you. Regards. Neil Wilson. On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> wrote:
Do you have the VMs on thin provisioned storage or sparse disks?
Pausing happens when the VM has an IO error or runs out of space on the storage domain, and it is done intentionally, so that the VM will not experience a disk corruption. If you have thin provisioned disks, and the VM writes to it's disks faster than the disks can grow, this is exactly what you will see
On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote:
Hi guys,
I've had two different Vm's randomly pause this past week and inside ovirt the error received is something like 'vm ran out of storage and was paused'. Resuming the vm's didn't work and I had to force them off and then on which resolved the issue.
Has anyone had this issue before?
I realise this is very vague so if you could please let me know which logs to send in.
Thank you
Regards.
Neil Wilson
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Hi Dan, Sorry, attached is engine.log I've taken out the two sections where each of the VM's were paused. Does the error "VM babbage has paused due to no Storage space error" mean the main storage domain has run out of storage, or that the VM has run out? Both VM's appear to have been running on node01 when they were paused. My vdsm versions are all... vdsm-cli-4.13.0-11.el6.noarch vdsm-python-cpopen-4.13.0-11.el6.x86_64 vdsm-xmlrpc-4.13.0-11.el6.noarch vdsm-4.13.0-11.el6.x86_64 vdsm-python-4.13.0-11.el6.x86_64 I currently have a 61% over allocation ratio on my primary storage domain, with 1948GB available. Thank you. Regards. Neil Wilson. On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dan,
Sorry for only coming back to you now. The VM's are thin provisioned. The Server 2003 VM hasn't run out of disk space there is about 20Gigs free, and the usage barely grows as the VM only shares printers. The other VM that paused is also on thin provisioned disks and also has plenty space, this guest is running Centos 6.3 64bit and only runs basic reporting.
After the 2003 guest was rebooted, the network card showed up as unplugged in ovirt, and we had to remove it, and re-add it again in order to correct the issue. The Centos VM did not have the same issue.
I'm concerned that this might happen to a VM that's quite critical, any thoughts or ideas?
The only recent changes have been updating from Dreyou 3.2 to the official Centos repo and updating to 3.3.1-2. Prior to updating I haven't had this issue.
Any assistance is greatly appreciated.
Thank you.
Regards.
Neil Wilson.
On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> wrote:
Do you have the VMs on thin provisioned storage or sparse disks?
Pausing happens when the VM has an IO error or runs out of space on the storage domain, and it is done intentionally, so that the VM will not experience a disk corruption. If you have thin provisioned disks, and the VM writes to it's disks faster than the disks can grow, this is exactly what you will see
On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote:
Hi guys,
I've had two different Vm's randomly pause this past week and inside ovirt the error received is something like 'vm ran out of storage and was paused'. Resuming the vm's didn't work and I had to force them off and then on which resolved the issue.
Has anyone had this issue before?
I realise this is very vague so if you could please let me know which logs to send in.
Thank you
Regards.
Neil Wilson
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

the storage space is configured in percentages and not physical size. so if 20G is less than 10% (default config) of your storage it will pause the vms regardless of how much GB you still have. this is configurable though so you can change it to less than 10% if you like. to answer the second question, vm's will not pause on ENOSpace error if they run out of space internally but only if the external storage cannot be consumed. so only if you run out of space in the storage and and not if vm runs out of space in its on fs. On 01/21/2014 09:51 AM, Neil wrote:
Hi Dan,
Sorry, attached is engine.log I've taken out the two sections where each of the VM's were paused.
Does the error "VM babbage has paused due to no Storage space error" mean the main storage domain has run out of storage, or that the VM has run out?
Both VM's appear to have been running on node01 when they were paused. My vdsm versions are all...
vdsm-cli-4.13.0-11.el6.noarch vdsm-python-cpopen-4.13.0-11.el6.x86_64 vdsm-xmlrpc-4.13.0-11.el6.noarch vdsm-4.13.0-11.el6.x86_64 vdsm-python-4.13.0-11.el6.x86_64
I currently have a 61% over allocation ratio on my primary storage domain, with 1948GB available.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dan,
Sorry for only coming back to you now. The VM's are thin provisioned. The Server 2003 VM hasn't run out of disk space there is about 20Gigs free, and the usage barely grows as the VM only shares printers. The other VM that paused is also on thin provisioned disks and also has plenty space, this guest is running Centos 6.3 64bit and only runs basic reporting.
After the 2003 guest was rebooted, the network card showed up as unplugged in ovirt, and we had to remove it, and re-add it again in order to correct the issue. The Centos VM did not have the same issue.
I'm concerned that this might happen to a VM that's quite critical, any thoughts or ideas?
The only recent changes have been updating from Dreyou 3.2 to the official Centos repo and updating to 3.3.1-2. Prior to updating I haven't had this issue.
Any assistance is greatly appreciated.
Thank you.
Regards.
Neil Wilson.
On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> wrote:
Do you have the VMs on thin provisioned storage or sparse disks?
Pausing happens when the VM has an IO error or runs out of space on the storage domain, and it is done intentionally, so that the VM will not experience a disk corruption. If you have thin provisioned disks, and the VM writes to it's disks faster than the disks can grow, this is exactly what you will see
On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote:
Hi guys,
I've had two different Vm's randomly pause this past week and inside ovirt the error received is something like 'vm ran out of storage and was paused'. Resuming the vm's didn't work and I had to force them off and then on which resolved the issue.
Has anyone had this issue before?
I realise this is very vague so if you could please let me know which logs to send in.
Thank you
Regards.
Neil Wilson
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron

Thanks for the replies guys, Looking at my two VM's that have paused so far through the oVirt GUI the following sizes show under Disks. VM Reports: Virtual Size 35GB, Actual Size 41GB Looking on the Centos OS side, Disk size is 33G and used is 12G with 19G available (40%) usage. VM Babbage: Virtual Size is 40GB, Actual Size 53GB On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so under 50% usage. Do you see any issues with the above stats? Then my main Datacenter storage is as follows... Size: 6887 GB Available: 1948 GB Used: 4939 GB Allocated: 1196 GB Over Allocation: 61% Could there be a problem here? I can allocate additional LUNS if you feel the space isn't correctly allocated. Apologies for going on about this, but I'm really concerned that something isn't right and I might have a serious problem if an important machine locks up. Thank you and much appreciated. Regards. Neil Wilson. On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote:
the storage space is configured in percentages and not physical size. so if 20G is less than 10% (default config) of your storage it will pause the vms regardless of how much GB you still have. this is configurable though so you can change it to less than 10% if you like.
to answer the second question, vm's will not pause on ENOSpace error if they run out of space internally but only if the external storage cannot be consumed. so only if you run out of space in the storage and and not if vm runs out of space in its on fs.
On 01/21/2014 09:51 AM, Neil wrote:
Hi Dan,
Sorry, attached is engine.log I've taken out the two sections where each of the VM's were paused.
Does the error "VM babbage has paused due to no Storage space error" mean the main storage domain has run out of storage, or that the VM has run out?
Both VM's appear to have been running on node01 when they were paused. My vdsm versions are all...
vdsm-cli-4.13.0-11.el6.noarch vdsm-python-cpopen-4.13.0-11.el6.x86_64 vdsm-xmlrpc-4.13.0-11.el6.noarch vdsm-4.13.0-11.el6.x86_64 vdsm-python-4.13.0-11.el6.x86_64
I currently have a 61% over allocation ratio on my primary storage domain, with 1948GB available.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dan,
Sorry for only coming back to you now. The VM's are thin provisioned. The Server 2003 VM hasn't run out of disk space there is about 20Gigs free, and the usage barely grows as the VM only shares printers. The other VM that paused is also on thin provisioned disks and also has plenty space, this guest is running Centos 6.3 64bit and only runs basic reporting.
After the 2003 guest was rebooted, the network card showed up as unplugged in ovirt, and we had to remove it, and re-add it again in order to correct the issue. The Centos VM did not have the same issue.
I'm concerned that this might happen to a VM that's quite critical, any thoughts or ideas?
The only recent changes have been updating from Dreyou 3.2 to the official Centos repo and updating to 3.3.1-2. Prior to updating I haven't had this issue.
Any assistance is greatly appreciated.
Thank you.
Regards.
Neil Wilson.
On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> wrote:
Do you have the VMs on thin provisioned storage or sparse disks?
Pausing happens when the VM has an IO error or runs out of space on the storage domain, and it is done intentionally, so that the VM will not experience a disk corruption. If you have thin provisioned disks, and the VM writes to it's disks faster than the disks can grow, this is exactly what you will see
On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote:
Hi guys,
I've had two different Vm's randomly pause this past week and inside ovirt the error received is something like 'vm ran out of storage and was paused'. Resuming the vm's didn't work and I had to force them off and then on which resolved the issue.
Has anyone had this issue before?
I realise this is very vague so if you could please let me know which logs to send in.
Thank you
Regards.
Neil Wilson
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron

Hi Neil, Can you please attach the vdsm logs? also, as for the vm's, do they have any snapshots? from your suggestion to allocate more luns, are you using iscsi or FC? Thanks, Dafna On 01/22/2014 08:45 AM, Neil wrote:
Thanks for the replies guys,
Looking at my two VM's that have paused so far through the oVirt GUI the following sizes show under Disks.
VM Reports: Virtual Size 35GB, Actual Size 41GB Looking on the Centos OS side, Disk size is 33G and used is 12G with 19G available (40%) usage.
VM Babbage: Virtual Size is 40GB, Actual Size 53GB On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so under 50% usage.
Do you see any issues with the above stats?
Then my main Datacenter storage is as follows...
Size: 6887 GB Available: 1948 GB Used: 4939 GB Allocated: 1196 GB Over Allocation: 61%
Could there be a problem here? I can allocate additional LUNS if you feel the space isn't correctly allocated.
Apologies for going on about this, but I'm really concerned that something isn't right and I might have a serious problem if an important machine locks up.
Thank you and much appreciated.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote:
the storage space is configured in percentages and not physical size. so if 20G is less than 10% (default config) of your storage it will pause the vms regardless of how much GB you still have. this is configurable though so you can change it to less than 10% if you like.
to answer the second question, vm's will not pause on ENOSpace error if they run out of space internally but only if the external storage cannot be consumed. so only if you run out of space in the storage and and not if vm runs out of space in its on fs.
On 01/21/2014 09:51 AM, Neil wrote:
Hi Dan,
Sorry, attached is engine.log I've taken out the two sections where each of the VM's were paused.
Does the error "VM babbage has paused due to no Storage space error" mean the main storage domain has run out of storage, or that the VM has run out?
Both VM's appear to have been running on node01 when they were paused. My vdsm versions are all...
vdsm-cli-4.13.0-11.el6.noarch vdsm-python-cpopen-4.13.0-11.el6.x86_64 vdsm-xmlrpc-4.13.0-11.el6.noarch vdsm-4.13.0-11.el6.x86_64 vdsm-python-4.13.0-11.el6.x86_64
I currently have a 61% over allocation ratio on my primary storage domain, with 1948GB available.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dan,
Sorry for only coming back to you now. The VM's are thin provisioned. The Server 2003 VM hasn't run out of disk space there is about 20Gigs free, and the usage barely grows as the VM only shares printers. The other VM that paused is also on thin provisioned disks and also has plenty space, this guest is running Centos 6.3 64bit and only runs basic reporting.
After the 2003 guest was rebooted, the network card showed up as unplugged in ovirt, and we had to remove it, and re-add it again in order to correct the issue. The Centos VM did not have the same issue.
I'm concerned that this might happen to a VM that's quite critical, any thoughts or ideas?
The only recent changes have been updating from Dreyou 3.2 to the official Centos repo and updating to 3.3.1-2. Prior to updating I haven't had this issue.
Any assistance is greatly appreciated.
Thank you.
Regards.
Neil Wilson.
On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> wrote:
Do you have the VMs on thin provisioned storage or sparse disks?
Pausing happens when the VM has an IO error or runs out of space on the storage domain, and it is done intentionally, so that the VM will not experience a disk corruption. If you have thin provisioned disks, and the VM writes to it's disks faster than the disks can grow, this is exactly what you will see
On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote:
Hi guys,
I've had two different Vm's randomly pause this past week and inside ovirt the error received is something like 'vm ran out of storage and was paused'. Resuming the vm's didn't work and I had to force them off and then on which resolved the issue.
Has anyone had this issue before?
I realise this is very vague so if you could please let me know which logs to send in.
Thank you
Regards.
Neil Wilson
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron
-- Dafna Ron

Hi Dafna, Thanks. The vdsm logs are quite large, so I've only attached the logs for the pause of the VM called Babbage on the 19th of Jan. As for snapshots, Babbage has one from June 2013 and Reports has two from June and Oct 2013. I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's have thin provisioned disks. Please shout if you'd like any further info or logs. Thank you. Regards. Neil Wilson. On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote:
Hi Neil,
Can you please attach the vdsm logs? also, as for the vm's, do they have any snapshots? from your suggestion to allocate more luns, are you using iscsi or FC?
Thanks,
Dafna
On 01/22/2014 08:45 AM, Neil wrote:
Thanks for the replies guys,
Looking at my two VM's that have paused so far through the oVirt GUI the following sizes show under Disks.
VM Reports: Virtual Size 35GB, Actual Size 41GB Looking on the Centos OS side, Disk size is 33G and used is 12G with 19G available (40%) usage.
VM Babbage: Virtual Size is 40GB, Actual Size 53GB On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so under 50% usage.
Do you see any issues with the above stats?
Then my main Datacenter storage is as follows...
Size: 6887 GB Available: 1948 GB Used: 4939 GB Allocated: 1196 GB Over Allocation: 61%
Could there be a problem here? I can allocate additional LUNS if you feel the space isn't correctly allocated.
Apologies for going on about this, but I'm really concerned that something isn't right and I might have a serious problem if an important machine locks up.
Thank you and much appreciated.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote:
the storage space is configured in percentages and not physical size. so if 20G is less than 10% (default config) of your storage it will pause the vms regardless of how much GB you still have. this is configurable though so you can change it to less than 10% if you like.
to answer the second question, vm's will not pause on ENOSpace error if they run out of space internally but only if the external storage cannot be consumed. so only if you run out of space in the storage and and not if vm runs out of space in its on fs.
On 01/21/2014 09:51 AM, Neil wrote:
Hi Dan,
Sorry, attached is engine.log I've taken out the two sections where each of the VM's were paused.
Does the error "VM babbage has paused due to no Storage space error" mean the main storage domain has run out of storage, or that the VM has run out?
Both VM's appear to have been running on node01 when they were paused. My vdsm versions are all...
vdsm-cli-4.13.0-11.el6.noarch vdsm-python-cpopen-4.13.0-11.el6.x86_64 vdsm-xmlrpc-4.13.0-11.el6.noarch vdsm-4.13.0-11.el6.x86_64 vdsm-python-4.13.0-11.el6.x86_64
I currently have a 61% over allocation ratio on my primary storage domain, with 1948GB available.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dan,
Sorry for only coming back to you now. The VM's are thin provisioned. The Server 2003 VM hasn't run out of disk space there is about 20Gigs free, and the usage barely grows as the VM only shares printers. The other VM that paused is also on thin provisioned disks and also has plenty space, this guest is running Centos 6.3 64bit and only runs basic reporting.
After the 2003 guest was rebooted, the network card showed up as unplugged in ovirt, and we had to remove it, and re-add it again in order to correct the issue. The Centos VM did not have the same issue.
I'm concerned that this might happen to a VM that's quite critical, any thoughts or ideas?
The only recent changes have been updating from Dreyou 3.2 to the official Centos repo and updating to 3.3.1-2. Prior to updating I haven't had this issue.
Any assistance is greatly appreciated.
Thank you.
Regards.
Neil Wilson.
On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> wrote:
Do you have the VMs on thin provisioned storage or sparse disks?
Pausing happens when the VM has an IO error or runs out of space on the storage domain, and it is done intentionally, so that the VM will not experience a disk corruption. If you have thin provisioned disks, and the VM writes to it's disks faster than the disks can grow, this is exactly what you will see
On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote: > > Hi guys, > > I've had two different Vm's randomly pause this past week and inside > ovirt > the error received is something like 'vm ran out of storage and was > paused'. > Resuming the vm's didn't work and I had to force them off and then on > which > resolved the issue. > > Has anyone had this issue before? > > I realise this is very vague so if you could please let me know which > logs > to send in. > > Thank you > > Regards. > > Neil Wilson > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron
-- Dafna Ron

Thanks Neil, but looking at the logs you sent the ERROR on the vm's have started long before. Can you please to grep this thread from all the logs? Thread-286029: (it's for vm 2736197b-6dc3-4155-9a29-9306ca64881d) lets try to see when the ERROR started for this thread. Thanks, Dafna On 01/22/2014 01:14 PM, Neil wrote:
Hi Dafna,
Thanks.
The vdsm logs are quite large, so I've only attached the logs for the pause of the VM called Babbage on the 19th of Jan.
As for snapshots, Babbage has one from June 2013 and Reports has two from June and Oct 2013.
I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's have thin provisioned disks.
Please shout if you'd like any further info or logs.
Thank you.
Regards.
Neil Wilson.
On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote:
Hi Neil,
Can you please attach the vdsm logs? also, as for the vm's, do they have any snapshots? from your suggestion to allocate more luns, are you using iscsi or FC?
Thanks,
Dafna
On 01/22/2014 08:45 AM, Neil wrote:
Thanks for the replies guys,
Looking at my two VM's that have paused so far through the oVirt GUI the following sizes show under Disks.
VM Reports: Virtual Size 35GB, Actual Size 41GB Looking on the Centos OS side, Disk size is 33G and used is 12G with 19G available (40%) usage.
VM Babbage: Virtual Size is 40GB, Actual Size 53GB On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so under 50% usage.
Do you see any issues with the above stats?
Then my main Datacenter storage is as follows...
Size: 6887 GB Available: 1948 GB Used: 4939 GB Allocated: 1196 GB Over Allocation: 61%
Could there be a problem here? I can allocate additional LUNS if you feel the space isn't correctly allocated.
Apologies for going on about this, but I'm really concerned that something isn't right and I might have a serious problem if an important machine locks up.
Thank you and much appreciated.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote:
the storage space is configured in percentages and not physical size. so if 20G is less than 10% (default config) of your storage it will pause the vms regardless of how much GB you still have. this is configurable though so you can change it to less than 10% if you like.
to answer the second question, vm's will not pause on ENOSpace error if they run out of space internally but only if the external storage cannot be consumed. so only if you run out of space in the storage and and not if vm runs out of space in its on fs.
On 01/21/2014 09:51 AM, Neil wrote:
Hi Dan,
Sorry, attached is engine.log I've taken out the two sections where each of the VM's were paused.
Does the error "VM babbage has paused due to no Storage space error" mean the main storage domain has run out of storage, or that the VM has run out?
Both VM's appear to have been running on node01 when they were paused. My vdsm versions are all...
vdsm-cli-4.13.0-11.el6.noarch vdsm-python-cpopen-4.13.0-11.el6.x86_64 vdsm-xmlrpc-4.13.0-11.el6.noarch vdsm-4.13.0-11.el6.x86_64 vdsm-python-4.13.0-11.el6.x86_64
I currently have a 61% over allocation ratio on my primary storage domain, with 1948GB available.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dan,
Sorry for only coming back to you now. The VM's are thin provisioned. The Server 2003 VM hasn't run out of disk space there is about 20Gigs free, and the usage barely grows as the VM only shares printers. The other VM that paused is also on thin provisioned disks and also has plenty space, this guest is running Centos 6.3 64bit and only runs basic reporting.
After the 2003 guest was rebooted, the network card showed up as unplugged in ovirt, and we had to remove it, and re-add it again in order to correct the issue. The Centos VM did not have the same issue.
I'm concerned that this might happen to a VM that's quite critical, any thoughts or ideas?
The only recent changes have been updating from Dreyou 3.2 to the official Centos repo and updating to 3.3.1-2. Prior to updating I haven't had this issue.
Any assistance is greatly appreciated.
Thank you.
Regards.
Neil Wilson.
On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> wrote: > Do you have the VMs on thin provisioned storage or sparse disks? > > Pausing happens when the VM has an IO error or runs out of space on > the > storage domain, and it is done intentionally, so that the VM will not > experience a disk corruption. If you have thin provisioned disks, and > the VM > writes to it's disks faster than the disks can grow, this is exactly > what > you will see > > > On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote: >> Hi guys, >> >> I've had two different Vm's randomly pause this past week and inside >> ovirt >> the error received is something like 'vm ran out of storage and was >> paused'. >> Resuming the vm's didn't work and I had to force them off and then on >> which >> resolved the issue. >> >> Has anyone had this issue before? >> >> I realise this is very vague so if you could please let me know which >> logs >> to send in. >> >> Thank you >> >> Regards. >> >> Neil Wilson >> >> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron
-- Dafna Ron
-- Dafna Ron

Could you please attach the engine.log from the same time? thanks! ----- Original Message -----
From: "Neil" <nwilson123@gmail.com> To: dron@redhat.com Cc: "users" <users@ovirt.org> Sent: Wednesday, January 22, 2014 1:14:25 PM Subject: Re: [Users] Vm's being paused
Hi Dafna,
Thanks.
The vdsm logs are quite large, so I've only attached the logs for the pause of the VM called Babbage on the 19th of Jan.
As for snapshots, Babbage has one from June 2013 and Reports has two from June and Oct 2013.
I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's have thin provisioned disks.
Please shout if you'd like any further info or logs.
Thank you.
Regards.
Neil Wilson.
On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote:
Hi Neil,
Can you please attach the vdsm logs? also, as for the vm's, do they have any snapshots? from your suggestion to allocate more luns, are you using iscsi or FC?
Thanks,
Dafna
On 01/22/2014 08:45 AM, Neil wrote:
Thanks for the replies guys,
Looking at my two VM's that have paused so far through the oVirt GUI the following sizes show under Disks.
VM Reports: Virtual Size 35GB, Actual Size 41GB Looking on the Centos OS side, Disk size is 33G and used is 12G with 19G available (40%) usage.
VM Babbage: Virtual Size is 40GB, Actual Size 53GB On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so under 50% usage.
Do you see any issues with the above stats?
Then my main Datacenter storage is as follows...
Size: 6887 GB Available: 1948 GB Used: 4939 GB Allocated: 1196 GB Over Allocation: 61%
Could there be a problem here? I can allocate additional LUNS if you feel the space isn't correctly allocated.
Apologies for going on about this, but I'm really concerned that something isn't right and I might have a serious problem if an important machine locks up.
Thank you and much appreciated.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote:
the storage space is configured in percentages and not physical size. so if 20G is less than 10% (default config) of your storage it will pause the vms regardless of how much GB you still have. this is configurable though so you can change it to less than 10% if you like.
to answer the second question, vm's will not pause on ENOSpace error if they run out of space internally but only if the external storage cannot be consumed. so only if you run out of space in the storage and and not if vm runs out of space in its on fs.
On 01/21/2014 09:51 AM, Neil wrote:
Hi Dan,
Sorry, attached is engine.log I've taken out the two sections where each of the VM's were paused.
Does the error "VM babbage has paused due to no Storage space error" mean the main storage domain has run out of storage, or that the VM has run out?
Both VM's appear to have been running on node01 when they were paused. My vdsm versions are all...
vdsm-cli-4.13.0-11.el6.noarch vdsm-python-cpopen-4.13.0-11.el6.x86_64 vdsm-xmlrpc-4.13.0-11.el6.noarch vdsm-4.13.0-11.el6.x86_64 vdsm-python-4.13.0-11.el6.x86_64
I currently have a 61% over allocation ratio on my primary storage domain, with 1948GB available.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dan,
Sorry for only coming back to you now. The VM's are thin provisioned. The Server 2003 VM hasn't run out of disk space there is about 20Gigs free, and the usage barely grows as the VM only shares printers. The other VM that paused is also on thin provisioned disks and also has plenty space, this guest is running Centos 6.3 64bit and only runs basic reporting.
After the 2003 guest was rebooted, the network card showed up as unplugged in ovirt, and we had to remove it, and re-add it again in order to correct the issue. The Centos VM did not have the same issue.
I'm concerned that this might happen to a VM that's quite critical, any thoughts or ideas?
The only recent changes have been updating from Dreyou 3.2 to the official Centos repo and updating to 3.3.1-2. Prior to updating I haven't had this issue.
Any assistance is greatly appreciated.
Thank you.
Regards.
Neil Wilson.
On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> wrote: > > Do you have the VMs on thin provisioned storage or sparse disks? > > Pausing happens when the VM has an IO error or runs out of space on > the > storage domain, and it is done intentionally, so that the VM will not > experience a disk corruption. If you have thin provisioned disks, and > the VM > writes to it's disks faster than the disks can grow, this is exactly > what > you will see > > > On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote: >> >> Hi guys, >> >> I've had two different Vm's randomly pause this past week and inside >> ovirt >> the error received is something like 'vm ran out of storage and was >> paused'. >> Resuming the vm's didn't work and I had to force them off and then on >> which >> resolved the issue. >> >> Has anyone had this issue before? >> >> I realise this is very vague so if you could please let me know which >> logs >> to send in. >> >> Thank you >> >> Regards. >> >> Neil Wilson >> >> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron
-- Dafna Ron
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Hi guys, Sorry for the very late reply, I've been out of the office doing installations. Unfortunately due to the time delay, my oldest logs are only as far back as the attached. I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm not sure what info is required, so the full log is attached. Please shout if you need any info or further details. Thank you very much. Regards. Neil Wilson. On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin@redhat.com> wrote:
Could you please attach the engine.log from the same time?
thanks!
----- Original Message -----
From: "Neil" <nwilson123@gmail.com> To: dron@redhat.com Cc: "users" <users@ovirt.org> Sent: Wednesday, January 22, 2014 1:14:25 PM Subject: Re: [Users] Vm's being paused
Hi Dafna,
Thanks.
The vdsm logs are quite large, so I've only attached the logs for the pause of the VM called Babbage on the 19th of Jan.
As for snapshots, Babbage has one from June 2013 and Reports has two from June and Oct 2013.
I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's have thin provisioned disks.
Please shout if you'd like any further info or logs.
Thank you.
Regards.
Neil Wilson.
On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote:
Hi Neil,
Can you please attach the vdsm logs? also, as for the vm's, do they have any snapshots? from your suggestion to allocate more luns, are you using iscsi or FC?
Thanks,
Dafna
On 01/22/2014 08:45 AM, Neil wrote:
Thanks for the replies guys,
Looking at my two VM's that have paused so far through the oVirt GUI the following sizes show under Disks.
VM Reports: Virtual Size 35GB, Actual Size 41GB Looking on the Centos OS side, Disk size is 33G and used is 12G with 19G available (40%) usage.
VM Babbage: Virtual Size is 40GB, Actual Size 53GB On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so under 50% usage.
Do you see any issues with the above stats?
Then my main Datacenter storage is as follows...
Size: 6887 GB Available: 1948 GB Used: 4939 GB Allocated: 1196 GB Over Allocation: 61%
Could there be a problem here? I can allocate additional LUNS if you feel the space isn't correctly allocated.
Apologies for going on about this, but I'm really concerned that something isn't right and I might have a serious problem if an important machine locks up.
Thank you and much appreciated.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote:
the storage space is configured in percentages and not physical size. so if 20G is less than 10% (default config) of your storage it will pause the vms regardless of how much GB you still have. this is configurable though so you can change it to less than 10% if you like.
to answer the second question, vm's will not pause on ENOSpace error if they run out of space internally but only if the external storage cannot be consumed. so only if you run out of space in the storage and and not if vm runs out of space in its on fs.
On 01/21/2014 09:51 AM, Neil wrote:
Hi Dan,
Sorry, attached is engine.log I've taken out the two sections where each of the VM's were paused.
Does the error "VM babbage has paused due to no Storage space error" mean the main storage domain has run out of storage, or that the VM has run out?
Both VM's appear to have been running on node01 when they were paused. My vdsm versions are all...
vdsm-cli-4.13.0-11.el6.noarch vdsm-python-cpopen-4.13.0-11.el6.x86_64 vdsm-xmlrpc-4.13.0-11.el6.noarch vdsm-4.13.0-11.el6.x86_64 vdsm-python-4.13.0-11.el6.x86_64
I currently have a 61% over allocation ratio on my primary storage domain, with 1948GB available.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote: > > Hi Dan, > > Sorry for only coming back to you now. > The VM's are thin provisioned. The Server 2003 VM hasn't run out of > disk space there is about 20Gigs free, and the usage barely grows as > the VM only shares printers. The other VM that paused is also on thin > provisioned disks and also has plenty space, this guest is running > Centos 6.3 64bit and only runs basic reporting. > > After the 2003 guest was rebooted, the network card showed up as > unplugged in ovirt, and we had to remove it, and re-add it again in > order to correct the issue. The Centos VM did not have the same issue. > > I'm concerned that this might happen to a VM that's quite critical, > any thoughts or ideas? > > The only recent changes have been updating from Dreyou 3.2 to the > official Centos repo and updating to 3.3.1-2. Prior to updating I > haven't had this issue. > > Any assistance is greatly appreciated. > > Thank you. > > Regards. > > Neil Wilson. > > > > > > > > > > > > On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> wrote: >> >> Do you have the VMs on thin provisioned storage or sparse disks? >> >> Pausing happens when the VM has an IO error or runs out of space on >> the >> storage domain, and it is done intentionally, so that the VM will not >> experience a disk corruption. If you have thin provisioned disks, and >> the VM >> writes to it's disks faster than the disks can grow, this is exactly >> what >> you will see >> >> >> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote: >>> >>> Hi guys, >>> >>> I've had two different Vm's randomly pause this past week and inside >>> ovirt >>> the error received is something like 'vm ran out of storage and was >>> paused'. >>> Resuming the vm's didn't work and I had to force them off and then on >>> which >>> resolved the issue. >>> >>> Has anyone had this issue before? >>> >>> I realise this is very vague so if you could please let me know which >>> logs >>> to send in. >>> >>> Thank you >>> >>> Regards. >>> >>> Neil Wilson >>> >>> >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >>> >> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron
-- Dafna Ron
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

OK. You have several issues in the setup... so it's a bit tricky... you had a problem with your storage on the 14th of Jan and one of the hosts rebooted (if you have the vdsm log from that day than I can see what happened on vdsm side) in engine, I could see a problem with the export domain and this should not have cause a reboot. Can you tell me if you had a problem with the data domain as well or was it just the export domain? were you having any vm's exported/imported at that time? In any case - this is a bug. As for the vm's - if the vm's are no longer in migrating state than please restart ovirt-engine service (looks like a cache issue) 2014-01-14 09:38:08,590 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-34) RefreshVmList vm id 2736197b-6dc3-4155-9a29-9306ca64881d status = Down on vds node03.blabla.com ignoring it in the re fresh until migration is done if they are in migrating state - there should have been a timeout a long time ago. can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on both all hosts? Last thing is that your ISO domain seems to be having issues as well. This should not effect the host status but if any of the vm's were booted from an iso or have an iso attached in the boot sequence this will explain the migration issue. Thanks, Dafna On 01/28/2014 09:28 AM, Neil wrote:
Hi guys,
Sorry for the very late reply, I've been out of the office doing installations. Unfortunately due to the time delay, my oldest logs are only as far back as the attached.
I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm not sure what info is required, so the full log is attached.
Please shout if you need any info or further details.
Thank you very much.
Regards.
Neil Wilson.
On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin@redhat.com> wrote:
Could you please attach the engine.log from the same time?
thanks!
----- Original Message -----
From: "Neil" <nwilson123@gmail.com> To: dron@redhat.com Cc: "users" <users@ovirt.org> Sent: Wednesday, January 22, 2014 1:14:25 PM Subject: Re: [Users] Vm's being paused
Hi Dafna,
Thanks.
The vdsm logs are quite large, so I've only attached the logs for the pause of the VM called Babbage on the 19th of Jan.
As for snapshots, Babbage has one from June 2013 and Reports has two from June and Oct 2013.
I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's have thin provisioned disks.
Please shout if you'd like any further info or logs.
Thank you.
Regards.
Neil Wilson.
On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote:
Hi Neil,
Can you please attach the vdsm logs? also, as for the vm's, do they have any snapshots? from your suggestion to allocate more luns, are you using iscsi or FC?
Thanks,
Dafna
On 01/22/2014 08:45 AM, Neil wrote:
Thanks for the replies guys,
Looking at my two VM's that have paused so far through the oVirt GUI the following sizes show under Disks.
VM Reports: Virtual Size 35GB, Actual Size 41GB Looking on the Centos OS side, Disk size is 33G and used is 12G with 19G available (40%) usage.
VM Babbage: Virtual Size is 40GB, Actual Size 53GB On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so under 50% usage.
Do you see any issues with the above stats?
Then my main Datacenter storage is as follows...
Size: 6887 GB Available: 1948 GB Used: 4939 GB Allocated: 1196 GB Over Allocation: 61%
Could there be a problem here? I can allocate additional LUNS if you feel the space isn't correctly allocated.
Apologies for going on about this, but I'm really concerned that something isn't right and I might have a serious problem if an important machine locks up.
Thank you and much appreciated.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote:
the storage space is configured in percentages and not physical size. so if 20G is less than 10% (default config) of your storage it will pause the vms regardless of how much GB you still have. this is configurable though so you can change it to less than 10% if you like.
to answer the second question, vm's will not pause on ENOSpace error if they run out of space internally but only if the external storage cannot be consumed. so only if you run out of space in the storage and and not if vm runs out of space in its on fs.
On 01/21/2014 09:51 AM, Neil wrote: > Hi Dan, > > Sorry, attached is engine.log I've taken out the two sections where > each of the VM's were paused. > > Does the error "VM babbage has paused due to no Storage space error" > mean the main storage domain has run out of storage, or that the VM > has run out? > > Both VM's appear to have been running on node01 when they were paused. > My vdsm versions are all... > > vdsm-cli-4.13.0-11.el6.noarch > vdsm-python-cpopen-4.13.0-11.el6.x86_64 > vdsm-xmlrpc-4.13.0-11.el6.noarch > vdsm-4.13.0-11.el6.x86_64 > vdsm-python-4.13.0-11.el6.x86_64 > > I currently have a 61% over allocation ratio on my primary storage > domain, with 1948GB available. > > Thank you. > > Regards. > > Neil Wilson. > > > On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote: >> Hi Dan, >> >> Sorry for only coming back to you now. >> The VM's are thin provisioned. The Server 2003 VM hasn't run out of >> disk space there is about 20Gigs free, and the usage barely grows as >> the VM only shares printers. The other VM that paused is also on thin >> provisioned disks and also has plenty space, this guest is running >> Centos 6.3 64bit and only runs basic reporting. >> >> After the 2003 guest was rebooted, the network card showed up as >> unplugged in ovirt, and we had to remove it, and re-add it again in >> order to correct the issue. The Centos VM did not have the same issue. >> >> I'm concerned that this might happen to a VM that's quite critical, >> any thoughts or ideas? >> >> The only recent changes have been updating from Dreyou 3.2 to the >> official Centos repo and updating to 3.3.1-2. Prior to updating I >> haven't had this issue. >> >> Any assistance is greatly appreciated. >> >> Thank you. >> >> Regards. >> >> Neil Wilson. >> >> >> >> >> >> >> >> >> >> >> >> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> wrote: >>> Do you have the VMs on thin provisioned storage or sparse disks? >>> >>> Pausing happens when the VM has an IO error or runs out of space on >>> the >>> storage domain, and it is done intentionally, so that the VM will not >>> experience a disk corruption. If you have thin provisioned disks, and >>> the VM >>> writes to it's disks faster than the disks can grow, this is exactly >>> what >>> you will see >>> >>> >>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> wrote: >>>> Hi guys, >>>> >>>> I've had two different Vm's randomly pause this past week and inside >>>> ovirt >>>> the error received is something like 'vm ran out of storage and was >>>> paused'. >>>> Resuming the vm's didn't work and I had to force them off and then on >>>> which >>>> resolved the issue. >>>> >>>> Has anyone had this issue before? >>>> >>>> I realise this is very vague so if you could please let me know which >>>> logs >>>> to send in. >>>> >>>> Thank you >>>> >>>> Regards. >>>> >>>> Neil Wilson >>>> >>>> >>>> _______________________________________________ >>>> Users mailing list >>>> Users@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/users >>>> >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron
-- Dafna Ron
Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron

Hi Dafna, Thanks for coming back to me. I'll try answer your queries one by one. On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron@redhat.com> wrote:
you had a problem with your storage on the 14th of Jan and one of the hosts rebooted (if you have the vdsm log from that day than I can see what happened on vdsm side) in engine, I could see a problem with the export domain and this should not have cause a reboot.
1.) I don't unfortunately have logs going back that far. Looking at all 3 hosts uptime, the one with the least uptime is 21 days, the others are all over 40 days, so there definitely wasn't a host that rebooted on the 14th of Jan, would a network issue or Firewall issue also cause the error you've seen to look as if a host rebooted? There was a bonding mode change on the 14th of January, so perhaps this caused the issue?
Can you tell me if you had a problem with the data domain as well or was it just the export domain? were you having any vm's exported/imported at that time? In any case - this is a bug.
2.) I think this was the same day that the bonding mode was changed on the host while the host was live (by mistake), and had SPM running on it. I haven't done any importing or exporting for a few years on this oVirt setup.
As for the vm's - if the vm's are no longer in migrating state than please restart ovirt-engine service (looks like a cache issue)
3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
if they are in migrating state - there should have been a timeout a long time ago. can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on both all hosts?
4.) Ran on all hosts... node01.blabla.com 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up Id Name State ---------------------------------------------------- 1 adam running 2 reports running 4 tux running 6 Moodle running 7 babbage running node02.blabla.com dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up Id Name State ---------------------------------------------------- 9 proxy02 running 10 spam running 12 software running 13 zimbra running node03.blabla.com e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up Id Name State ---------------------------------------------------- 13 oliver running 14 dhcp running
Last thing is that your ISO domain seems to be having issues as well. This should not effect the host status but if any of the vm's were booted from an iso or have an iso attached in the boot sequence this will explain the migration issue.
There was an ISO domain issue a while back, but this was corrected about 2 weeks ago after iptables re-enabled itself on boot after running updates, I've checked now and the ISO domain appears to be fine and I can see all the images stored within. I've stumbled across what appears to be another error and all three hosts are showing this over and over in /var/log/messages, and I'm not sure if it's related? ... Jan 28 14:58:59 node01 vdsm vm.Vm ERROR vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format' I've attached the full vdsm log from node02 to this reply. Please shout if you need anything else. Thank you. Regards. Neil Wilson.
On 01/28/2014 09:28 AM, Neil wrote:
Hi guys,
Sorry for the very late reply, I've been out of the office doing installations. Unfortunately due to the time delay, my oldest logs are only as far back as the attached.
I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm not sure what info is required, so the full log is attached.
Please shout if you need any info or further details.
Thank you very much.
Regards.
Neil Wilson.
On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin@redhat.com> wrote:
Could you please attach the engine.log from the same time?
thanks!
----- Original Message -----
From: "Neil" <nwilson123@gmail.com> To: dron@redhat.com Cc: "users" <users@ovirt.org> Sent: Wednesday, January 22, 2014 1:14:25 PM Subject: Re: [Users] Vm's being paused
Hi Dafna,
Thanks.
The vdsm logs are quite large, so I've only attached the logs for the pause of the VM called Babbage on the 19th of Jan.
As for snapshots, Babbage has one from June 2013 and Reports has two from June and Oct 2013.
I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's have thin provisioned disks.
Please shout if you'd like any further info or logs.
Thank you.
Regards.
Neil Wilson.
On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote:
Hi Neil,
Can you please attach the vdsm logs? also, as for the vm's, do they have any snapshots? from your suggestion to allocate more luns, are you using iscsi or FC?
Thanks,
Dafna
On 01/22/2014 08:45 AM, Neil wrote:
Thanks for the replies guys,
Looking at my two VM's that have paused so far through the oVirt GUI the following sizes show under Disks.
VM Reports: Virtual Size 35GB, Actual Size 41GB Looking on the Centos OS side, Disk size is 33G and used is 12G with 19G available (40%) usage.
VM Babbage: Virtual Size is 40GB, Actual Size 53GB On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so under 50% usage.
Do you see any issues with the above stats?
Then my main Datacenter storage is as follows...
Size: 6887 GB Available: 1948 GB Used: 4939 GB Allocated: 1196 GB Over Allocation: 61%
Could there be a problem here? I can allocate additional LUNS if you feel the space isn't correctly allocated.
Apologies for going on about this, but I'm really concerned that something isn't right and I might have a serious problem if an important machine locks up.
Thank you and much appreciated.
Regards.
Neil Wilson.
On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote: > > the storage space is configured in percentages and not physical size. > so if 20G is less than 10% (default config) of your storage it will > pause > the vms regardless of how much GB you still have. > this is configurable though so you can change it to less than 10% if > you > like. > > to answer the second question, vm's will not pause on ENOSpace error > if > they > run out of space internally but only if the external storage cannot > be > consumed. so only if you run out of space in the storage and and not > if > vm > runs out of space in its on fs. > > > > On 01/21/2014 09:51 AM, Neil wrote: >> >> Hi Dan, >> >> Sorry, attached is engine.log I've taken out the two sections where >> each of the VM's were paused. >> >> Does the error "VM babbage has paused due to no Storage space error" >> mean the main storage domain has run out of storage, or that the VM >> has run out? >> >> Both VM's appear to have been running on node01 when they were >> paused. >> My vdsm versions are all... >> >> vdsm-cli-4.13.0-11.el6.noarch >> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >> vdsm-xmlrpc-4.13.0-11.el6.noarch >> vdsm-4.13.0-11.el6.x86_64 >> vdsm-python-4.13.0-11.el6.x86_64 >> >> I currently have a 61% over allocation ratio on my primary storage >> domain, with 1948GB available. >> >> Thank you. >> >> Regards. >> >> Neil Wilson. >> >> >> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote: >>> >>> Hi Dan, >>> >>> Sorry for only coming back to you now. >>> The VM's are thin provisioned. The Server 2003 VM hasn't run out of >>> disk space there is about 20Gigs free, and the usage barely grows >>> as >>> the VM only shares printers. The other VM that paused is also on >>> thin >>> provisioned disks and also has plenty space, this guest is running >>> Centos 6.3 64bit and only runs basic reporting. >>> >>> After the 2003 guest was rebooted, the network card showed up as >>> unplugged in ovirt, and we had to remove it, and re-add it again in >>> order to correct the issue. The Centos VM did not have the same >>> issue. >>> >>> I'm concerned that this might happen to a VM that's quite critical, >>> any thoughts or ideas? >>> >>> The only recent changes have been updating from Dreyou 3.2 to the >>> official Centos repo and updating to 3.3.1-2. Prior to updating I >>> haven't had this issue. >>> >>> Any assistance is greatly appreciated. >>> >>> Thank you. >>> >>> Regards. >>> >>> Neil Wilson. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> >>> wrote: >>>> >>>> Do you have the VMs on thin provisioned storage or sparse disks? >>>> >>>> Pausing happens when the VM has an IO error or runs out of space >>>> on >>>> the >>>> storage domain, and it is done intentionally, so that the VM will >>>> not >>>> experience a disk corruption. If you have thin provisioned disks, >>>> and >>>> the VM >>>> writes to it's disks faster than the disks can grow, this is >>>> exactly >>>> what >>>> you will see >>>> >>>> >>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> >>>> wrote: >>>>> >>>>> Hi guys, >>>>> >>>>> I've had two different Vm's randomly pause this past week and >>>>> inside >>>>> ovirt >>>>> the error received is something like 'vm ran out of storage and >>>>> was >>>>> paused'. >>>>> Resuming the vm's didn't work and I had to force them off and >>>>> then on >>>>> which >>>>> resolved the issue. >>>>> >>>>> Has anyone had this issue before? >>>>> >>>>> I realise this is very vague so if you could please let me know >>>>> which >>>>> logs >>>>> to send in. >>>>> >>>>> Thank you >>>>> >>>>> Regards. >>>>> >>>>> Neil Wilson >>>>> >>>>> >>>>> _______________________________________________ >>>>> Users mailing list >>>>> Users@ovirt.org >>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>> >>>> _______________________________________________ >>>> Users mailing list >>>> Users@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/users > > > > -- > Dafna Ron
-- Dafna Ron
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron

yes - engine lost communication with vdsm and it has no way of knowing if the host is down or if there was a network issue so a network issue would cause the same errors that I see in the logs. The error you put on the iso is the reason the vm's have failed migration - if a vm is run with a cd and the cd is gone than the vm will not be able to be migrated. after the engine restart, do you still see a problem with the size or did the report of size changed? Dafna On 01/28/2014 01:02 PM, Neil wrote:
Hi Dafna,
Thanks for coming back to me. I'll try answer your queries one by one.
On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron@redhat.com> wrote:
you had a problem with your storage on the 14th of Jan and one of the hosts rebooted (if you have the vdsm log from that day than I can see what happened on vdsm side) in engine, I could see a problem with the export domain and this should not have cause a reboot. 1.) I don't unfortunately have logs going back that far. Looking at all 3 hosts uptime, the one with the least uptime is 21 days, the others are all over 40 days, so there definitely wasn't a host that rebooted on the 14th of Jan, would a network issue or Firewall issue also cause the error you've seen to look as if a host rebooted? There was a bonding mode change on the 14th of January, so perhaps this caused the issue?
Can you tell me if you had a problem with the data domain as well or was it just the export domain? were you having any vm's exported/imported at that time? In any case - this is a bug. 2.) I think this was the same day that the bonding mode was changed on the host while the host was live (by mistake), and had SPM running on it. I haven't done any importing or exporting for a few years on this oVirt setup.
As for the vm's - if the vm's are no longer in migrating state than please restart ovirt-engine service (looks like a cache issue) 3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
if they are in migrating state - there should have been a timeout a long time ago. can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on both all hosts? 4.) Ran on all hosts...
node01.blabla.com 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
Id Name State ---------------------------------------------------- 1 adam running 2 reports running 4 tux running 6 Moodle running 7 babbage running
node02.blabla.com dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
Id Name State ---------------------------------------------------- 9 proxy02 running 10 spam running 12 software running 13 zimbra running
node03.blabla.com e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
Id Name State ---------------------------------------------------- 13 oliver running 14 dhcp running
Last thing is that your ISO domain seems to be having issues as well. This should not effect the host status but if any of the vm's were booted from an iso or have an iso attached in the boot sequence this will explain the migration issue. There was an ISO domain issue a while back, but this was corrected about 2 weeks ago after iptables re-enabled itself on boot after running updates, I've checked now and the ISO domain appears to be fine and I can see all the images stored within.
I've stumbled across what appears to be another error and all three hosts are showing this over and over in /var/log/messages, and I'm not sure if it's related? ...
Jan 28 14:58:59 node01 vdsm vm.Vm ERROR vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
I've attached the full vdsm log from node02 to this reply.
Please shout if you need anything else.
Thank you.
Regards.
Neil Wilson.
On 01/28/2014 09:28 AM, Neil wrote:
Hi guys,
Sorry for the very late reply, I've been out of the office doing installations. Unfortunately due to the time delay, my oldest logs are only as far back as the attached.
I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm not sure what info is required, so the full log is attached.
Please shout if you need any info or further details.
Thank you very much.
Regards.
Neil Wilson.
On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin@redhat.com> wrote:
Could you please attach the engine.log from the same time?
thanks!
----- Original Message -----
From: "Neil" <nwilson123@gmail.com> To: dron@redhat.com Cc: "users" <users@ovirt.org> Sent: Wednesday, January 22, 2014 1:14:25 PM Subject: Re: [Users] Vm's being paused
Hi Dafna,
Thanks.
The vdsm logs are quite large, so I've only attached the logs for the pause of the VM called Babbage on the 19th of Jan.
As for snapshots, Babbage has one from June 2013 and Reports has two from June and Oct 2013.
I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's have thin provisioned disks.
Please shout if you'd like any further info or logs.
Thank you.
Regards.
Neil Wilson.
On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote:
Hi Neil,
Can you please attach the vdsm logs? also, as for the vm's, do they have any snapshots? from your suggestion to allocate more luns, are you using iscsi or FC?
Thanks,
Dafna
On 01/22/2014 08:45 AM, Neil wrote: > Thanks for the replies guys, > > Looking at my two VM's that have paused so far through the oVirt GUI > the following sizes show under Disks. > > VM Reports: > Virtual Size 35GB, Actual Size 41GB > Looking on the Centos OS side, Disk size is 33G and used is 12G with > 19G available (40%) usage. > > VM Babbage: > Virtual Size is 40GB, Actual Size 53GB > On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so > under 50% usage. > > > Do you see any issues with the above stats? > > Then my main Datacenter storage is as follows... > > Size: 6887 GB > Available: 1948 GB > Used: 4939 GB > Allocated: 1196 GB > Over Allocation: 61% > > Could there be a problem here? I can allocate additional LUNS if you > feel the space isn't correctly allocated. > > Apologies for going on about this, but I'm really concerned that > something isn't right and I might have a serious problem if an > important machine locks up. > > Thank you and much appreciated. > > Regards. > > Neil Wilson. > > > > > > > > > > > > > On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote: >> the storage space is configured in percentages and not physical size. >> so if 20G is less than 10% (default config) of your storage it will >> pause >> the vms regardless of how much GB you still have. >> this is configurable though so you can change it to less than 10% if >> you >> like. >> >> to answer the second question, vm's will not pause on ENOSpace error >> if >> they >> run out of space internally but only if the external storage cannot >> be >> consumed. so only if you run out of space in the storage and and not >> if >> vm >> runs out of space in its on fs. >> >> >> >> On 01/21/2014 09:51 AM, Neil wrote: >>> Hi Dan, >>> >>> Sorry, attached is engine.log I've taken out the two sections where >>> each of the VM's were paused. >>> >>> Does the error "VM babbage has paused due to no Storage space error" >>> mean the main storage domain has run out of storage, or that the VM >>> has run out? >>> >>> Both VM's appear to have been running on node01 when they were >>> paused. >>> My vdsm versions are all... >>> >>> vdsm-cli-4.13.0-11.el6.noarch >>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>> vdsm-4.13.0-11.el6.x86_64 >>> vdsm-python-4.13.0-11.el6.x86_64 >>> >>> I currently have a 61% over allocation ratio on my primary storage >>> domain, with 1948GB available. >>> >>> Thank you. >>> >>> Regards. >>> >>> Neil Wilson. >>> >>> >>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote: >>>> Hi Dan, >>>> >>>> Sorry for only coming back to you now. >>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out of >>>> disk space there is about 20Gigs free, and the usage barely grows >>>> as >>>> the VM only shares printers. The other VM that paused is also on >>>> thin >>>> provisioned disks and also has plenty space, this guest is running >>>> Centos 6.3 64bit and only runs basic reporting. >>>> >>>> After the 2003 guest was rebooted, the network card showed up as >>>> unplugged in ovirt, and we had to remove it, and re-add it again in >>>> order to correct the issue. The Centos VM did not have the same >>>> issue. >>>> >>>> I'm concerned that this might happen to a VM that's quite critical, >>>> any thoughts or ideas? >>>> >>>> The only recent changes have been updating from Dreyou 3.2 to the >>>> official Centos repo and updating to 3.3.1-2. Prior to updating I >>>> haven't had this issue. >>>> >>>> Any assistance is greatly appreciated. >>>> >>>> Thank you. >>>> >>>> Regards. >>>> >>>> Neil Wilson. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> >>>> wrote: >>>>> Do you have the VMs on thin provisioned storage or sparse disks? >>>>> >>>>> Pausing happens when the VM has an IO error or runs out of space >>>>> on >>>>> the >>>>> storage domain, and it is done intentionally, so that the VM will >>>>> not >>>>> experience a disk corruption. If you have thin provisioned disks, >>>>> and >>>>> the VM >>>>> writes to it's disks faster than the disks can grow, this is >>>>> exactly >>>>> what >>>>> you will see >>>>> >>>>> >>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> >>>>> wrote: >>>>>> Hi guys, >>>>>> >>>>>> I've had two different Vm's randomly pause this past week and >>>>>> inside >>>>>> ovirt >>>>>> the error received is something like 'vm ran out of storage and >>>>>> was >>>>>> paused'. >>>>>> Resuming the vm's didn't work and I had to force them off and >>>>>> then on >>>>>> which >>>>>> resolved the issue. >>>>>> >>>>>> Has anyone had this issue before? >>>>>> >>>>>> I realise this is very vague so if you could please let me know >>>>>> which >>>>>> logs >>>>>> to send in. >>>>>> >>>>>> Thank you >>>>>> >>>>>> Regards. >>>>>> >>>>>> Neil Wilson >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Users mailing list >>>>>> Users@ovirt.org >>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>> >>>>> _______________________________________________ >>>>> Users mailing list >>>>> Users@ovirt.org >>>>> http://lists.ovirt.org/mailman/listinfo/users >> >> >> -- >> Dafna Ron
-- Dafna Ron
Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron
-- Dafna Ron

Hi Dafna, Thanks for clarifying that, I found the migration issue and this was resolved once I sorted out the ISO domain problem. I'm sorry I don't understand your last question? "> after the engine restart, do you still see a problem with the size or did the report of size changed?" The migration issue was resolved, it's now just trying to track down why the two VM's paused on their own, one on the 8th of Jan(I think) and one on the 19th of Jan. Thank you. Regards. Neil Wilson. On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <dron@redhat.com> wrote:
yes - engine lost communication with vdsm and it has no way of knowing if the host is down or if there was a network issue so a network issue would cause the same errors that I see in the logs.
The error you put on the iso is the reason the vm's have failed migration - if a vm is run with a cd and the cd is gone than the vm will not be able to be migrated.
after the engine restart, do you still see a problem with the size or did the report of size changed?
Dafna
On 01/28/2014 01:02 PM, Neil wrote:
Hi Dafna,
Thanks for coming back to me. I'll try answer your queries one by one.
On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron@redhat.com> wrote:
you had a problem with your storage on the 14th of Jan and one of the hosts rebooted (if you have the vdsm log from that day than I can see what happened on vdsm side) in engine, I could see a problem with the export domain and this should not have cause a reboot.
1.) I don't unfortunately have logs going back that far. Looking at all 3 hosts uptime, the one with the least uptime is 21 days, the others are all over 40 days, so there definitely wasn't a host that rebooted on the 14th of Jan, would a network issue or Firewall issue also cause the error you've seen to look as if a host rebooted? There was a bonding mode change on the 14th of January, so perhaps this caused the issue?
Can you tell me if you had a problem with the data domain as well or was it just the export domain? were you having any vm's exported/imported at that time? In any case - this is a bug.
2.) I think this was the same day that the bonding mode was changed on the host while the host was live (by mistake), and had SPM running on it. I haven't done any importing or exporting for a few years on this oVirt setup.
As for the vm's - if the vm's are no longer in migrating state than please restart ovirt-engine service (looks like a cache issue)
3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
if they are in migrating state - there should have been a timeout a long time ago. can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on both all hosts?
4.) Ran on all hosts...
node01.blabla.com 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
Id Name State ---------------------------------------------------- 1 adam running 2 reports running 4 tux running 6 Moodle running 7 babbage running
node02.blabla.com dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
Id Name State ---------------------------------------------------- 9 proxy02 running 10 spam running 12 software running 13 zimbra running
node03.blabla.com e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
Id Name State ---------------------------------------------------- 13 oliver running 14 dhcp running
Last thing is that your ISO domain seems to be having issues as well. This should not effect the host status but if any of the vm's were booted from an iso or have an iso attached in the boot sequence this will explain the migration issue.
There was an ISO domain issue a while back, but this was corrected about 2 weeks ago after iptables re-enabled itself on boot after running updates, I've checked now and the ISO domain appears to be fine and I can see all the images stored within.
I've stumbled across what appears to be another error and all three hosts are showing this over and over in /var/log/messages, and I'm not sure if it's related? ...
Jan 28 14:58:59 node01 vdsm vm.Vm ERROR vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
I've attached the full vdsm log from node02 to this reply.
Please shout if you need anything else.
Thank you.
Regards.
Neil Wilson.
On 01/28/2014 09:28 AM, Neil wrote:
Hi guys,
Sorry for the very late reply, I've been out of the office doing installations. Unfortunately due to the time delay, my oldest logs are only as far back as the attached.
I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm not sure what info is required, so the full log is attached.
Please shout if you need any info or further details.
Thank you very much.
Regards.
Neil Wilson.
On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin@redhat.com> wrote:
Could you please attach the engine.log from the same time?
thanks!
----- Original Message -----
From: "Neil" <nwilson123@gmail.com> To: dron@redhat.com Cc: "users" <users@ovirt.org> Sent: Wednesday, January 22, 2014 1:14:25 PM Subject: Re: [Users] Vm's being paused
Hi Dafna,
Thanks.
The vdsm logs are quite large, so I've only attached the logs for the pause of the VM called Babbage on the 19th of Jan.
As for snapshots, Babbage has one from June 2013 and Reports has two from June and Oct 2013.
I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's have thin provisioned disks.
Please shout if you'd like any further info or logs.
Thank you.
Regards.
Neil Wilson.
On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote: > > Hi Neil, > > Can you please attach the vdsm logs? > also, as for the vm's, do they have any snapshots? > from your suggestion to allocate more luns, are you using iscsi or > FC? > > Thanks, > > Dafna > > > On 01/22/2014 08:45 AM, Neil wrote: >> >> Thanks for the replies guys, >> >> Looking at my two VM's that have paused so far through the oVirt GUI >> the following sizes show under Disks. >> >> VM Reports: >> Virtual Size 35GB, Actual Size 41GB >> Looking on the Centos OS side, Disk size is 33G and used is 12G with >> 19G available (40%) usage. >> >> VM Babbage: >> Virtual Size is 40GB, Actual Size 53GB >> On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, >> so >> under 50% usage. >> >> >> Do you see any issues with the above stats? >> >> Then my main Datacenter storage is as follows... >> >> Size: 6887 GB >> Available: 1948 GB >> Used: 4939 GB >> Allocated: 1196 GB >> Over Allocation: 61% >> >> Could there be a problem here? I can allocate additional LUNS if you >> feel the space isn't correctly allocated. >> >> Apologies for going on about this, but I'm really concerned that >> something isn't right and I might have a serious problem if an >> important machine locks up. >> >> Thank you and much appreciated. >> >> Regards. >> >> Neil Wilson. >> >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote: >>> >>> the storage space is configured in percentages and not physical >>> size. >>> so if 20G is less than 10% (default config) of your storage it will >>> pause >>> the vms regardless of how much GB you still have. >>> this is configurable though so you can change it to less than 10% >>> if >>> you >>> like. >>> >>> to answer the second question, vm's will not pause on ENOSpace >>> error >>> if >>> they >>> run out of space internally but only if the external storage cannot >>> be >>> consumed. so only if you run out of space in the storage and and >>> not >>> if >>> vm >>> runs out of space in its on fs. >>> >>> >>> >>> On 01/21/2014 09:51 AM, Neil wrote: >>>> >>>> Hi Dan, >>>> >>>> Sorry, attached is engine.log I've taken out the two sections >>>> where >>>> each of the VM's were paused. >>>> >>>> Does the error "VM babbage has paused due to no Storage space >>>> error" >>>> mean the main storage domain has run out of storage, or that the >>>> VM >>>> has run out? >>>> >>>> Both VM's appear to have been running on node01 when they were >>>> paused. >>>> My vdsm versions are all... >>>> >>>> vdsm-cli-4.13.0-11.el6.noarch >>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>>> vdsm-4.13.0-11.el6.x86_64 >>>> vdsm-python-4.13.0-11.el6.x86_64 >>>> >>>> I currently have a 61% over allocation ratio on my primary storage >>>> domain, with 1948GB available. >>>> >>>> Thank you. >>>> >>>> Regards. >>>> >>>> Neil Wilson. >>>> >>>> >>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> >>>> wrote: >>>>> >>>>> Hi Dan, >>>>> >>>>> Sorry for only coming back to you now. >>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out >>>>> of >>>>> disk space there is about 20Gigs free, and the usage barely grows >>>>> as >>>>> the VM only shares printers. The other VM that paused is also on >>>>> thin >>>>> provisioned disks and also has plenty space, this guest is >>>>> running >>>>> Centos 6.3 64bit and only runs basic reporting. >>>>> >>>>> After the 2003 guest was rebooted, the network card showed up as >>>>> unplugged in ovirt, and we had to remove it, and re-add it again >>>>> in >>>>> order to correct the issue. The Centos VM did not have the same >>>>> issue. >>>>> >>>>> I'm concerned that this might happen to a VM that's quite >>>>> critical, >>>>> any thoughts or ideas? >>>>> >>>>> The only recent changes have been updating from Dreyou 3.2 to the >>>>> official Centos repo and updating to 3.3.1-2. Prior to updating I >>>>> haven't had this issue. >>>>> >>>>> Any assistance is greatly appreciated. >>>>> >>>>> Thank you. >>>>> >>>>> Regards. >>>>> >>>>> Neil Wilson. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> >>>>> wrote: >>>>>> >>>>>> Do you have the VMs on thin provisioned storage or sparse disks? >>>>>> >>>>>> Pausing happens when the VM has an IO error or runs out of space >>>>>> on >>>>>> the >>>>>> storage domain, and it is done intentionally, so that the VM >>>>>> will >>>>>> not >>>>>> experience a disk corruption. If you have thin provisioned >>>>>> disks, >>>>>> and >>>>>> the VM >>>>>> writes to it's disks faster than the disks can grow, this is >>>>>> exactly >>>>>> what >>>>>> you will see >>>>>> >>>>>> >>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Hi guys, >>>>>>> >>>>>>> I've had two different Vm's randomly pause this past week and >>>>>>> inside >>>>>>> ovirt >>>>>>> the error received is something like 'vm ran out of storage and >>>>>>> was >>>>>>> paused'. >>>>>>> Resuming the vm's didn't work and I had to force them off and >>>>>>> then on >>>>>>> which >>>>>>> resolved the issue. >>>>>>> >>>>>>> Has anyone had this issue before? >>>>>>> >>>>>>> I realise this is very vague so if you could please let me know >>>>>>> which >>>>>>> logs >>>>>>> to send in. >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> Regards. >>>>>>> >>>>>>> Neil Wilson >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Users mailing list >>>>>>> Users@ovirt.org >>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>> >>>>>> _______________________________________________ >>>>>> Users mailing list >>>>>> Users@ovirt.org >>>>>> http://lists.ovirt.org/mailman/listinfo/users >>> >>> >>> >>> -- >>> Dafna Ron > > > > -- > Dafna Ron
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron
-- Dafna Ron

Sorry, more on this issue, I see my logs are rapidly filling up my disk space on node02 with this error in /var/log/messages... Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 513, in _highWrite#012 self._vm._dom.blockInfo(vmDrive.path, 0)#012 File "/usr/share/vdsm/vm.py", line 835, in f#012 ret = attr(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in blockInfo#012 if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)#012libvirtError: invalid argument: invalid path /rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/fac8a3bb-e414-43c0-affc-6e2628757a28/6c3e5ae8-23fc-4196-ba42-778bdc0fbad8 not assigned to domain Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`ac2a3f99-a6db-4cae-955d-efdfb901abb7`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format' Not sure if this is related at all though? Thanks. Regards. Neil Wilson. On Wed, Jan 29, 2014 at 9:02 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dafna,
Thanks for clarifying that, I found the migration issue and this was resolved once I sorted out the ISO domain problem.
I'm sorry I don't understand your last question? "> after the engine restart, do you still see a problem with the size or did the report of size changed?"
The migration issue was resolved, it's now just trying to track down why the two VM's paused on their own, one on the 8th of Jan(I think) and one on the 19th of Jan.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <dron@redhat.com> wrote:
yes - engine lost communication with vdsm and it has no way of knowing if the host is down or if there was a network issue so a network issue would cause the same errors that I see in the logs.
The error you put on the iso is the reason the vm's have failed migration - if a vm is run with a cd and the cd is gone than the vm will not be able to be migrated.
after the engine restart, do you still see a problem with the size or did the report of size changed?
Dafna
On 01/28/2014 01:02 PM, Neil wrote:
Hi Dafna,
Thanks for coming back to me. I'll try answer your queries one by one.
On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron@redhat.com> wrote:
you had a problem with your storage on the 14th of Jan and one of the hosts rebooted (if you have the vdsm log from that day than I can see what happened on vdsm side) in engine, I could see a problem with the export domain and this should not have cause a reboot.
1.) I don't unfortunately have logs going back that far. Looking at all 3 hosts uptime, the one with the least uptime is 21 days, the others are all over 40 days, so there definitely wasn't a host that rebooted on the 14th of Jan, would a network issue or Firewall issue also cause the error you've seen to look as if a host rebooted? There was a bonding mode change on the 14th of January, so perhaps this caused the issue?
Can you tell me if you had a problem with the data domain as well or was it just the export domain? were you having any vm's exported/imported at that time? In any case - this is a bug.
2.) I think this was the same day that the bonding mode was changed on the host while the host was live (by mistake), and had SPM running on it. I haven't done any importing or exporting for a few years on this oVirt setup.
As for the vm's - if the vm's are no longer in migrating state than please restart ovirt-engine service (looks like a cache issue)
3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
if they are in migrating state - there should have been a timeout a long time ago. can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on both all hosts?
4.) Ran on all hosts...
node01.blabla.com 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
Id Name State ---------------------------------------------------- 1 adam running 2 reports running 4 tux running 6 Moodle running 7 babbage running
node02.blabla.com dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
Id Name State ---------------------------------------------------- 9 proxy02 running 10 spam running 12 software running 13 zimbra running
node03.blabla.com e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
Id Name State ---------------------------------------------------- 13 oliver running 14 dhcp running
Last thing is that your ISO domain seems to be having issues as well. This should not effect the host status but if any of the vm's were booted from an iso or have an iso attached in the boot sequence this will explain the migration issue.
There was an ISO domain issue a while back, but this was corrected about 2 weeks ago after iptables re-enabled itself on boot after running updates, I've checked now and the ISO domain appears to be fine and I can see all the images stored within.
I've stumbled across what appears to be another error and all three hosts are showing this over and over in /var/log/messages, and I'm not sure if it's related? ...
Jan 28 14:58:59 node01 vdsm vm.Vm ERROR vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
I've attached the full vdsm log from node02 to this reply.
Please shout if you need anything else.
Thank you.
Regards.
Neil Wilson.
On 01/28/2014 09:28 AM, Neil wrote:
Hi guys,
Sorry for the very late reply, I've been out of the office doing installations. Unfortunately due to the time delay, my oldest logs are only as far back as the attached.
I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm not sure what info is required, so the full log is attached.
Please shout if you need any info or further details.
Thank you very much.
Regards.
Neil Wilson.
On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin@redhat.com> wrote:
Could you please attach the engine.log from the same time?
thanks!
----- Original Message ----- > > From: "Neil" <nwilson123@gmail.com> > To: dron@redhat.com > Cc: "users" <users@ovirt.org> > Sent: Wednesday, January 22, 2014 1:14:25 PM > Subject: Re: [Users] Vm's being paused > > Hi Dafna, > > Thanks. > > The vdsm logs are quite large, so I've only attached the logs for the > pause of the VM called Babbage on the 19th of Jan. > > As for snapshots, Babbage has one from June 2013 and Reports has two > from June and Oct 2013. > > I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's > have thin provisioned disks. > > Please shout if you'd like any further info or logs. > > Thank you. > > Regards. > > Neil Wilson. > > > > > > On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote: >> >> Hi Neil, >> >> Can you please attach the vdsm logs? >> also, as for the vm's, do they have any snapshots? >> from your suggestion to allocate more luns, are you using iscsi or >> FC? >> >> Thanks, >> >> Dafna >> >> >> On 01/22/2014 08:45 AM, Neil wrote: >>> >>> Thanks for the replies guys, >>> >>> Looking at my two VM's that have paused so far through the oVirt GUI >>> the following sizes show under Disks. >>> >>> VM Reports: >>> Virtual Size 35GB, Actual Size 41GB >>> Looking on the Centos OS side, Disk size is 33G and used is 12G with >>> 19G available (40%) usage. >>> >>> VM Babbage: >>> Virtual Size is 40GB, Actual Size 53GB >>> On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, >>> so >>> under 50% usage. >>> >>> >>> Do you see any issues with the above stats? >>> >>> Then my main Datacenter storage is as follows... >>> >>> Size: 6887 GB >>> Available: 1948 GB >>> Used: 4939 GB >>> Allocated: 1196 GB >>> Over Allocation: 61% >>> >>> Could there be a problem here? I can allocate additional LUNS if you >>> feel the space isn't correctly allocated. >>> >>> Apologies for going on about this, but I'm really concerned that >>> something isn't right and I might have a serious problem if an >>> important machine locks up. >>> >>> Thank you and much appreciated. >>> >>> Regards. >>> >>> Neil Wilson. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote: >>>> >>>> the storage space is configured in percentages and not physical >>>> size. >>>> so if 20G is less than 10% (default config) of your storage it will >>>> pause >>>> the vms regardless of how much GB you still have. >>>> this is configurable though so you can change it to less than 10% >>>> if >>>> you >>>> like. >>>> >>>> to answer the second question, vm's will not pause on ENOSpace >>>> error >>>> if >>>> they >>>> run out of space internally but only if the external storage cannot >>>> be >>>> consumed. so only if you run out of space in the storage and and >>>> not >>>> if >>>> vm >>>> runs out of space in its on fs. >>>> >>>> >>>> >>>> On 01/21/2014 09:51 AM, Neil wrote: >>>>> >>>>> Hi Dan, >>>>> >>>>> Sorry, attached is engine.log I've taken out the two sections >>>>> where >>>>> each of the VM's were paused. >>>>> >>>>> Does the error "VM babbage has paused due to no Storage space >>>>> error" >>>>> mean the main storage domain has run out of storage, or that the >>>>> VM >>>>> has run out? >>>>> >>>>> Both VM's appear to have been running on node01 when they were >>>>> paused. >>>>> My vdsm versions are all... >>>>> >>>>> vdsm-cli-4.13.0-11.el6.noarch >>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>>>> vdsm-4.13.0-11.el6.x86_64 >>>>> vdsm-python-4.13.0-11.el6.x86_64 >>>>> >>>>> I currently have a 61% over allocation ratio on my primary storage >>>>> domain, with 1948GB available. >>>>> >>>>> Thank you. >>>>> >>>>> Regards. >>>>> >>>>> Neil Wilson. >>>>> >>>>> >>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> >>>>> wrote: >>>>>> >>>>>> Hi Dan, >>>>>> >>>>>> Sorry for only coming back to you now. >>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out >>>>>> of >>>>>> disk space there is about 20Gigs free, and the usage barely grows >>>>>> as >>>>>> the VM only shares printers. The other VM that paused is also on >>>>>> thin >>>>>> provisioned disks and also has plenty space, this guest is >>>>>> running >>>>>> Centos 6.3 64bit and only runs basic reporting. >>>>>> >>>>>> After the 2003 guest was rebooted, the network card showed up as >>>>>> unplugged in ovirt, and we had to remove it, and re-add it again >>>>>> in >>>>>> order to correct the issue. The Centos VM did not have the same >>>>>> issue. >>>>>> >>>>>> I'm concerned that this might happen to a VM that's quite >>>>>> critical, >>>>>> any thoughts or ideas? >>>>>> >>>>>> The only recent changes have been updating from Dreyou 3.2 to the >>>>>> official Centos repo and updating to 3.3.1-2. Prior to updating I >>>>>> haven't had this issue. >>>>>> >>>>>> Any assistance is greatly appreciated. >>>>>> >>>>>> Thank you. >>>>>> >>>>>> Regards. >>>>>> >>>>>> Neil Wilson. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Do you have the VMs on thin provisioned storage or sparse disks? >>>>>>> >>>>>>> Pausing happens when the VM has an IO error or runs out of space >>>>>>> on >>>>>>> the >>>>>>> storage domain, and it is done intentionally, so that the VM >>>>>>> will >>>>>>> not >>>>>>> experience a disk corruption. If you have thin provisioned >>>>>>> disks, >>>>>>> and >>>>>>> the VM >>>>>>> writes to it's disks faster than the disks can grow, this is >>>>>>> exactly >>>>>>> what >>>>>>> you will see >>>>>>> >>>>>>> >>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi guys, >>>>>>>> >>>>>>>> I've had two different Vm's randomly pause this past week and >>>>>>>> inside >>>>>>>> ovirt >>>>>>>> the error received is something like 'vm ran out of storage and >>>>>>>> was >>>>>>>> paused'. >>>>>>>> Resuming the vm's didn't work and I had to force them off and >>>>>>>> then on >>>>>>>> which >>>>>>>> resolved the issue. >>>>>>>> >>>>>>>> Has anyone had this issue before? >>>>>>>> >>>>>>>> I realise this is very vague so if you could please let me know >>>>>>>> which >>>>>>>> logs >>>>>>>> to send in. >>>>>>>> >>>>>>>> Thank you >>>>>>>> >>>>>>>> Regards. >>>>>>>> >>>>>>>> Neil Wilson >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Users mailing list >>>>>>>> Users@ovirt.org >>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Users mailing list >>>>>>> Users@ovirt.org >>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>> >>>> >>>> >>>> -- >>>> Dafna Ron >> >> >> >> -- >> Dafna Ron > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >
-- Dafna Ron
-- Dafna Ron

The reason I asked about the size if because this was the original issue no? vm's pausing on lack of space? You're having a problem with your data domains. Can you check the rout from the hosts to the storage? I think that you have some disconnection to the storage from the hosts since it's random and not from all the vm's I would suggest that its a routing problem? Thanks, Dafna On 01/29/2014 08:00 AM, Neil wrote:
Sorry, more on this issue, I see my logs are rapidly filling up my disk space on node02 with this error in /var/log/messages...
Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 513, in _highWrite#012 self._vm._dom.blockInfo(vmDrive.path, 0)#012 File "/usr/share/vdsm/vm.py", line 835, in f#012 ret = attr(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in blockInfo#012 if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)#012libvirtError: invalid argument: invalid path /rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/fac8a3bb-e414-43c0-affc-6e2628757a28/6c3e5ae8-23fc-4196-ba42-778bdc0fbad8 not assigned to domain Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`ac2a3f99-a6db-4cae-955d-efdfb901abb7`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
Not sure if this is related at all though?
Thanks.
Regards.
Neil Wilson.
On Wed, Jan 29, 2014 at 9:02 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dafna,
Thanks for clarifying that, I found the migration issue and this was resolved once I sorted out the ISO domain problem.
I'm sorry I don't understand your last question? "> after the engine restart, do you still see a problem with the size or did the report of size changed?"
The migration issue was resolved, it's now just trying to track down why the two VM's paused on their own, one on the 8th of Jan(I think) and one on the 19th of Jan.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <dron@redhat.com> wrote:
yes - engine lost communication with vdsm and it has no way of knowing if the host is down or if there was a network issue so a network issue would cause the same errors that I see in the logs.
The error you put on the iso is the reason the vm's have failed migration - if a vm is run with a cd and the cd is gone than the vm will not be able to be migrated.
after the engine restart, do you still see a problem with the size or did the report of size changed?
Dafna
On 01/28/2014 01:02 PM, Neil wrote:
Hi Dafna,
Thanks for coming back to me. I'll try answer your queries one by one.
On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron@redhat.com> wrote:
you had a problem with your storage on the 14th of Jan and one of the hosts rebooted (if you have the vdsm log from that day than I can see what happened on vdsm side) in engine, I could see a problem with the export domain and this should not have cause a reboot. 1.) I don't unfortunately have logs going back that far. Looking at all 3 hosts uptime, the one with the least uptime is 21 days, the others are all over 40 days, so there definitely wasn't a host that rebooted on the 14th of Jan, would a network issue or Firewall issue also cause the error you've seen to look as if a host rebooted? There was a bonding mode change on the 14th of January, so perhaps this caused the issue?
Can you tell me if you had a problem with the data domain as well or was it just the export domain? were you having any vm's exported/imported at that time? In any case - this is a bug. 2.) I think this was the same day that the bonding mode was changed on the host while the host was live (by mistake), and had SPM running on it. I haven't done any importing or exporting for a few years on this oVirt setup.
As for the vm's - if the vm's are no longer in migrating state than please restart ovirt-engine service (looks like a cache issue) 3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
if they are in migrating state - there should have been a timeout a long time ago. can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on both all hosts? 4.) Ran on all hosts...
node01.blabla.com 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
Id Name State ---------------------------------------------------- 1 adam running 2 reports running 4 tux running 6 Moodle running 7 babbage running
node02.blabla.com dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
Id Name State ---------------------------------------------------- 9 proxy02 running 10 spam running 12 software running 13 zimbra running
node03.blabla.com e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
Id Name State ---------------------------------------------------- 13 oliver running 14 dhcp running
Last thing is that your ISO domain seems to be having issues as well. This should not effect the host status but if any of the vm's were booted from an iso or have an iso attached in the boot sequence this will explain the migration issue. There was an ISO domain issue a while back, but this was corrected about 2 weeks ago after iptables re-enabled itself on boot after running updates, I've checked now and the ISO domain appears to be fine and I can see all the images stored within.
I've stumbled across what appears to be another error and all three hosts are showing this over and over in /var/log/messages, and I'm not sure if it's related? ...
Jan 28 14:58:59 node01 vdsm vm.Vm ERROR vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
I've attached the full vdsm log from node02 to this reply.
Please shout if you need anything else.
Thank you.
Regards.
Neil Wilson.
On 01/28/2014 09:28 AM, Neil wrote:
Hi guys,
Sorry for the very late reply, I've been out of the office doing installations. Unfortunately due to the time delay, my oldest logs are only as far back as the attached.
I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm not sure what info is required, so the full log is attached.
Please shout if you need any info or further details.
Thank you very much.
Regards.
Neil Wilson.
On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin@redhat.com> wrote: > Could you please attach the engine.log from the same time? > > thanks! > > ----- Original Message ----- >> From: "Neil" <nwilson123@gmail.com> >> To: dron@redhat.com >> Cc: "users" <users@ovirt.org> >> Sent: Wednesday, January 22, 2014 1:14:25 PM >> Subject: Re: [Users] Vm's being paused >> >> Hi Dafna, >> >> Thanks. >> >> The vdsm logs are quite large, so I've only attached the logs for the >> pause of the VM called Babbage on the 19th of Jan. >> >> As for snapshots, Babbage has one from June 2013 and Reports has two >> from June and Oct 2013. >> >> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's >> have thin provisioned disks. >> >> Please shout if you'd like any further info or logs. >> >> Thank you. >> >> Regards. >> >> Neil Wilson. >> >> >> >> >> >> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote: >>> Hi Neil, >>> >>> Can you please attach the vdsm logs? >>> also, as for the vm's, do they have any snapshots? >>> from your suggestion to allocate more luns, are you using iscsi or >>> FC? >>> >>> Thanks, >>> >>> Dafna >>> >>> >>> On 01/22/2014 08:45 AM, Neil wrote: >>>> Thanks for the replies guys, >>>> >>>> Looking at my two VM's that have paused so far through the oVirt GUI >>>> the following sizes show under Disks. >>>> >>>> VM Reports: >>>> Virtual Size 35GB, Actual Size 41GB >>>> Looking on the Centos OS side, Disk size is 33G and used is 12G with >>>> 19G available (40%) usage. >>>> >>>> VM Babbage: >>>> Virtual Size is 40GB, Actual Size 53GB >>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, >>>> so >>>> under 50% usage. >>>> >>>> >>>> Do you see any issues with the above stats? >>>> >>>> Then my main Datacenter storage is as follows... >>>> >>>> Size: 6887 GB >>>> Available: 1948 GB >>>> Used: 4939 GB >>>> Allocated: 1196 GB >>>> Over Allocation: 61% >>>> >>>> Could there be a problem here? I can allocate additional LUNS if you >>>> feel the space isn't correctly allocated. >>>> >>>> Apologies for going on about this, but I'm really concerned that >>>> something isn't right and I might have a serious problem if an >>>> important machine locks up. >>>> >>>> Thank you and much appreciated. >>>> >>>> Regards. >>>> >>>> Neil Wilson. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote: >>>>> the storage space is configured in percentages and not physical >>>>> size. >>>>> so if 20G is less than 10% (default config) of your storage it will >>>>> pause >>>>> the vms regardless of how much GB you still have. >>>>> this is configurable though so you can change it to less than 10% >>>>> if >>>>> you >>>>> like. >>>>> >>>>> to answer the second question, vm's will not pause on ENOSpace >>>>> error >>>>> if >>>>> they >>>>> run out of space internally but only if the external storage cannot >>>>> be >>>>> consumed. so only if you run out of space in the storage and and >>>>> not >>>>> if >>>>> vm >>>>> runs out of space in its on fs. >>>>> >>>>> >>>>> >>>>> On 01/21/2014 09:51 AM, Neil wrote: >>>>>> Hi Dan, >>>>>> >>>>>> Sorry, attached is engine.log I've taken out the two sections >>>>>> where >>>>>> each of the VM's were paused. >>>>>> >>>>>> Does the error "VM babbage has paused due to no Storage space >>>>>> error" >>>>>> mean the main storage domain has run out of storage, or that the >>>>>> VM >>>>>> has run out? >>>>>> >>>>>> Both VM's appear to have been running on node01 when they were >>>>>> paused. >>>>>> My vdsm versions are all... >>>>>> >>>>>> vdsm-cli-4.13.0-11.el6.noarch >>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>>>>> vdsm-4.13.0-11.el6.x86_64 >>>>>> vdsm-python-4.13.0-11.el6.x86_64 >>>>>> >>>>>> I currently have a 61% over allocation ratio on my primary storage >>>>>> domain, with 1948GB available. >>>>>> >>>>>> Thank you. >>>>>> >>>>>> Regards. >>>>>> >>>>>> Neil Wilson. >>>>>> >>>>>> >>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> >>>>>> wrote: >>>>>>> Hi Dan, >>>>>>> >>>>>>> Sorry for only coming back to you now. >>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out >>>>>>> of >>>>>>> disk space there is about 20Gigs free, and the usage barely grows >>>>>>> as >>>>>>> the VM only shares printers. The other VM that paused is also on >>>>>>> thin >>>>>>> provisioned disks and also has plenty space, this guest is >>>>>>> running >>>>>>> Centos 6.3 64bit and only runs basic reporting. >>>>>>> >>>>>>> After the 2003 guest was rebooted, the network card showed up as >>>>>>> unplugged in ovirt, and we had to remove it, and re-add it again >>>>>>> in >>>>>>> order to correct the issue. The Centos VM did not have the same >>>>>>> issue. >>>>>>> >>>>>>> I'm concerned that this might happen to a VM that's quite >>>>>>> critical, >>>>>>> any thoughts or ideas? >>>>>>> >>>>>>> The only recent changes have been updating from Dreyou 3.2 to the >>>>>>> official Centos repo and updating to 3.3.1-2. Prior to updating I >>>>>>> haven't had this issue. >>>>>>> >>>>>>> Any assistance is greatly appreciated. >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>>> Regards. >>>>>>> >>>>>>> Neil Wilson. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> >>>>>>> wrote: >>>>>>>> Do you have the VMs on thin provisioned storage or sparse disks? >>>>>>>> >>>>>>>> Pausing happens when the VM has an IO error or runs out of space >>>>>>>> on >>>>>>>> the >>>>>>>> storage domain, and it is done intentionally, so that the VM >>>>>>>> will >>>>>>>> not >>>>>>>> experience a disk corruption. If you have thin provisioned >>>>>>>> disks, >>>>>>>> and >>>>>>>> the VM >>>>>>>> writes to it's disks faster than the disks can grow, this is >>>>>>>> exactly >>>>>>>> what >>>>>>>> you will see >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> >>>>>>>> wrote: >>>>>>>>> Hi guys, >>>>>>>>> >>>>>>>>> I've had two different Vm's randomly pause this past week and >>>>>>>>> inside >>>>>>>>> ovirt >>>>>>>>> the error received is something like 'vm ran out of storage and >>>>>>>>> was >>>>>>>>> paused'. >>>>>>>>> Resuming the vm's didn't work and I had to force them off and >>>>>>>>> then on >>>>>>>>> which >>>>>>>>> resolved the issue. >>>>>>>>> >>>>>>>>> Has anyone had this issue before? >>>>>>>>> >>>>>>>>> I realise this is very vague so if you could please let me know >>>>>>>>> which >>>>>>>>> logs >>>>>>>>> to send in. >>>>>>>>> >>>>>>>>> Thank you >>>>>>>>> >>>>>>>>> Regards. >>>>>>>>> >>>>>>>>> Neil Wilson >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Users mailing list >>>>>>>>> Users@ovirt.org >>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Users mailing list >>>>>>>> Users@ovirt.org >>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>> >>>>> >>>>> -- >>>>> Dafna Ron >>> >>> >>> -- >>> Dafna Ron >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> -- Dafna Ron
-- Dafna Ron
-- Dafna Ron

Hi Dafna, On Wed, Jan 29, 2014 at 1:14 PM, Dafna Ron <dron@redhat.com> wrote:
The reason I asked about the size if because this was the original issue no? vm's pausing on lack of space?
Apologies, I just wanted to make sure it was still about this pausing and not the original migration issue that I think you were also helping me with a few weeks back.
You're having a problem with your data domains. Can you check the rout from the hosts to the storage? I think that you have some disconnection to the storage from the hosts since it's random and not from all the vm's I would suggest that its a routing problem? Thanks, Dafna
The connections to the main data domain is 8Gb Fibre Channel directly from each of the hosts to the FC SAN, so if it is a connection issue then I can't understand how anything would be working. Or am I barking up the wrong tree completely? There were some ethernet network bridging changes on each of the hosts in early January, but this would only affect the NFS mounted ISO domain, or could this be the cause of the problems? Is this disconnection causing the huge log files that I sent previously? Thank you. Regards. Neil Wilson.
On 01/29/2014 08:00 AM, Neil wrote:
Sorry, more on this issue, I see my logs are rapidly filling up my disk space on node02 with this error in /var/log/messages...
Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 513, in _highWrite#012 self._vm._dom.blockInfo(vmDrive.path, 0)#012 File "/usr/share/vdsm/vm.py", line 835, in f#012 ret = attr(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in blockInfo#012 if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)#012libvirtError: invalid argument: invalid path
/rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/fac8a3bb-e414-43c0-affc-6e2628757a28/6c3e5ae8-23fc-4196-ba42-778bdc0fbad8 not assigned to domain Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`ac2a3f99-a6db-4cae-955d-efdfb901abb7`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
Not sure if this is related at all though?
Thanks.
Regards.
Neil Wilson.
On Wed, Jan 29, 2014 at 9:02 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dafna,
Thanks for clarifying that, I found the migration issue and this was resolved once I sorted out the ISO domain problem.
I'm sorry I don't understand your last question? "> after the engine restart, do you still see a problem with the size or did the report of size changed?"
The migration issue was resolved, it's now just trying to track down why the two VM's paused on their own, one on the 8th of Jan(I think) and one on the 19th of Jan.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <dron@redhat.com> wrote:
yes - engine lost communication with vdsm and it has no way of knowing if the host is down or if there was a network issue so a network issue would cause the same errors that I see in the logs.
The error you put on the iso is the reason the vm's have failed migration - if a vm is run with a cd and the cd is gone than the vm will not be able to be migrated.
after the engine restart, do you still see a problem with the size or did the report of size changed?
Dafna
On 01/28/2014 01:02 PM, Neil wrote:
Hi Dafna,
Thanks for coming back to me. I'll try answer your queries one by one.
On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron@redhat.com> wrote:
you had a problem with your storage on the 14th of Jan and one of the hosts rebooted (if you have the vdsm log from that day than I can see what happened on vdsm side) in engine, I could see a problem with the export domain and this should not have cause a reboot.
1.) I don't unfortunately have logs going back that far. Looking at all 3 hosts uptime, the one with the least uptime is 21 days, the others are all over 40 days, so there definitely wasn't a host that rebooted on the 14th of Jan, would a network issue or Firewall issue also cause the error you've seen to look as if a host rebooted? There was a bonding mode change on the 14th of January, so perhaps this caused the issue?
Can you tell me if you had a problem with the data domain as well or was it just the export domain? were you having any vm's exported/imported at that time? In any case - this is a bug.
2.) I think this was the same day that the bonding mode was changed on the host while the host was live (by mistake), and had SPM running on it. I haven't done any importing or exporting for a few years on this oVirt setup.
As for the vm's - if the vm's are no longer in migrating state than please restart ovirt-engine service (looks like a cache issue)
3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
if they are in migrating state - there should have been a timeout a long time ago. can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on both all hosts?
4.) Ran on all hosts...
node01.blabla.com 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
Id Name State ---------------------------------------------------- 1 adam running 2 reports running 4 tux running 6 Moodle running 7 babbage running
node02.blabla.com dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
Id Name State ---------------------------------------------------- 9 proxy02 running 10 spam running 12 software running 13 zimbra running
node03.blabla.com e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
Id Name State ---------------------------------------------------- 13 oliver running 14 dhcp running
Last thing is that your ISO domain seems to be having issues as well. This should not effect the host status but if any of the vm's were booted from an iso or have an iso attached in the boot sequence this will explain the migration issue.
There was an ISO domain issue a while back, but this was corrected about 2 weeks ago after iptables re-enabled itself on boot after running updates, I've checked now and the ISO domain appears to be fine and I can see all the images stored within.
I've stumbled across what appears to be another error and all three hosts are showing this over and over in /var/log/messages, and I'm not sure if it's related? ...
Jan 28 14:58:59 node01 vdsm vm.Vm ERROR vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
I've attached the full vdsm log from node02 to this reply.
Please shout if you need anything else.
Thank you.
Regards.
Neil Wilson.
On 01/28/2014 09:28 AM, Neil wrote: > > Hi guys, > > Sorry for the very late reply, I've been out of the office doing > installations. > Unfortunately due to the time delay, my oldest logs are only as far > back as the attached. > > I've only grep'd for Thread-286029 in the vdsm log. The engine.log > I'm > not sure what info is required, so the full log is attached. > > Please shout if you need any info or further details. > > Thank you very much. > > Regards. > > Neil Wilson. > > > On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine > <mbourvin@redhat.com> > wrote: >> >> Could you please attach the engine.log from the same time? >> >> thanks! >> >> ----- Original Message ----- >>> >>> From: "Neil" <nwilson123@gmail.com> >>> To: dron@redhat.com >>> Cc: "users" <users@ovirt.org> >>> Sent: Wednesday, January 22, 2014 1:14:25 PM >>> Subject: Re: [Users] Vm's being paused >>> >>> Hi Dafna, >>> >>> Thanks. >>> >>> The vdsm logs are quite large, so I've only attached the logs for >>> the >>> pause of the VM called Babbage on the 19th of Jan. >>> >>> As for snapshots, Babbage has one from June 2013 and Reports has >>> two >>> from June and Oct 2013. >>> >>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 >>> VM's >>> have thin provisioned disks. >>> >>> Please shout if you'd like any further info or logs. >>> >>> Thank you. >>> >>> Regards. >>> >>> Neil Wilson. >>> >>> >>> >>> >>> >>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> >>> wrote: >>>> >>>> Hi Neil, >>>> >>>> Can you please attach the vdsm logs? >>>> also, as for the vm's, do they have any snapshots? >>>> from your suggestion to allocate more luns, are you using iscsi or >>>> FC? >>>> >>>> Thanks, >>>> >>>> Dafna >>>> >>>> >>>> On 01/22/2014 08:45 AM, Neil wrote: >>>>> >>>>> Thanks for the replies guys, >>>>> >>>>> Looking at my two VM's that have paused so far through the oVirt >>>>> GUI >>>>> the following sizes show under Disks. >>>>> >>>>> VM Reports: >>>>> Virtual Size 35GB, Actual Size 41GB >>>>> Looking on the Centos OS side, Disk size is 33G and used is 12G >>>>> with >>>>> 19G available (40%) usage. >>>>> >>>>> VM Babbage: >>>>> Virtual Size is 40GB, Actual Size 53GB >>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is >>>>> 16.3G, >>>>> so >>>>> under 50% usage. >>>>> >>>>> >>>>> Do you see any issues with the above stats? >>>>> >>>>> Then my main Datacenter storage is as follows... >>>>> >>>>> Size: 6887 GB >>>>> Available: 1948 GB >>>>> Used: 4939 GB >>>>> Allocated: 1196 GB >>>>> Over Allocation: 61% >>>>> >>>>> Could there be a problem here? I can allocate additional LUNS if >>>>> you >>>>> feel the space isn't correctly allocated. >>>>> >>>>> Apologies for going on about this, but I'm really concerned that >>>>> something isn't right and I might have a serious problem if an >>>>> important machine locks up. >>>>> >>>>> Thank you and much appreciated. >>>>> >>>>> Regards. >>>>> >>>>> Neil Wilson. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> >>>>> wrote: >>>>>> >>>>>> the storage space is configured in percentages and not physical >>>>>> size. >>>>>> so if 20G is less than 10% (default config) of your storage it >>>>>> will >>>>>> pause >>>>>> the vms regardless of how much GB you still have. >>>>>> this is configurable though so you can change it to less than >>>>>> 10% >>>>>> if >>>>>> you >>>>>> like. >>>>>> >>>>>> to answer the second question, vm's will not pause on ENOSpace >>>>>> error >>>>>> if >>>>>> they >>>>>> run out of space internally but only if the external storage >>>>>> cannot >>>>>> be >>>>>> consumed. so only if you run out of space in the storage and and >>>>>> not >>>>>> if >>>>>> vm >>>>>> runs out of space in its on fs. >>>>>> >>>>>> >>>>>> >>>>>> On 01/21/2014 09:51 AM, Neil wrote: >>>>>>> >>>>>>> Hi Dan, >>>>>>> >>>>>>> Sorry, attached is engine.log I've taken out the two sections >>>>>>> where >>>>>>> each of the VM's were paused. >>>>>>> >>>>>>> Does the error "VM babbage has paused due to no Storage space >>>>>>> error" >>>>>>> mean the main storage domain has run out of storage, or that >>>>>>> the >>>>>>> VM >>>>>>> has run out? >>>>>>> >>>>>>> Both VM's appear to have been running on node01 when they were >>>>>>> paused. >>>>>>> My vdsm versions are all... >>>>>>> >>>>>>> vdsm-cli-4.13.0-11.el6.noarch >>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>>>>>> vdsm-4.13.0-11.el6.x86_64 >>>>>>> vdsm-python-4.13.0-11.el6.x86_64 >>>>>>> >>>>>>> I currently have a 61% over allocation ratio on my primary >>>>>>> storage >>>>>>> domain, with 1948GB available. >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>>> Regards. >>>>>>> >>>>>>> Neil Wilson. >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Dan, >>>>>>>> >>>>>>>> Sorry for only coming back to you now. >>>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run >>>>>>>> out >>>>>>>> of >>>>>>>> disk space there is about 20Gigs free, and the usage barely >>>>>>>> grows >>>>>>>> as >>>>>>>> the VM only shares printers. The other VM that paused is also >>>>>>>> on >>>>>>>> thin >>>>>>>> provisioned disks and also has plenty space, this guest is >>>>>>>> running >>>>>>>> Centos 6.3 64bit and only runs basic reporting. >>>>>>>> >>>>>>>> After the 2003 guest was rebooted, the network card showed up >>>>>>>> as >>>>>>>> unplugged in ovirt, and we had to remove it, and re-add it >>>>>>>> again >>>>>>>> in >>>>>>>> order to correct the issue. The Centos VM did not have the >>>>>>>> same >>>>>>>> issue. >>>>>>>> >>>>>>>> I'm concerned that this might happen to a VM that's quite >>>>>>>> critical, >>>>>>>> any thoughts or ideas? >>>>>>>> >>>>>>>> The only recent changes have been updating from Dreyou 3.2 to >>>>>>>> the >>>>>>>> official Centos repo and updating to 3.3.1-2. Prior to >>>>>>>> updating I >>>>>>>> haven't had this issue. >>>>>>>> >>>>>>>> Any assistance is greatly appreciated. >>>>>>>> >>>>>>>> Thank you. >>>>>>>> >>>>>>>> Regards. >>>>>>>> >>>>>>>> Neil Wilson. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Do you have the VMs on thin provisioned storage or sparse >>>>>>>>> disks? >>>>>>>>> >>>>>>>>> Pausing happens when the VM has an IO error or runs out of >>>>>>>>> space >>>>>>>>> on >>>>>>>>> the >>>>>>>>> storage domain, and it is done intentionally, so that the VM >>>>>>>>> will >>>>>>>>> not >>>>>>>>> experience a disk corruption. If you have thin provisioned >>>>>>>>> disks, >>>>>>>>> and >>>>>>>>> the VM >>>>>>>>> writes to it's disks faster than the disks can grow, this is >>>>>>>>> exactly >>>>>>>>> what >>>>>>>>> you will see >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi guys, >>>>>>>>>> >>>>>>>>>> I've had two different Vm's randomly pause this past week >>>>>>>>>> and >>>>>>>>>> inside >>>>>>>>>> ovirt >>>>>>>>>> the error received is something like 'vm ran out of storage >>>>>>>>>> and >>>>>>>>>> was >>>>>>>>>> paused'. >>>>>>>>>> Resuming the vm's didn't work and I had to force them off >>>>>>>>>> and >>>>>>>>>> then on >>>>>>>>>> which >>>>>>>>>> resolved the issue. >>>>>>>>>> >>>>>>>>>> Has anyone had this issue before? >>>>>>>>>> >>>>>>>>>> I realise this is very vague so if you could please let me >>>>>>>>>> know >>>>>>>>>> which >>>>>>>>>> logs >>>>>>>>>> to send in. >>>>>>>>>> >>>>>>>>>> Thank you >>>>>>>>>> >>>>>>>>>> Regards. >>>>>>>>>> >>>>>>>>>> Neil Wilson >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Users mailing list >>>>>>>>>> Users@ovirt.org >>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Users mailing list >>>>>>>>> Users@ovirt.org >>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Dafna Ron >>>> >>>> >>>> >>>> -- >>>> Dafna Ron >>> >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >>> -- Dafna Ron
-- Dafna Ron
-- Dafna Ron

Sorry for the re-post, I was suddenly unsubscribed from the oVirt users list for the 3rd time this month. Regards. Neil Wilson. On Wed, Jan 29, 2014 at 4:16 PM, Neil <nwilson123@gmail.com> wrote:
Hi Dafna,
On Wed, Jan 29, 2014 at 1:14 PM, Dafna Ron <dron@redhat.com> wrote:
The reason I asked about the size if because this was the original issue no? vm's pausing on lack of space?
Apologies, I just wanted to make sure it was still about this pausing and not the original migration issue that I think you were also helping me with a few weeks back.
You're having a problem with your data domains. Can you check the rout from the hosts to the storage? I think that you have some disconnection to the storage from the hosts since it's random and not from all the vm's I would suggest that its a routing problem? Thanks, Dafna
The connections to the main data domain is 8Gb Fibre Channel directly from each of the hosts to the FC SAN, so if it is a connection issue then I can't understand how anything would be working. Or am I barking up the wrong tree completely? There were some ethernet network bridging changes on each of the hosts in early January, but this would only affect the NFS mounted ISO domain, or could this be the cause of the problems?
Is this disconnection causing the huge log files that I sent previously?
Thank you.
Regards.
Neil Wilson.
On 01/29/2014 08:00 AM, Neil wrote:
Sorry, more on this issue, I see my logs are rapidly filling up my disk space on node02 with this error in /var/log/messages...
Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 513, in _highWrite#012 self._vm._dom.blockInfo(vmDrive.path, 0)#012 File "/usr/share/vdsm/vm.py", line 835, in f#012 ret = attr(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in blockInfo#012 if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)#012libvirtError: invalid argument: invalid path
/rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/fac8a3bb-e414-43c0-affc-6e2628757a28/6c3e5ae8-23fc-4196-ba42-778bdc0fbad8 not assigned to domain Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`ac2a3f99-a6db-4cae-955d-efdfb901abb7`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
Not sure if this is related at all though?
Thanks.
Regards.
Neil Wilson.
On Wed, Jan 29, 2014 at 9:02 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dafna,
Thanks for clarifying that, I found the migration issue and this was resolved once I sorted out the ISO domain problem.
I'm sorry I don't understand your last question? "> after the engine restart, do you still see a problem with the size or did the report of size changed?"
The migration issue was resolved, it's now just trying to track down why the two VM's paused on their own, one on the 8th of Jan(I think) and one on the 19th of Jan.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <dron@redhat.com> wrote:
yes - engine lost communication with vdsm and it has no way of knowing if the host is down or if there was a network issue so a network issue would cause the same errors that I see in the logs.
The error you put on the iso is the reason the vm's have failed migration - if a vm is run with a cd and the cd is gone than the vm will not be able to be migrated.
after the engine restart, do you still see a problem with the size or did the report of size changed?
Dafna
On 01/28/2014 01:02 PM, Neil wrote:
Hi Dafna,
Thanks for coming back to me. I'll try answer your queries one by one.
On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron@redhat.com> wrote: > > you had a problem with your storage on the 14th of Jan and one of the > hosts > rebooted (if you have the vdsm log from that day than I can see what > happened on vdsm side) > in engine, I could see a problem with the export domain and this > should > not > have cause a reboot.
1.) I don't unfortunately have logs going back that far. Looking at all 3 hosts uptime, the one with the least uptime is 21 days, the others are all over 40 days, so there definitely wasn't a host that rebooted on the 14th of Jan, would a network issue or Firewall issue also cause the error you've seen to look as if a host rebooted? There was a bonding mode change on the 14th of January, so perhaps this caused the issue?
> Can you tell me if you had a problem with the data > domain as well or was it just the export domain? were you having any > vm's > exported/imported at that time? > In any case - this is a bug.
2.) I think this was the same day that the bonding mode was changed on the host while the host was live (by mistake), and had SPM running on it. I haven't done any importing or exporting for a few years on this oVirt setup.
> As for the vm's - if the vm's are no longer in migrating state than > please > restart ovirt-engine service (looks like a cache issue)
3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
> if they are in migrating state - there should have been a timeout a > long > time ago. > can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on > both > all hosts?
4.) Ran on all hosts...
node01.blabla.com 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
Id Name State ---------------------------------------------------- 1 adam running 2 reports running 4 tux running 6 Moodle running 7 babbage running
node02.blabla.com dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
Id Name State ---------------------------------------------------- 9 proxy02 running 10 spam running 12 software running 13 zimbra running
node03.blabla.com e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
Id Name State ---------------------------------------------------- 13 oliver running 14 dhcp running
> Last thing is that your ISO domain seems to be having issues as well. > This should not effect the host status but if any of the vm's were > booted > from an iso or have an iso attached in the boot sequence this will > explain > the migration issue.
There was an ISO domain issue a while back, but this was corrected about 2 weeks ago after iptables re-enabled itself on boot after running updates, I've checked now and the ISO domain appears to be fine and I can see all the images stored within.
I've stumbled across what appears to be another error and all three hosts are showing this over and over in /var/log/messages, and I'm not sure if it's related? ...
Jan 28 14:58:59 node01 vdsm vm.Vm ERROR vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
I've attached the full vdsm log from node02 to this reply.
Please shout if you need anything else.
Thank you.
Regards.
Neil Wilson.
> On 01/28/2014 09:28 AM, Neil wrote: >> >> Hi guys, >> >> Sorry for the very late reply, I've been out of the office doing >> installations. >> Unfortunately due to the time delay, my oldest logs are only as far >> back as the attached. >> >> I've only grep'd for Thread-286029 in the vdsm log. The engine.log >> I'm >> not sure what info is required, so the full log is attached. >> >> Please shout if you need any info or further details. >> >> Thank you very much. >> >> Regards. >> >> Neil Wilson. >> >> >> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine >> <mbourvin@redhat.com> >> wrote: >>> >>> Could you please attach the engine.log from the same time? >>> >>> thanks! >>> >>> ----- Original Message ----- >>>> >>>> From: "Neil" <nwilson123@gmail.com> >>>> To: dron@redhat.com >>>> Cc: "users" <users@ovirt.org> >>>> Sent: Wednesday, January 22, 2014 1:14:25 PM >>>> Subject: Re: [Users] Vm's being paused >>>> >>>> Hi Dafna, >>>> >>>> Thanks. >>>> >>>> The vdsm logs are quite large, so I've only attached the logs for >>>> the >>>> pause of the VM called Babbage on the 19th of Jan. >>>> >>>> As for snapshots, Babbage has one from June 2013 and Reports has >>>> two >>>> from June and Oct 2013. >>>> >>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 >>>> VM's >>>> have thin provisioned disks. >>>> >>>> Please shout if you'd like any further info or logs. >>>> >>>> Thank you. >>>> >>>> Regards. >>>> >>>> Neil Wilson. >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> >>>> wrote: >>>>> >>>>> Hi Neil, >>>>> >>>>> Can you please attach the vdsm logs? >>>>> also, as for the vm's, do they have any snapshots? >>>>> from your suggestion to allocate more luns, are you using iscsi or >>>>> FC? >>>>> >>>>> Thanks, >>>>> >>>>> Dafna >>>>> >>>>> >>>>> On 01/22/2014 08:45 AM, Neil wrote: >>>>>> >>>>>> Thanks for the replies guys, >>>>>> >>>>>> Looking at my two VM's that have paused so far through the oVirt >>>>>> GUI >>>>>> the following sizes show under Disks. >>>>>> >>>>>> VM Reports: >>>>>> Virtual Size 35GB, Actual Size 41GB >>>>>> Looking on the Centos OS side, Disk size is 33G and used is 12G >>>>>> with >>>>>> 19G available (40%) usage. >>>>>> >>>>>> VM Babbage: >>>>>> Virtual Size is 40GB, Actual Size 53GB >>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is >>>>>> 16.3G, >>>>>> so >>>>>> under 50% usage. >>>>>> >>>>>> >>>>>> Do you see any issues with the above stats? >>>>>> >>>>>> Then my main Datacenter storage is as follows... >>>>>> >>>>>> Size: 6887 GB >>>>>> Available: 1948 GB >>>>>> Used: 4939 GB >>>>>> Allocated: 1196 GB >>>>>> Over Allocation: 61% >>>>>> >>>>>> Could there be a problem here? I can allocate additional LUNS if >>>>>> you >>>>>> feel the space isn't correctly allocated. >>>>>> >>>>>> Apologies for going on about this, but I'm really concerned that >>>>>> something isn't right and I might have a serious problem if an >>>>>> important machine locks up. >>>>>> >>>>>> Thank you and much appreciated. >>>>>> >>>>>> Regards. >>>>>> >>>>>> Neil Wilson. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> >>>>>> wrote: >>>>>>> >>>>>>> the storage space is configured in percentages and not physical >>>>>>> size. >>>>>>> so if 20G is less than 10% (default config) of your storage it >>>>>>> will >>>>>>> pause >>>>>>> the vms regardless of how much GB you still have. >>>>>>> this is configurable though so you can change it to less than >>>>>>> 10% >>>>>>> if >>>>>>> you >>>>>>> like. >>>>>>> >>>>>>> to answer the second question, vm's will not pause on ENOSpace >>>>>>> error >>>>>>> if >>>>>>> they >>>>>>> run out of space internally but only if the external storage >>>>>>> cannot >>>>>>> be >>>>>>> consumed. so only if you run out of space in the storage and and >>>>>>> not >>>>>>> if >>>>>>> vm >>>>>>> runs out of space in its on fs. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 01/21/2014 09:51 AM, Neil wrote: >>>>>>>> >>>>>>>> Hi Dan, >>>>>>>> >>>>>>>> Sorry, attached is engine.log I've taken out the two sections >>>>>>>> where >>>>>>>> each of the VM's were paused. >>>>>>>> >>>>>>>> Does the error "VM babbage has paused due to no Storage space >>>>>>>> error" >>>>>>>> mean the main storage domain has run out of storage, or that >>>>>>>> the >>>>>>>> VM >>>>>>>> has run out? >>>>>>>> >>>>>>>> Both VM's appear to have been running on node01 when they were >>>>>>>> paused. >>>>>>>> My vdsm versions are all... >>>>>>>> >>>>>>>> vdsm-cli-4.13.0-11.el6.noarch >>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>>>>>>> vdsm-4.13.0-11.el6.x86_64 >>>>>>>> vdsm-python-4.13.0-11.el6.x86_64 >>>>>>>> >>>>>>>> I currently have a 61% over allocation ratio on my primary >>>>>>>> storage >>>>>>>> domain, with 1948GB available. >>>>>>>> >>>>>>>> Thank you. >>>>>>>> >>>>>>>> Regards. >>>>>>>> >>>>>>>> Neil Wilson. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Dan, >>>>>>>>> >>>>>>>>> Sorry for only coming back to you now. >>>>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run >>>>>>>>> out >>>>>>>>> of >>>>>>>>> disk space there is about 20Gigs free, and the usage barely >>>>>>>>> grows >>>>>>>>> as >>>>>>>>> the VM only shares printers. The other VM that paused is also >>>>>>>>> on >>>>>>>>> thin >>>>>>>>> provisioned disks and also has plenty space, this guest is >>>>>>>>> running >>>>>>>>> Centos 6.3 64bit and only runs basic reporting. >>>>>>>>> >>>>>>>>> After the 2003 guest was rebooted, the network card showed up >>>>>>>>> as >>>>>>>>> unplugged in ovirt, and we had to remove it, and re-add it >>>>>>>>> again >>>>>>>>> in >>>>>>>>> order to correct the issue. The Centos VM did not have the >>>>>>>>> same >>>>>>>>> issue. >>>>>>>>> >>>>>>>>> I'm concerned that this might happen to a VM that's quite >>>>>>>>> critical, >>>>>>>>> any thoughts or ideas? >>>>>>>>> >>>>>>>>> The only recent changes have been updating from Dreyou 3.2 to >>>>>>>>> the >>>>>>>>> official Centos repo and updating to 3.3.1-2. Prior to >>>>>>>>> updating I >>>>>>>>> haven't had this issue. >>>>>>>>> >>>>>>>>> Any assistance is greatly appreciated. >>>>>>>>> >>>>>>>>> Thank you. >>>>>>>>> >>>>>>>>> Regards. >>>>>>>>> >>>>>>>>> Neil Wilson. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Do you have the VMs on thin provisioned storage or sparse >>>>>>>>>> disks? >>>>>>>>>> >>>>>>>>>> Pausing happens when the VM has an IO error or runs out of >>>>>>>>>> space >>>>>>>>>> on >>>>>>>>>> the >>>>>>>>>> storage domain, and it is done intentionally, so that the VM >>>>>>>>>> will >>>>>>>>>> not >>>>>>>>>> experience a disk corruption. If you have thin provisioned >>>>>>>>>> disks, >>>>>>>>>> and >>>>>>>>>> the VM >>>>>>>>>> writes to it's disks faster than the disks can grow, this is >>>>>>>>>> exactly >>>>>>>>>> what >>>>>>>>>> you will see >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi guys, >>>>>>>>>>> >>>>>>>>>>> I've had two different Vm's randomly pause this past week >>>>>>>>>>> and >>>>>>>>>>> inside >>>>>>>>>>> ovirt >>>>>>>>>>> the error received is something like 'vm ran out of storage >>>>>>>>>>> and >>>>>>>>>>> was >>>>>>>>>>> paused'. >>>>>>>>>>> Resuming the vm's didn't work and I had to force them off >>>>>>>>>>> and >>>>>>>>>>> then on >>>>>>>>>>> which >>>>>>>>>>> resolved the issue. >>>>>>>>>>> >>>>>>>>>>> Has anyone had this issue before? >>>>>>>>>>> >>>>>>>>>>> I realise this is very vague so if you could please let me >>>>>>>>>>> know >>>>>>>>>>> which >>>>>>>>>>> logs >>>>>>>>>>> to send in. >>>>>>>>>>> >>>>>>>>>>> Thank you >>>>>>>>>>> >>>>>>>>>>> Regards. >>>>>>>>>>> >>>>>>>>>>> Neil Wilson >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Users mailing list >>>>>>>>>>> Users@ovirt.org >>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Users mailing list >>>>>>>>>> Users@ovirt.org >>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Dafna Ron >>>>> >>>>> >>>>> >>>>> -- >>>>> Dafna Ron >>>> >>>> _______________________________________________ >>>> Users mailing list >>>> Users@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/users >>>> > -- > Dafna Ron
-- Dafna Ron
-- Dafna Ron

mmm... I think that there is a bug with the iso domain.... and I am not sure if it was already opened. can you help me to debug this and see if its related? :) I think that you have some intermittent network issues to the iso domain and every time it happens, the vms that have booted with a cd (even if you detached it) would pause. I have a second suspicion... is it possible that the vms that pause had a cd and you ejected it at some point? perhaps after or during the network issues you had on the 14th? can you run dumpxml from libvirt? let me know if you need help with this command. Thanks, Dafna On 01/29/2014 02:16 PM, Neil wrote:
Hi Dafna,
On Wed, Jan 29, 2014 at 1:14 PM, Dafna Ron <dron@redhat.com> wrote:
The reason I asked about the size if because this was the original issue no? vm's pausing on lack of space? Apologies, I just wanted to make sure it was still about this pausing and not the original migration issue that I think you were also helping me with a few weeks back.
You're having a problem with your data domains. Can you check the rout from the hosts to the storage? I think that you have some disconnection to the storage from the hosts since it's random and not from all the vm's I would suggest that its a routing problem? Thanks, Dafna The connections to the main data domain is 8Gb Fibre Channel directly from each of the hosts to the FC SAN, so if it is a connection issue then I can't understand how anything would be working. Or am I barking up the wrong tree completely? There were some ethernet network bridging changes on each of the hosts in early January, but this would only affect the NFS mounted ISO domain, or could this be the cause of the problems?
Is this disconnection causing the huge log files that I sent previously?
Thank you.
Regards.
Neil Wilson.
On 01/29/2014 08:00 AM, Neil wrote:
Sorry, more on this issue, I see my logs are rapidly filling up my disk space on node02 with this error in /var/log/messages...
Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 513, in _highWrite#012 self._vm._dom.blockInfo(vmDrive.path, 0)#012 File "/usr/share/vdsm/vm.py", line 835, in f#012 ret = attr(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in blockInfo#012 if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)#012libvirtError: invalid argument: invalid path
/rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/fac8a3bb-e414-43c0-affc-6e2628757a28/6c3e5ae8-23fc-4196-ba42-778bdc0fbad8 not assigned to domain Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`ac2a3f99-a6db-4cae-955d-efdfb901abb7`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
Not sure if this is related at all though?
Thanks.
Regards.
Neil Wilson.
On Wed, Jan 29, 2014 at 9:02 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dafna,
Thanks for clarifying that, I found the migration issue and this was resolved once I sorted out the ISO domain problem.
I'm sorry I don't understand your last question? "> after the engine restart, do you still see a problem with the size or did the report of size changed?"
The migration issue was resolved, it's now just trying to track down why the two VM's paused on their own, one on the 8th of Jan(I think) and one on the 19th of Jan.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <dron@redhat.com> wrote:
yes - engine lost communication with vdsm and it has no way of knowing if the host is down or if there was a network issue so a network issue would cause the same errors that I see in the logs.
The error you put on the iso is the reason the vm's have failed migration - if a vm is run with a cd and the cd is gone than the vm will not be able to be migrated.
after the engine restart, do you still see a problem with the size or did the report of size changed?
Dafna
On 01/28/2014 01:02 PM, Neil wrote:
Hi Dafna,
Thanks for coming back to me. I'll try answer your queries one by one.
On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron@redhat.com> wrote: > you had a problem with your storage on the 14th of Jan and one of the > hosts > rebooted (if you have the vdsm log from that day than I can see what > happened on vdsm side) > in engine, I could see a problem with the export domain and this > should > not > have cause a reboot. 1.) I don't unfortunately have logs going back that far. Looking at all 3 hosts uptime, the one with the least uptime is 21 days, the others are all over 40 days, so there definitely wasn't a host that rebooted on the 14th of Jan, would a network issue or Firewall issue also cause the error you've seen to look as if a host rebooted? There was a bonding mode change on the 14th of January, so perhaps this caused the issue?
> Can you tell me if you had a problem with the data > domain as well or was it just the export domain? were you having any > vm's > exported/imported at that time? > In any case - this is a bug. 2.) I think this was the same day that the bonding mode was changed on the host while the host was live (by mistake), and had SPM running on it. I haven't done any importing or exporting for a few years on this oVirt setup.
> As for the vm's - if the vm's are no longer in migrating state than > please > restart ovirt-engine service (looks like a cache issue) 3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
> if they are in migrating state - there should have been a timeout a > long > time ago. > can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on > both > all hosts? 4.) Ran on all hosts...
node01.blabla.com 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
Id Name State ---------------------------------------------------- 1 adam running 2 reports running 4 tux running 6 Moodle running 7 babbage running
node02.blabla.com dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
Id Name State ---------------------------------------------------- 9 proxy02 running 10 spam running 12 software running 13 zimbra running
node03.blabla.com e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
Id Name State ---------------------------------------------------- 13 oliver running 14 dhcp running
> Last thing is that your ISO domain seems to be having issues as well. > This should not effect the host status but if any of the vm's were > booted > from an iso or have an iso attached in the boot sequence this will > explain > the migration issue. There was an ISO domain issue a while back, but this was corrected about 2 weeks ago after iptables re-enabled itself on boot after running updates, I've checked now and the ISO domain appears to be fine and I can see all the images stored within.
I've stumbled across what appears to be another error and all three hosts are showing this over and over in /var/log/messages, and I'm not sure if it's related? ...
Jan 28 14:58:59 node01 vdsm vm.Vm ERROR vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
I've attached the full vdsm log from node02 to this reply.
Please shout if you need anything else.
Thank you.
Regards.
Neil Wilson.
> On 01/28/2014 09:28 AM, Neil wrote: >> Hi guys, >> >> Sorry for the very late reply, I've been out of the office doing >> installations. >> Unfortunately due to the time delay, my oldest logs are only as far >> back as the attached. >> >> I've only grep'd for Thread-286029 in the vdsm log. The engine.log >> I'm >> not sure what info is required, so the full log is attached. >> >> Please shout if you need any info or further details. >> >> Thank you very much. >> >> Regards. >> >> Neil Wilson. >> >> >> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine >> <mbourvin@redhat.com> >> wrote: >>> Could you please attach the engine.log from the same time? >>> >>> thanks! >>> >>> ----- Original Message ----- >>>> From: "Neil" <nwilson123@gmail.com> >>>> To: dron@redhat.com >>>> Cc: "users" <users@ovirt.org> >>>> Sent: Wednesday, January 22, 2014 1:14:25 PM >>>> Subject: Re: [Users] Vm's being paused >>>> >>>> Hi Dafna, >>>> >>>> Thanks. >>>> >>>> The vdsm logs are quite large, so I've only attached the logs for >>>> the >>>> pause of the VM called Babbage on the 19th of Jan. >>>> >>>> As for snapshots, Babbage has one from June 2013 and Reports has >>>> two >>>> from June and Oct 2013. >>>> >>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 >>>> VM's >>>> have thin provisioned disks. >>>> >>>> Please shout if you'd like any further info or logs. >>>> >>>> Thank you. >>>> >>>> Regards. >>>> >>>> Neil Wilson. >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> >>>> wrote: >>>>> Hi Neil, >>>>> >>>>> Can you please attach the vdsm logs? >>>>> also, as for the vm's, do they have any snapshots? >>>>> from your suggestion to allocate more luns, are you using iscsi or >>>>> FC? >>>>> >>>>> Thanks, >>>>> >>>>> Dafna >>>>> >>>>> >>>>> On 01/22/2014 08:45 AM, Neil wrote: >>>>>> Thanks for the replies guys, >>>>>> >>>>>> Looking at my two VM's that have paused so far through the oVirt >>>>>> GUI >>>>>> the following sizes show under Disks. >>>>>> >>>>>> VM Reports: >>>>>> Virtual Size 35GB, Actual Size 41GB >>>>>> Looking on the Centos OS side, Disk size is 33G and used is 12G >>>>>> with >>>>>> 19G available (40%) usage. >>>>>> >>>>>> VM Babbage: >>>>>> Virtual Size is 40GB, Actual Size 53GB >>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is >>>>>> 16.3G, >>>>>> so >>>>>> under 50% usage. >>>>>> >>>>>> >>>>>> Do you see any issues with the above stats? >>>>>> >>>>>> Then my main Datacenter storage is as follows... >>>>>> >>>>>> Size: 6887 GB >>>>>> Available: 1948 GB >>>>>> Used: 4939 GB >>>>>> Allocated: 1196 GB >>>>>> Over Allocation: 61% >>>>>> >>>>>> Could there be a problem here? I can allocate additional LUNS if >>>>>> you >>>>>> feel the space isn't correctly allocated. >>>>>> >>>>>> Apologies for going on about this, but I'm really concerned that >>>>>> something isn't right and I might have a serious problem if an >>>>>> important machine locks up. >>>>>> >>>>>> Thank you and much appreciated. >>>>>> >>>>>> Regards. >>>>>> >>>>>> Neil Wilson. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> >>>>>> wrote: >>>>>>> the storage space is configured in percentages and not physical >>>>>>> size. >>>>>>> so if 20G is less than 10% (default config) of your storage it >>>>>>> will >>>>>>> pause >>>>>>> the vms regardless of how much GB you still have. >>>>>>> this is configurable though so you can change it to less than >>>>>>> 10% >>>>>>> if >>>>>>> you >>>>>>> like. >>>>>>> >>>>>>> to answer the second question, vm's will not pause on ENOSpace >>>>>>> error >>>>>>> if >>>>>>> they >>>>>>> run out of space internally but only if the external storage >>>>>>> cannot >>>>>>> be >>>>>>> consumed. so only if you run out of space in the storage and and >>>>>>> not >>>>>>> if >>>>>>> vm >>>>>>> runs out of space in its on fs. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 01/21/2014 09:51 AM, Neil wrote: >>>>>>>> Hi Dan, >>>>>>>> >>>>>>>> Sorry, attached is engine.log I've taken out the two sections >>>>>>>> where >>>>>>>> each of the VM's were paused. >>>>>>>> >>>>>>>> Does the error "VM babbage has paused due to no Storage space >>>>>>>> error" >>>>>>>> mean the main storage domain has run out of storage, or that >>>>>>>> the >>>>>>>> VM >>>>>>>> has run out? >>>>>>>> >>>>>>>> Both VM's appear to have been running on node01 when they were >>>>>>>> paused. >>>>>>>> My vdsm versions are all... >>>>>>>> >>>>>>>> vdsm-cli-4.13.0-11.el6.noarch >>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>>>>>>> vdsm-4.13.0-11.el6.x86_64 >>>>>>>> vdsm-python-4.13.0-11.el6.x86_64 >>>>>>>> >>>>>>>> I currently have a 61% over allocation ratio on my primary >>>>>>>> storage >>>>>>>> domain, with 1948GB available. >>>>>>>> >>>>>>>> Thank you. >>>>>>>> >>>>>>>> Regards. >>>>>>>> >>>>>>>> Neil Wilson. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> >>>>>>>> wrote: >>>>>>>>> Hi Dan, >>>>>>>>> >>>>>>>>> Sorry for only coming back to you now. >>>>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run >>>>>>>>> out >>>>>>>>> of >>>>>>>>> disk space there is about 20Gigs free, and the usage barely >>>>>>>>> grows >>>>>>>>> as >>>>>>>>> the VM only shares printers. The other VM that paused is also >>>>>>>>> on >>>>>>>>> thin >>>>>>>>> provisioned disks and also has plenty space, this guest is >>>>>>>>> running >>>>>>>>> Centos 6.3 64bit and only runs basic reporting. >>>>>>>>> >>>>>>>>> After the 2003 guest was rebooted, the network card showed up >>>>>>>>> as >>>>>>>>> unplugged in ovirt, and we had to remove it, and re-add it >>>>>>>>> again >>>>>>>>> in >>>>>>>>> order to correct the issue. The Centos VM did not have the >>>>>>>>> same >>>>>>>>> issue. >>>>>>>>> >>>>>>>>> I'm concerned that this might happen to a VM that's quite >>>>>>>>> critical, >>>>>>>>> any thoughts or ideas? >>>>>>>>> >>>>>>>>> The only recent changes have been updating from Dreyou 3.2 to >>>>>>>>> the >>>>>>>>> official Centos repo and updating to 3.3.1-2. Prior to >>>>>>>>> updating I >>>>>>>>> haven't had this issue. >>>>>>>>> >>>>>>>>> Any assistance is greatly appreciated. >>>>>>>>> >>>>>>>>> Thank you. >>>>>>>>> >>>>>>>>> Regards. >>>>>>>>> >>>>>>>>> Neil Wilson. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> Do you have the VMs on thin provisioned storage or sparse >>>>>>>>>> disks? >>>>>>>>>> >>>>>>>>>> Pausing happens when the VM has an IO error or runs out of >>>>>>>>>> space >>>>>>>>>> on >>>>>>>>>> the >>>>>>>>>> storage domain, and it is done intentionally, so that the VM >>>>>>>>>> will >>>>>>>>>> not >>>>>>>>>> experience a disk corruption. If you have thin provisioned >>>>>>>>>> disks, >>>>>>>>>> and >>>>>>>>>> the VM >>>>>>>>>> writes to it's disks faster than the disks can grow, this is >>>>>>>>>> exactly >>>>>>>>>> what >>>>>>>>>> you will see >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> Hi guys, >>>>>>>>>>> >>>>>>>>>>> I've had two different Vm's randomly pause this past week >>>>>>>>>>> and >>>>>>>>>>> inside >>>>>>>>>>> ovirt >>>>>>>>>>> the error received is something like 'vm ran out of storage >>>>>>>>>>> and >>>>>>>>>>> was >>>>>>>>>>> paused'. >>>>>>>>>>> Resuming the vm's didn't work and I had to force them off >>>>>>>>>>> and >>>>>>>>>>> then on >>>>>>>>>>> which >>>>>>>>>>> resolved the issue. >>>>>>>>>>> >>>>>>>>>>> Has anyone had this issue before? >>>>>>>>>>> >>>>>>>>>>> I realise this is very vague so if you could please let me >>>>>>>>>>> know >>>>>>>>>>> which >>>>>>>>>>> logs >>>>>>>>>>> to send in. >>>>>>>>>>> >>>>>>>>>>> Thank you >>>>>>>>>>> >>>>>>>>>>> Regards. >>>>>>>>>>> >>>>>>>>>>> Neil Wilson >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Users mailing list >>>>>>>>>>> Users@ovirt.org >>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Users mailing list >>>>>>>>>> Users@ovirt.org >>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Dafna Ron >>>>> >>>>> >>>>> -- >>>>> Dafna Ron >>>> _______________________________________________ >>>> Users mailing list >>>> Users@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/users >>>> > -- > Dafna Ron
-- Dafna Ron
-- Dafna Ron
-- Dafna Ron

Hi Dafna, My sincere apologies for not coming back to you sooner on this. I've finally had a chance to start investigating, but in between my last discussion and now, updates have been done on both the hosts and the engine, so perhaps something there has fixed it, as I haven't had a pause happen in quite a long time. When trying to gather the info you requested above I think I've found what is causing all the excessive logging...that I sent through previously... I have a VM called Proxy, which a few years back ran out of disk space, and wouldn't boot, as it required an fsck, but we'd get an unknown storage error when doing an fsck on the image, so we had to attach a new LUN and dd out the entire image, then run an fsck, and then re-import the image, which got the VM operational again. A while back we tried to remove the old disk image, and received a storage error, and looking at this now I see that it appears the old image never successfully removed. If I look at the VM under Disks I can see the old disk still attached in place, but there is an hourglass instead of a green arrow showing. Also right clicking on the Disk the only option you can choose is Add, so something seems to still have this locked. In the logs I have the same error showing over and over... AttributeError: 'Drive' object has no attribute 'format' Thread-313::DEBUG::2014-02-24 16:44:30,056::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 8 edom: 10 level: 2 message: invalid argument: invalid path /rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/6128b18f-eee9-422e-bc8a-f3b9fe331b09/38ac4afa-22e9-4359-ac16-3ff5d7b3b6db not assigned to domain Thread-313::ERROR::2014-02-24 16:44:30,057::sampling::355::vm.Vm::(collect) vmId=`23b9212c-1e25-4003-aa18-b1e819bf6bb1`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c9de30> Traceback (most recent call last): File "/usr/share/vdsm/sampling.py", line 351, in collect statsFunction() File "/usr/share/vdsm/sampling.py", line 226, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/vm.py", line 528, in _highWrite self._vm.extendDrivesIfNeeded() File "/usr/share/vdsm/vm.py", line 2288, in extendDrivesIfNeeded capacity, alloc, physical = self._dom.blockInfo(drive.path, 0) File "/usr/share/vdsm/vm.py", line 841, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in blockInfo if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self) libvirtError: invalid argument: invalid path /rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/6128b18f-eee9-422e-bc8a-f3b9fe331b09/38ac4afa-22e9-4359-ac16-3ff5d7b3b6db not assigned to domain Any ideas on how to get rid of the "corrupt" disk finally? Thanks. Regards. Neil Wilson. On Wed, Jan 29, 2014 at 5:32 PM, Dafna Ron <dron@redhat.com> wrote:
mmm... I think that there is a bug with the iso domain.... and I am not sure if it was already opened.
can you help me to debug this and see if its related? :)
I think that you have some intermittent network issues to the iso domain and every time it happens, the vms that have booted with a cd (even if you detached it) would pause.
I have a second suspicion... is it possible that the vms that pause had a cd and you ejected it at some point? perhaps after or during the network issues you had on the 14th? can you run dumpxml from libvirt? let me know if you need help with this command.
Thanks,
Dafna
On 01/29/2014 02:16 PM, Neil wrote:
Hi Dafna,
On Wed, Jan 29, 2014 at 1:14 PM, Dafna Ron <dron@redhat.com> wrote:
The reason I asked about the size if because this was the original issue no? vm's pausing on lack of space?
Apologies, I just wanted to make sure it was still about this pausing and not the original migration issue that I think you were also helping me with a few weeks back.
You're having a problem with your data domains. Can you check the rout from the hosts to the storage? I think that you have some disconnection to the storage from the hosts since it's random and not from all the vm's I would suggest that its a routing problem? Thanks, Dafna
The connections to the main data domain is 8Gb Fibre Channel directly from each of the hosts to the FC SAN, so if it is a connection issue then I can't understand how anything would be working. Or am I barking up the wrong tree completely? There were some ethernet network bridging changes on each of the hosts in early January, but this would only affect the NFS mounted ISO domain, or could this be the cause of the problems?
Is this disconnection causing the huge log files that I sent previously?
Thank you.
Regards.
Neil Wilson.
On 01/29/2014 08:00 AM, Neil wrote:
Sorry, more on this issue, I see my logs are rapidly filling up my disk space on node02 with this error in /var/log/messages...
Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 513, in _highWrite#012 self._vm._dom.blockInfo(vmDrive.path, 0)#012 File "/usr/share/vdsm/vm.py", line 835, in f#012 ret = attr(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in blockInfo#012 if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)#012libvirtError: invalid argument: invalid path
/rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/fac8a3bb-e414-43c0-affc-6e2628757a28/6c3e5ae8-23fc-4196-ba42-778bdc0fbad8 not assigned to domain Jan 29 09:56:53 node02 vdsm vm.Vm ERROR vmId=`ac2a3f99-a6db-4cae-955d-efdfb901abb7`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
Not sure if this is related at all though?
Thanks.
Regards.
Neil Wilson.
On Wed, Jan 29, 2014 at 9:02 AM, Neil <nwilson123@gmail.com> wrote:
Hi Dafna,
Thanks for clarifying that, I found the migration issue and this was resolved once I sorted out the ISO domain problem.
I'm sorry I don't understand your last question? "> after the engine restart, do you still see a problem with the size or did the report of size changed?"
The migration issue was resolved, it's now just trying to track down why the two VM's paused on their own, one on the 8th of Jan(I think) and one on the 19th of Jan.
Thank you.
Regards.
Neil Wilson.
On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <dron@redhat.com> wrote:
yes - engine lost communication with vdsm and it has no way of knowing if the host is down or if there was a network issue so a network issue would cause the same errors that I see in the logs.
The error you put on the iso is the reason the vm's have failed migration - if a vm is run with a cd and the cd is gone than the vm will not be able to be migrated.
after the engine restart, do you still see a problem with the size or did the report of size changed?
Dafna
On 01/28/2014 01:02 PM, Neil wrote: > > Hi Dafna, > > Thanks for coming back to me. I'll try answer your queries one by > one. > > On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron@redhat.com> wrote: >> >> you had a problem with your storage on the 14th of Jan and one of >> the >> hosts >> rebooted (if you have the vdsm log from that day than I can see what >> happened on vdsm side) >> in engine, I could see a problem with the export domain and this >> should >> not >> have cause a reboot. > > 1.) I don't unfortunately have logs going back that far. Looking at > all 3 hosts uptime, the one with the least uptime is 21 days, the > others are all over 40 days, so there definitely wasn't a host that > rebooted on the 14th of Jan, would a network issue or Firewall issue > also cause the error you've seen to look as if a host rebooted? There > was a bonding mode change on the 14th of January, so perhaps this > caused the issue? > > >> Can you tell me if you had a problem with the data >> domain as well or was it just the export domain? were you having any >> vm's >> exported/imported at that time? >> In any case - this is a bug. > > 2.) I think this was the same day that the bonding mode was changed > on > the host while the host was live (by mistake), and had SPM running on > it. I haven't done any importing or exporting for a few years on this > oVirt setup. > > >> As for the vm's - if the vm's are no longer in migrating state than >> please >> restart ovirt-engine service (looks like a cache issue) > > 3.) Restarted ovirt-engine, logging now appears to be normal without > any > errors. > > >> if they are in migrating state - there should have been a timeout a >> long >> time ago. >> can you please run 'vdsClient -s 0 list table' and 'virsh -r list' >> on >> both >> all hosts? > > 4.) Ran on all hosts... > > node01.blabla.com > 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up > 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up > ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up > 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up > 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up > > Id Name State > ---------------------------------------------------- > 1 adam running > 2 reports running > 4 tux running > 6 Moodle running > 7 babbage running > > node02.blabla.com > dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up > 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up > ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up > 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up > > Id Name State > ---------------------------------------------------- > 9 proxy02 running > 10 spam running > 12 software running > 13 zimbra running > > node03.blabla.com > e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up > 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up > > Id Name State > ---------------------------------------------------- > 13 oliver running > 14 dhcp running > > >> Last thing is that your ISO domain seems to be having issues as >> well. >> This should not effect the host status but if any of the vm's were >> booted >> from an iso or have an iso attached in the boot sequence this will >> explain >> the migration issue. > > There was an ISO domain issue a while back, but this was corrected > about 2 weeks ago after iptables re-enabled itself on boot after > running updates, I've checked now and the ISO domain appears to be > fine and I can see all the images stored within. > > I've stumbled across what appears to be another error and all three > hosts are showing this over and over in /var/log/messages, and I'm > not > sure if it's related? ... > > Jan 28 14:58:59 node01 vdsm vm.Vm ERROR > vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: > <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most > recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, > in collect#012 statsFunction()#012 File > "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue > = > self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", > line 509, in _highWrite#012 if not vmDrive.blockDev or > vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no > attribute 'format' > > I've attached the full vdsm log from node02 to this reply. > > Please shout if you need anything else. > > Thank you. > > Regards. > > Neil Wilson. > >> On 01/28/2014 09:28 AM, Neil wrote: >>> >>> Hi guys, >>> >>> Sorry for the very late reply, I've been out of the office doing >>> installations. >>> Unfortunately due to the time delay, my oldest logs are only as far >>> back as the attached. >>> >>> I've only grep'd for Thread-286029 in the vdsm log. The engine.log >>> I'm >>> not sure what info is required, so the full log is attached. >>> >>> Please shout if you need any info or further details. >>> >>> Thank you very much. >>> >>> Regards. >>> >>> Neil Wilson. >>> >>> >>> On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine >>> <mbourvin@redhat.com> >>> wrote: >>>> >>>> Could you please attach the engine.log from the same time? >>>> >>>> thanks! >>>> >>>> ----- Original Message ----- >>>>> >>>>> From: "Neil" <nwilson123@gmail.com> >>>>> To: dron@redhat.com >>>>> Cc: "users" <users@ovirt.org> >>>>> Sent: Wednesday, January 22, 2014 1:14:25 PM >>>>> Subject: Re: [Users] Vm's being paused >>>>> >>>>> Hi Dafna, >>>>> >>>>> Thanks. >>>>> >>>>> The vdsm logs are quite large, so I've only attached the logs for >>>>> the >>>>> pause of the VM called Babbage on the 19th of Jan. >>>>> >>>>> As for snapshots, Babbage has one from June 2013 and Reports has >>>>> two >>>>> from June and Oct 2013. >>>>> >>>>> I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 >>>>> VM's >>>>> have thin provisioned disks. >>>>> >>>>> Please shout if you'd like any further info or logs. >>>>> >>>>> Thank you. >>>>> >>>>> Regards. >>>>> >>>>> Neil Wilson. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> >>>>> wrote: >>>>>> >>>>>> Hi Neil, >>>>>> >>>>>> Can you please attach the vdsm logs? >>>>>> also, as for the vm's, do they have any snapshots? >>>>>> from your suggestion to allocate more luns, are you using iscsi >>>>>> or >>>>>> FC? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Dafna >>>>>> >>>>>> >>>>>> On 01/22/2014 08:45 AM, Neil wrote: >>>>>>> >>>>>>> Thanks for the replies guys, >>>>>>> >>>>>>> Looking at my two VM's that have paused so far through the >>>>>>> oVirt >>>>>>> GUI >>>>>>> the following sizes show under Disks. >>>>>>> >>>>>>> VM Reports: >>>>>>> Virtual Size 35GB, Actual Size 41GB >>>>>>> Looking on the Centos OS side, Disk size is 33G and used is 12G >>>>>>> with >>>>>>> 19G available (40%) usage. >>>>>>> >>>>>>> VM Babbage: >>>>>>> Virtual Size is 40GB, Actual Size 53GB >>>>>>> On the Server 2003 OS side, Disk size is 39.9Gb and used is >>>>>>> 16.3G, >>>>>>> so >>>>>>> under 50% usage. >>>>>>> >>>>>>> >>>>>>> Do you see any issues with the above stats? >>>>>>> >>>>>>> Then my main Datacenter storage is as follows... >>>>>>> >>>>>>> Size: 6887 GB >>>>>>> Available: 1948 GB >>>>>>> Used: 4939 GB >>>>>>> Allocated: 1196 GB >>>>>>> Over Allocation: 61% >>>>>>> >>>>>>> Could there be a problem here? I can allocate additional LUNS >>>>>>> if >>>>>>> you >>>>>>> feel the space isn't correctly allocated. >>>>>>> >>>>>>> Apologies for going on about this, but I'm really concerned >>>>>>> that >>>>>>> something isn't right and I might have a serious problem if an >>>>>>> important machine locks up. >>>>>>> >>>>>>> Thank you and much appreciated. >>>>>>> >>>>>>> Regards. >>>>>>> >>>>>>> Neil Wilson. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> the storage space is configured in percentages and not >>>>>>>> physical >>>>>>>> size. >>>>>>>> so if 20G is less than 10% (default config) of your storage it >>>>>>>> will >>>>>>>> pause >>>>>>>> the vms regardless of how much GB you still have. >>>>>>>> this is configurable though so you can change it to less than >>>>>>>> 10% >>>>>>>> if >>>>>>>> you >>>>>>>> like. >>>>>>>> >>>>>>>> to answer the second question, vm's will not pause on ENOSpace >>>>>>>> error >>>>>>>> if >>>>>>>> they >>>>>>>> run out of space internally but only if the external storage >>>>>>>> cannot >>>>>>>> be >>>>>>>> consumed. so only if you run out of space in the storage and >>>>>>>> and >>>>>>>> not >>>>>>>> if >>>>>>>> vm >>>>>>>> runs out of space in its on fs. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 01/21/2014 09:51 AM, Neil wrote: >>>>>>>>> >>>>>>>>> Hi Dan, >>>>>>>>> >>>>>>>>> Sorry, attached is engine.log I've taken out the two sections >>>>>>>>> where >>>>>>>>> each of the VM's were paused. >>>>>>>>> >>>>>>>>> Does the error "VM babbage has paused due to no Storage space >>>>>>>>> error" >>>>>>>>> mean the main storage domain has run out of storage, or that >>>>>>>>> the >>>>>>>>> VM >>>>>>>>> has run out? >>>>>>>>> >>>>>>>>> Both VM's appear to have been running on node01 when they >>>>>>>>> were >>>>>>>>> paused. >>>>>>>>> My vdsm versions are all... >>>>>>>>> >>>>>>>>> vdsm-cli-4.13.0-11.el6.noarch >>>>>>>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>>>>>>>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>>>>>>>> vdsm-4.13.0-11.el6.x86_64 >>>>>>>>> vdsm-python-4.13.0-11.el6.x86_64 >>>>>>>>> >>>>>>>>> I currently have a 61% over allocation ratio on my primary >>>>>>>>> storage >>>>>>>>> domain, with 1948GB available. >>>>>>>>> >>>>>>>>> Thank you. >>>>>>>>> >>>>>>>>> Regards. >>>>>>>>> >>>>>>>>> Neil Wilson. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Dan, >>>>>>>>>> >>>>>>>>>> Sorry for only coming back to you now. >>>>>>>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run >>>>>>>>>> out >>>>>>>>>> of >>>>>>>>>> disk space there is about 20Gigs free, and the usage barely >>>>>>>>>> grows >>>>>>>>>> as >>>>>>>>>> the VM only shares printers. The other VM that paused is >>>>>>>>>> also >>>>>>>>>> on >>>>>>>>>> thin >>>>>>>>>> provisioned disks and also has plenty space, this guest is >>>>>>>>>> running >>>>>>>>>> Centos 6.3 64bit and only runs basic reporting. >>>>>>>>>> >>>>>>>>>> After the 2003 guest was rebooted, the network card showed >>>>>>>>>> up >>>>>>>>>> as >>>>>>>>>> unplugged in ovirt, and we had to remove it, and re-add it >>>>>>>>>> again >>>>>>>>>> in >>>>>>>>>> order to correct the issue. The Centos VM did not have the >>>>>>>>>> same >>>>>>>>>> issue. >>>>>>>>>> >>>>>>>>>> I'm concerned that this might happen to a VM that's quite >>>>>>>>>> critical, >>>>>>>>>> any thoughts or ideas? >>>>>>>>>> >>>>>>>>>> The only recent changes have been updating from Dreyou 3.2 >>>>>>>>>> to >>>>>>>>>> the >>>>>>>>>> official Centos repo and updating to 3.3.1-2. Prior to >>>>>>>>>> updating I >>>>>>>>>> haven't had this issue. >>>>>>>>>> >>>>>>>>>> Any assistance is greatly appreciated. >>>>>>>>>> >>>>>>>>>> Thank you. >>>>>>>>>> >>>>>>>>>> Regards. >>>>>>>>>> >>>>>>>>>> Neil Wilson. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny >>>>>>>>>> <dyasny@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Do you have the VMs on thin provisioned storage or sparse >>>>>>>>>>> disks? >>>>>>>>>>> >>>>>>>>>>> Pausing happens when the VM has an IO error or runs out of >>>>>>>>>>> space >>>>>>>>>>> on >>>>>>>>>>> the >>>>>>>>>>> storage domain, and it is done intentionally, so that the >>>>>>>>>>> VM >>>>>>>>>>> will >>>>>>>>>>> not >>>>>>>>>>> experience a disk corruption. If you have thin provisioned >>>>>>>>>>> disks, >>>>>>>>>>> and >>>>>>>>>>> the VM >>>>>>>>>>> writes to it's disks faster than the disks can grow, this >>>>>>>>>>> is >>>>>>>>>>> exactly >>>>>>>>>>> what >>>>>>>>>>> you will see >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil >>>>>>>>>>> <nwilson123@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi guys, >>>>>>>>>>>> >>>>>>>>>>>> I've had two different Vm's randomly pause this past week >>>>>>>>>>>> and >>>>>>>>>>>> inside >>>>>>>>>>>> ovirt >>>>>>>>>>>> the error received is something like 'vm ran out of >>>>>>>>>>>> storage >>>>>>>>>>>> and >>>>>>>>>>>> was >>>>>>>>>>>> paused'. >>>>>>>>>>>> Resuming the vm's didn't work and I had to force them off >>>>>>>>>>>> and >>>>>>>>>>>> then on >>>>>>>>>>>> which >>>>>>>>>>>> resolved the issue. >>>>>>>>>>>> >>>>>>>>>>>> Has anyone had this issue before? >>>>>>>>>>>> >>>>>>>>>>>> I realise this is very vague so if you could please let me >>>>>>>>>>>> know >>>>>>>>>>>> which >>>>>>>>>>>> logs >>>>>>>>>>>> to send in. >>>>>>>>>>>> >>>>>>>>>>>> Thank you >>>>>>>>>>>> >>>>>>>>>>>> Regards. >>>>>>>>>>>> >>>>>>>>>>>> Neil Wilson >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Users mailing list >>>>>>>>>>>> Users@ovirt.org >>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Users mailing list >>>>>>>>>>> Users@ovirt.org >>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Dafna Ron >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Dafna Ron >>>>> >>>>> _______________________________________________ >>>>> Users mailing list >>>>> Users@ovirt.org >>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>> >> -- >> Dafna Ron
-- Dafna Ron
-- Dafna Ron
-- Dafna Ron

----- Original Message -----
From: "Neil" <nwilson123@gmail.com> To: dron@redhat.com Cc: "users" <users@ovirt.org> Sent: Monday, February 24, 2014 5:02:04 PM Subject: Re: [Users] Vm's being paused
Hi Dafna,
My sincere apologies for not coming back to you sooner on this. I've finally had a chance to start investigating, but in between my last discussion and now, updates have been done on both the hosts and the engine, so perhaps something there has fixed it, as I haven't had a pause happen in quite a long time.
When trying to gather the info you requested above I think I've found what is causing all the excessive logging...that I sent through previously...
I have a VM called Proxy, which a few years back ran out of disk space, and wouldn't boot, as it required an fsck, but we'd get an unknown storage error when doing an fsck on the image, so we had to attach a new LUN and dd out the entire image, then run an fsck, and then re-import the image, which got the VM operational again. A while back we tried to remove the old disk image, and received a storage error, and looking at this now I see that it appears the old image never successfully removed. If I look at the VM under Disks I can see the old disk still attached in place, but there is an hourglass instead of a green arrow showing. Also right clicking on the Disk the only option you can choose is Add, so something seems to still have this locked.
In the logs I have the same error showing over and over...
AttributeError: 'Drive' object has no attribute 'format' Thread-313::DEBUG::2014-02-24 16:44:30,056::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 8 edom: 10 level: 2 message: invalid argument: invalid path /rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/6128b18f-eee9-422e-bc8a-f3b9fe331b09/38ac4afa-22e9-4359-ac16-3ff5d7b3b6db not assigned to domain Thread-313::ERROR::2014-02-24 16:44:30,057::sampling::355::vm.Vm::(collect) vmId=`23b9212c-1e25-4003-aa18-b1e819bf6bb1`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c9de30> Traceback (most recent call last): File "/usr/share/vdsm/sampling.py", line 351, in collect statsFunction() File "/usr/share/vdsm/sampling.py", line 226, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/vm.py", line 528, in _highWrite self._vm.extendDrivesIfNeeded() File "/usr/share/vdsm/vm.py", line 2288, in extendDrivesIfNeeded capacity, alloc, physical = self._dom.blockInfo(drive.path, 0) File "/usr/share/vdsm/vm.py", line 841, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in blockInfo if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self) libvirtError: invalid argument: invalid path /rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/6128b18f-eee9-422e-bc8a-f3b9fe331b09/38ac4afa-22e9-4359-ac16-3ff5d7b3b6db not assigned to domain
Any ideas on how to get rid of the "corrupt" disk finally?
This may happen when you migrate a vm from machines running different versions of vdsm. Vdsm changed the path to the disk lately, so when you migrate a vm, some disk are not found where vdsm think they should be. This leads to missing format attribute and libvirt errors when trying to check the status of such disks. This issue is fixed in upstream: http://gerrit.ovirt.org/24202 And in ovirt-3.4: http://gerrit.ovirt.org/24324 I think the best way to avoid this issue, is to have the same vdsm version on all hosts in the same cluster. Nir

On Jan 28, 2014, at 19:18 , Dafna Ron <dron@redhat.com> wrote:
yes - engine lost communication with vdsm and it has no way of knowing if the host is down or if there was a network issue so a network issue would cause the same errors that I see in the logs.
The error you put on the iso is the reason the vm's have failed migration - if a vm is run with a cd and the cd is gone than the vm will not be able to be migrated.
which, as I learned last week, is not entirely correct. Pure libvirt VM seems to work fine…so it must be somewhere something in oVirt:( looking into it but just for future reference we want it to work:)
after the engine restart, do you still see a problem with the size or did the report of size changed?
Dafna
On 01/28/2014 01:02 PM, Neil wrote:
Hi Dafna,
Thanks for coming back to me. I'll try answer your queries one by one.
On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <dron@redhat.com> wrote:
you had a problem with your storage on the 14th of Jan and one of the hosts rebooted (if you have the vdsm log from that day than I can see what happened on vdsm side) in engine, I could see a problem with the export domain and this should not have cause a reboot. 1.) I don't unfortunately have logs going back that far. Looking at all 3 hosts uptime, the one with the least uptime is 21 days, the others are all over 40 days, so there definitely wasn't a host that rebooted on the 14th of Jan, would a network issue or Firewall issue also cause the error you've seen to look as if a host rebooted? There was a bonding mode change on the 14th of January, so perhaps this caused the issue?
Can you tell me if you had a problem with the data domain as well or was it just the export domain? were you having any vm's exported/imported at that time? In any case - this is a bug. 2.) I think this was the same day that the bonding mode was changed on the host while the host was live (by mistake), and had SPM running on it. I haven't done any importing or exporting for a few years on this oVirt setup.
As for the vm's - if the vm's are no longer in migrating state than please restart ovirt-engine service (looks like a cache issue) 3.) Restarted ovirt-engine, logging now appears to be normal without any errors.
if they are in migrating state - there should have been a timeout a long time ago. can you please run 'vdsClient -s 0 list table' and 'virsh -r list' on both all hosts? 4.) Ran on all hosts...
node01.blabla.com 63da7faa-f92a-4652-90f2-b6660a4fb7b3 11232 adam Up 502170aa-0fc6-4287-bb08-5844be6e0352 13986 babbage Up ff9036fb-1499-45e4-8cde-e350eee3c489 26733 reports Up 2736197b-6dc3-4155-9a29-9306ca64881d 13804 tux Up 0a3af7b2-ea94-42f3-baeb-78b950af4402 25257 Moodle Up
Id Name State ---------------------------------------------------- 1 adam running 2 reports running 4 tux running 6 Moodle running 7 babbage running
node02.blabla.com dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b 2879 spam Up 23b9212c-1e25-4003-aa18-b1e819bf6bb1 32454 proxy02 Up ac2a3f99-a6db-4cae-955d-efdfb901abb7 5605 software Up 179c293b-e6a3-4ec6-a54c-2f92f875bc5e 8870 zimbra Up
Id Name State ---------------------------------------------------- 9 proxy02 running 10 spam running 12 software running 13 zimbra running
node03.blabla.com e42b7ccc-ce04-4308-aeb2-2291399dd3ef 25809 dhcp Up 16d3f077-b74c-4055-97d0-423da78d8a0c 23939 oliver Up
Id Name State ---------------------------------------------------- 13 oliver running 14 dhcp running
Last thing is that your ISO domain seems to be having issues as well. This should not effect the host status but if any of the vm's were booted from an iso or have an iso attached in the boot sequence this will explain the migration issue. There was an ISO domain issue a while back, but this was corrected about 2 weeks ago after iptables re-enabled itself on boot after running updates, I've checked now and the ISO domain appears to be fine and I can see all the images stored within.
I've stumbled across what appears to be another error and all three hosts are showing this over and over in /var/log/messages, and I'm not sure if it's related? ...
Jan 28 14:58:59 node01 vdsm vm.Vm ERROR vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most recent call last):#012 File "/usr/share/vdsm/sampling.py", line 351, in collect#012 statsFunction()#012 File "/usr/share/vdsm/sampling.py", line 226, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/vm.py", line 509, in _highWrite#012 if not vmDrive.blockDev or vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no attribute 'format'
I've attached the full vdsm log from node02 to this reply.
Please shout if you need anything else.
Thank you.
Regards.
Neil Wilson.
On 01/28/2014 09:28 AM, Neil wrote:
Hi guys,
Sorry for the very late reply, I've been out of the office doing installations. Unfortunately due to the time delay, my oldest logs are only as far back as the attached.
I've only grep'd for Thread-286029 in the vdsm log. The engine.log I'm not sure what info is required, so the full log is attached.
Please shout if you need any info or further details.
Thank you very much.
Regards.
Neil Wilson.
On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine <mbourvin@redhat.com> wrote:
Could you please attach the engine.log from the same time?
thanks!
----- Original Message -----
From: "Neil" <nwilson123@gmail.com> To: dron@redhat.com Cc: "users" <users@ovirt.org> Sent: Wednesday, January 22, 2014 1:14:25 PM Subject: Re: [Users] Vm's being paused
Hi Dafna,
Thanks.
The vdsm logs are quite large, so I've only attached the logs for the pause of the VM called Babbage on the 19th of Jan.
As for snapshots, Babbage has one from June 2013 and Reports has two from June and Oct 2013.
I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11 VM's have thin provisioned disks.
Please shout if you'd like any further info or logs.
Thank you.
Regards.
Neil Wilson.
On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <dron@redhat.com> wrote: > Hi Neil, > > Can you please attach the vdsm logs? > also, as for the vm's, do they have any snapshots? > from your suggestion to allocate more luns, are you using iscsi or FC? > > Thanks, > > Dafna > > > On 01/22/2014 08:45 AM, Neil wrote: >> Thanks for the replies guys, >> >> Looking at my two VM's that have paused so far through the oVirt GUI >> the following sizes show under Disks. >> >> VM Reports: >> Virtual Size 35GB, Actual Size 41GB >> Looking on the Centos OS side, Disk size is 33G and used is 12G with >> 19G available (40%) usage. >> >> VM Babbage: >> Virtual Size is 40GB, Actual Size 53GB >> On the Server 2003 OS side, Disk size is 39.9Gb and used is 16.3G, so >> under 50% usage. >> >> >> Do you see any issues with the above stats? >> >> Then my main Datacenter storage is as follows... >> >> Size: 6887 GB >> Available: 1948 GB >> Used: 4939 GB >> Allocated: 1196 GB >> Over Allocation: 61% >> >> Could there be a problem here? I can allocate additional LUNS if you >> feel the space isn't correctly allocated. >> >> Apologies for going on about this, but I'm really concerned that >> something isn't right and I might have a serious problem if an >> important machine locks up. >> >> Thank you and much appreciated. >> >> Regards. >> >> Neil Wilson. >> >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <dron@redhat.com> wrote: >>> the storage space is configured in percentages and not physical size. >>> so if 20G is less than 10% (default config) of your storage it will >>> pause >>> the vms regardless of how much GB you still have. >>> this is configurable though so you can change it to less than 10% if >>> you >>> like. >>> >>> to answer the second question, vm's will not pause on ENOSpace error >>> if >>> they >>> run out of space internally but only if the external storage cannot >>> be >>> consumed. so only if you run out of space in the storage and and not >>> if >>> vm >>> runs out of space in its on fs. >>> >>> >>> >>> On 01/21/2014 09:51 AM, Neil wrote: >>>> Hi Dan, >>>> >>>> Sorry, attached is engine.log I've taken out the two sections where >>>> each of the VM's were paused. >>>> >>>> Does the error "VM babbage has paused due to no Storage space error" >>>> mean the main storage domain has run out of storage, or that the VM >>>> has run out? >>>> >>>> Both VM's appear to have been running on node01 when they were >>>> paused. >>>> My vdsm versions are all... >>>> >>>> vdsm-cli-4.13.0-11.el6.noarch >>>> vdsm-python-cpopen-4.13.0-11.el6.x86_64 >>>> vdsm-xmlrpc-4.13.0-11.el6.noarch >>>> vdsm-4.13.0-11.el6.x86_64 >>>> vdsm-python-4.13.0-11.el6.x86_64 >>>> >>>> I currently have a 61% over allocation ratio on my primary storage >>>> domain, with 1948GB available. >>>> >>>> Thank you. >>>> >>>> Regards. >>>> >>>> Neil Wilson. >>>> >>>> >>>> On Tue, Jan 21, 2014 at 11:24 AM, Neil <nwilson123@gmail.com> wrote: >>>>> Hi Dan, >>>>> >>>>> Sorry for only coming back to you now. >>>>> The VM's are thin provisioned. The Server 2003 VM hasn't run out of >>>>> disk space there is about 20Gigs free, and the usage barely grows >>>>> as >>>>> the VM only shares printers. The other VM that paused is also on >>>>> thin >>>>> provisioned disks and also has plenty space, this guest is running >>>>> Centos 6.3 64bit and only runs basic reporting. >>>>> >>>>> After the 2003 guest was rebooted, the network card showed up as >>>>> unplugged in ovirt, and we had to remove it, and re-add it again in >>>>> order to correct the issue. The Centos VM did not have the same >>>>> issue. >>>>> >>>>> I'm concerned that this might happen to a VM that's quite critical, >>>>> any thoughts or ideas? >>>>> >>>>> The only recent changes have been updating from Dreyou 3.2 to the >>>>> official Centos repo and updating to 3.3.1-2. Prior to updating I >>>>> haven't had this issue. >>>>> >>>>> Any assistance is greatly appreciated. >>>>> >>>>> Thank you. >>>>> >>>>> Regards. >>>>> >>>>> Neil Wilson. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <dyasny@gmail.com> >>>>> wrote: >>>>>> Do you have the VMs on thin provisioned storage or sparse disks? >>>>>> >>>>>> Pausing happens when the VM has an IO error or runs out of space >>>>>> on >>>>>> the >>>>>> storage domain, and it is done intentionally, so that the VM will >>>>>> not >>>>>> experience a disk corruption. If you have thin provisioned disks, >>>>>> and >>>>>> the VM >>>>>> writes to it's disks faster than the disks can grow, this is >>>>>> exactly >>>>>> what >>>>>> you will see >>>>>> >>>>>> >>>>>> On Sun, Jan 19, 2014 at 10:04 AM, Neil <nwilson123@gmail.com> >>>>>> wrote: >>>>>>> Hi guys, >>>>>>> >>>>>>> I've had two different Vm's randomly pause this past week and >>>>>>> inside >>>>>>> ovirt >>>>>>> the error received is something like 'vm ran out of storage and >>>>>>> was >>>>>>> paused'. >>>>>>> Resuming the vm's didn't work and I had to force them off and >>>>>>> then on >>>>>>> which >>>>>>> resolved the issue. >>>>>>> >>>>>>> Has anyone had this issue before? >>>>>>> >>>>>>> I realise this is very vague so if you could please let me know >>>>>>> which >>>>>>> logs >>>>>>> to send in. >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> Regards. >>>>>>> >>>>>>> Neil Wilson >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Users mailing list >>>>>>> Users@ovirt.org >>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>> >>>>>> _______________________________________________ >>>>>> Users mailing list >>>>>> Users@ovirt.org >>>>>> http://lists.ovirt.org/mailman/listinfo/users >>> >>> >>> -- >>> Dafna Ron > > > -- > Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Dafna Ron
-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
participants (7)
-
Dafna Ron
-
Dan Yasny
-
Meital Bourvine
-
Michal Skrivanek
-
Neil
-
Nir Soffer
-
Sven Kieske