[ovirt-users] VM has been paused due to no Storage space error.

Thu Apr 14 10:23:04 UTC 2016

Hi Nir,

El 2016-04-14 11:02, Nir Soffer escribió:
> On Thu, Apr 14, 2016 at 12:38 PM, Fred Rolland <frolland at redhat.com> 
> wrote:
>> Nir,
>> See attached the repoplot output.
> 
> So we have about one concurrent lvm command without any disk 
> operations, and
> everything seems snappy.
> 
> Nicolás, maybe this storage or the host is overloaded by the vms? Are 
> your vms
> doing lot of io?
> 

Not that I know, actually it should have been a "calm" time slot as far 
as IOs go, nor the storage was overloaded at that time. If I'm not 
mistaken, on the repoplot report I see there are two LVM operations at a 
time, maybe that has something to do with it? (although as you say, the 
lvextend is just a metadata change...)

> lvextend operation should be very fast operation, this is just a
> metadata change,
> allocating couple of extents to that lv.
> 
> Zdenek, how do you suggest to debug slow lvm commands?
> 
> See the attached pdf, lvm commands took 15-50 seconds.
> 
>> 
>> On Thu, Apr 14, 2016 at 12:18 PM, Nir Soffer <nsoffer at redhat.com> 
>> wrote:
>>> 
>>> On Thu, Apr 14, 2016 at 12:02 PM, Fred Rolland <frolland at redhat.com>
>>> wrote:
>>> > From the log, we can see that the lvextend command took 18 sec, which is
>>> > quite long.
>>> 
>>> Fred, can you run repoplot on this log file? it will may explain why 
>>> this
>>> lvm
>>> call took 18 seconds.
>>> 
>>> Nir
>>> 
>>> >
>>> > 60decf0c-6d9a-4c3b-bee6-de9d2ff05e85::DEBUG::2016-04-13
>>> > 10:52:06,759::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>>> > --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm lvextend --config '
>>> > devices {
>>> > preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>> > write_cache_state=0 disable_after_error_count=3 filter = [
>>> > '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\'', '\''r|.*|'\''
>>> > ] }
>>> > global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1
>>> > use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } '
>>> > --autobackup
>>> > n --size 6016m
>>> >
>>> > 5de4a000-a9c4-489c-8eee-10368647c413/721d09bc-60e7-4310-9ba2-522d2a4b03d0
>>> > (cwd None)
>>> > ....
>>> > 60decf0c-6d9a-4c3b-bee6-de9d2ff05e85::DEBUG::2016-04-13
>>> > 10:52:22,217::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '
>>> > WARNING: lvmetad is running but disabled. Restart lvmetad before
>>> > enabling
>>> > it!\n  WARNING: This metadata update is NOT backed up\n'; <rc> = 0
>>> >
>>> >
>>> > The watermark can be configured by the following value:
>>> >
>>> > 'volume_utilization_percent', '50',
>>> >     'Together with volume_utilization_chunk_mb, set the minimal free '
>>> >     'space before a thin provisioned block volume is extended. Use '
>>> >     'lower values to extend earlier.')
>>> >
>>> > On Thu, Apr 14, 2016 at 11:42 AM, Michal Skrivanek
>>> > <michal.skrivanek at redhat.com> wrote:
>>> >>
>>> >>
>>> >> > On 14 Apr 2016, at 09:57, nicolas at devels.es wrote:
>>> >> >
>>> >> > Ok, that makes sense, thanks for the insight both Alex and Fred. I'm
>>> >> > attaching the VDSM log of the SPM node at the time of the pause. I
>>> >> > couldn't
>>> >> > find anything that would clearly identify the problem, but maybe
>>> >> > you'll be
>>> >> > able to.
>>> >>
>>> >> In extreme conditions it will happen. When your storage is slow to
>>> >> respond
>>> >> to extension request, and when your write rate is very high then it may
>>> >> happen, as it is happening to you, that you run out space sooner than
>>> >> the
>>> >> extension finishes. You can change the watermark value I guess(right,
>>> >> Fred?), but better would be to plan a bit more ahead and either use
>>> >> preallocated or create thin and then allocate expected size in advance
>>> >> before the operation causing it (typically it only happens during
>>> >> untarring
>>> >> gigabytes of data, or huge database dump/restore)
>>> >> Even then, the VM should always be automatially resumed once the disk
>>> >> space is allocated
>>> >>
>>> >> Thanks,
>>> >> michal
>>> >>
>>> >> >
>>> >> > Thanks.
>>> >> >
>>> >> > Regards.
>>> >> >
>>> >> > El 2016-04-13 13:09, Fred Rolland escribió:
>>> >> >> Hi,
>>> >> >> Yes, just as Alex explained, if the disk has been created as thin
>>> >> >> provisioning, the vdsm will extends once a watermark is reached.
>>> >> >> Usually it should not get to the state the Vm is paused.
>>> >> >> From the log, you can see that the request for extension has been
>>> >> >> sent
>>> >> >> before the VM got to the No Space Error.
>>> >> >> Later, we can see the VM resuming.
>>> >> >> INFO::2016-04-13
>>> >> >> 10:52:04,182::vm::1026::virt.vm::(extendDrivesIfNeeded)
>>> >> >> vmId=`f9cd282e-110a-4896-98d3-6d320662744d`::Requesting extension
>>> >> >> for
>>> >> >> volume
>>> >> >> ....
>>> >> >> INFO::2016-04-13 10:52:29,360::vm::3728::virt.vm::(onIOError)
>>> >> >> vmId=`f9cd282e-110a-4896-98d3-6d320662744d`::abnormal vm stop device
>>> >> >> virtio-disk0 error enospc
>>> >> >> ....
>>> >> >> INFO::2016-04-13
>>> >> >> 10:52:54,317::vm::5084::virt.vm::(_logGuestCpuStatus)
>>> >> >> vmId=`f9cd282e-110a-4896-98d3-6d320662744d`::CPU running: onResume
>>> >> >> Note that the extension is done on the SPM host, so it would be
>>> >> >> interesting to see the vdsm log from the host that was in SPM role
>>> >> >> at
>>> >> >> this timeframe.
>>> >> >> Regards,
>>> >> >> Fred
>>> >> >> On Wed, Apr 13, 2016 at 2:43 PM, Alex Crow <acrow at integrafin.co.uk>
>>> >> >> wrote:
>>> >> >>> Hi,
>>> >> >>> If you have set up VM disks as Thin Provisioned, the VM has to
>>> >> >>> pause when the disk image needs to expand. You won't see this on
>>> >> >>> VMs
>>> >> >>> with preallocated storage.
>>> >> >>> It's not the SAN that's running out of space, it's the VM image
>>> >> >>> needing to be expanded incrementally each time.
>>> >> >>> Cheers
>>> >> >>> Alex
>>> >> >>> On 13/04/16 12:04, nicolas at devels.es wrote:
>>> >> >>> Hi Fred,
>>> >> >>> This is an iSCSI storage. I'm attaching the VDSM logs from the host
>>> >> >>> where this machine has been running. Should you need any further
>>> >> >>> info, don't hesitate to ask.
>>> >> >>> Thanks.
>>> >> >>> Regards.
>>> >> >>> El 2016-04-13 11:54, Fred Rolland escribió:
>>> >> >>> Hi,
>>> >> >>> What kind of storage do you have ? (ISCSI,FC,NFS...)
>>> >> >>> Can you provide the vdsm logs from the host where this VM runs ?
>>> >> >>> Thanks,
>>> >> >>> Freddy
>>> >> >>> On Wed, Apr 13, 2016 at 1:02 PM, <nicolas at devels.es> wrote:
>>> >> >>> Hi,
>>> >> >>> We're running oVirt 3.6.4.1-1. Lately we're seeing a bunch of
>>> >> >>> events like these:
>>> >> >>> 2016-04-13 10:52:30,735 INFO
>>> >> >>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> >> >>> (DefaultQuartzScheduler_Worker-86) [60dea18f] VM
>>> >> >>> 'f9cd282e-110a-4896-98d3-6d320662744d'(vm.domain.com [1] [1]) moved
>>> >> >>> from
>>> >> >>> 'Up' --> 'Paused'
>>> >> >>> 2016-04-13 10:52:30,815 INFO
>>> >> >>
>>> >> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> >> >>> (DefaultQuartzScheduler_Worker-86) [60dea18f] Correlation ID: null,
>>> >> >>> Call Stack: null, Custom Event ID: -1, Message: VM vm.domain.com
>>> >> >>> [1] [1]
>>> >> >>> has been paused.
>>> >> >>> 2016-04-13 10:52:30,898 ERROR
>>> >> >>
>>> >> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> >> >>> (DefaultQuartzScheduler_Worker-86) [60dea18f] Correlation ID: null,
>>> >> >>> Call Stack: null, Custom Event ID: -1, Message: VM vm.domain.com
>>> >> >>> [1] [1]
>>> >> >>> has been paused due to no Storage space error.
>>> >> >>> 2016-04-13 10:52:52,320 WARN
>>> >> >>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
>>> >> >>> (org.ovirt.thread.pool-8-thread-38) [] domain
>>> >> >>> '5de4a000-a9c4-489c-8eee-10368647c413:iscsi01' in problem. vds:
>>> >> >>> 'host6.domain.com [2] [2]'
>>> >> >>> 2016-04-13 10:52:55,183 INFO
>>> >> >>> [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>> >> >>> (DefaultQuartzScheduler_Worker-70) [3da0f3d4] VM
>>> >> >>> 'f9cd282e-110a-4896-98d3-6d320662744d'(vm.domain.com [1] [1]) moved
>>> >> >>> from
>>> >> >>> 'Paused' --> 'Up'
>>> >> >>> 2016-04-13 10:52:55,318 INFO
>>> >> >>
>>> >> >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> >> >>> (DefaultQuartzScheduler_Worker-70) [3da0f3d4] Correlation ID: null,
>>> >> >>> Call Stack: null, Custom Event ID: -1, Message: VM vm.domain.com
>>> >> >>> [1] [1]
>>> >> >>> has recovered from paused back to up.
>>> >> >>> The storage domain is far from being full, though (400+ G available
>>> >> >>> right now). Could this be related to this other issue [1]? If not,
>>> >> >>> how could I debug what's going on?
>>> >> >>> Thanks.
>>> >> >>>  [1]: https://www.mail-archive.com/users@ovirt.org/msg32079.html
>>> >> >>> [3]
>>> >> >>> [3]
>>> >> >>> _______________________________________________
>>> >> >>> Users mailing list
>>> >> >>> Users at ovirt.org
>>> >> >>> http://lists.ovirt.org/mailman/listinfo/users [4] [4]
>>> >> >>> Links:
>>> >> >>> ------
>>> >> >>> [1] http://vm.domain.com [1]
>>> >> >>> [2] http://host6.domain.com [2]
>>> >> >>> [3] https://www.mail-archive.com/users@ovirt.org/msg32079.html [3]
>>> >> >>> [4] http://lists.ovirt.org/mailman/listinfo/users [4]
>>> >> >> _______________________________________________
>>> >> >> Users mailing list
>>> >> >> Users at ovirt.org
>>> >> >> http://lists.ovirt.org/mailman/listinfo/users [4]
>>> >> >> --
>>> >> >> This message is intended only for the addressee and may contain
>>> >> >> confidential information. Unless you are that person, you may not
>>> >> >> disclose its contents or use it in any way and are requested to
>>> >> >> delete
>>> >> >> the message along with any attachments and notify us immediately.
>>> >> >> This email is not intended to, nor should it be taken to, constitute
>>> >> >> advice.
>>> >> >> The information provided is correct to our knowledge & belief and
>>> >> >> must
>>> >> >> not
>>> >> >> be used as a substitute for obtaining tax, regulatory, investment,
>>> >> >> legal or
>>> >> >> any other appropriate advice.
>>> >> >> "Transact" is operated by Integrated Financial Arrangements Ltd.
>>> >> >> 29 Clement's Lane, London EC4N 7AE. Tel: (020) 7608 4900 Fax: (020)
>>> >> >> 7608 5300.
>>> >> >> (Registered office: as above; Registered in England and Wales under
>>> >> >> number: 3727592). Authorised and regulated by the Financial Conduct
>>> >> >> Authority (entered on the Financial Services Register; no. 190856).
>>> >> >> _______________________________________________
>>> >> >> Users mailing list
>>> >> >> Users at ovirt.org
>>> >> >> http://lists.ovirt.org/mailman/listinfo/users [4]
>>> >> >> Links:
>>> >> >> ------
>>> >> >> [1] http://vm.domain.com
>>> >> >> [2] http://host6.domain.com
>>> >> >> [3] https://www.mail-archive.com/users@ovirt.org/msg32079.html
>>> >> >> [4] http://lists.ovirt.org/mailman/listinfo/users
>>> >> >> _______________________________________________
>>> >> >> Users mailing list
>>> >> >> Users at ovirt.org
>>> >> >> http://lists.ovirt.org/mailman/listinfo/users
>>> >> > <vdsm.log.gz>_______________________________________________
>>> >> > Users mailing list
>>> >> > Users at ovirt.org
>>> >> > http://lists.ovirt.org/mailman/listinfo/users
>>> >>
>>> >
>>> >
>>> > _______________________________________________
>>> > Users mailing list
>>> > Users at ovirt.org
>>> > http://lists.ovirt.org/mailman/listinfo/users
>>> >
>> 
>> 
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users