[ovirt-users] Ongoing "VM is not responding" and "ETL sampling errors".

J Brian Ismay jismay at cenic.org
Thu Aug 10 00:11:54 UTC 2017


Some more information, was doing more digging, and found that there is a 
possible correlation between the VM timeouts and the following sanlock 
messages:

Aug  9 07:47:57 hype-02 sanlock[1565]: 2017-08-09 07:47:57-0700 3509532 
[48218]: s49 delta_renew read timeout 10 sec offset 0 
/rhev/data-center/mnt/netapp:_tus__vms/b256c5ec-296a-46f0-8dec-55a1b90fd841/dom_md/ids
Aug  9 07:47:57 hype-02 sanlock[1565]: 2017-08-09 07:47:57-0700 3509532 
[48218]: s49 renewal error -202 delta_length 10 last_success 3509501
Aug  9 07:47:58 hype-02 sanlock[1565]: 2017-08-09 07:47:58-0700 3509533 
[51222]: s51 delta_renew read timeout 10 sec offset 0 
/var/run/vdsm/storage/8b22a6bd-50d4-4a6e-9b89-8b2e00e7835c/000b7ce5-fa09-41ad-b0c6-0839d550c3df/5e5c81d1-0075-4a2c-baf5-13a08b9d025b
Aug  9 07:47:58 hype-02 sanlock[1565]: 2017-08-09 07:47:58-0700 3509533 
[51222]: s51 renewal error -202 delta_length 10 last_success 3509502
Aug  9 07:47:58 hype-02 sanlock[1565]: 2017-08-09 07:47:58-0700 3509533 
[48256]: s50 delta_renew read timeout 10 sec offset 0 
/rhev/data-center/mnt/netapp:_hostedengine/8b22a6bd-50d4-4a6e-9b89-8b2e00e7835c/dom_md/ids
Aug  9 07:47:58 hype-02 sanlock[1565]: 2017-08-09 07:47:58-0700 3509533 
[48256]: s50 renewal error -202 delta_length 10 last_success 3509502

This is from one of my nodes, and other not responding events match 
other sanlock renewal errors very closely, with the sanlock errors 
preceding the VM errors by a small amount. The other nodes have 
virtually identical logs.

--
Brian


On 2017-08-09 10:54, J Brian Ismay wrote:
> Hello the list,
> 
> I have been slowly bringing up a 9-node cluster for the last few 
> months.
> 
> All nodes are identical, dual 2-port 10G nics, lots of memory and CPU
> Storage is a Netapp Filer accessed via NFS on a dedicated 10Gb
> dual-switch environment.
> 
> Generally everything is working fine, but ever since our last rebuild
> of the cluster in preperation for a move into production status we
> have been getting repeated errors showing in the HostedEngine console:
> 
> VM foo is not responding.
> VM bar is not responding.
> VM baz is not responding.
> 
> These errors happen on a fairly regular basis, and generally are
> multiple VMs all being hosted by different nodes. When errors occur I
> also lose external connectivity to the VM in question, both via its
> service IP address and via the ovirt console. The actual outages
> appear to generally last 15-20 seconds and then things recover and go
> back to normal.
> 
> We are also getting much more frequent errors:
> 
> ETL service sampling has encountered an error. Please consult the
> service log for more details.
> 
> I have attached snippets from the Engine engine.log from this morning.
> If any other logs are needed for to help diagnosis I can provide them.
> 
> --
> Brian Ismay
> SR. Systems Administrator
> jismay at cenic.org
> 
> ----
> engine.log: NOTE, the system clock is in UTC, local time is PDT, so
> this occurred at 07:48AM local time.
> 
> 2017-08-09 14:48:37,237 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler6) [2cea1ef7] VM
> '69880324-2d2e-4a70-8071-4ae0f0ae342e'(vm1) moved from 'Up' -->
> 'NotResponding'
> 2017-08-09 14:48:37,277 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler6) [2cea1ef7] Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: VM vm1 is not responding.
> 2017-08-09 14:48:37,277 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler6) [2cea1ef7] VM
> '4471e3ee-9f69-4903-b68f-c1293aea047f'(vm2) moved from 'Up' -->
> 'NotResponding'
> 2017-08-09 14:48:37,282 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler6) [2cea1ef7] Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: VM vm2 is not responding.
> 2017-08-09 14:48:38,326 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler5) [cf129f7] VM
> '35fd4afa-12a1-4326-9db5-a86939a01fa8'(vm3) moved from 'Up' -->
> 'NotResponding'
> 2017-08-09 14:48:38,360 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler5) [cf129f7] Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: VM vm3 is not responding.
> 2017-08-09 14:48:38,360 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler5) [cf129f7] VM
> 'd83e9633-3597-4046-95ee-2a166682b85e'(vm4) moved from 'Up' -->
> 'NotResponding'
> 2017-08-09 14:48:38,365 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler5) [cf129f7] Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: VM vm4 is not responding.
> 2017-08-09 14:48:49,075 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler8) [3b1149ff] VM
> 'd41984d0-4418-4991-9af0-25593abac976'(vm5) moved from 'Up' -->
> 'NotResponding'
> 2017-08-09 14:48:49,130 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler8) [3b1149ff] Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: VM vm5 is not responding.
> 2017-08-09 14:48:49,131 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler8) [3b1149ff] VM
> 'ed87b37d-5b79-4105-ba89-29a59361eb4e'(vm6) moved from 'Up' -->
> 'NotResponding'
> 2017-08-09 14:48:49,136 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler8) [3b1149ff] Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: VM vm6 is not responding.
> 2017-08-09 14:48:52,221 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler7) [2973c87] VM
> '506980f4-6764-4cc6-bb20-c1956d8ed201'(vm7) moved from 'Up' -->
> 'NotResponding'
> 2017-08-09 14:48:52,226 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler7) [2973c87] Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: VM vm7 is not responding.
> 2017-08-09 14:48:52,299 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler6) [2cea1ef7] VM
> '69880324-2d2e-4a70-8071-4ae0f0ae342e'(vm1) moved from 'NotResponding'
> --> 'Up'
> 2017-08-09 14:48:52,300 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler6) [2cea1ef7] VM
> '4471e3ee-9f69-4903-b68f-c1293aea047f'(vm2) moved from 'NotResponding'
> --> 'Up'
> 2017-08-09 14:48:53,373 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler5) [cf129f7] VM
> '638b2aab-e4f7-43e0-a2a8-95c75813e669'(vm8) moved from 'Up' -->
> 'NotResponding'
> 2017-08-09 14:48:53,379 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler5) [cf129f7] Correlation ID: null, Call Stack:
> null, Custom Event ID: -1, Message: VM vm8 is not responding.
> 2017-08-09 14:48:54,380 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler6) [2cea1ef7] VM
> '35fd4afa-12a1-4326-9db5-a86939a01fa8'(vm3) moved from 'NotResponding'
> --> 'Up'
> 2017-08-09 14:48:54,381 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler6) [2cea1ef7] VM
> 'd83e9633-3597-4046-95ee-2a166682b85e'(vm4) moved from 'NotResponding'
> --> 'Up'
> 2017-08-09 14:49:04,197 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler7) [2973c87] VM
> 'd41984d0-4418-4991-9af0-25593abac976'(vm5) moved from 'NotResponding'
> --> 'Up'
> 2017-08-09 14:49:04,198 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler7) [2973c87] VM
> 'ed87b37d-5b79-4105-ba89-29a59361eb4e'(vm6) moved from 'NotResponding'
> --> 'Up'
> 2017-08-09 14:49:07,293 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler8) [3b1149ff] VM
> '506980f4-6764-4cc6-bb20-c1956d8ed201'(vm7) moved from 'NotResponding'
> --> 'Up'
> 2017-08-09 14:49:09,388 INFO
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (DefaultQuartzScheduler7) [2973c87] VM
> '638b2aab-e4f7-43e0-a2a8-95c75813e669'(vm8) moved from 'NotResponding'
> --> 'Up'
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users


More information about the Users mailing list