Some more information, was doing more digging, and found that there is a
possible correlation between the VM timeouts and the following sanlock
messages:
Aug 9 07:47:57 hype-02 sanlock[1565]: 2017-08-09 07:47:57-0700 3509532
[48218]: s49 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/netapp:_tus__vms/b256c5ec-296a-46f0-8dec-55a1b90fd841/dom_md/ids
Aug 9 07:47:57 hype-02 sanlock[1565]: 2017-08-09 07:47:57-0700 3509532
[48218]: s49 renewal error -202 delta_length 10 last_success 3509501
Aug 9 07:47:58 hype-02 sanlock[1565]: 2017-08-09 07:47:58-0700 3509533
[51222]: s51 delta_renew read timeout 10 sec offset 0
/var/run/vdsm/storage/8b22a6bd-50d4-4a6e-9b89-8b2e00e7835c/000b7ce5-fa09-41ad-b0c6-0839d550c3df/5e5c81d1-0075-4a2c-baf5-13a08b9d025b
Aug 9 07:47:58 hype-02 sanlock[1565]: 2017-08-09 07:47:58-0700 3509533
[51222]: s51 renewal error -202 delta_length 10 last_success 3509502
Aug 9 07:47:58 hype-02 sanlock[1565]: 2017-08-09 07:47:58-0700 3509533
[48256]: s50 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/netapp:_hostedengine/8b22a6bd-50d4-4a6e-9b89-8b2e00e7835c/dom_md/ids
Aug 9 07:47:58 hype-02 sanlock[1565]: 2017-08-09 07:47:58-0700 3509533
[48256]: s50 renewal error -202 delta_length 10 last_success 3509502
This is from one of my nodes, and other not responding events match
other sanlock renewal errors very closely, with the sanlock errors
preceding the VM errors by a small amount. The other nodes have
virtually identical logs.
--
Brian
On 2017-08-09 10:54, J Brian Ismay wrote:
Hello the list,
I have been slowly bringing up a 9-node cluster for the last few
months.
All nodes are identical, dual 2-port 10G nics, lots of memory and CPU
Storage is a Netapp Filer accessed via NFS on a dedicated 10Gb
dual-switch environment.
Generally everything is working fine, but ever since our last rebuild
of the cluster in preperation for a move into production status we
have been getting repeated errors showing in the HostedEngine console:
VM foo is not responding.
VM bar is not responding.
VM baz is not responding.
These errors happen on a fairly regular basis, and generally are
multiple VMs all being hosted by different nodes. When errors occur I
also lose external connectivity to the VM in question, both via its
service IP address and via the ovirt console. The actual outages
appear to generally last 15-20 seconds and then things recover and go
back to normal.
We are also getting much more frequent errors:
ETL service sampling has encountered an error. Please consult the
service log for more details.
I have attached snippets from the Engine engine.log from this morning.
If any other logs are needed for to help diagnosis I can provide them.
--
Brian Ismay
SR. Systems Administrator
jismay(a)cenic.org
----
engine.log: NOTE, the system clock is in UTC, local time is PDT, so
this occurred at 07:48AM local time.
2017-08-09 14:48:37,237 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'69880324-2d2e-4a70-8071-4ae0f0ae342e'(vm1) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:37,277 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler6) [2cea1ef7] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm1 is not responding.
2017-08-09 14:48:37,277 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'4471e3ee-9f69-4903-b68f-c1293aea047f'(vm2) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:37,282 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler6) [2cea1ef7] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm2 is not responding.
2017-08-09 14:48:38,326 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler5) [cf129f7] VM
'35fd4afa-12a1-4326-9db5-a86939a01fa8'(vm3) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:38,360 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [cf129f7] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm3 is not responding.
2017-08-09 14:48:38,360 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler5) [cf129f7] VM
'd83e9633-3597-4046-95ee-2a166682b85e'(vm4) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:38,365 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [cf129f7] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm4 is not responding.
2017-08-09 14:48:49,075 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler8) [3b1149ff] VM
'd41984d0-4418-4991-9af0-25593abac976'(vm5) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:49,130 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler8) [3b1149ff] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm5 is not responding.
2017-08-09 14:48:49,131 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler8) [3b1149ff] VM
'ed87b37d-5b79-4105-ba89-29a59361eb4e'(vm6) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:49,136 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler8) [3b1149ff] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm6 is not responding.
2017-08-09 14:48:52,221 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler7) [2973c87] VM
'506980f4-6764-4cc6-bb20-c1956d8ed201'(vm7) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:52,226 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler7) [2973c87] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm7 is not responding.
2017-08-09 14:48:52,299 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'69880324-2d2e-4a70-8071-4ae0f0ae342e'(vm1) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:48:52,300 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'4471e3ee-9f69-4903-b68f-c1293aea047f'(vm2) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:48:53,373 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler5) [cf129f7] VM
'638b2aab-e4f7-43e0-a2a8-95c75813e669'(vm8) moved from 'Up' -->
'NotResponding'
2017-08-09 14:48:53,379 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [cf129f7] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM vm8 is not responding.
2017-08-09 14:48:54,380 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'35fd4afa-12a1-4326-9db5-a86939a01fa8'(vm3) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:48:54,381 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler6) [2cea1ef7] VM
'd83e9633-3597-4046-95ee-2a166682b85e'(vm4) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:49:04,197 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler7) [2973c87] VM
'd41984d0-4418-4991-9af0-25593abac976'(vm5) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:49:04,198 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler7) [2973c87] VM
'ed87b37d-5b79-4105-ba89-29a59361eb4e'(vm6) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:49:07,293 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler8) [3b1149ff] VM
'506980f4-6764-4cc6-bb20-c1956d8ed201'(vm7) moved from 'NotResponding'
--> 'Up'
2017-08-09 14:49:09,388 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler7) [2973c87] VM
'638b2aab-e4f7-43e0-a2a8-95c75813e669'(vm8) moved from 'NotResponding'
--> 'Up'
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users