Hi there, recently there was a network failure in our ovirt infrastructure, causing ovirt engine to become unstable. It will restarted after 10-20minutes. Load average was high. Command issued will hanged.

Looking at host logs, there was endless locking errors (/var/log/sanlock.log) below.

I tried to re-initialize by stopp HE HA agent/broker in all hosts, by issuing following command in one of the host:

# su - vdsm -s /bin/bash

$ sanlock direct init -s hosted-engine:0:/rhev/data-center/mnt/192.168.10.10\\:_engine/a184f8ac-b779-4bf8-81c3-751115e15436/ha_agent/hosted-engine.lockspace:0

and than restart both agent and broker in the same host.

However i m still getting the same problem.

Any advice on this matter?

Installation Infos:
----------------------
ovirt 3.5.3 
vdsm-xmlrpc-4.16.24-0.el6.noarch
vdsm-python-zombiereaper-4.16.24-0.el6.noarch
vdsm-python-4.16.24-0.el6.noarch
vdsm-jsonrpc-4.16.24-0.el6.noarch
vdsm-4.16.24-0.el6.x86_64
vdsm-cli-4.16.24-0.el6.noarch
vdsm-yajsonrpc-4.16.24-0.el6.noarch
ovirt-hosted-engine-ha-1.2.6-2.el6.noarch
ovirt-hosted-engine-setup-1.2.6-0.0.master.20150812080635.git5295df1.el6.noarch

----- end installation infos

------ /var/log/sanlock.log

2015-08-18 04:20:02+0800 1704 [9385]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/192.168.10.10:_engine/a184f8ac-b779-4bf8-81c3-751115e15436/dom_md/ids

2015-08-18 04:20:02+0800 1704 [9385]: s2 renewal error -202 delta_length 11 last_success 1662

2015-08-18 04:20:11+0800 1713 [9385]: a184f8ac aio collect 0 0x7fc5040008c0:0x7fc5040008d0:0x7fc50b9f7000 result 1048576:0 other free

2015-08-18 04:20:11+0800 1713 [9833]: hosted-e aio collect 0 0x7fc4f80008c0:0x7fc4f80008d0:0x7fc50baf9000 result 1048576:0 other free

2015-08-18 04:20:11+0800 1713 [9385]: a184f8ac aio collect 0 0x7fc504000910:0x7fc504000920:0x7fc50bbfb000 result 1048576:0 other free

2015-08-18 04:20:11+0800 1713 [9833]: hosted-e aio collect 0 0x7fc4f8000910:0x7fc4f8000920:0x7fc50beff000 result 1048576:0 other free

2015-08-18 04:21:43+0800 1805 [9385]: a184f8ac aio timeout 0 0x7fc5040008c0:0x7fc5040008d0:0x7fc50adf2000 ioto 10 to_count 18

2015-08-18 04:21:43+0800 1805 [9385]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/192.168.10.10:_engine/a184f8ac-b779-4bf8-81c3-751115e15436/dom_md/ids

2015-08-18 04:21:43+0800 1805 [9385]: s2 renewal error -202 delta_length 10 last_success 1774

2015-08-18 04:21:43+0800 1805 [9833]: hosted-e aio timeout 0 0x7fc4f80008c0:0x7fc4f80008d0:0x7fc50aef4000 ioto 10 to_count 14

2015-08-18 04:21:43+0800 1805 [9833]: s3 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/192.168.10.10:_engine/a184f8ac-b779-4bf8-81c3-751115e15436/images/190f4d2a-77f4-4403-af0d-62853560c653/2be7db4d-f30e-4873-b4ef-cff9e757341c

2015-08-18 04:21:43+0800 1805 [9833]: s3 renewal error -202 delta_length 10 last_success 1774

2015-08-18 04:21:52+0800 1814 [9385]: a184f8ac aio collect 0 0x7fc5040008c0:0x7fc5040008d0:0x7fc50adf2000 result 1048576:0 other free

2015-08-18 04:21:52+0800 1814 [9833]: hosted-e aio collect 0 0x7fc4f80008c0:0x7fc4f80008d0:0x7fc50aef4000 result 1048576:0 other free

2015-08-18 04:23:04+0800 1885 [9833]: hosted-e aio timeout 0 0x7fc4f80008c0:0x7fc4f80008d0:0x7fc50bbfb000 ioto 10 to_count 15

2015-08-18 04:23:04+0800 1885 [9833]: s3 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/192.168.10.10:_engine/a184f8ac-b779-4bf8-81c3-751115e15436/images/190f4d2a-77f4-4403-af0d-62853560c653/2be7db4d-f30e-4873-b4ef-cff9e757341c

2015-08-18 04:23:04+0800 1885 [9833]: s3 renewal error -202 delta_length 10 last_success 1855

2015-08-18 04:23:04+0800 1886 [9385]: a184f8ac aio timeout 0 0x7fc5040008c0:0x7fc5040008d0:0x7fc50beff000 ioto 10 to_count 19

2015-08-18 04:23:04+0800 1886 [9385]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/192.168.10.10:_engine/a184f8ac-b779-4bf8-81c3-751115e15436/dom_md/ids

2015-08-18 04:23:04+0800 1886 [9385]: s2 renewal error -202 delta_length 10 last_success 1855

2015-08-18 04:23:15+0800 1896 [9833]: hosted-e aio timeout 0 0x7fc4f8000910:0x7fc4f8000920:0x7fc50baf9000 ioto 10 to_count 16

2015-08-18 04:23:15+0800 1896 [9833]: s3 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/192.168.10.10:_engine/a184f8ac-b779-4bf8-81c3-751115e15436/images/190f4d2a-77f4-4403-af0d-62853560c653/2be7db4d-f30e-4873-b4ef-cff9e757341c

2015-08-18 04:23:15+0800 1896 [9833]: s3 renewal error -202 delta_length 10 last_success 1855

2015-08-18 04:23:15+0800 1897 [9385]: a184f8ac aio timeout 0 0x7fc504000910:0x7fc504000920:0x7fc50b9f7000 ioto 10 to_count 20

2015-08-18 04:23:15+0800 1897 [9385]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/192.168.10.10:_engine/a184f8ac-b779-4bf8-81c3-751115e15436/dom_md/ids

2015-08-18 04:23:15+0800 1897 [9385]: s2 renewal error -202 delta_length 11 last_success 1855

2015-08-18 04:23:26+0800 1907 [9833]: hosted-e aio timeout 0 0x7fc4f8000960:0x7fc4f8000970:0x7fc50aef4000 ioto 10 to_count 17

2015-08-18 04:23:26+0800 1907 [9833]: s3 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/192.168.10.10:_engine/a184f8ac-b779-4bf8-81c3-751115e15436/images/190f4d2a-77f4-4403-af0d-62853560c653/2be7db4d-f30e-4873-b4ef-cff9e757341c

2015-08-18 04:23:26+0800 1907 [9833]: s3 renewal error -202 delta_length 10 last_success 1855

2015-08-18 04:23:26+0800 1908 [9385]: a184f8ac aio timeout 0 0x7fc504000960:0x7fc504000970:0x7fc50adf2000 ioto 10 to_count 21

2015-08-18 04:23:26+0800 1908 [9385]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/192.168.10.10:_engine/a184f8ac-b779-4bf8-81c3-751115e15436/dom_md/ids

2015-08-18 04:23:26+0800 1908 [9385]: s2 renewal error -202 delta_length 11 last_success 1855

----- end /var/log/sanlock.log