Suddenly all VM's down including HostedEngine & NFSshares (except HE) unmounted

HI All, I'm trying to tackle an issues on 4.0.2 that sunddenly all VM's including the HostedEngine are just down at once. I have also seen that all NFS shares are unmounted except the HostedEngine Storage, which is on the same NFS device as well. I have checked the logs, nothing strange to see there, but as I run a vrrp setup and do some tests also I wonder if there is a duplicate IP brought up, could this make happen the whole system to go down and the Engine or VDSM unmounts the NFS shares ? My switches don't complain. It's strange that the HE share is only available after it happens. If so, this would be quite fragile and we should tackle where it goes wrong. Anyone seen this bahaviour ? Thanks, Matt

This usually happens when SPM falls off or master storage domain was unreachable for a brief period of time in some capacity. Your logs should say something about an underlying storage problem so oVirt offlined or paused the VMs to avoid problems. I'd check the pathway to your master storage domain. You're probably right that something had another conflict IP. This happened to me one time where someone brought up a system on an IP that matched my SPM On Aug 21, 2016 3:33 PM, "Matt ." <yamakasi.014@gmail.com> wrote:
HI All,
I'm trying to tackle an issues on 4.0.2 that sunddenly all VM's including the HostedEngine are just down at once.
I have also seen that all NFS shares are unmounted except the HostedEngine Storage, which is on the same NFS device as well.
I have checked the logs, nothing strange to see there, but as I run a vrrp setup and do some tests also I wonder if there is a duplicate IP brought up, could this make happen the whole system to go down and the Engine or VDSM unmounts the NFS shares ? My switches don't complain.
It's strange that the HE share is only available after it happens.
If so, this would be quite fragile and we should tackle where it goes wrong.
Anyone seen this bahaviour ?
Thanks,
Matt _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

THe strange things is that there are no IP's duplicated in the ovirt environment, storage or whatever the VM's make running. What happens tho is that the statusses of all agents change, and why... don' t ask me. There is really nothing in the logs that shows this behaviour. Restarting broker, agent, Rebooting the hosts, it doesn' t work out. the only one where I can start the HostedEngine on now is Host-4 where I was able to start them on other hosts in theit current states also. Something is wobbeling around the communication between the agents if you ask me. This happened from 4.0.1 --== Host 1 status ==-- Status up-to-date : False Hostname : host-01.mydomain.tld Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 6b73a02e Host timestamp : 2710 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2710 (Sun Aug 21 21:52:56 2016) host-id=1 score=0 maintenance=False state=AgentStopped stopped=True --== Host 2 status ==-- Status up-to-date : False Hostname : host-02.mydomain.tld Host ID : 2 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 8e647fca Host timestamp : 509 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=509 (Sun Aug 21 21:53:00 2016) host-id=2 score=0 maintenance=False state=AgentStopped stopped=True --== Host 3 status ==-- Status up-to-date : False Hostname : host-01.mydomain.tld Host ID : 3 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 73748f9f Host timestamp : 2888 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2888 (Sun Aug 21 00:16:12 2016) host-id=3 score=0 maintenance=False state=AgentStopped stopped=True --== Host 4 status ==-- Status up-to-date : False Hostname : host-02.mydomain.tld Host ID : 4 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : 86ef0447 Host timestamp : 67879 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=67879 (Sun Aug 21 18:30:38 2016) host-id=4 score=3400 maintenance=False state=GlobalMaintenance stopped=False 2016-08-21 22:09 GMT+02:00 Charles Kozler <ckozleriii@gmail.com>:
This usually happens when SPM falls off or master storage domain was unreachable for a brief period of time in some capacity. Your logs should say something about an underlying storage problem so oVirt offlined or paused the VMs to avoid problems. I'd check the pathway to your master storage domain. You're probably right that something had another conflict IP. This happened to me one time where someone brought up a system on an IP that matched my SPM
On Aug 21, 2016 3:33 PM, "Matt ." <yamakasi.014@gmail.com> wrote:
HI All,
I'm trying to tackle an issues on 4.0.2 that sunddenly all VM's including the HostedEngine are just down at once.
I have also seen that all NFS shares are unmounted except the HostedEngine Storage, which is on the same NFS device as well.
I have checked the logs, nothing strange to see there, but as I run a vrrp setup and do some tests also I wonder if there is a duplicate IP brought up, could this make happen the whole system to go down and the Engine or VDSM unmounts the NFS shares ? My switches don't complain.
It's strange that the HE share is only available after it happens.
If so, this would be quite fragile and we should tackle where it goes wrong.
Anyone seen this bahaviour ?
Thanks,
Matt _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

I seem some very strange behaviour, hosts are random rebooting where I get the feeling it crashed on Sanlock or is, that is the last logline from /var/log/messages I saw this happening on the latest kernel (yum update) and different hardware, the filer is OK! Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5613 [5294]: hosted-e close_task_aio 2 0x7f3728000960 busy Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5613 [5294]: hosted-e close_task_aio 3 0x7f37280009b0 busy Aug 22 01:35:27 host-01 ovirt-ha-broker: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Aug 22 01:35:27 host-01 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Aug 22 01:35:27 host-01 journal: vdsm root ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012 File "/usr/lib/python2.$ Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5614 [5901]: s2 delta_renew read rv -2 offset 0 /rhev/data-center/mnt/flr-01.... Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5614 [5901]: s2 renewal error -2 delta_length 10 last_success 5518 Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5614 [1024]: s2 kill 6871 sig 15 count 17 Aug 22 01:35:28 host-01 wdmd[988]: test failed rem 24 now 5614 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:28 host-01 sanlock[1024]: 2016-08-22 01:35:28+0200 5615 [1024]: s2 kill 6871 sig 15 count 18 Aug 22 01:35:29 host-01 wdmd[988]: test failed rem 23 now 5615 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:29 host-01 sanlock[1024]: 2016-08-22 01:35:29+0200 5616 [1024]: s2 kill 6871 sig 15 count 19 Aug 22 01:35:30 host-01 wdmd[988]: test failed rem 22 now 5616 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:30 host-01 sanlock[1024]: 2016-08-22 01:35:30+0200 5617 [1024]: s2 kill 6871 sig 15 count 20 Aug 22 01:35:31 host-01 wdmd[988]: test failed rem 21 now 5617 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:31 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:31 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:31 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:31 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:31 host-01 sanlock[1024]: 2016-08-22 01:35:31+0200 5618 [1024]: s2 kill 6871 sig 15 count 21 Aug 22 01:35:32 host-01 wdmd[988]: test failed rem 20 now 5618 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:32 host-01 momd: /usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py:52: DeprecationWarning: BaseException.message has been deprecat$ Aug 22 01:35:32 host-01 momd: self.stats_error('getVmMemoryStats(): %s' % e.message) Aug 22 01:35:32 host-01 momd: /usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py:52: DeprecationWarning: BaseException.message has been deprecat$ Aug 22 01:35:32 host-01 momd: self.stats_error('getVmMemoryStats(): %s' % e.message) Aug 22 01:35:32 host-01 sanlock[1024]: 2016-08-22 01:35:32+0200 5619 [1024]: s2 kill 6871 sig 15 count 22 Aug 22 01:35:33 host-01 ovirt-ha-broker: WARNING:engine_health.CpuLoadNoEngine:bad health status: Hosted Engine is not up! Aug 22 01:35:33 host-01 wdmd[988]: test failed rem 19 now 5619 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:33 host-01 sanlock[1024]: 2016-08-22 01:35:33+0200 5620 [1024]: s2 kill 6871 sig 15 count 23 Aug 22 01:35:34 host-01 wdmd[988]: test failed rem 18 now 5620 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:34 host-01 sanlock[1024]: 2016-08-22 01:35:34+0200 5621 [1024]: s2 kill 6871 sig 15 count 24 Aug 22 01:35:35 host-01 wdmd[988]: test failed rem 17 now 5621 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622 [10454]: 60c7bc7a close_task_aio 0 0x7f37180008c0 busy Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622 [10454]: 60c7bc7a close_task_aio 1 0x7f3718000910 busy Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622 [10454]: 60c7bc7a close_task_aio 2 0x7f3718000960 busy Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622 [10454]: 60c7bc7a close_task_aio 3 0x7f37180009b0 busy Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622 [1024]: s2 kill 6871 sig 15 count 25 Aug 22 01:35:36 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:36 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:36 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:36 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:36 host-01 wdmd[988]: test failed rem 16 now 5622 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:36 host-01 sanlock[1024]: 2016-08-22 01:35:36+0200 5623 [1024]: s2 kill 6871 sig 15 count 26 Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [10364]: 0ec21a57 close_task_aio 0 0x7f37240008c0 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [10364]: 0ec21a57 close_task_aio 1 0x7f3724000910 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [10364]: 0ec21a57 close_task_aio 2 0x7f3724000960 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [10364]: 0ec21a57 close_task_aio 3 0x7f37240009b0 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [5294]: hosted-e close_task_aio 0 0x7f37280008c0 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [5294]: hosted-e close_task_aio 1 0x7f3728000910 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [5294]: hosted-e close_task_aio 2 0x7f3728000960 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [5294]: hosted-e close_task_aio 3 0x7f37280009b0 busy Aug 22 01:35:37 host-01 wdmd[988]: test failed rem 15 now 5623 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:38 host-01 sanlock[1024]: 2016-08-22 01:35:38+0200 5624 [1024]: s2 kill 6871 sig 15 count 27 Aug 22 01:35:38 host-01 sanlock[1024]: 2016-08-22 01:35:38+0200 5624 [5901]: s2 delta_renew read rv -2 offset 0 /rhev/data-center/mnt/flr-01.... Aug 22 01:35:38 host-01 sanlock[1024]: 2016-08-22 01:35:38+0200 5624 [5901]: s2 renewal error -2 delta_length 10 last_success 5518 Aug 22 01:35:38 host-01 wdmd[988]: test failed rem 14 now 5624 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:39 host-01 sanlock[1024]: 2016-08-22 01:35:39+0200 5625 [1024]: s2 kill 6871 sig 15 count 28 Aug 22 01:35:39 host-01 wdmd[988]: test failed rem 13 now 5625 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:39 host-01 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Aug 22 01:35:39 host-01 journal: vdsm root ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012 File "/usr/lib/python2.$ Aug 22 01:35:39 host-01 ovirt-ha-broker: INFO:mem_free.MemFree:memFree: 22380 Aug 22 01:35:39 host-01 ovirt-ha-broker: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Aug 22 01:35:40 host-01 sanlock[1024]: 2016-08-22 01:35:40+0200 5626 [1024]: s2 kill 6871 sig 15 count 29 Aug 22 01:35:40 host-01 momd: /usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py:52: DeprecationWarning: BaseException.message has been deprecat$ Aug 22 01:35:40 host-01 momd: self.stats_error('getVmMemoryStats(): %s' % e.message) Aug 22 01:35:40 host-01 wdmd[988]: test failed rem 12 now 5626 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:41 host-01 sanlock[1024]: 2016-08-22 01:35:41+0200 5627 [1024]: s2 kill 6871 sig 15 count 30 Aug 22 01:35:41 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:41 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:41 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:41 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:41 host-01 wdmd[988]: test failed rem 11 now 5627 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:42 host-01 sanlock[1024]: 2016-08-22 01:35:42+0200 5628 [1024]: s2 kill 6871 sig 15 count 31 Aug 22 01:35:42 host-01 wdmd[988]: test failed rem 10 now 5628 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:42 host-01 ovirt-ha-broker: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Aug 22 01:35:42 host-01 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Aug 22 01:35:42 host-01 journal: vdsm root ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012 File "/usr/lib/python2.$ Aug 22 01:35:43 host-01 sanlock[1024]: 2016-08-22 01:35:43+0200 5629 [1024]: s2 kill 6871 sig 15 count 32 Aug 22 01:35:43 host-01 ovirt-ha-broker: WARNING:engine_health.CpuLoadNoEngine:bad health status: Hosted Engine is not up! Aug 22 01:35:43 host-01 wdmd[988]: test failed rem 9 now 5629 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259a$ Aug 22 01:35:44 host-01 sanlock[1024]: 2016-08-22 01:35:44+0200 5630 [1024]: s2 kill 6871 sig 15 count 33 Aug 22 01:35:44 host-01 wdmd[988]: test failed rem 8 now 5630 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259a$ Aug 22 01:35:44 host-01 ovirt-ha-broker: INFO:ping.Ping:Successfully pinged 172.16.31.252 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Aug 22 01:40:19 host-01 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7$ Aug 22 01:40:11 host-01 journal: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.6G available → current limit 1.5G). Aug 22 01:40:11 host-01 kernel: Initializing cgroup subsys cpuset Aug 22 01:40:11 host-01 kernel: Initializing cgroup subsys cpu Aug 22 01:40:11 host-01 kernel: Initializing cgroup subsys cpuacct 2016-08-21 22:34 GMT+02:00 Matt . <yamakasi.014@gmail.com>:
THe strange things is that there are no IP's duplicated in the ovirt environment, storage or whatever the VM's make running.
What happens tho is that the statusses of all agents change, and why... don' t ask me.
There is really nothing in the logs that shows this behaviour.
Restarting broker, agent, Rebooting the hosts, it doesn' t work out. the only one where I can start the HostedEngine on now is Host-4 where I was able to start them on other hosts in theit current states also.
Something is wobbeling around the communication between the agents if you ask me. This happened from 4.0.1
--== Host 1 status ==--
Status up-to-date : False Hostname : host-01.mydomain.tld Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 6b73a02e Host timestamp : 2710 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2710 (Sun Aug 21 21:52:56 2016) host-id=1 score=0 maintenance=False state=AgentStopped stopped=True
--== Host 2 status ==--
Status up-to-date : False Hostname : host-02.mydomain.tld Host ID : 2 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 8e647fca Host timestamp : 509 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=509 (Sun Aug 21 21:53:00 2016) host-id=2 score=0 maintenance=False state=AgentStopped stopped=True
--== Host 3 status ==--
Status up-to-date : False Hostname : host-01.mydomain.tld Host ID : 3 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 73748f9f Host timestamp : 2888 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2888 (Sun Aug 21 00:16:12 2016) host-id=3 score=0 maintenance=False state=AgentStopped stopped=True
--== Host 4 status ==--
Status up-to-date : False Hostname : host-02.mydomain.tld Host ID : 4 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : 86ef0447 Host timestamp : 67879 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=67879 (Sun Aug 21 18:30:38 2016) host-id=4 score=3400 maintenance=False state=GlobalMaintenance stopped=False
2016-08-21 22:09 GMT+02:00 Charles Kozler <ckozleriii@gmail.com>:
This usually happens when SPM falls off or master storage domain was unreachable for a brief period of time in some capacity. Your logs should say something about an underlying storage problem so oVirt offlined or paused the VMs to avoid problems. I'd check the pathway to your master storage domain. You're probably right that something had another conflict IP. This happened to me one time where someone brought up a system on an IP that matched my SPM
On Aug 21, 2016 3:33 PM, "Matt ." <yamakasi.014@gmail.com> wrote:
HI All,
I'm trying to tackle an issues on 4.0.2 that sunddenly all VM's including the HostedEngine are just down at once.
I have also seen that all NFS shares are unmounted except the HostedEngine Storage, which is on the same NFS device as well.
I have checked the logs, nothing strange to see there, but as I run a vrrp setup and do some tests also I wonder if there is a duplicate IP brought up, could this make happen the whole system to go down and the Engine or VDSM unmounts the NFS shares ? My switches don't complain.
It's strange that the HE share is only available after it happens.
If so, this would be quite fragile and we should tackle where it goes wrong.
Anyone seen this bahaviour ?
Thanks,
Matt _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Some extra info: I see a very high load on all hosts without any VM started: top - 02:36:36 up 56 min, 1 user, load average: 9.95, 8.11, 7.67 Tasks: 247 total, 1 running, 246 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.2 us, 0.8 sy, 0.0 ni, 97.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 32780472 total, 31680580 free, 676704 used, 423188 buff/cache KiB Swap: 25165820 total, 25165820 free, 0 used. 31827176 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 951 root 15 -5 1371512 35676 8648 S 4.7 0.1 0:48.57 supervdsmServer 953 vdsm 20 0 5713492 52204 6496 S 4.7 0.2 1:00.16 ovirt-ha-broker 4949 vdsm 0 -20 5148232 124364 12144 S 2.7 0.4 2:33.33 vdsm 4952 vdsm 20 0 599528 21272 4932 S 0.7 0.1 0:16.44 python In the logs is also this shown: periodic/91::DEBUG::2016-08-22 02:33:29,621::task::597::Storage.TaskManager.Task::(_updateState) Task=`65f12a56-dab5-4a19-a9ef-967e4b617087`::moving from state preparing -> state finished periodic/91::DEBUG::2016-08-22 02:33:29,621::resourceManager::952::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} periodic/91::DEBUG::2016-08-22 02:33:29,622::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} periodic/91::DEBUG::2016-08-22 02:33:29,622::task::995::Storage.TaskManager.Task::(_decref) Task=`65f12a56-dab5-4a19-a9ef-967e4b617087`::ref 0 aborting False periodic/89::ERROR::2016-08-22 02:33:29,636::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out periodic/89::ERROR::2016-08-22 02:33:29,637::api::253::root::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 232, in _getHaInfo stats = instance.get_all_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain .format(sd_type, options, e)) RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '4093ad17-bef5-4e4b-9a16-259a98e20321'}: Connection timed out periodic/89::DEBUG::2016-08-22 02:33:29,638::executor::182::Executor::(_run) Worker was discarded periodic/86::WARNING::2016-08-22 02:33:30,936::periodic::269::virt.periodic.VmDispatcher::(__call__) could not run <class 'vdsm.virt.periodic.DriveWatermarkMonitor'> on ['5576ec24-112e-4995-89f8-57e40c43cc5a'] periodic/90::WARNING::2016-08-22 02:33:32,937::periodic::269::virt.periodic.VmDispatcher::(__call__) could not run <class 'vdsm.virt.periodic.DriveWatermarkMonitor'> on ['5576ec24-112e-4995-89f8-57e40c43cc5a'] Reactor thread::INFO::2016-08-22 02:33:33,234::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from ::ffff:172.16.30.11:56176 The strange thing is that I can fairly mount my NFS share where my VM's and HE is on, this is no issue. But in some way vdsmd fails and the RHEV mounts, which ovit makes dies and a df -h times out. I'm very curious what is going on here. 2016-08-22 2:01 GMT+02:00 Matt . <yamakasi.014@gmail.com>:
I seem some very strange behaviour, hosts are random rebooting where I get the feeling it crashed on Sanlock or is, that is the last logline from /var/log/messages
I saw this happening on the latest kernel (yum update) and different hardware, the filer is OK!
Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5613 [5294]: hosted-e close_task_aio 2 0x7f3728000960 busy Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5613 [5294]: hosted-e close_task_aio 3 0x7f37280009b0 busy Aug 22 01:35:27 host-01 ovirt-ha-broker: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Aug 22 01:35:27 host-01 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Aug 22 01:35:27 host-01 journal: vdsm root ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012 File "/usr/lib/python2.$ Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5614 [5901]: s2 delta_renew read rv -2 offset 0 /rhev/data-center/mnt/flr-01.... Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5614 [5901]: s2 renewal error -2 delta_length 10 last_success 5518 Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5614 [1024]: s2 kill 6871 sig 15 count 17 Aug 22 01:35:28 host-01 wdmd[988]: test failed rem 24 now 5614 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:28 host-01 sanlock[1024]: 2016-08-22 01:35:28+0200 5615 [1024]: s2 kill 6871 sig 15 count 18 Aug 22 01:35:29 host-01 wdmd[988]: test failed rem 23 now 5615 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:29 host-01 sanlock[1024]: 2016-08-22 01:35:29+0200 5616 [1024]: s2 kill 6871 sig 15 count 19 Aug 22 01:35:30 host-01 wdmd[988]: test failed rem 22 now 5616 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:30 host-01 sanlock[1024]: 2016-08-22 01:35:30+0200 5617 [1024]: s2 kill 6871 sig 15 count 20 Aug 22 01:35:31 host-01 wdmd[988]: test failed rem 21 now 5617 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:31 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:31 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:31 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:31 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:31 host-01 sanlock[1024]: 2016-08-22 01:35:31+0200 5618 [1024]: s2 kill 6871 sig 15 count 21 Aug 22 01:35:32 host-01 wdmd[988]: test failed rem 20 now 5618 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:32 host-01 momd: /usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py:52: DeprecationWarning: BaseException.message has been deprecat$ Aug 22 01:35:32 host-01 momd: self.stats_error('getVmMemoryStats(): %s' % e.message) Aug 22 01:35:32 host-01 momd: /usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py:52: DeprecationWarning: BaseException.message has been deprecat$ Aug 22 01:35:32 host-01 momd: self.stats_error('getVmMemoryStats(): %s' % e.message) Aug 22 01:35:32 host-01 sanlock[1024]: 2016-08-22 01:35:32+0200 5619 [1024]: s2 kill 6871 sig 15 count 22 Aug 22 01:35:33 host-01 ovirt-ha-broker: WARNING:engine_health.CpuLoadNoEngine:bad health status: Hosted Engine is not up! Aug 22 01:35:33 host-01 wdmd[988]: test failed rem 19 now 5619 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:33 host-01 sanlock[1024]: 2016-08-22 01:35:33+0200 5620 [1024]: s2 kill 6871 sig 15 count 23 Aug 22 01:35:34 host-01 wdmd[988]: test failed rem 18 now 5620 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:34 host-01 sanlock[1024]: 2016-08-22 01:35:34+0200 5621 [1024]: s2 kill 6871 sig 15 count 24 Aug 22 01:35:35 host-01 wdmd[988]: test failed rem 17 now 5621 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622 [10454]: 60c7bc7a close_task_aio 0 0x7f37180008c0 busy Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622 [10454]: 60c7bc7a close_task_aio 1 0x7f3718000910 busy Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622 [10454]: 60c7bc7a close_task_aio 2 0x7f3718000960 busy Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622 [10454]: 60c7bc7a close_task_aio 3 0x7f37180009b0 busy Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622 [1024]: s2 kill 6871 sig 15 count 25 Aug 22 01:35:36 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:36 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:36 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:36 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:36 host-01 wdmd[988]: test failed rem 16 now 5622 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:36 host-01 sanlock[1024]: 2016-08-22 01:35:36+0200 5623 [1024]: s2 kill 6871 sig 15 count 26 Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [10364]: 0ec21a57 close_task_aio 0 0x7f37240008c0 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [10364]: 0ec21a57 close_task_aio 1 0x7f3724000910 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [10364]: 0ec21a57 close_task_aio 2 0x7f3724000960 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [10364]: 0ec21a57 close_task_aio 3 0x7f37240009b0 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [5294]: hosted-e close_task_aio 0 0x7f37280008c0 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [5294]: hosted-e close_task_aio 1 0x7f3728000910 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [5294]: hosted-e close_task_aio 2 0x7f3728000960 busy Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623 [5294]: hosted-e close_task_aio 3 0x7f37280009b0 busy Aug 22 01:35:37 host-01 wdmd[988]: test failed rem 15 now 5623 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:38 host-01 sanlock[1024]: 2016-08-22 01:35:38+0200 5624 [1024]: s2 kill 6871 sig 15 count 27 Aug 22 01:35:38 host-01 sanlock[1024]: 2016-08-22 01:35:38+0200 5624 [5901]: s2 delta_renew read rv -2 offset 0 /rhev/data-center/mnt/flr-01.... Aug 22 01:35:38 host-01 sanlock[1024]: 2016-08-22 01:35:38+0200 5624 [5901]: s2 renewal error -2 delta_length 10 last_success 5518 Aug 22 01:35:38 host-01 wdmd[988]: test failed rem 14 now 5624 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:39 host-01 sanlock[1024]: 2016-08-22 01:35:39+0200 5625 [1024]: s2 kill 6871 sig 15 count 28 Aug 22 01:35:39 host-01 wdmd[988]: test failed rem 13 now 5625 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:39 host-01 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Aug 22 01:35:39 host-01 journal: vdsm root ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012 File "/usr/lib/python2.$ Aug 22 01:35:39 host-01 ovirt-ha-broker: INFO:mem_free.MemFree:memFree: 22380 Aug 22 01:35:39 host-01 ovirt-ha-broker: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Aug 22 01:35:40 host-01 sanlock[1024]: 2016-08-22 01:35:40+0200 5626 [1024]: s2 kill 6871 sig 15 count 29 Aug 22 01:35:40 host-01 momd: /usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py:52: DeprecationWarning: BaseException.message has been deprecat$ Aug 22 01:35:40 host-01 momd: self.stats_error('getVmMemoryStats(): %s' % e.message) Aug 22 01:35:40 host-01 wdmd[988]: test failed rem 12 now 5626 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:41 host-01 sanlock[1024]: 2016-08-22 01:35:41+0200 5627 [1024]: s2 kill 6871 sig 15 count 30 Aug 22 01:35:41 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:41 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:41 host-01 ovirt-ha-agent: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use D$ Aug 22 01:35:41 host-01 ovirt-ha-agent: pending = getattr(dispatcher, 'pending', lambda: 0) Aug 22 01:35:41 host-01 wdmd[988]: test failed rem 11 now 5627 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:42 host-01 sanlock[1024]: 2016-08-22 01:35:42+0200 5628 [1024]: s2 kill 6871 sig 15 count 31 Aug 22 01:35:42 host-01 wdmd[988]: test failed rem 10 now 5628 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259$ Aug 22 01:35:42 host-01 ovirt-ha-broker: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Aug 22 01:35:42 host-01 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Aug 22 01:35:42 host-01 journal: vdsm root ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012 File "/usr/lib/python2.$ Aug 22 01:35:43 host-01 sanlock[1024]: 2016-08-22 01:35:43+0200 5629 [1024]: s2 kill 6871 sig 15 count 32 Aug 22 01:35:43 host-01 ovirt-ha-broker: WARNING:engine_health.CpuLoadNoEngine:bad health status: Hosted Engine is not up! Aug 22 01:35:43 host-01 wdmd[988]: test failed rem 9 now 5629 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259a$ Aug 22 01:35:44 host-01 sanlock[1024]: 2016-08-22 01:35:44+0200 5630 [1024]: s2 kill 6871 sig 15 count 33 Aug 22 01:35:44 host-01 wdmd[988]: test failed rem 8 now 5630 ping 5568 close 5578 renewal 5518 expire 5598 client 1024 sanlock_4093ad17-bef5-4e4b-9a16-259a$ Aug 22 01:35:44 host-01 ovirt-ha-broker: INFO:ping.Ping:Successfully pinged 172.16.31.252 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Aug 22 01:40:19 host-01 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7$ Aug 22 01:40:11 host-01 journal: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.6G available → current limit 1.5G). Aug 22 01:40:11 host-01 kernel: Initializing cgroup subsys cpuset Aug 22 01:40:11 host-01 kernel: Initializing cgroup subsys cpu Aug 22 01:40:11 host-01 kernel: Initializing cgroup subsys cpuacct
2016-08-21 22:34 GMT+02:00 Matt . <yamakasi.014@gmail.com>:
THe strange things is that there are no IP's duplicated in the ovirt environment, storage or whatever the VM's make running.
What happens tho is that the statusses of all agents change, and why... don' t ask me.
There is really nothing in the logs that shows this behaviour.
Restarting broker, agent, Rebooting the hosts, it doesn' t work out. the only one where I can start the HostedEngine on now is Host-4 where I was able to start them on other hosts in theit current states also.
Something is wobbeling around the communication between the agents if you ask me. This happened from 4.0.1
--== Host 1 status ==--
Status up-to-date : False Hostname : host-01.mydomain.tld Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 6b73a02e Host timestamp : 2710 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2710 (Sun Aug 21 21:52:56 2016) host-id=1 score=0 maintenance=False state=AgentStopped stopped=True
--== Host 2 status ==--
Status up-to-date : False Hostname : host-02.mydomain.tld Host ID : 2 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 8e647fca Host timestamp : 509 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=509 (Sun Aug 21 21:53:00 2016) host-id=2 score=0 maintenance=False state=AgentStopped stopped=True
--== Host 3 status ==--
Status up-to-date : False Hostname : host-01.mydomain.tld Host ID : 3 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 73748f9f Host timestamp : 2888 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2888 (Sun Aug 21 00:16:12 2016) host-id=3 score=0 maintenance=False state=AgentStopped stopped=True
--== Host 4 status ==--
Status up-to-date : False Hostname : host-02.mydomain.tld Host ID : 4 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : 86ef0447 Host timestamp : 67879 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=67879 (Sun Aug 21 18:30:38 2016) host-id=4 score=3400 maintenance=False state=GlobalMaintenance stopped=False
2016-08-21 22:09 GMT+02:00 Charles Kozler <ckozleriii@gmail.com>:
This usually happens when SPM falls off or master storage domain was unreachable for a brief period of time in some capacity. Your logs should say something about an underlying storage problem so oVirt offlined or paused the VMs to avoid problems. I'd check the pathway to your master storage domain. You're probably right that something had another conflict IP. This happened to me one time where someone brought up a system on an IP that matched my SPM
On Aug 21, 2016 3:33 PM, "Matt ." <yamakasi.014@gmail.com> wrote:
HI All,
I'm trying to tackle an issues on 4.0.2 that sunddenly all VM's including the HostedEngine are just down at once.
I have also seen that all NFS shares are unmounted except the HostedEngine Storage, which is on the same NFS device as well.
I have checked the logs, nothing strange to see there, but as I run a vrrp setup and do some tests also I wonder if there is a duplicate IP brought up, could this make happen the whole system to go down and the Engine or VDSM unmounts the NFS shares ? My switches don't complain.
It's strange that the HE share is only available after it happens.
If so, this would be quite fragile and we should tackle where it goes wrong.
Anyone seen this bahaviour ?
Thanks,
Matt _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
participants (2)
-
Charles Kozler
-
Matt .