[ovirt-users] Suddenly all VM's down including HostedEngine & NFSshares (except HE) unmounted

Matt . yamakasi.014 at gmail.com
Mon Aug 22 00:01:12 UTC 2016


I seem some very strange behaviour, hosts are random rebooting where I
get the feeling it crashed on Sanlock or is, that is the last logline
from /var/log/messages

I saw this happening on the latest kernel (yum update) and different
hardware, the filer is OK!



Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5613
[5294]: hosted-e close_task_aio 2 0x7f3728000960 busy
Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5613
[5294]: hosted-e close_task_aio 3 0x7f37280009b0 busy
Aug 22 01:35:27 host-01 ovirt-ha-broker:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Aug 22 01:35:27 host-01 journal: vdsm
ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection
closed: Connection timed out
Aug 22 01:35:27 host-01 journal: vdsm root ERROR failed to retrieve
Hosted Engine HA info#012Traceback (most recent call last):#012  File
"/usr/lib/python2.$
Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5614
[5901]: s2 delta_renew read rv -2 offset 0
/rhev/data-center/mnt/flr-01....
Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5614
[5901]: s2 renewal error -2 delta_length 10 last_success 5518
Aug 22 01:35:27 host-01 sanlock[1024]: 2016-08-22 01:35:27+0200 5614
[1024]: s2 kill 6871 sig 15 count 17
Aug 22 01:35:28 host-01 wdmd[988]: test failed rem 24 now 5614 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:28 host-01 sanlock[1024]: 2016-08-22 01:35:28+0200 5615
[1024]: s2 kill 6871 sig 15 count 18
Aug 22 01:35:29 host-01 wdmd[988]: test failed rem 23 now 5615 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:29 host-01 sanlock[1024]: 2016-08-22 01:35:29+0200 5616
[1024]: s2 kill 6871 sig 15 count 19
Aug 22 01:35:30 host-01 wdmd[988]: test failed rem 22 now 5616 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:30 host-01 sanlock[1024]: 2016-08-22 01:35:30+0200 5617
[1024]: s2 kill 6871 sig 15 count 20
Aug 22 01:35:31 host-01 wdmd[988]: test failed rem 21 now 5617 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:31 host-01 ovirt-ha-agent:
/usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352:
DeprecationWarning: Dispatcher.pending is deprecated. Use D$
Aug 22 01:35:31 host-01 ovirt-ha-agent: pending = getattr(dispatcher,
'pending', lambda: 0)
Aug 22 01:35:31 host-01 ovirt-ha-agent:
/usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352:
DeprecationWarning: Dispatcher.pending is deprecated. Use D$
Aug 22 01:35:31 host-01 ovirt-ha-agent: pending = getattr(dispatcher,
'pending', lambda: 0)
Aug 22 01:35:31 host-01 sanlock[1024]: 2016-08-22 01:35:31+0200 5618
[1024]: s2 kill 6871 sig 15 count 21
Aug 22 01:35:32 host-01 wdmd[988]: test failed rem 20 now 5618 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:32 host-01 momd:
/usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py:52:
DeprecationWarning: BaseException.message has been deprecat$
Aug 22 01:35:32 host-01 momd: self.stats_error('getVmMemoryStats():
%s' % e.message)
Aug 22 01:35:32 host-01 momd:
/usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py:52:
DeprecationWarning: BaseException.message has been deprecat$
Aug 22 01:35:32 host-01 momd: self.stats_error('getVmMemoryStats():
%s' % e.message)
Aug 22 01:35:32 host-01 sanlock[1024]: 2016-08-22 01:35:32+0200 5619
[1024]: s2 kill 6871 sig 15 count 22
Aug 22 01:35:33 host-01 ovirt-ha-broker:
WARNING:engine_health.CpuLoadNoEngine:bad health status: Hosted Engine
is not up!
Aug 22 01:35:33 host-01 wdmd[988]: test failed rem 19 now 5619 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:33 host-01 sanlock[1024]: 2016-08-22 01:35:33+0200 5620
[1024]: s2 kill 6871 sig 15 count 23
Aug 22 01:35:34 host-01 wdmd[988]: test failed rem 18 now 5620 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:34 host-01 sanlock[1024]: 2016-08-22 01:35:34+0200 5621
[1024]: s2 kill 6871 sig 15 count 24
Aug 22 01:35:35 host-01 wdmd[988]: test failed rem 17 now 5621 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622
[10454]: 60c7bc7a close_task_aio 0 0x7f37180008c0 busy
Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622
[10454]: 60c7bc7a close_task_aio 1 0x7f3718000910 busy
Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622
[10454]: 60c7bc7a close_task_aio 2 0x7f3718000960 busy
Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622
[10454]: 60c7bc7a close_task_aio 3 0x7f37180009b0 busy
Aug 22 01:35:35 host-01 sanlock[1024]: 2016-08-22 01:35:35+0200 5622
[1024]: s2 kill 6871 sig 15 count 25
Aug 22 01:35:36 host-01 ovirt-ha-agent:
/usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352:
DeprecationWarning: Dispatcher.pending is deprecated. Use D$
Aug 22 01:35:36 host-01 ovirt-ha-agent: pending = getattr(dispatcher,
'pending', lambda: 0)
Aug 22 01:35:36 host-01 ovirt-ha-agent:
/usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352:
DeprecationWarning: Dispatcher.pending is deprecated. Use D$
Aug 22 01:35:36 host-01 ovirt-ha-agent: pending = getattr(dispatcher,
'pending', lambda: 0)
Aug 22 01:35:36 host-01 wdmd[988]: test failed rem 16 now 5622 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:36 host-01 sanlock[1024]: 2016-08-22 01:35:36+0200 5623
[1024]: s2 kill 6871 sig 15 count 26
Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623
[10364]: 0ec21a57 close_task_aio 0 0x7f37240008c0 busy
Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623
[10364]: 0ec21a57 close_task_aio 1 0x7f3724000910 busy
Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623
[10364]: 0ec21a57 close_task_aio 2 0x7f3724000960 busy
Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623
[10364]: 0ec21a57 close_task_aio 3 0x7f37240009b0 busy
Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623
[5294]: hosted-e close_task_aio 0 0x7f37280008c0 busy
Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623
[5294]: hosted-e close_task_aio 1 0x7f3728000910 busy
Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623
[5294]: hosted-e close_task_aio 2 0x7f3728000960 busy
Aug 22 01:35:37 host-01 sanlock[1024]: 2016-08-22 01:35:37+0200 5623
[5294]: hosted-e close_task_aio 3 0x7f37280009b0 busy
Aug 22 01:35:37 host-01 wdmd[988]: test failed rem 15 now 5623 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:38 host-01 sanlock[1024]: 2016-08-22 01:35:38+0200 5624
[1024]: s2 kill 6871 sig 15 count 27
Aug 22 01:35:38 host-01 sanlock[1024]: 2016-08-22 01:35:38+0200 5624
[5901]: s2 delta_renew read rv -2 offset 0
/rhev/data-center/mnt/flr-01....
Aug 22 01:35:38 host-01 sanlock[1024]: 2016-08-22 01:35:38+0200 5624
[5901]: s2 renewal error -2 delta_length 10 last_success 5518
Aug 22 01:35:38 host-01 wdmd[988]: test failed rem 14 now 5624 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:39 host-01 sanlock[1024]: 2016-08-22 01:35:39+0200 5625
[1024]: s2 kill 6871 sig 15 count 28
Aug 22 01:35:39 host-01 wdmd[988]: test failed rem 13 now 5625 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:39 host-01 journal: vdsm
ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection
closed: Connection timed out
Aug 22 01:35:39 host-01 journal: vdsm root ERROR failed to retrieve
Hosted Engine HA info#012Traceback (most recent call last):#012  File
"/usr/lib/python2.$
Aug 22 01:35:39 host-01 ovirt-ha-broker: INFO:mem_free.MemFree:memFree: 22380
Aug 22 01:35:39 host-01 ovirt-ha-broker:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Aug 22 01:35:40 host-01 sanlock[1024]: 2016-08-22 01:35:40+0200 5626
[1024]: s2 kill 6871 sig 15 count 29
Aug 22 01:35:40 host-01 momd:
/usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py:52:
DeprecationWarning: BaseException.message has been deprecat$
Aug 22 01:35:40 host-01 momd: self.stats_error('getVmMemoryStats():
%s' % e.message)
Aug 22 01:35:40 host-01 wdmd[988]: test failed rem 12 now 5626 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:41 host-01 sanlock[1024]: 2016-08-22 01:35:41+0200 5627
[1024]: s2 kill 6871 sig 15 count 30
Aug 22 01:35:41 host-01 ovirt-ha-agent:
/usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352:
DeprecationWarning: Dispatcher.pending is deprecated. Use D$
Aug 22 01:35:41 host-01 ovirt-ha-agent: pending = getattr(dispatcher,
'pending', lambda: 0)
Aug 22 01:35:41 host-01 ovirt-ha-agent:
/usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352:
DeprecationWarning: Dispatcher.pending is deprecated. Use D$
Aug 22 01:35:41 host-01 ovirt-ha-agent: pending = getattr(dispatcher,
'pending', lambda: 0)
Aug 22 01:35:41 host-01 wdmd[988]: test failed rem 11 now 5627 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:42 host-01 sanlock[1024]: 2016-08-22 01:35:42+0200 5628
[1024]: s2 kill 6871 sig 15 count 31
Aug 22 01:35:42 host-01 wdmd[988]: test failed rem 10 now 5628 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259$
Aug 22 01:35:42 host-01 ovirt-ha-broker:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Aug 22 01:35:42 host-01 journal: vdsm
ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection
closed: Connection timed out
Aug 22 01:35:42 host-01 journal: vdsm root ERROR failed to retrieve
Hosted Engine HA info#012Traceback (most recent call last):#012  File
"/usr/lib/python2.$
Aug 22 01:35:43 host-01 sanlock[1024]: 2016-08-22 01:35:43+0200 5629
[1024]: s2 kill 6871 sig 15 count 32
Aug 22 01:35:43 host-01 ovirt-ha-broker:
WARNING:engine_health.CpuLoadNoEngine:bad health status: Hosted Engine
is not up!
Aug 22 01:35:43 host-01 wdmd[988]: test failed rem 9 now 5629 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259a$
Aug 22 01:35:44 host-01 sanlock[1024]: 2016-08-22 01:35:44+0200 5630
[1024]: s2 kill 6871 sig 15 count 33
Aug 22 01:35:44 host-01 wdmd[988]: test failed rem 8 now 5630 ping
5568 close 5578 renewal 5518 expire 5598 client 1024
sanlock_4093ad17-bef5-4e4b-9a16-259a$
Aug 22 01:35:44 host-01 ovirt-ha-broker: INFO:ping.Ping:Successfully
pinged 172.16.31.252
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Aug
22 01:40:19 host-01 rsyslogd: [origin software="rsyslogd"
swVersion="7.4.7$
Aug 22 01:40:11 host-01 journal: Runtime journal is using 8.0M (max
allowed 1.5G, trying to leave 2.3G free of 15.6G available → current
limit 1.5G).
Aug 22 01:40:11 host-01 kernel: Initializing cgroup subsys cpuset
Aug 22 01:40:11 host-01 kernel: Initializing cgroup subsys cpu
Aug 22 01:40:11 host-01 kernel: Initializing cgroup subsys cpuacct

2016-08-21 22:34 GMT+02:00 Matt . <yamakasi.014 at gmail.com>:
> THe strange things is that there are no IP's duplicated in the ovirt
> environment, storage or whatever the VM's make running.
>
> What happens tho is that the statusses of all agents change, and
> why... don' t ask me.
>
> There is really nothing in the logs that shows this behaviour.
>
> Restarting broker, agent, Rebooting the hosts, it doesn' t work out.
> the only one where I can start the HostedEngine on now is Host-4 where
> I was able to start them on other hosts in theit current states also.
>
> Something is wobbeling around the communication between the agents if
> you ask me. This happened from 4.0.1
>
> --== Host 1 status ==--
>
> Status up-to-date                  : False
> Hostname                           : host-01.mydomain.tld
> Host ID                            : 1
> Engine status                      : unknown stale-data
> Score                              : 0
> stopped                            : True
> Local maintenance                  : False
> crc32                              : 6b73a02e
> Host timestamp                     : 2710
> Extra metadata (valid at timestamp):
>         metadata_parse_version=1
>         metadata_feature_version=1
>         timestamp=2710 (Sun Aug 21 21:52:56 2016)
>         host-id=1
>         score=0
>         maintenance=False
>         state=AgentStopped
>         stopped=True
>
>
> --== Host 2 status ==--
>
> Status up-to-date                  : False
> Hostname                           : host-02.mydomain.tld
> Host ID                            : 2
> Engine status                      : unknown stale-data
> Score                              : 0
> stopped                            : True
> Local maintenance                  : False
> crc32                              : 8e647fca
> Host timestamp                     : 509
> Extra metadata (valid at timestamp):
>         metadata_parse_version=1
>         metadata_feature_version=1
>         timestamp=509 (Sun Aug 21 21:53:00 2016)
>         host-id=2
>         score=0
>         maintenance=False
>         state=AgentStopped
>         stopped=True
>
>
> --== Host 3 status ==--
>
> Status up-to-date                  : False
> Hostname                           : host-01.mydomain.tld
> Host ID                            : 3
> Engine status                      : unknown stale-data
> Score                              : 0
> stopped                            : True
> Local maintenance                  : False
> crc32                              : 73748f9f
> Host timestamp                     : 2888
> Extra metadata (valid at timestamp):
>         metadata_parse_version=1
>         metadata_feature_version=1
>         timestamp=2888 (Sun Aug 21 00:16:12 2016)
>         host-id=3
>         score=0
>         maintenance=False
>         state=AgentStopped
>         stopped=True
>
>
> --== Host 4 status ==--
>
> Status up-to-date                  : False
> Hostname                           : host-02.mydomain.tld
> Host ID                            : 4
> Engine status                      : unknown stale-data
> Score                              : 3400
> stopped                            : False
> Local maintenance                  : False
> crc32                              : 86ef0447
> Host timestamp                     : 67879
> Extra metadata (valid at timestamp):
>         metadata_parse_version=1
>         metadata_feature_version=1
>         timestamp=67879 (Sun Aug 21 18:30:38 2016)
>         host-id=4
>         score=3400
>         maintenance=False
>         state=GlobalMaintenance
>         stopped=False
>
>
>
> 2016-08-21 22:09 GMT+02:00 Charles Kozler <ckozleriii at gmail.com>:
>> This usually happens when SPM falls off or master storage domain was
>> unreachable for a brief period of time in some capacity. Your logs should
>> say something about an underlying storage problem so oVirt offlined or
>> paused the VMs to avoid problems. I'd check the pathway to your master
>> storage domain. You're probably right that something had another conflict
>> IP. This happened to me one time where someone brought up a system on an IP
>> that matched my SPM
>>
>>
>> On Aug 21, 2016 3:33 PM, "Matt ." <yamakasi.014 at gmail.com> wrote:
>>>
>>> HI All,
>>>
>>> I'm trying to tackle an issues on 4.0.2 that sunddenly all VM's
>>> including the HostedEngine are just down at once.
>>>
>>> I have also seen that all NFS shares are unmounted except the
>>> HostedEngine Storage, which is on the same NFS device as well.
>>>
>>> I have checked the logs, nothing strange to see there, but as I run a
>>> vrrp setup and do some tests also I wonder if there is a duplicate IP
>>> brought up, could this make happen the whole system to go down and the
>>> Engine or VDSM unmounts the NFS shares ? My switches don't complain.
>>>
>>> It's strange that the HE share is only available after it happens.
>>>
>>> If so, this would be quite fragile and we should tackle where it goes
>>> wrong.
>>>
>>> Anyone seen this bahaviour ?
>>>
>>> Thanks,
>>>
>>> Matt
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users



More information about the Users mailing list