Hi Artur
Hope you are well, please see below, this after I restarted the engine:
host:
[root@ovirt-aa-1-21:~]↥ # tcpdump -i ovirtmgmt -c 1000 -ttttnnvvS dst
ovirt-engine-aa-1-01
tcpdump: listening on ovirtmgmt, link-type EN10MB (Ethernet), capture size
262144 bytes
2020-08-07 12:09:32.553543 ARP, Ethernet (len 6), IPv4 (len 4), Reply
172.140.220.111 is-at 00:25:b5:04:00:25, length 28
2020-08-07 12:10:05.584594 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
172.140.220.111.54321 > 172.140.220.23.56202: Flags [S.], cksum 0x5cd5
(incorrect -> 0xc8ca), seq 4036072905, ack 3265413231, win 28960, options
[mss 1460,sackOK,TS val 3039504636 ecr 341411251,nop,wscale 7], length 0
2020-08-07 12:10:10.589276 ARP, Ethernet (len 6), IPv4 (len 4), Reply
172.140.220.111 is-at 00:25:b5:04:00:25, length 28
2020-08-07 12:10:15.596230 IP (tos 0x0, ttl 64, id 48438, offset 0, flags
[DF], proto TCP (6), length 52)
172.140.220.111.54321 > 172.140.220.23.56202: Flags [F.], cksum 0x5ccd
(incorrect -> 0x40b8), seq 4036072906, ack 3265413231, win 227, options
[nop,nop,TS val 3039514647 ecr 341411251], length 0
2020-08-07 12:10:20.596429 ARP, Ethernet (len 6), IPv4 (len 4), Request
who-has 172.140.220.23 tell 172.140.220.111, length 28
2020-08-07 12:10:20.663699 IP (tos 0x0, ttl 64, id 64726, offset 0, flags
[DF], proto TCP (6), length 40)
172.140.220.111.54321 > 172.140.220.23.56202: Flags [R], cksum 0x1d20
(correct), seq 4036072907, win 0, length 0
engine
[root@ovirt-engine-aa-1-01 ~]# tcpdump -i eth0 -c 1000 -ttttnnvvS src
ovirt-aa-1-21
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size
262144 bytes
2020-08-07 12:09:31.891242 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
172.140.220.111.54321 > 172.140.220.23.56202: Flags [S.], cksum 0xc8ca
(correct), seq 4036072905, ack 3265413231, win 28960, options [mss
1460,sackOK,TS val 3039504636 ecr 341411251,nop,wscale 7], length 0
2020-08-07 12:09:36.895502 ARP, Ethernet (len 6), IPv4 (len 4), Reply
172.140.220.111 is-at 00:25:b5:04:00:25, length 42
2020-08-07 12:09:41.901981 IP (tos 0x0, ttl 64, id 48438, offset 0, flags
[DF], proto TCP (6), length 52)
172.140.220.111.54321 > 172.140.220.23.56202: Flags [F.], cksum 0x40b8
(correct), seq 4036072906, ack 3265413231, win 227, options [nop,nop,TS val
3039514647 ecr 341411251], length 0
2020-08-07 12:09:46.901681 ARP, Ethernet (len 6), IPv4 (len 4), Request
who-has 172.140.220.23 tell 172.140.220.111, length 42
2020-08-07 12:09:46.968911 IP (tos 0x0, ttl 64, id 64726, offset 0, flags
[DF], proto TCP (6), length 40)
172.140.220.111.54321 > 172.140.220.23.56202: Flags [R], cksum 0x1d20
(correct), seq 4036072907, win 0, length 0
Regards
Nar
On Fri, 7 Aug 2020 at 11:54, Artur Socha <asocha(a)redhat.com> wrote:
Hi Nardus,
There is one more thing to be checked.
1) could you check if there are any packets sent from the affected host to
the engine?
on host:
# outgoing traffic
sudo tcpdump -i <interface_name_on_host> -c 1000 -ttttnnvvS dst
<engine_host>
2) same the other way round. Check if there are packets received on engine
side from affected host
on engine:
# incoming traffic
sudo tcpdump -i <interface_name_on_engine> -c 1000 -ttttnnvvS src
<affected_host>
Artur
On Thu, Aug 6, 2020 at 4:51 PM Artur Socha <asocha(a)redhat.com> wrote:
> Thanks Nardus,
> After a quick look I found what I was suspecting - there are way too many
> threads in Blocked state. I don't know yet the reason but this is very
> helpful. I'll let you know about the findings/investigation. Meanwhile, you
> may try restarting the engine as (a very brute and ugly) workaround).
> You may try to setup slightly bigger thread pool - may save you some time
> until the next hiccup. However, please be aware that this may come with the
> cost in memory usage and higher cpu usage (due to increased context
> switching)
> Here are some docs:
>
> # Specify the thread pool size for jboss managed scheduled executor service used by
commands to periodically execute
> # methods. It is generally not necessary to increase the number of threads in this
thread pool. To change the value
> # permanently create a conf file 99-engine-scheduled-thread-pool.conf in
/etc/ovirt-engine/engine.conf.d/
> ENGINE_SCHEDULED_THREAD_POOL_SIZE=100
>
>
> A.
>
>
> On Thu, Aug 6, 2020 at 4:19 PM Nardus Geldenhuys <nardusg(a)gmail.com>
> wrote:
>
>> Hi Artur
>>
>> Please find attached, also let me know if I need to rerun. They 5 min
>> apart
>>
>> [root@engine-aa-1-01 ovirt-engine]# ps -ef | grep jboss | grep -v grep
>> | awk '{ print $2 }'
>> 27390
>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
>> your_engine_thread_dump_1.txt
>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
>> your_engine_thread_dump_2.txt
>> [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 >
>> your_engine_thread_dump_3.txt
>>
>> Regards
>>
>> Nar
>>
>> On Thu, 6 Aug 2020 at 15:55, Artur Socha <asocha(a)redhat.com> wrote:
>>
>>> Sure thing.
>>> On engine host please find jboss pid. You can use this command:
>>>
>>> ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
>>>
>>> or jps tool from jdk. Sample output on my dev environment is:
>>>
>>> ± % jps
>>> !2860
>>> 64853 jboss-modules.jar
>>> 196217 Jps
>>>
>>> Then use jstack from jdk:
>>> jstack <pid> > your_engine_thread_dump.txt
>>> 2 or 3 dumps taken in approximately 5 minutes intervals would be even
>>> more useful.
>>>
>>> Here you can find even more options
>>>
https://www.baeldung.com/java-thread-dump
>>>
>>> Artur
>>>
>>> On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nardusg(a)gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> Can create thread dump, please send details on howto.
>>>>
>>>> Regards
>>>>
>>>> Nardus
>>>>
>>>> On Thu, 6 Aug 2020 at 14:17, Artur Socha <asocha(a)redhat.com>
wrote:
>>>>
>>>>> Hi Nardus,
>>>>> You might have hit an issue I have been hunting for some time ( [1]
>>>>> and [2] ).
>>>>> [1] could not be properly resolved because at a time was not able to
>>>>> recreate an issue on dev setup.
>>>>> I suspect [2] is related.
>>>>>
>>>>> Would you be able to prepare a thread dump from your engine
instance?
>>>>> Additionally, please check for potential libvirt errors/warnings.
>>>>> Can you also paste the output of:
>>>>> sudo yum list installed | grep vdsm
>>>>> sudo yum list installed | grep ovirt-engine
>>>>> sudo yum list installed | grep libvirt
>>>>>
>>>>> Usually, according to previous reports, restarting the engine helps
>>>>> to restore connectivity with hosts ... at least for some time.
>>>>>
>>>>> [1]
https://bugzilla.redhat.com/show_bug.cgi?id=1845152
>>>>> [2]
https://bugzilla.redhat.com/show_bug.cgi?id=1846338
>>>>>
>>>>> regards,
>>>>> Artur
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys
<nardusg(a)gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Also see this in engine:
>>>>>>
>>>>>> Aug 6, 2020, 7:37:17 AM
>>>>>> VDSM someserver command Get Host Capabilities failed: Message
>>>>>> timeout which can be caused by communication issues
>>>>>>
>>>>>> On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov
<hunter86_bg(a)yahoo.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Can you fheck for errors on the affected host. Most probably
you
>>>>>>> need the vdsm logs.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Strahil Nikolov
>>>>>>>
>>>>>>> На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys
<
>>>>>>> nardusg(a)gmail.com> написа:
>>>>>>> >Hi Strahil
>>>>>>> >
>>>>>>> >Hope you are well. I get the following error when I tried
to
>>>>>>> confirm
>>>>>>> >reboot:
>>>>>>> >
>>>>>>> >Error while executing action: Cannot confirm 'Host
has been
>>>>>>> rebooted'
>>>>>>> >Host.
>>>>>>> >Valid Host statuses are "Non operational",
"Maintenance" or
>>>>>>> >"Connecting".
>>>>>>> >
>>>>>>> >And I can't put it in maintenance, only option is
"restart" or
>>>>>>> "stop".
>>>>>>> >
>>>>>>> >Regards
>>>>>>> >
>>>>>>> >Nar
>>>>>>> >
>>>>>>> >On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <
>>>>>>> hunter86_bg(a)yahoo.com>
>>>>>>> >wrote:
>>>>>>> >
>>>>>>> >> After rebooting the node, have you
"marked" it that it was
>>>>>>> rebooted ?
>>>>>>> >>
>>>>>>> >> Best Regards,
>>>>>>> >> Strahil Nikolov
>>>>>>> >>
>>>>>>> >> На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus
Geldenhuys <
>>>>>>> >> nardusg(a)gmail.com> написа:
>>>>>>> >> >Hi oVirt land
>>>>>>> >> >
>>>>>>> >> >Hope you are well. Got a bit of an issue,
actually a big issue.
>>>>>>> We
>>>>>>> >had
>>>>>>> >> >some
>>>>>>> >> >sort of dip of some sort. All the VM's is
still running, but
>>>>>>> some of
>>>>>>> >> >the
>>>>>>> >> >hosts is show "Unassigned" or
"NonResponsive". So all the hosts
>>>>>>> was
>>>>>>> >> >showing
>>>>>>> >> >UP and was fine before our dip. So I did
increase
>>>>>>> >vdsHeartbeatInSecond
>>>>>>> >> >to
>>>>>>> >> >240, no luck.
>>>>>>> >> >
>>>>>>> >> >I still get a timeout on the engine lock even
thou I can
>>>>>>> connect to
>>>>>>> >> >that
>>>>>>> >> >host from the engine using nc to test to port
54321. I also did
>>>>>>> >restart
>>>>>>> >> >vdsmd and also rebooted the host with no luck.
>>>>>>> >> >
>>>>>>> >> > nc -v someserver 54321
>>>>>>> >> >Ncat: Version 7.50 (
https://nmap.org/ncat )
>>>>>>> >> >Ncat: Connected to 172.40.2.172:54321.
>>>>>>> >> >
>>>>>>> >> >2020-08-05 20:20:34,256+02 ERROR
>>>>>>> >>
>>>>>>>
>>>>>>>
>>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>> >>
>(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID:
>>>>>>> >> >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM
someserver command Get
>>>>>>> Host
>>>>>>> >> >Capabilities failed: Message timeout which can
be caused by
>>>>>>> >> >communication
>>>>>>> >> >issues
>>>>>>> >> >
>>>>>>> >> >Any troubleshoot ideas will be gladly
appreciated.
>>>>>>> >> >
>>>>>>> >> >Regards
>>>>>>> >> >
>>>>>>> >> >Nar
>>>>>>> >>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list -- users(a)ovirt.org
>>>>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>>>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>>>>>> oVirt Code of Conduct:
>>>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>>>> List Archives:
>>>>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76...
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Artur Socha
>>>>> Senior Software Engineer, RHV
>>>>> Red Hat
>>>>>
>>>>
>>>
>>> --
>>> Artur Socha
>>> Senior Software Engineer, RHV
>>> Red Hat
>>>
>>
>
> --
> Artur Socha
> Senior Software Engineer, RHV
> Red Hat
>
--
Artur Socha
Senior Software Engineer, RHV
Red Hat