Milan Zamazal <mzamazal(a)redhat.com> writes:
Michal Skrivanek <michal.skrivanek(a)redhat.com> writes:
>> On 16 Sep 2019, at 10:30, Milan Zamazal <mzamazal(a)redhat.com> wrote:
>>
>> Dusan Fodor <dfodor(a)redhat.com> writes:
>>
>>> After even more investigation, root of issue seems to lie in vdsm receiving
>>> SIGTERM in the only host that is in state up [1]:
>>> *[vds] Received signal 15, shutting down (vdsmd:70)*
>>
>> I see, thank you for looking into it and finding the signal. Can you
>> see in the logs what could cause this? Are Engine fencing attempts
>> issued before or after this signal? If it is not caused by Engine
>> fencing, is there anything in the system logs explaining that SIGTERM?
>
> unrelated
>
>>
>> Let's take the upcoming OST gating as an opportunity to fix that host
>> status flipping problem. It must be fixed before OST gating is enabled.
>
> it seems rather infra-related to the initOnVdsUp() processing. Best
> for now would be to wait a little and try again to check the Host
> status once it’s Up for the first time.
Is there any alternative to waiting? Such as checking that some VDS Up
event or so appeared twice?
Is anybody working on any fix of the failure?
>> Thanks,
>> michal
>>
>>>
>>>> while the other host is still in status Installing (so it cannot be used
>>>> for fencing- hence the fence action failure).
>>>> The vdsm then goes back up in few moments, but engine, expecting the
host
>>>> is up all the time, meanwhile fails doing an operation that requires
host
>>>> to be up.
>>>>
>>>> [1]
>>>>
https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/arti...
>>>>
>>>> On Fri, Sep 13, 2019 at 5:18 PM Dusan Fodor <dfodor(a)redhat.com>
wrote:
>>>>
>>>>> For brave investigators, similar issue in later stage of the same
test can
>>>>> be found here [1]. Same symptom of fence action fail, but this time
it
>>>>> causes failure for adding storage itself:
>>>>> *2019-09-12 09:53:32,571-04 ERROR
>>>>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource]
(default
>>>>> task-1) [] Operation Failed: [Cannot attach Storage. There is no
active
>>>>> Host in the Data Center.]*
>>>>>
>>>>> [1]
https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15821
>>>>>
>>>>> On Fri, Sep 13, 2019 at 5:09 PM Dusan Fodor <dfodor(a)redhat.com>
wrote:
>>>>>
>>>>>> Hello all,
>>>>>> lately i witnessed multiple failures for
add_master_storage_domain test,
>>>>>> which were not related to changes themselves, nor any infra
issue. One
>>>>>> example can be found here [1].
>>>>>> After investigation with huge help of Milan, issue is that Host
falls
>>>>>> from up state to whatever-but-not-up suddenly.
>>>>>>
>>>>>>
>>>>>> 1. add_storage_domain picks a random host that is in up state
>>>>>> 2. meantime engine starts fence action for it, so probably
something
>>>>>> gone bad with the host; the fence action fails with:
>>>>>> *[org.ovirt.engine.core.bll.pm.FenceProxyLocator]
>>>>>> (EE-ManagedThreadFactory-engineScheduled-Thread-38) [6692895f]
Can not run
>>>>>> fence action on host 'lago-basic-suite-master-host-0',
no suitable proxy
>>>>>> host was found.*
>>>>>> 3. test fails on not being able to attach the domain to non-up
>>>>>> host:
>>>>>> *[org.ovirt.engine.api.restapi.resource.AbstractBackendResource]
>>>>>> (default task-1) [] Operation Failed: [Cannot add storage
server connection
>>>>>> when Host status is not up]*
>>>>>>
>>>>>> For better orientation in failed job's engine log [1], fence
action for
>>>>>> host fails at
>>>>>> :46:12,842-04
>>>>>> engine learns it cannot connect storage to host at
>>>>>> :46:16,105-04
>>>>>>
>>>>>> The test itself add_master_storage_domain starts at ~ :46:13,753
>>>>>> (according to lago log).
>>>>>>
>>>>>> Could you please check this?
>>>>>> Thanks
>>>>>>
>>>>>> [1]
https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829
>>>>>> [2]
>>>>>>
https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/arti...
>>>>>>
>>>>>>
>>>> _______________________________________________
>>>> Devel mailing list -- devel(a)ovirt.org
>>>> To unsubscribe send an email to devel-leave(a)ovirt.org
>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/MMH7DGCH24G...
>>> _______________________________________________
>>> Devel mailing list -- devel(a)ovirt.org
>>> To unsubscribe send an email to devel-leave(a)ovirt.org
>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/KQY5JULWUDT...