OST fails, nothing provides nmstate

Hi, OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It fails with FAILED! => {"changed": false, "failures": [], "msg": "Depsolve Error occured: \n Problem 1: cannot install the best update candidate for package vdsm- network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n Problem 2: package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires vdsm-network = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be installed\n - cannot install the best update candidate for package vdsm- python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n [...] See [2] for full error. Can someone please take a look? Thanks Vojta [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact/ exported-artifacts/test_logs/basic-suite-master/post-002_bootstrap.py/lago- basic-suite-master-engine/_var_log/ovirt-engine/engine.log

On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek <vjuranek@redhat.com> wrote:
Hi, OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It fails with
FAILED! => {"changed": false, "failures": [], "msg": "Depsolve Error occured: \n Problem 1: cannot install the best update candidate for package vdsm- network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n Problem 2: package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires vdsm-network = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be installed\n - cannot install the best update candidate for package vdsm- python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
nmstate should be provided by copr repo enabled by ovirt-release-master. Who installs this rpm in OST?
[...]
See [2] for full error.
Can someone please take a look? Thanks Vojta
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact/ exported-artifacts/test_logs/basic-suite-master/post-002_bootstrap.py/lago- basic-suite-master-engine/_var_log/ovirt-engine/engine.log_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQN26BL73...

On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek <vjuranek@redhat.com> wrote:
Hi, OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It fails
with
FAILED! => {"changed": false, "failures": [], "msg": "Depsolve Error
occured:
\n Problem 1: cannot install the best update candidate for package vdsm- network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n Problem 2: package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires vdsm-network = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be installed\n - cannot install the best update candidate for package vdsm- python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
nmstate should be provided by copr repo enabled by ovirt-release-master.
I re-triggered as https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 maybe https://gerrit.ovirt.org/#/c/104825/ was missing
Who installs this rpm in OST?
I do not understand the question.
[...]
See [2] for full error.
Can someone please take a look? Thanks Vojta
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact/ exported-artifacts/test_logs/basic-suite-master/ post-002_bootstrap.py/lago-
basic-suite-master-engine/_var_log/ovirt-engine/engine.log_______________________________________________
Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQN26BL73...
Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZN5K3NS5...

On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> wrote:
On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek <vjuranek@redhat.com> wrote:
Hi, OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It fails
with
FAILED! => {"changed": false, "failures": [], "msg": "Depsolve Error
occured:
\n Problem 1: cannot install the best update candidate for package vdsm- network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n Problem 2: package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires vdsm-network = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be installed\n - cannot install the best update candidate for package vdsm- python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
nmstate should be provided by copr repo enabled by ovirt-release-master.
I re-triggered as https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 maybe https://gerrit.ovirt.org/#/c/104825/ was missing
Looks like https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. Miguel, do you think merging https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-host-cq... would solve this?
Who installs this rpm in OST?
I do not understand the question.
[...]
See [2] for full error.
Can someone please take a look? Thanks Vojta
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact/ exported-artifacts/test_logs/basic-suite-master/ post-002_bootstrap.py/lago-
basic-suite-master-engine/_var_log/ovirt-engine/engine.log_______________________________________________
Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQN26BL73...
Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZN5K3NS5...

On Fri, Nov 22, 2019 at 9:41 AM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> wrote:
On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek <vjuranek@redhat.com> wrote:
Hi, OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It fails with
FAILED! => {"changed": false, "failures": [], "msg": "Depsolve Error occured: \n Problem 1: cannot install the best update candidate for package vdsm- network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n Problem 2: package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires vdsm-network = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be installed\n - cannot install the best update candidate for package vdsm- python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
nmstate should be provided by copr repo enabled by ovirt-release-master.
I re-triggered as https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 maybe https://gerrit.ovirt.org/#/c/104825/ was missing
Looks like https://gerrit.ovirt.org/#/c/104825/ is ignored by OST.
Right.
Miguel, do you think merging
https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-host-cq...
It should. I'll break down patch patch into smaller pieces - one adding the nmstate / NM copr repos, another w/ enabling nmstate.
would solve this?
Who installs this rpm in OST?
I do not understand the question.
[...]
See [2] for full error.
Can someone please take a look? Thanks Vojta
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact/ exported-artifacts/test_logs/basic-suite-master/post-002_bootstrap.py/lago- basic-suite-master-engine/_var_log/ovirt-engine/engine.log_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQN26BL73...
Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZN5K3NS5...

On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote:
On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> wrote:
On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek <vjuranek@redhat.com>
wrote:
Hi, OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It fails
with
FAILED! => {"changed": false, "failures": [], "msg": "Depsolve Error
occured:
\n Problem 1: cannot install the best update candidate for package vdsm- network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides
nmstate
needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n Problem 2: package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires
vdsm-network
= 4.40.0-1271.git524e08c8a.el8, but none of the providers can be
installed\n
- cannot install the best update candidate for package vdsm- python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
nmstate should be provided by copr repo enabled by ovirt-release-master.
I re-triggered as https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 maybe https://gerrit.ovirt.org/#/c/104825/ was missing
Looks like https://gerrit.ovirt.org/#/c/104825/ is ignored by OST.
maybe not. You re-triggered with [1], which really missed this patch. I did a rebase and now running with this patch in build #6132 [2]. Let's wait for it to see if gerrit #104825 helps. [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/
Miguel, do you think merging
https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-host-cq .repo.in
would solve this?
Who installs this rpm in OST?
I do not understand the question.
[...]
See [2] for full error.
Can someone please take a look? Thanks Vojta
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ [2]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact/
exported-artifacts/test_logs/basic-suite-master/
post-002_bootstrap.py/lago-
basic-suite-master-engine/_var_log/ovirt-engine/engine.log_______________ ________________________________>>
Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQN26B L73K7D45A2IR7R3UMMM23/ _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZN5K3 NS5TGXFCILYES77KI5TZU/

On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote:
On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> wrote:
On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek <vjuranek@redhat.com>
wrote:
Hi, OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It fails
with
FAILED! => {"changed": false, "failures": [], "msg": "Depsolve Error
occured:
\n Problem 1: cannot install the best update candidate for package vdsm- network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides
nmstate
needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n Problem 2: package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires
vdsm-network
= 4.40.0-1271.git524e08c8a.el8, but none of the providers can be
installed\n
- cannot install the best update candidate for package vdsm- python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
nmstate should be provided by copr repo enabled by ovirt-release-master.
I re-triggered as https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 maybe https://gerrit.ovirt.org/#/c/104825/ was missing
Looks like https://gerrit.ovirt.org/#/c/104825/ is ignored by OST.
maybe not. You re-triggered with [1], which really missed this patch. I did a rebase and now running with this patch in build #6132 [2]. Let's wait for it to see if gerrit #104825 helps.
[1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/
Miguel, do you think merging
https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-host-cq .repo.in
would solve this?
I've split the patch Dominik mentions above in two, one of them adding the nmstate / networkmanager copr repos - [3]. Let's see if it fixes it. [3] - https://gerrit.ovirt.org/#/c/104897/
Who installs this rpm in OST?
I do not understand the question.
[...]
See [2] for full error.
Can someone please take a look? Thanks Vojta
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ [2]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact/
exported-artifacts/test_logs/basic-suite-master/
post-002_bootstrap.py/lago-
basic-suite-master-engine/_var_log/ovirt-engine/engine.log_______________ ________________________________>>
Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQN26B L73K7D45A2IR7R3UMMM23/ _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZN5K3 NS5TGXFCILYES77KI5TZU/

On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote:
On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote:
On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> wrote:
On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek <vjuranek@redhat.com>
wrote:
Hi, OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It fails
with
FAILED! => {"changed": false, "failures": [], "msg": "Depsolve Error
occured:
\n Problem 1: cannot install the best update candidate for package vdsm- network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides
nmstate
needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n Problem 2: package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires
vdsm-network
= 4.40.0-1271.git524e08c8a.el8, but none of the providers can be
installed\n
- cannot install the best update candidate for package vdsm- python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides nmstate needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
nmstate should be provided by copr repo enabled by ovirt-release-master.
I re-triggered as https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 maybe https://gerrit.ovirt.org/#/c/104825/ was missing
Looks like https://gerrit.ovirt.org/#/c/104825/ is ignored by OST.
maybe not. You re-triggered with [1], which really missed this patch. I did a rebase and now running with this patch in build #6132 [2]. Let's wait
for it to see if gerrit #104825 helps.
[1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/
Miguel, do you think merging
https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos t-cq
.repo.in
would solve this?
I've split the patch Dominik mentions above in two, one of them adding the nmstate / networkmanager copr repos - [3].
Let's see if it fixes it.
it fixes original issue, but OST still fails in 098_ovirt_provider_ovn.use_ovn_provider: https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
[3] - https://gerrit.ovirt.org/#/c/104897/
Who installs this rpm in OST?
I do not understand the question.
[...]
See [2] for full error.
Can someone please take a look? Thanks Vojta
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ [2]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact /
exported-artifacts/test_logs/basic-suite-master/
post-002_bootstrap.py/lago-
basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ ____ ________________________________>>
Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ N26B L73K7D45A2IR7R3UMMM23/ _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ N5K3 NS5TGXFCILYES77KI5TZU/
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H 5BQ3SCHOYZX6JMTQPBW/

On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote:
On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote:
On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> wrote:
On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek <vjuranek@redhat.com>
wrote:
> Hi, > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It > fails
with
> FAILED! => {"changed": false, "failures": [], "msg": "Depsolve > Error
occured:
> \n Problem 1: cannot install the best update candidate for package > vdsm- > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides
nmstate
> needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > Problem 2: > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires
vdsm-network
> = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be
installed\n
> - cannot install the best update candidate for package vdsm- > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides > nmstate > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
nmstate should be provided by copr repo enabled by ovirt-release-master.
I re-triggered as https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 maybe https://gerrit.ovirt.org/#/c/104825/ was missing
Looks like https://gerrit.ovirt.org/#/c/104825/ is ignored by OST.
maybe not. You re-triggered with [1], which really missed this patch. I did a rebase and now running with this patch in build #6132 [2]. Let's wait
for it to see if gerrit #104825 helps.
[1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/
Miguel, do you think merging
https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos t-cq
.repo.in
would solve this?
I've split the patch Dominik mentions above in two, one of them adding the nmstate / networkmanager copr repos - [3].
Let's see if it fixes it.
it fixes original issue, but OST still fails in 098_ovirt_provider_ovn.use_ovn_provider:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
I think Dominik was looking into this issue; +Dominik Holler please confirm. Let me know if you need any help Dominik.
[3] - https://gerrit.ovirt.org/#/c/104897/
Who installs this rpm in OST?
I do not understand the question.
> [...] > > > > See [2] for full error. > > > > Can someone please take a look? > Thanks > Vojta > > > > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > [2]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact /
> exported-artifacts/test_logs/basic-suite-master/
post-002_bootstrap.py/lago-
basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ ____ ________________________________>>
> Devel mailing list -- devel@ovirt.org > To unsubscribe send an email to devel-leave@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ N26B L73K7D45A2IR7R3UMMM23/ _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ N5K3 NS5TGXFCILYES77KI5TZU/
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H 5BQ3SCHOYZX6JMTQPBW/

On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso < mdbarroso@redhat.com> wrote:
On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso
On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote:
On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com
wrote:
On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> wrote:
> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > <vjuranek@redhat.com> > > > > wrote: > > > Hi, > > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It > > fails > > > > with > > > > > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve > > Error > > > > occured: > > > \n Problem 1: cannot install the best update candidate for
> > vdsm- > > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing
> > > > nmstate > > > > > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > Problem 2: > > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires > > > > vdsm-network > > > > > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be > > > > installed\n > > > > > - cannot install the best update candidate for package vdsm- > > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing
wrote: package provides provides
> > nmstate > > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > > > nmstate should be provided by copr repo enabled by > ovirt-release-master.
I re-triggered as https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 maybe https://gerrit.ovirt.org/#/c/104825/ was missing
Looks like https://gerrit.ovirt.org/#/c/104825/ is ignored by OST.
maybe not. You re-triggered with [1], which really missed this patch. I did a rebase and now running with this patch in build #6132 [2]. Let's wait for it to see if gerrit #104825 helps.
[1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/
Miguel, do you think merging
https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos
t-cq .repo.in
would solve this?
I've split the patch Dominik mentions above in two, one of them adding the nmstate / networkmanager copr repos - [3].
Let's see if it fixes it.
it fixes original issue, but OST still fails in 098_ovirt_provider_ovn.use_ovn_provider:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
I think Dominik was looking into this issue; +Dominik Holler please confirm.
Let me know if you need any help Dominik.
Thanks. The problem is that the hosts lost connection to storage: https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) I failed to reproduce local to analyze this, I will try again, any hints welcome.
[3] - https://gerrit.ovirt.org/#/c/104897/
> Who installs this rpm in OST?
I do not understand the question.
> > [...] > > > > > > > > See [2] for full error. > > > > > > > > Can someone please take a look? > > Thanks > > Vojta > > > > > > > > [1]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/
> > [2] > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > / > > > > > exported-artifacts/test_logs/basic-suite-master/ > > > > post-002_bootstrap.py/lago- > > > > basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > ____ > ________________________________>> > > > Devel mailing list -- devel@ovirt.org > > To unsubscribe send an email to devel-leave@ovirt.org > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > > > > oVirt Code of Conduct: > > https://www.ovirt.org/community/about/community-guidelines/ > > > > > List Archives: > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > N26B > L73K7D45A2IR7R3UMMM23/ > _______________________________________________ > Devel mailing list -- devel@ovirt.org > To unsubscribe send an email to devel-leave@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > N5K3 > NS5TGXFCILYES77KI5TZU/
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H
5BQ3SCHOYZX6JMTQPBW/

On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso < mdbarroso@redhat.com> wrote:
On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso
On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote:
On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <
dholler@redhat.com>
wrote:
> On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com
> wrote: > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >> <vjuranek@redhat.com> >> >> >> >> wrote: >> >> > Hi, >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >> > fails >> >> >> >> with >> >> >> >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >> > Error >> >> >> >> occured: >> >> > \n Problem 1: cannot install the best update candidate for
>> > vdsm- >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing
>> >> >> >> nmstate >> >> >> >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> > Problem 2: >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >> >> >> >> vdsm-network >> >> >> >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be >> >> >> >> installed\n >> >> >> >> > - cannot install the best update candidate for package vdsm- >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing
>> > nmstate >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >> >> >> nmstate should be provided by copr repo enabled by >> ovirt-release-master. > > > > I re-triggered as > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > maybe > https://gerrit.ovirt.org/#/c/104825/ > was missing
Looks like https://gerrit.ovirt.org/#/c/104825/ is ignored by OST.
maybe not. You re-triggered with [1], which really missed this
wrote: package provides provides patch.
I did a rebase and now running with this patch in build #6132 [2]. Let's wait for it to see if gerrit #104825 helps.
[1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/
Miguel, do you think merging
https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos
t-cq .repo.in
would solve this?
I've split the patch Dominik mentions above in two, one of them adding the nmstate / networkmanager copr repos - [3].
Let's see if it fixes it.
it fixes original issue, but OST still fails in 098_ovirt_provider_ovn.use_ovn_provider:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
I think Dominik was looking into this issue; +Dominik Holler please confirm.
Let me know if you need any help Dominik.
Thanks. The problem is that the hosts lost connection to storage:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... :
2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472)
I failed to reproduce local to analyze this, I will try again, any hints welcome.
https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. Is there someone with knowledge about the basic_ui_sanity around?
[3] - https://gerrit.ovirt.org/#/c/104897/
>> Who installs this rpm in OST? > > > > I do not understand the question. > > > >> > [...] >> > >> > >> > >> > See [2] for full error. >> > >> > >> > >> > Can someone please take a look? >> > Thanks >> > Vojta >> > >> > >> > >> > [1]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/
>> > [2] >> >> >> >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >> / >> >> >> >> > exported-artifacts/test_logs/basic-suite-master/ >> >> >> >> post-002_bootstrap.py/lago- >> >> >> >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >> ____ >> ________________________________>> >> >> > Devel mailing list -- devel@ovirt.org >> > To unsubscribe send an email to devel-leave@ovirt.org >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >> >> >> > oVirt Code of Conduct: >> >> https://www.ovirt.org/community/about/community-guidelines/ >> >> >> >> > List Archives: >> >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >> N26B >> L73K7D45A2IR7R3UMMM23/ >> _______________________________________________ >> Devel mailing list -- devel@ovirt.org >> To unsubscribe send an email to devel-leave@ovirt.org >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >> N5K3 >> NS5TGXFCILYES77KI5TZU/
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H
5BQ3SCHOYZX6JMTQPBW/

On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso < mdbarroso@redhat.com> wrote:
On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso
On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com
wrote:
On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote:
> On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <
dholler@redhat.com>
> wrote: > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < nsoffer@redhat.com> > > wrote: > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > >> <vjuranek@redhat.com> > >> > >> > >> > >> wrote: > >> > >> > Hi, > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It > >> > fails > >> > >> > >> > >> with > >> > >> > >> > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve > >> > Error > >> > >> > >> > >> occured: > >> > >> > \n Problem 1: cannot install the best update candidate for
> >> > vdsm- > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing
> >> > >> > >> > >> nmstate > >> > >> > >> > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > >> > Problem 2: > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires > >> > >> > >> > >> vdsm-network > >> > >> > >> > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be > >> > >> > >> > >> installed\n > >> > >> > >> > >> > - cannot install the best update candidate for package vdsm- > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing
> >> > nmstate > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > >> > >> > >> > >> nmstate should be provided by copr repo enabled by > >> ovirt-release-master. > > > > > > > > I re-triggered as > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > > maybe > > https://gerrit.ovirt.org/#/c/104825/ > > was missing > > > > Looks like > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST.
maybe not. You re-triggered with [1], which really missed this
wrote: package provides provides patch.
I did a rebase and now running with this patch in build #6132 [2]. Let's wait for it to see if gerrit #104825 helps.
[1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/
> Miguel, do you think merging > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos > t-cq .repo.in > > > > would solve this?
I've split the patch Dominik mentions above in two, one of them adding the nmstate / networkmanager copr repos - [3].
Let's see if it fixes it.
it fixes original issue, but OST still fails in 098_ovirt_provider_ovn.use_ovn_provider:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
I think Dominik was looking into this issue; +Dominik Holler please confirm.
Let me know if you need any help Dominik.
Thanks. The problem is that the hosts lost connection to storage:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... :
2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472)
I failed to reproduce local to analyze this, I will try again, any hints welcome.
https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. Is there someone with knowledge about the basic_ui_sanity around?
Marcin, could you please take a look?
[3] - https://gerrit.ovirt.org/#/c/104897/
> > > >> Who installs this rpm in OST? > > > > > > > > I do not understand the question. > > > > > > > >> > [...] > >> > > >> > > >> > > >> > See [2] for full error. > >> > > >> > > >> > > >> > Can someone please take a look? > >> > Thanks > >> > Vojta > >> > > >> > > >> > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > >> > [2] > >> > >> > >> > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > >> / > >> > >> > >> > >> > exported-artifacts/test_logs/basic-suite-master/ > >> > >> > >> > >> post-002_bootstrap.py/lago- > >> > >> > >> > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > >> ____ > >> ________________________________>> > >> > >> > Devel mailing list -- devel@ovirt.org > >> > To unsubscribe send an email to devel-leave@ovirt.org > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >> > >> > >> > >> > oVirt Code of Conduct: > >> > >> https://www.ovirt.org/community/about/community-guidelines/ > >> > >> > >> > >> > List Archives: > >> > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > >> N26B > >> L73K7D45A2IR7R3UMMM23/ > >> _______________________________________________ > >> Devel mailing list -- devel@ovirt.org > >> To unsubscribe send an email to devel-leave@ovirt.org > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >> oVirt Code of Conduct: > >> https://www.ovirt.org/community/about/community-guidelines/ > >> List Archives: > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > >> N5K3 > >> NS5TGXFCILYES77KI5TZU/
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H
5BQ3SCHOYZX6JMTQPBW/
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On 11/22/19 4:54 PM, Martin Perina wrote:
On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote:
On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote:
On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com <mailto:mdbarroso@redhat.com>> wrote:
On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com <mailto:vjuranek@redhat.com>> wrote: > > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com <mailto:vjuranek@redhat.com>> > > wrote: > > > > > > > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: > > > > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> > > > > wrote: > > > > > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com <mailto:nsoffer@redhat.com>> > > > > > wrote: > > > > > > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > > > > >> <vjuranek@redhat.com <mailto:vjuranek@redhat.com>> > > > > >> > > > > >> > > > > >> > > > > >> wrote: > > > > >> > > > > >> > Hi, > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It > > > > >> > fails > > > > >> > > > > >> > > > > >> > > > > >> with > > > > >> > > > > >> > > > > >> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve > > > > >> > Error > > > > >> > > > > >> > > > > >> > > > > >> occured: > > > > >> > > > > >> > \n Problem 1: cannot install the best update candidate for package > > > > >> > vdsm- > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides > > > > >> > > > > >> > > > > >> > > > > >> nmstate > > > > >> > > > > >> > > > > >> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > > > >> > Problem 2: > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires > > > > >> > > > > >> > > > > >> > > > > >> vdsm-network > > > > >> > > > > >> > > > > >> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be > > > > >> > > > > >> > > > > >> > > > > >> installed\n > > > > >> > > > > >> > > > > >> > > > > >> > - cannot install the best update candidate for package vdsm- > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides > > > > >> > nmstate > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > > > >> > > > > >> > > > > >> > > > > >> nmstate should be provided by copr repo enabled by > > > > >> ovirt-release-master. > > > > > > > > > > > > > > > > > > > > I re-triggered as > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > > > > > maybe > > > > > https://gerrit.ovirt.org/#/c/104825/ > > > > > was missing > > > > > > > > > > > > > > > > Looks like > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. > > > > > > > > > > > > maybe not. You re-triggered with [1], which really missed this patch. > > > I did a rebase and now running with this patch in build #6132 [2]. Let's > > > wait > for it to see if gerrit #104825 helps. > > > > > > > > > > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ > > > > > > > > > > > > > Miguel, do you think merging > > > > > > > > > > > > > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos > > > > t-cq > .repo.in <http://repo.in> > > > > > > > > > > > > > > > > would solve this? > > > > > > I've split the patch Dominik mentions above in two, one of them adding > > the nmstate / networkmanager copr repos - [3]. > > > > Let's see if it fixes it. > > it fixes original issue, but OST still fails in > 098_ovirt_provider_ovn.use_ovn_provider: > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
I think Dominik was looking into this issue; +Dominik Holler please confirm.
Let me know if you need any help Dominik.
Thanks. The problem is that the hosts lost connection to storage: https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... :
2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472)
I failed to reproduce local to analyze this, I will try again, any hints welcome.
https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. Is there someone with knowledge about the basic_ui_sanity around?
How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? Looking at 6134 run you were discussing: - timing of the ui sanity set-up [1]: 11:40:20 @ Run test: 008_basic_ui_sanity.py: - timing of first encountered storage error [2]: 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue". Nir, Amit, can you comment on this? [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export...
Marcin, could you please take a look?
> > > [3] - https://gerrit.ovirt.org/#/c/104897/ > > > > > > > > > > > > > > > > >> Who installs this rpm in OST? > > > > > > > > > > > > > > > > > > > > I do not understand the question. > > > > > > > > > > > > > > > > > > > >> > [...] > > > > >> > > > > > >> > > > > > >> > > > > > >> > See [2] for full error. > > > > >> > > > > > >> > > > > > >> > > > > > >> > Can someone please take a look? > > > > >> > Thanks > > > > >> > Vojta > > > > >> > > > > > >> > > > > > >> > > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > > > > >> > [2] > > > > >> > > > > >> > > > > >> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > > > > >> / > > > > >> > > > > >> > > > > >> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ > > > > >> > > > > >> > > > > >> > > > > >> post-002_bootstrap.py/lago- <http://post-002_bootstrap.py/lago-> > > > > >> > > > > >> > > > > >> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > > > > >> ____ > > > > >> ________________________________>> > > > > >> > > > > >> > Devel mailing list -- devel@ovirt.org <mailto:devel@ovirt.org> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org <mailto:devel-leave@ovirt.org> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > > > >> > > > > >> > > > > >> > > > > >> > oVirt Code of Conduct: > > > > >> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ > > > > >> > > > > >> > > > > >> > > > > >> > List Archives: > > > > >> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > > > > >> N26B > > > > >> L73K7D45A2IR7R3UMMM23/ > > > > >> _______________________________________________ > > > > >> Devel mailing list -- devel@ovirt.org <mailto:devel@ovirt.org> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org <mailto:devel-leave@ovirt.org> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > > > >> oVirt Code of Conduct: > > > > >> https://www.ovirt.org/community/about/community-guidelines/ > > > > >> List Archives: > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > > > > >> N5K3 > > > > >> NS5TGXFCILYES77KI5TZU/ > > > > > > > > > > _______________________________________________ > > Devel mailing list -- devel@ovirt.org <mailto:devel@ovirt.org> > > To unsubscribe send an email to devel-leave@ovirt.org <mailto:devel-leave@ovirt.org> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > oVirt Code of Conduct: > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H > > 5BQ3SCHOYZX6JMTQPBW/ >
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com> wrote:
On 11/22/19 4:54 PM, Martin Perina wrote:
On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso < mdbarroso@redhat.com> wrote:
On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso
On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> wrote: > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < dholler@redhat.com> > > wrote: > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < nsoffer@redhat.com> > > > wrote: > > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > > >> <vjuranek@redhat.com> > > >> > > >> > > >> > > >> wrote: > > >> > > >> > Hi, > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It > > >> > fails > > >> > > >> > > >> > > >> with > > >> > > >> > > >> > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve > > >> > Error > > >> > > >> > > >> > > >> occured: > > >> > > >> > \n Problem 1: cannot install the best update candidate for
> > >> > vdsm- > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing
> > >> > > >> > > >> > > >> nmstate > > >> > > >> > > >> > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > >> > Problem 2: > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires > > >> > > >> > > >> > > >> vdsm-network > > >> > > >> > > >> > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be > > >> > > >> > > >> > > >> installed\n > > >> > > >> > > >> > > >> > - cannot install the best update candidate for package vdsm- > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing
> > >> > nmstate > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > >> > > >> > > >> > > >> nmstate should be provided by copr repo enabled by > > >> ovirt-release-master. > > > > > > > > > > > > I re-triggered as > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > > > maybe > > > https://gerrit.ovirt.org/#/c/104825/ > > > was missing > > > > > > > > Looks like > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. > > > > maybe not. You re-triggered with [1], which really missed this
wrote: package provides provides patch.
> I did a rebase and now running with this patch in build #6132 [2]. Let's > wait for it to see if gerrit #104825 helps. > > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ > > > > > Miguel, do you think merging > > > > > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos > > t-cq .repo.in > > > > > > > > would solve this?
I've split the patch Dominik mentions above in two, one of them adding the nmstate / networkmanager copr repos - [3].
Let's see if it fixes it.
it fixes original issue, but OST still fails in 098_ovirt_provider_ovn.use_ovn_provider:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
I think Dominik was looking into this issue; +Dominik Holler please confirm.
Let me know if you need any help Dominik.
Thanks. The problem is that the hosts lost connection to storage:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... :
2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472)
I failed to reproduce local to analyze this, I will try again, any hints welcome.
https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. Is there someone with knowledge about the basic_ui_sanity around?
How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish?
Looking at 6134 run you were discussing:
- timing of the ui sanity set-up [1]:
11:40:20 @ Run test: 008_basic_ui_sanity.py:
- timing of first encountered storage error [2]:
2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout')
Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue".
Nir, Amit, can you comment on this?
The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. The reason for such issue can be misconfigured network (maybe network team is testing negative flows?), or some issue in the NFS server. One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage. In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. Nir [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console
[2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export...
Marcin, could you please take a look?
[3] - https://gerrit.ovirt.org/#/c/104897/
> > > > > > >> Who installs this rpm in OST? > > > > > > > > > > > > I do not understand the question. > > > > > > > > > > > >> > [...] > > >> > > > >> > > > >> > > > >> > See [2] for full error. > > >> > > > >> > > > >> > > > >> > Can someone please take a look? > > >> > Thanks > > >> > Vojta > > >> > > > >> > > > >> > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > > >> > [2] > > >> > > >> > > >> > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > > >> / > > >> > > >> > > >> > > >> > exported-artifacts/test_logs/basic-suite-master/ > > >> > > >> > > >> > > >> post-002_bootstrap.py/lago- > > >> > > >> > > >> > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > > >> ____ > > >> ________________________________>> > > >> > > >> > Devel mailing list -- devel@ovirt.org > > >> > To unsubscribe send an email to devel-leave@ovirt.org > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > >> > > >> > > >> > > >> > oVirt Code of Conduct: > > >> > > >> https://www.ovirt.org/community/about/community-guidelines/ > > >> > > >> > > >> > > >> > List Archives: > > >> > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > > >> N26B > > >> L73K7D45A2IR7R3UMMM23/ > > >> _______________________________________________ > > >> Devel mailing list -- devel@ovirt.org > > >> To unsubscribe send an email to devel-leave@ovirt.org > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > >> oVirt Code of Conduct: > > >> https://www.ovirt.org/community/about/community-guidelines/ > > >> List Archives: > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > > >> N5K3 > > >> NS5TGXFCILYES77KI5TZU/ > >
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H
5BQ3SCHOYZX6JMTQPBW/
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com> wrote:
On 11/22/19 4:54 PM, Martin Perina wrote:
On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso < mdbarroso@redhat.com> wrote:
On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote:
On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora
> On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> > wrote: > > > > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: > > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < dholler@redhat.com> > > > wrote: > > > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < nsoffer@redhat.com> > > > > wrote: > > > > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > > > >> <vjuranek@redhat.com> > > > >> > > > >> > > > >> > > > >> wrote: > > > >> > > > >> > Hi, > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It > > > >> > fails > > > >> > > > >> > > > >> > > > >> with > > > >> > > > >> > > > >> > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve > > > >> > Error > > > >> > > > >> > > > >> > > > >> occured: > > > >> > > > >> > \n Problem 1: cannot install the best update candidate for package > > > >> > vdsm- > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing
> > > >> > > > >> > > > >> > > > >> nmstate > > > >> > > > >> > > > >> > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > > >> > Problem 2: > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires > > > >> > > > >> > > > >> > > > >> vdsm-network > > > >> > > > >> > > > >> > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be > > > >> > > > >> > > > >> > > > >> installed\n > > > >> > > > >> > > > >> > > > >> > - cannot install the best update candidate for package vdsm- > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing
> > > >> > nmstate > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > > >> > > > >> > > > >> > > > >> nmstate should be provided by copr repo enabled by > > > >> ovirt-release-master. > > > > > > > > > > > > > > > > I re-triggered as > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > > > > maybe > > > > https://gerrit.ovirt.org/#/c/104825/ > > > > was missing > > > > > > > > > > > > Looks like > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. > > > > > > > > maybe not. You re-triggered with [1], which really missed this
Barroso wrote: provides provides patch.
> > I did a rebase and now running with this patch in build #6132 [2]. Let's > > wait for it to see if gerrit #104825 helps. > > > > > > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ > > > > > > > > > Miguel, do you think merging > > > > > > > > > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos > > > t-cq .repo.in > > > > > > > > > > > > would solve this? > > > I've split the patch Dominik mentions above in two, one of them adding > the nmstate / networkmanager copr repos - [3]. > > Let's see if it fixes it.
it fixes original issue, but OST still fails in 098_ovirt_provider_ovn.use_ovn_provider:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
I think Dominik was looking into this issue; +Dominik Holler please confirm.
Let me know if you need any help Dominik.
Thanks. The problem is that the hosts lost connection to storage:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... :
2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472)
I failed to reproduce local to analyze this, I will try again, any hints welcome.
https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. Is there someone with knowledge about the basic_ui_sanity around?
How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish?
Looking at 6134 run you were discussing:
- timing of the ui sanity set-up [1]:
11:40:20 @ Run test: 008_basic_ui_sanity.py:
- timing of first encountered storage error [2]:
2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout')
Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue".
Nir, Amit, can you comment on this?
The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning.
The reason for such issue can be misconfigured network (maybe network team is testing negative flows?),
No.
or some issue in the NFS server.
Only hint I found is "Exiting Time2Retain handler because session_reinstatement=1" but I have no idea what this means or if this is relevant at all.
One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage.
In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host.
Ack, this seems to be the problem.
Nir
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console
[2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export...
Marcin, could you please take a look?
> [3] - https://gerrit.ovirt.org/#/c/104897/ > > > > > > > > > > > >> Who installs this rpm in OST? > > > > > > > > > > > > > > > > I do not understand the question. > > > > > > > > > > > > > > > >> > [...] > > > >> > > > > >> > > > > >> > > > > >> > See [2] for full error. > > > >> > > > > >> > > > > >> > > > > >> > Can someone please take a look? > > > >> > Thanks > > > >> > Vojta > > > >> > > > > >> > > > > >> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > > > >> > [2] > > > >> > > > >> > > > >> > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > > > >> / > > > >> > > > >> > > > >> > > > >> > exported-artifacts/test_logs/basic-suite-master/ > > > >> > > > >> > > > >> > > > >> post-002_bootstrap.py/lago- > > > >> > > > >> > > > >> > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > > > >> ____ > > > >> ________________________________>> > > > >> > > > >> > Devel mailing list -- devel@ovirt.org > > > >> > To unsubscribe send an email to devel-leave@ovirt.org > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > > >> > > > >> > > > >> > > > >> > oVirt Code of Conduct: > > > >> > > > >> https://www.ovirt.org/community/about/community-guidelines/ > > > >> > > > >> > > > >> > > > >> > List Archives: > > > >> > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > > > >> N26B > > > >> L73K7D45A2IR7R3UMMM23/ > > > >> _______________________________________________ > > > >> Devel mailing list -- devel@ovirt.org > > > >> To unsubscribe send an email to devel-leave@ovirt.org > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > > >> oVirt Code of Conduct: > > > >> https://www.ovirt.org/community/about/community-guidelines/ > > > >> List Archives: > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > > > >> N5K3 > > > >> NS5TGXFCILYES77KI5TZU/ > > > > > > _______________________________________________ > Devel mailing list -- devel@ovirt.org > To unsubscribe send an email to devel-leave@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ List Archives: > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H > 5BQ3SCHOYZX6JMTQPBW/
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com> wrote:
On 11/22/19 4:54 PM, Martin Perina wrote:
On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso < mdbarroso@redhat.com> wrote:
On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote: > > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> > > wrote: > > > > > > > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: > > > > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < dholler@redhat.com> > > > > wrote: > > > > > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < nsoffer@redhat.com> > > > > > wrote: > > > > > > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > > > > >> <vjuranek@redhat.com> > > > > >> > > > > >> > > > > >> > > > > >> wrote: > > > > >> > > > > >> > Hi, > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It > > > > >> > fails > > > > >> > > > > >> > > > > >> > > > > >> with > > > > >> > > > > >> > > > > >> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve > > > > >> > Error > > > > >> > > > > >> > > > > >> > > > > >> occured: > > > > >> > > > > >> > \n Problem 1: cannot install the best update candidate for package > > > > >> > vdsm- > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides > > > > >> > > > > >> > > > > >> > > > > >> nmstate > > > > >> > > > > >> > > > > >> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > > > >> > Problem 2: > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires > > > > >> > > > > >> > > > > >> > > > > >> vdsm-network > > > > >> > > > > >> > > > > >> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be > > > > >> > > > > >> > > > > >> > > > > >> installed\n > > > > >> > > > > >> > > > > >> > > > > >> > - cannot install the best update candidate for package vdsm- > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides > > > > >> > nmstate > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > > > >> > > > > >> > > > > >> > > > > >> nmstate should be provided by copr repo enabled by > > > > >> ovirt-release-master. > > > > > > > > > > > > > > > > > > > > I re-triggered as > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > > > > > maybe > > > > > https://gerrit.ovirt.org/#/c/104825/ > > > > > was missing > > > > > > > > > > > > > > > > Looks like > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. > > > > > > > > > > > > maybe not. You re-triggered with [1], which really missed this patch. > > > I did a rebase and now running with this patch in build #6132 [2]. Let's > > > wait > for it to see if gerrit #104825 helps. > > > > > > > > > > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ > > > > > > > > > > > > > Miguel, do you think merging > > > > > > > > > > > > > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos > > > > t-cq > .repo.in > > > > > > > > > > > > > > > > would solve this? > > > > > > I've split the patch Dominik mentions above in two, one of them adding > > the nmstate / networkmanager copr repos - [3]. > > > > Let's see if it fixes it. > > it fixes original issue, but OST still fails in > 098_ovirt_provider_ovn.use_ovn_provider: > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
I think Dominik was looking into this issue; +Dominik Holler please confirm.
Let me know if you need any help Dominik.
Thanks. The problem is that the hosts lost connection to storage:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... :
2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472)
I failed to reproduce local to analyze this, I will try again, any hints welcome.
https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. Is there someone with knowledge about the basic_ui_sanity around?
How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish?
Looking at 6134 run you were discussing:
- timing of the ui sanity set-up [1]:
11:40:20 @ Run test: 008_basic_ui_sanity.py:
- timing of first encountered storage error [2]:
2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout')
Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related.
You are right, a time.sleep(8*60) in https://gerrit.ovirt.org/#/c/104925/2 has triggers the issue the same way.
I remember talking with Steven Rosenberg on IRC a couple of days ago about
some storage metadata issues and he said he got a response from Nir, that "it's a known issue".
Nir, Amit, can you comment on this?
The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning.
The reason for such issue can be misconfigured network (maybe network team is testing negative flows?),
No.
or some issue in the NFS server.
Only hint I found is "Exiting Time2Retain handler because session_reinstatement=1" but I have no idea what this means or if this is relevant at all.
One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage.
In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host.
Ack, this seems to be the problem.
Nir
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console
[2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export...
Marcin, could you please take a look?
> > [3] - https://gerrit.ovirt.org/#/c/104897/ > > > > > > > > > > > > > > > > >> Who installs this rpm in OST? > > > > > > > > > > > > > > > > > > > > I do not understand the question. > > > > > > > > > > > > > > > > > > > >> > [...] > > > > >> > > > > > >> > > > > > >> > > > > > >> > See [2] for full error. > > > > >> > > > > > >> > > > > > >> > > > > > >> > Can someone please take a look? > > > > >> > Thanks > > > > >> > Vojta > > > > >> > > > > > >> > > > > > >> > > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > > > > >> > [2] > > > > >> > > > > >> > > > > >> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > > > > >> / > > > > >> > > > > >> > > > > >> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ > > > > >> > > > > >> > > > > >> > > > > >> post-002_bootstrap.py/lago- > > > > >> > > > > >> > > > > >> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > > > > >> ____ > > > > >> ________________________________>> > > > > >> > > > > >> > Devel mailing list -- devel@ovirt.org > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > > > >> > > > > >> > > > > >> > > > > >> > oVirt Code of Conduct: > > > > >> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ > > > > >> > > > > >> > > > > >> > > > > >> > List Archives: > > > > >> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > > > > >> N26B > > > > >> L73K7D45A2IR7R3UMMM23/ > > > > >> _______________________________________________ > > > > >> Devel mailing list -- devel@ovirt.org > > > > >> To unsubscribe send an email to devel-leave@ovirt.org > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > > > >> oVirt Code of Conduct: > > > > >> https://www.ovirt.org/community/about/community-guidelines/ > > > > >> List Archives: > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > > > > >> N5K3 > > > > >> NS5TGXFCILYES77KI5TZU/ > > > > > > > > > > _______________________________________________ > > Devel mailing list -- devel@ovirt.org > > To unsubscribe send an email to devel-leave@ovirt.org > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > oVirt Code of Conduct: > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H > > 5BQ3SCHOYZX6JMTQPBW/ >
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com> wrote:
On 11/22/19 4:54 PM, Martin Perina wrote:
On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso < mdbarroso@redhat.com> wrote:
> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < > vjuranek@redhat.com> wrote: > > > > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora > Barroso wrote: > > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < > vjuranek@redhat.com> > > > wrote: > > > > > > > > > > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: > > > > > > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < > dholler@redhat.com> > > > > > wrote: > > > > > > > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < > nsoffer@redhat.com> > > > > > > wrote: > > > > > > > > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > > > > > >> <vjuranek@redhat.com> > > > > > >> > > > > > >> > > > > > >> > > > > > >> wrote: > > > > > >> > > > > > >> > Hi, > > > > > >> > OST fails (see e.g. [1]) in > 002_bootstrap.check_update_host. It > > > > > >> > fails > > > > > >> > > > > > >> > > > > > >> > > > > > >> with > > > > > >> > > > > > >> > > > > > >> > > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": > "Depsolve > > > > > >> > Error > > > > > >> > > > > > >> > > > > > >> > > > > > >> occured: > > > > > >> > > > > > >> > \n Problem 1: cannot install the best update candidate > for package > > > > > >> > vdsm- > > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - > nothing provides > > > > > >> > > > > > >> > > > > > >> > > > > > >> nmstate > > > > > >> > > > > > >> > > > > > >> > > > > > >> > needed by > vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > > > > >> > Problem 2: > > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch > requires > > > > > >> > > > > > >> > > > > > >> > > > > > >> vdsm-network > > > > > >> > > > > > >> > > > > > >> > > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the > providers can be > > > > > >> > > > > > >> > > > > > >> > > > > > >> installed\n > > > > > >> > > > > > >> > > > > > >> > > > > > >> > - cannot install the best update candidate for package > vdsm- > > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing > provides > > > > > >> > nmstate > > > > > >> > needed by > vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > > > > > >> > > > > > >> > > > > > >> > > > > > >> nmstate should be provided by copr repo enabled by > > > > > >> ovirt-release-master. > > > > > > > > > > > > > > > > > > > > > > > > I re-triggered as > > > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > > > > > > maybe > > > > > > https://gerrit.ovirt.org/#/c/104825/ > > > > > > was missing > > > > > > > > > > > > > > > > > > > > Looks like > > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. > > > > > > > > > > > > > > > > maybe not. You re-triggered with [1], which really missed this > patch. > > > > I did a rebase and now running with this patch in build #6132 > [2]. Let's > > > > wait > > for it to see if gerrit #104825 helps. > > > > > > > > > > > > > > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ > > > > [2] > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ > > > > > > > > > > > > > > > > > Miguel, do you think merging > > > > > > > > > > > > > > > > > > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos > > > > > t-cq > > .repo.in > > > > > > > > > > > > > > > > > > > > would solve this? > > > > > > > > > I've split the patch Dominik mentions above in two, one of them > adding > > > the nmstate / networkmanager copr repos - [3]. > > > > > > Let's see if it fixes it. > > > > it fixes original issue, but OST still fails in > > 098_ovirt_provider_ovn.use_ovn_provider: > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 > > I think Dominik was looking into this issue; +Dominik Holler please > confirm. > > Let me know if you need any help Dominik. >
Thanks. The problem is that the hosts lost connection to storage:
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... :
2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472)
I failed to reproduce local to analyze this, I will try again, any hints welcome.
https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. Is there someone with knowledge about the basic_ui_sanity around?
How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish?
Looking at 6134 run you were discussing:
- timing of the ui sanity set-up [1]:
11:40:20 @ Run test: 008_basic_ui_sanity.py:
- timing of first encountered storage error [2]:
2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout')
Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related.
You are right, a time.sleep(8*60) in https://gerrit.ovirt.org/#/c/104925/2 has triggers the issue the same way.
Nir or Steve, can you please confirm that this is a storage problem?
I remember talking with Steven Rosenberg on IRC a couple of days ago
about some storage metadata issues and he said he got a response from Nir, that "it's a known issue".
Nir, Amit, can you comment on this?
The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning.
The reason for such issue can be misconfigured network (maybe network team is testing negative flows?),
No.
or some issue in the NFS server.
Only hint I found is "Exiting Time2Retain handler because session_reinstatement=1" but I have no idea what this means or if this is relevant at all.
One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage.
In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host.
Ack, this seems to be the problem.
Nir
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console
[2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export...
Marcin, could you please take a look?
> > > > [3] - https://gerrit.ovirt.org/#/c/104897/ > > > > > > > > > > > > > > > > > > > > > >> Who installs this rpm in OST? > > > > > > > > > > > > > > > > > > > > > > > > I do not understand the question. > > > > > > > > > > > > > > > > > > > > > > > >> > [...] > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > See [2] for full error. > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > Can someone please take a look? > > > > > >> > Thanks > > > > > >> > Vojta > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > [1] > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > > > > > >> > [2] > > > > > >> > > > > > >> > > > > > >> > > > > > >> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > > > > > >> / > > > > > >> > > > > > >> > > > > > >> > > > > > >> > exported-artifacts/test_logs/basic-suite-master/ > > > > > >> > > > > > >> > > > > > >> > > > > > >> post-002_bootstrap.py/lago- > > > > > >> > > > > > >> > > > > > >> > > > > > >> > basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > > > > > >> ____ > > > > > >> ________________________________>> > > > > > >> > > > > > >> > Devel mailing list -- devel@ovirt.org > > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org > > > > > >> > Privacy Statement: > https://www.ovirt.org/site/privacy-policy/ > > > > > >> > > > > > >> > > > > > >> > > > > > >> > oVirt Code of Conduct: > > > > > >> > > > > > >> > https://www.ovirt.org/community/about/community-guidelines/ > > > > > >> > > > > > >> > > > > > >> > > > > > >> > List Archives: > > > > > >> > > > > > >> > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > > > > > >> N26B > > > > > >> L73K7D45A2IR7R3UMMM23/ > > > > > >> _______________________________________________ > > > > > >> Devel mailing list -- devel@ovirt.org > > > > > >> To unsubscribe send an email to devel-leave@ovirt.org > > > > > >> Privacy Statement: > https://www.ovirt.org/site/privacy-policy/ > > > > > >> oVirt Code of Conduct: > > > > > >> > https://www.ovirt.org/community/about/community-guidelines/ > > > > > >> List Archives: > > > > > >> > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > > > > > >> N5K3 > > > > > >> NS5TGXFCILYES77KI5TZU/ > > > > > > > > > > > > > > _______________________________________________ > > > Devel mailing list -- devel@ovirt.org > > > To unsubscribe send an email to devel-leave@ovirt.org > > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > > oVirt Code of Conduct: > > > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > > > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H > > > 5BQ3SCHOYZX6JMTQPBW/ > > > >
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com> wrote:
On 11/22/19 4:54 PM, Martin Perina wrote:
On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote: > > > > On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com> wrote: >> >> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote: >> > >> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: >> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com> >> > > wrote: >> > > > >> > > > >> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: >> > > > >> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> >> > > > > wrote: >> > > > > >> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> >> > > > > > wrote: >> > > > > > >> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >> > > > > >> <vjuranek@redhat.com> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> wrote: >> > > > > >> >> > > > > >> > Hi, >> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >> > > > > >> > fails >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> with >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >> > > > > >> > Error >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> occured: >> > > > > >> >> > > > > >> > \n Problem 1: cannot install the best update candidate for package >> > > > > >> > vdsm- >> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> nmstate >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> > > > > >> > Problem 2: >> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> vdsm-network >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> installed\n >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> > - cannot install the best update candidate for package vdsm- >> > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides >> > > > > >> > nmstate >> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> nmstate should be provided by copr repo enabled by >> > > > > >> ovirt-release-master. >> > > > > > >> > > > > > >> > > > > > >> > > > > > I re-triggered as >> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >> > > > > > maybe >> > > > > > https://gerrit.ovirt.org/#/c/104825/ >> > > > > > was missing >> > > > > >> > > > > >> > > > > >> > > > > Looks like >> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. >> > > > >> > > > >> > > > >> > > > maybe not. You re-triggered with [1], which really missed this patch. >> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's >> > > > wait >> > for it to see if gerrit #104825 helps. >> > > > >> > > > >> > > > >> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ >> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >> > > > >> > > > >> > > > >> > > > > Miguel, do you think merging >> > > > > >> > > > > >> > > > > >> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >> > > > > t-cq >> > .repo.in >> > > > > >> > > > > >> > > > > >> > > > > would solve this? >> > > >> > > >> > > I've split the patch Dominik mentions above in two, one of them adding >> > > the nmstate / networkmanager copr repos - [3]. >> > > >> > > Let's see if it fixes it. >> > >> > it fixes original issue, but OST still fails in >> > 098_ovirt_provider_ovn.use_ovn_provider: >> > >> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >> >> I think Dominik was looking into this issue; +Dominik Holler please confirm. >> >> Let me know if you need any help Dominik. > > > > Thanks. > The problem is that the hosts lost connection to storage: > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : > > 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) > 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked > delay = result.delay() > File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > raise exception.MiscFileReadException(self.path, self.rc, self.err) > vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') > 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) > > > I failed to reproduce local to analyze this, I will try again, any hints welcome. >
https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. Is there someone with knowledge about the basic_ui_sanity around?
How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish?
Looking at 6134 run you were discussing:
- timing of the ui sanity set-up [1]:
11:40:20 @ Run test: 008_basic_ui_sanity.py:
- timing of first encountered storage error [2]:
2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout')
Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related.
You are right, a time.sleep(8*60) in https://gerrit.ovirt.org/#/c/104925/2 has triggers the issue the same way.
So this is a test issues, assuming that the UI tests can complete in less than 8 minutes?
Nir or Steve, can you please confirm that this is a storage problem?
Why do you think we have a storage problem?
I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue".
Nir, Amit, can you comment on this?
The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning.
The reason for such issue can be misconfigured network (maybe network team is testing negative flows?),
No.
or some issue in the NFS server.
Only hint I found is "Exiting Time2Retain handler because session_reinstatement=1" but I have no idea what this means or if this is relevant at all.
One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage.
In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host.
Ack, this seems to be the problem.
Nir
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export...
Marcin, could you please take a look?
>> >> > >> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >> > > >> > > >> > > > > >> > > > > >> > > > > >> Who installs this rpm in OST? >> > > > > > >> > > > > > >> > > > > > >> > > > > > I do not understand the question. >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > [...] >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> > See [2] for full error. >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> > Can someone please take a look? >> > > > > >> > Thanks >> > > > > >> > Vojta >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >> > > > > >> > [2] >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >> > > > > >> / >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> post-002_bootstrap.py/lago- >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >> > > > > >> ____ >> > > > > >> ________________________________>> >> > > > > >> >> > > > > >> > Devel mailing list -- devel@ovirt.org >> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org >> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> > oVirt Code of Conduct: >> > > > > >> >> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> > List Archives: >> > > > > >> >> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >> > > > > >> N26B >> > > > > >> L73K7D45A2IR7R3UMMM23/ >> > > > > >> _______________________________________________ >> > > > > >> Devel mailing list -- devel@ovirt.org >> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org >> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> > > > > >> oVirt Code of Conduct: >> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >> > > > > >> List Archives: >> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >> > > > > >> N5K3 >> > > > > >> NS5TGXFCILYES77KI5TZU/ >> > > > >> > > > >> > > >> > > _______________________________________________ >> > > Devel mailing list -- devel@ovirt.org >> > > To unsubscribe send an email to devel-leave@ovirt.org >> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> > > oVirt Code of Conduct: >> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: >> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >> > > 5BQ3SCHOYZX6JMTQPBW/ >> > >>
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler <dholler@redhat.com>
On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler <dholler@redhat.com>
wrote:
On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com>
wrote:
On 11/22/19 4:54 PM, Martin Perina wrote:
On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com>
wrote:
> > > On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote: >> >> >> >> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso < mdbarroso@redhat.com> wrote: >>> >>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < vjuranek@redhat.com> wrote: >>> > >>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: >>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> >>> > > wrote: >>> > > > >>> > > > >>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: >>> > > > >>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < dholler@redhat.com> >>> > > > > wrote: >>> > > > > >>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < nsoffer@redhat.com> >>> > > > > > wrote: >>> > > > > > >>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >>> > > > > >> <vjuranek@redhat.com> >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> wrote: >>> > > > > >> >>> > > > > >> > Hi, >>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >>> > > > > >> > fails >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> with >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >>> > > > > >> > Error >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> occured: >>> > > > > >> >>> > > > > >> > \n Problem 1: cannot install the best update candidate for package >>> > > > > >> > vdsm- >>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> nmstate >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>> > > > > >> > Problem 2: >>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> vdsm-network >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the
>>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> installed\n >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> > - cannot install the best update candidate for
>>> > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides >>> > > > > >> > nmstate >>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> nmstate should be provided by copr repo enabled by >>> > > > > >> ovirt-release-master. >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > I re-triggered as >>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >>> > > > > > maybe >>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >>> > > > > > was missing >>> > > > > >>> > > > > >>> > > > > >>> > > > > Looks like >>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. >>> > > > >>> > > > >>> > > > >>> > > > maybe not. You re-triggered with [1], which really missed
>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's >>> > > > wait >>> > for it to see if gerrit #104825 helps. >>> > > > >>> > > > >>> > > > >>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ >>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >>> > > > >>> > > > >>> > > > >>> > > > > Miguel, do you think merging >>> > > > > >>> > > > > >>> > > > > >>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >>> > > > > t-cq >>> > .repo.in >>> > > > > >>> > > > > >>> > > > > >>> > > > > would solve this? >>> > > >>> > > >>> > > I've split the patch Dominik mentions above in two, one of
>>> > > the nmstate / networkmanager copr repos - [3]. >>> > > >>> > > Let's see if it fixes it. >>> > >>> > it fixes original issue, but OST still fails in >>> > 098_ovirt_provider_ovn.use_ovn_provider: >>> > >>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >>> >>> I think Dominik was looking into this issue; +Dominik Holler
>>> >>> Let me know if you need any help Dominik. >> >> >> >> Thanks. >> The problem is that the hosts lost connection to storage: >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : >> >> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) >> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >> Traceback (most recent call last): >> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py",
>> delay = result.delay() >> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py",
>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) >> >> >> I failed to reproduce local to analyze this, I will try again, any hints welcome. >> > > > > https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. > Is there someone with knowledge about the basic_ui_sanity around?
How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish?
Looking at 6134 run you were discussing:
- timing of the ui sanity set-up [1]:
11:40:20 @ Run test: 008_basic_ui_sanity.py:
- timing of first encountered storage error [2]:
2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py",
delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py",
wrote: providers can be package vdsm- this patch. them adding please confirm. line 499, in _pathChecked line 391, in delay line 499, in _pathChecked line 391, in delay
raise exception.MiscFileReadException(self.path, self.rc,
self.err)
vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout')
Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related.
You are right, a time.sleep(8*60) in https://gerrit.ovirt.org/#/c/104925/2 has triggers the issue the same way.
So this is a test issues, assuming that the UI tests can complete in less than 8 minutes?
To my eyes this looks like storage is just stop working after some time.
Nir or Steve, can you please confirm that this is a storage problem?
Why do you think we have a storage problem?
I understand from the posted log snippets that they say that the storage is not accessible anymore, while the host is still responsive. This might be triggered by something outside storage, e.g. the network providing the storage stopped working, But I think a possible next step in analysing this issue would be to find the reason why storage is not happy.
I remember talking with Steven Rosenberg on IRC a couple of days ago
about some storage metadata issues and he said he got a response from Nir, that "it's a known issue".
Nir, Amit, can you comment on this?
The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning.
The reason for such issue can be misconfigured network (maybe network team is testing negative flows?),
No.
or some issue in the NFS server.
Only hint I found is "Exiting Time2Retain handler because session_reinstatement=1" but I have no idea what this means or if this is relevant at all.
One read timeout is not an issue. We have a real issue only if we
have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage.
In OST we never expect such conditions since we don't test negative
flows, and we should have good connectivity with the vms running on the same host.
Ack, this seems to be the problem.
Nir
[1]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console
[2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... > >
Marcin, could you please take a look? > > > >>> >>> > >>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >>> > > >>> > > >>> > > > > >>> > > > > >>> > > > > >> Who installs this rpm in OST? >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > I do not understand the question. >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > >> > [...] >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > See [2] for full error. >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > Can someone please take a look? >>> > > > > >> > Thanks >>> > > > > >> > Vojta >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >>> > > > > >> > [2] >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >>> > > > > >> / >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> post-002_bootstrap.py/lago- >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >>> > > > > >> ____ >>> > > > > >> ________________________________>> >>> > > > > >> >>> > > > > >> > Devel mailing list -- devel@ovirt.org >>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org >>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> > oVirt Code of Conduct: >>> > > > > >> >>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> > List Archives: >>> > > > > >> >>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >>> > > > > >> N26B >>> > > > > >> L73K7D45A2IR7R3UMMM23/ >>> > > > > >> _______________________________________________ >>> > > > > >> Devel mailing list -- devel@ovirt.org >>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org >>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> > > > > >> oVirt Code of Conduct: >>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >>> > > > > >> List Archives: >>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >>> > > > > >> N5K3 >>> > > > > >> NS5TGXFCILYES77KI5TZU/ >>> > > > >>> > > > >>> > > >>> > > _______________________________________________ >>> > > Devel mailing list -- devel@ovirt.org >>> > > To unsubscribe send an email to devel-leave@ovirt.org >>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> > > oVirt Code of Conduct: >>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: >>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >>> > > 5BQ3SCHOYZX6JMTQPBW/ >>> > >>>
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com> wrote: > > > > On 11/22/19 4:54 PM, Martin Perina wrote: > > > > On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com> wrote: >> >> >> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote: >>> >>> >>> >>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com> wrote: >>>> >>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote: >>>> > >>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: >>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com> >>>> > > wrote: >>>> > > > >>>> > > > >>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: >>>> > > > >>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> >>>> > > > > wrote: >>>> > > > > >>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> >>>> > > > > > wrote: >>>> > > > > > >>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >>>> > > > > >> <vjuranek@redhat.com> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> wrote: >>>> > > > > >> >>>> > > > > >> > Hi, >>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >>>> > > > > >> > fails >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> with >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >>>> > > > > >> > Error >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> occured: >>>> > > > > >> >>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package >>>> > > > > >> > vdsm- >>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> nmstate >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>>> > > > > >> > Problem 2: >>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> vdsm-network >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> installed\n >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> > - cannot install the best update candidate for package vdsm- >>>> > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides >>>> > > > > >> > nmstate >>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> nmstate should be provided by copr repo enabled by >>>> > > > > >> ovirt-release-master. >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > I re-triggered as >>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >>>> > > > > > maybe >>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >>>> > > > > > was missing >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > Looks like >>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. >>>> > > > >>>> > > > >>>> > > > >>>> > > > maybe not. You re-triggered with [1], which really missed this patch. >>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's >>>> > > > wait >>>> > for it to see if gerrit #104825 helps. >>>> > > > >>>> > > > >>>> > > > >>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ >>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >>>> > > > >>>> > > > >>>> > > > >>>> > > > > Miguel, do you think merging >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >>>> > > > > t-cq >>>> > .repo.in >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > would solve this? >>>> > > >>>> > > >>>> > > I've split the patch Dominik mentions above in two, one of them adding >>>> > > the nmstate / networkmanager copr repos - [3]. >>>> > > >>>> > > Let's see if it fixes it. >>>> > >>>> > it fixes original issue, but OST still fails in >>>> > 098_ovirt_provider_ovn.use_ovn_provider: >>>> > >>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >>>> >>>> I think Dominik was looking into this issue; +Dominik Holler please confirm. >>>> >>>> Let me know if you need any help Dominik. >>> >>> >>> >>> Thanks. >>> The problem is that the hosts lost connection to storage: >>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : >>> >>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) >>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >>> Traceback (most recent call last): >>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >>> delay = result.delay() >>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) >>> >>> >>> I failed to reproduce local to analyze this, I will try again, any hints welcome. >>> >> >> >> >> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. >> Is there someone with knowledge about the basic_ui_sanity around? > > How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? > > Looking at 6134 run you were discussing: > > - timing of the ui sanity set-up [1]: > > 11:40:20 @ Run test: 008_basic_ui_sanity.py: > > - timing of first encountered storage error [2]: > > 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked > delay = result.delay() > File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > raise exception.MiscFileReadException(self.path, self.rc, self.err) > vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') > > Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related.
You are right, a time.sleep(8*60) in https://gerrit.ovirt.org/#/c/104925/2 has triggers the issue the same way.
So this is a test issues, assuming that the UI tests can complete in less than 8 minutes?
To my eyes this looks like storage is just stop working after some time.
Nir or Steve, can you please confirm that this is a storage problem?
Why do you think we have a storage problem?
I understand from the posted log snippets that they say that the storage is not accessible anymore,
No, so far one read timeout was reported, this does not mean storage is not available anymore. It can be temporary issue that does not harm anything.
while the host is still responsive. This might be triggered by something outside storage, e.g. the network providing the storage stopped working, But I think a possible next step in analysing this issue would be to find the reason why storage is not happy.
First step is to understand which test fails, and why. This can be done by the owner of the test, understanding what the test does and what is the expected system behavior. If the owner of the test thinks that the test failed because of a storage issue someone from storage can look at this. But the fact that adding long sleep reproduce the issue means it is not related in any way to storage. Nir
> > I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue". > > Nir, Amit, can you comment on this?
The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning.
The reason for such issue can be misconfigured network (maybe network team is testing negative flows?),
No.
or some issue in the NFS server.
Only hint I found is "Exiting Time2Retain handler because session_reinstatement=1" but I have no idea what this means or if this is relevant at all.
One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage.
In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host.
Ack, this seems to be the problem.
Nir
> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... >> >> > > Marcin, could you please take a look? >> >> >> >>>> >>>> > >>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >>>> > > >>>> > > >>>> > > > > >>>> > > > > >>>> > > > > >> Who installs this rpm in OST? >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > I do not understand the question. >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > >> > [...] >>>> > > > > >> > >>>> > > > > >> > >>>> > > > > >> > >>>> > > > > >> > See [2] for full error. >>>> > > > > >> > >>>> > > > > >> > >>>> > > > > >> > >>>> > > > > >> > Can someone please take a look? >>>> > > > > >> > Thanks >>>> > > > > >> > Vojta >>>> > > > > >> > >>>> > > > > >> > >>>> > > > > >> > >>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >>>> > > > > >> > [2] >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >>>> > > > > >> / >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> post-002_bootstrap.py/lago- >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >>>> > > > > >> ____ >>>> > > > > >> ________________________________>> >>>> > > > > >> >>>> > > > > >> > Devel mailing list -- devel@ovirt.org >>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org >>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> > oVirt Code of Conduct: >>>> > > > > >> >>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> > List Archives: >>>> > > > > >> >>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >>>> > > > > >> N26B >>>> > > > > >> L73K7D45A2IR7R3UMMM23/ >>>> > > > > >> _______________________________________________ >>>> > > > > >> Devel mailing list -- devel@ovirt.org >>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org >>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>> > > > > >> oVirt Code of Conduct: >>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >>>> > > > > >> List Archives: >>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >>>> > > > > >> N5K3 >>>> > > > > >> NS5TGXFCILYES77KI5TZU/ >>>> > > > >>>> > > > >>>> > > >>>> > > _______________________________________________ >>>> > > Devel mailing list -- devel@ovirt.org >>>> > > To unsubscribe send an email to devel-leave@ovirt.org >>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>> > > oVirt Code of Conduct: >>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: >>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >>>> > > 5BQ3SCHOYZX6JMTQPBW/ >>>> > >>>> > > > -- > Martin Perina > Manager, Software Engineering > Red Hat Czech s.r.o. > >

On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler <dholler@redhat.com>
On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler <dholler@redhat.com>
wrote:
On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler <dholler@redhat.com>
wrote:
On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com>
wrote:
> > > > On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com> wrote: >> >> >> >> On 11/22/19 4:54 PM, Martin Perina wrote: >> >> >> >> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler < dholler@redhat.com> wrote: >>> >>> >>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler < dholler@redhat.com> wrote: >>>> >>>> >>>> >>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso < mdbarroso@redhat.com> wrote: >>>>> >>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < vjuranek@redhat.com> wrote: >>>>> > >>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: >>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> >>>>> > > wrote: >>>>> > > > >>>>> > > > >>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: >>>>> > > > >>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < dholler@redhat.com> >>>>> > > > > wrote: >>>>> > > > > >>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < nsoffer@redhat.com> >>>>> > > > > > wrote: >>>>> > > > > > >>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >>>>> > > > > >> <vjuranek@redhat.com> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> wrote: >>>>> > > > > >> >>>>> > > > > >> > Hi, >>>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >>>>> > > > > >> > fails >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> with >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >>>>> > > > > >> > Error >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> occured: >>>>> > > > > >> >>>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package >>>>> > > > > >> > vdsm- >>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> nmstate >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>>>> > > > > >> > Problem 2: >>>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> vdsm-network >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the
>>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> installed\n >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> > - cannot install the best update candidate for
>>>>> > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides >>>>> > > > > >> > nmstate >>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> nmstate should be provided by copr repo enabled by >>>>> > > > > >> ovirt-release-master. >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > I re-triggered as >>>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >>>>> > > > > > maybe >>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >>>>> > > > > > was missing >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > Looks like >>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > maybe not. You re-triggered with [1], which really missed this patch. >>>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's >>>>> > > > wait >>>>> > for it to see if gerrit #104825 helps. >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ >>>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > > Miguel, do you think merging >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >>>>> > > > > t-cq >>>>> > .repo.in >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > would solve this? >>>>> > > >>>>> > > >>>>> > > I've split the patch Dominik mentions above in two, one of
>>>>> > > the nmstate / networkmanager copr repos - [3]. >>>>> > > >>>>> > > Let's see if it fixes it. >>>>> > >>>>> > it fixes original issue, but OST still fails in >>>>> > 098_ovirt_provider_ovn.use_ovn_provider: >>>>> > >>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >>>>> >>>>> I think Dominik was looking into this issue; +Dominik Holler
>>>>> >>>>> Let me know if you need any help Dominik. >>>> >>>> >>>> >>>> Thanks. >>>> The problem is that the hosts lost connection to storage: >>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : >>>> >>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) >>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >>>> Traceback (most recent call last): >>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >>>> delay = result.delay() >>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) >>>> >>>> >>>> I failed to reproduce local to analyze this, I will try again, any hints welcome. >>>> >>> >>> >>> >>> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. >>> Is there someone with knowledge about the basic_ui_sanity around? >> >> How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? >> >> Looking at 6134 run you were discussing: >> >> - timing of the ui sanity set-up [1]: >> >> 11:40:20 @ Run test: 008_basic_ui_sanity.py: >> >> - timing of first encountered storage error [2]: >> >> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >> Traceback (most recent call last): >> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >> delay = result.delay() >> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py",
wrote: providers can be package vdsm- them adding please confirm. line 391, in delay
>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >> >> Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related.
You are right, a time.sleep(8*60) in https://gerrit.ovirt.org/#/c/104925/2 has triggers the issue the same way.
So this is a test issues, assuming that the UI tests can complete in less than 8 minutes?
To my eyes this looks like storage is just stop working after some time.
Nir or Steve, can you please confirm that this is a storage problem?
Why do you think we have a storage problem?
I understand from the posted log snippets that they say that the storage is not accessible anymore,
No, so far one read timeout was reported, this does not mean storage is not available anymore. It can be temporary issue that does not harm anything.
while the host is still responsive. This might be triggered by something outside storage, e.g. the network providing the storage stopped working, But I think a possible next step in analysing this issue would be to find the reason why storage is not happy.
Sounds like there was a miscommunication in this thread. I try to address all of your points, please let me know if something is missing or not clearly expressed.
First step is to understand which test fails,
098_ovirt_provider_ovn.use_ovn_provider
and why. This can be done by the owner of the test,
The test was added by the network team.
understanding what the test does
The test tries to add a vNIC.
and what is the expected system behavior.
It is expected that adding a vNIC works, because the VM should be up.
If the owner of the test thinks that the test failed because of a storage issue
I am not sure who is the owner, but I do.
someone from storage can look at this.
Thanks, I would appreciate this.
But the fact that adding long sleep reproduce the issue means it is not related in any way to storage.
Nir
>> >> I remember talking with Steven Rosenberg on IRC a couple of days
ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue".
>> >> Nir, Amit, can you comment on this? > > > The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. > > The reason for such issue can be misconfigured network (maybe network team is testing negative flows?),
No.
> > or some issue in the NFS server. >
Only hint I found is "Exiting Time2Retain handler because session_reinstatement=1" but I have no idea what this means or if this is relevant at all.
> > One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage. > > In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. >
Ack, this seems to be the problem.
> > Nir > > >> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console >> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... >>> >>> >> >> Marcin, could you please take a look? >>> >>> >>> >>>>> >>>>> > >>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >>>>> > > >>>>> > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >> Who installs this rpm in OST? >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > I do not understand the question. >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > >> > [...] >>>>> > > > > >> > >>>>> > > > > >> > >>>>> > > > > >> > >>>>> > > > > >> > See [2] for full error. >>>>> > > > > >> > >>>>> > > > > >> > >>>>> > > > > >> > >>>>> > > > > >> > Can someone please take a look? >>>>> > > > > >> > Thanks >>>>> > > > > >> > Vojta >>>>> > > > > >> > >>>>> > > > > >> > >>>>> > > > > >> > >>>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >>>>> > > > > >> > [2] >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >>>>> > > > > >> / >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> post-002_bootstrap.py/lago- >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >>>>> > > > > >> ____ >>>>> > > > > >> ________________________________>> >>>>> > > > > >> >>>>> > > > > >> > Devel mailing list -- devel@ovirt.org >>>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org >>>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> > oVirt Code of Conduct: >>>>> > > > > >> >>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> > List Archives: >>>>> > > > > >> >>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >>>>> > > > > >> N26B >>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ >>>>> > > > > >> _______________________________________________ >>>>> > > > > >> Devel mailing list -- devel@ovirt.org >>>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org >>>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>> > > > > >> oVirt Code of Conduct: >>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >>>>> > > > > >> List Archives: >>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >>>>> > > > > >> N5K3 >>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ >>>>> > > > >>>>> > > > >>>>> > > >>>>> > > _______________________________________________ >>>>> > > Devel mailing list -- devel@ovirt.org >>>>> > > To unsubscribe send an email to devel-leave@ovirt.org >>>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>> > > oVirt Code of Conduct: >>>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: >>>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >>>>> > > 5BQ3SCHOYZX6JMTQPBW/ >>>>> > >>>>> >> >> >> -- >> Martin Perina >> Manager, Software Engineering >> Red Hat Czech s.r.o. >> >>

On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler <dholler@redhat.com> wrote:
On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler <dholler@redhat.com> wrote: > > > > On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com> wrote: >> >> >> >> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com> wrote: >>> >>> >>> >>> On 11/22/19 4:54 PM, Martin Perina wrote: >>> >>> >>> >>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com> wrote: >>>> >>>> >>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com> wrote: >>>>>> >>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote: >>>>>> > >>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: >>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com> >>>>>> > > wrote: >>>>>> > > > >>>>>> > > > >>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: >>>>>> > > > >>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> >>>>>> > > > > wrote: >>>>>> > > > > >>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> >>>>>> > > > > > wrote: >>>>>> > > > > > >>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >>>>>> > > > > >> <vjuranek@redhat.com> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> wrote: >>>>>> > > > > >> >>>>>> > > > > >> > Hi, >>>>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >>>>>> > > > > >> > fails >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> with >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >>>>>> > > > > >> > Error >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> occured: >>>>>> > > > > >> >>>>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package >>>>>> > > > > >> > vdsm- >>>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> nmstate >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>>>>> > > > > >> > Problem 2: >>>>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> vdsm-network >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> installed\n >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> > - cannot install the best update candidate for package vdsm- >>>>>> > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides >>>>>> > > > > >> > nmstate >>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> nmstate should be provided by copr repo enabled by >>>>>> > > > > >> ovirt-release-master. >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > I re-triggered as >>>>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >>>>>> > > > > > maybe >>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >>>>>> > > > > > was missing >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > Looks like >>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > maybe not. You re-triggered with [1], which really missed this patch. >>>>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's >>>>>> > > > wait >>>>>> > for it to see if gerrit #104825 helps. >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ >>>>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > > Miguel, do you think merging >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >>>>>> > > > > t-cq >>>>>> > .repo.in >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > would solve this? >>>>>> > > >>>>>> > > >>>>>> > > I've split the patch Dominik mentions above in two, one of them adding >>>>>> > > the nmstate / networkmanager copr repos - [3]. >>>>>> > > >>>>>> > > Let's see if it fixes it. >>>>>> > >>>>>> > it fixes original issue, but OST still fails in >>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: >>>>>> > >>>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >>>>>> >>>>>> I think Dominik was looking into this issue; +Dominik Holler please confirm. >>>>>> >>>>>> Let me know if you need any help Dominik. >>>>> >>>>> >>>>> >>>>> Thanks. >>>>> The problem is that the hosts lost connection to storage: >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : >>>>> >>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >>>>> Traceback (most recent call last): >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >>>>> delay = result.delay() >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) >>>>> >>>>> >>>>> I failed to reproduce local to analyze this, I will try again, any hints welcome. >>>>> >>>> >>>> >>>> >>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. >>>> Is there someone with knowledge about the basic_ui_sanity around? >>> >>> How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? >>> >>> Looking at 6134 run you were discussing: >>> >>> - timing of the ui sanity set-up [1]: >>> >>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: >>> >>> - timing of first encountered storage error [2]: >>> >>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >>> Traceback (most recent call last): >>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >>> delay = result.delay() >>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >>> >>> Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related.
You are right, a time.sleep(8*60) in https://gerrit.ovirt.org/#/c/104925/2 has triggers the issue the same way.
So this is a test issues, assuming that the UI tests can complete in less than 8 minutes?
To my eyes this looks like storage is just stop working after some time.
Nir or Steve, can you please confirm that this is a storage problem?
Why do you think we have a storage problem?
I understand from the posted log snippets that they say that the storage is not accessible anymore,
No, so far one read timeout was reported, this does not mean storage is not available anymore. It can be temporary issue that does not harm anything.
while the host is still responsive. This might be triggered by something outside storage, e.g. the network providing the storage stopped working, But I think a possible next step in analysing this issue would be to find the reason why storage is not happy.
Sounds like there was a miscommunication in this thread. I try to address all of your points, please let me know if something is missing or not clearly expressed.
First step is to understand which test fails,
098_ovirt_provider_ovn.use_ovn_provider
and why. This can be done by the owner of the test,
The test was added by the network team.
understanding what the test does
The test tries to add a vNIC.
and what is the expected system behavior.
It is expected that adding a vNIC works, because the VM should be up.
What was the actual behavior?
If the owner of the test thinks that the test failed because of a storage issue
I am not sure who is the owner, but I do.
Can you explain why how a vNIC failed because of a storage issue? Can you explain how adding 8 minutes sleep instead of the UI tests reproduced the issue?
someone from storage can look at this.
Thanks, I would appreciate this.
But the fact that adding long sleep reproduce the issue means it is not related in any way to storage.
Nir
>>> >>> I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue". >>> >>> Nir, Amit, can you comment on this? >> >> >> The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. >> >> The reason for such issue can be misconfigured network (maybe network team is testing negative flows?), > > > No. > >> >> or some issue in the NFS server. >> > > Only hint I found is > "Exiting Time2Retain handler because session_reinstatement=1" > but I have no idea what this means or if this is relevant at all. > >> >> One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage. >> >> In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. >> > > Ack, this seems to be the problem. > >> >> Nir >> >> >>> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console >>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... >>>> >>>> >>> >>> Marcin, could you please take a look? >>>> >>>> >>>> >>>>>> >>>>>> > >>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >>>>>> > > >>>>>> > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >> Who installs this rpm in OST? >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > I do not understand the question. >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > >> > [...] >>>>>> > > > > >> > >>>>>> > > > > >> > >>>>>> > > > > >> > >>>>>> > > > > >> > See [2] for full error. >>>>>> > > > > >> > >>>>>> > > > > >> > >>>>>> > > > > >> > >>>>>> > > > > >> > Can someone please take a look? >>>>>> > > > > >> > Thanks >>>>>> > > > > >> > Vojta >>>>>> > > > > >> > >>>>>> > > > > >> > >>>>>> > > > > >> > >>>>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >>>>>> > > > > >> > [2] >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >>>>>> > > > > >> / >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> post-002_bootstrap.py/lago- >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >>>>>> > > > > >> ____ >>>>>> > > > > >> ________________________________>> >>>>>> > > > > >> >>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org >>>>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org >>>>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> > oVirt Code of Conduct: >>>>>> > > > > >> >>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> > List Archives: >>>>>> > > > > >> >>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >>>>>> > > > > >> N26B >>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ >>>>>> > > > > >> _______________________________________________ >>>>>> > > > > >> Devel mailing list -- devel@ovirt.org >>>>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org >>>>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>> > > > > >> oVirt Code of Conduct: >>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >>>>>> > > > > >> List Archives: >>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >>>>>> > > > > >> N5K3 >>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ >>>>>> > > > >>>>>> > > > >>>>>> > > >>>>>> > > _______________________________________________ >>>>>> > > Devel mailing list -- devel@ovirt.org >>>>>> > > To unsubscribe send an email to devel-leave@ovirt.org >>>>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>> > > oVirt Code of Conduct: >>>>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: >>>>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ >>>>>> > >>>>>> >>> >>> >>> -- >>> Martin Perina >>> Manager, Software Engineering >>> Red Hat Czech s.r.o. >>> >>>

On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <nsoffer@redhat.com> wrote: > On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dholler@redhat.com> wrote: > > > > > > > > On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com> wrote: > >> > >> On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dholler@redhat.com> > wrote: > >> > > >> > > >> > > >> > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> > wrote: > >> >> > >> >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler <dholler@redhat.com> > wrote: > >> >> > > >> >> > > >> >> > > >> >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler <dholler@redhat.com> > wrote: > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler < > dholler@redhat.com> wrote: > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com> > wrote: > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com> > wrote: > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler < > dholler@redhat.com> wrote: > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler < > dholler@redhat.com> wrote: > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora > Barroso <mdbarroso@redhat.com> wrote: > >> >> >>>>>>>> > >> >> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < > vjuranek@redhat.com> wrote: > >> >> >>>>>>>> > > >> >> >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de > Mora Barroso wrote: > >> >> >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < > vjuranek@redhat.com> > >> >> >>>>>>>> > > wrote: > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik > Holler wrote: > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < > dholler@redhat.com> > >> >> >>>>>>>> > > > > wrote: > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < > nsoffer@redhat.com> > >> >> >>>>>>>> > > > > > wrote: > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > >> >> >>>>>>>> > > > > >> <vjuranek@redhat.com> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> wrote: > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > Hi, > >> >> >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in > 002_bootstrap.check_update_host. It > >> >> >>>>>>>> > > > > >> > fails > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> with > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], > "msg": "Depsolve > >> >> >>>>>>>> > > > > >> > Error > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> occured: > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best update > candidate for package > >> >> >>>>>>>> > > > > >> > vdsm- > >> >> >>>>>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n > - nothing provides > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> nmstate > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > needed by > vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > >> >> >>>>>>>> > > > > >> > Problem 2: > >> >> >>>>>>>> > > > > >> > package > vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> vdsm-network > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of > the providers can be > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> installed\n > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > - cannot install the best update candidate for > package vdsm- > >> >> >>>>>>>> > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n > - nothing provides > >> >> >>>>>>>> > > > > >> > nmstate > >> >> >>>>>>>> > > > > >> > needed by > vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> nmstate should be provided by copr repo enabled > by > >> >> >>>>>>>> > > > > >> ovirt-release-master. > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > I re-triggered as > >> >> >>>>>>>> > > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > >> >> >>>>>>>> > > > > > maybe > >> >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ > >> >> >>>>>>>> > > > > > was missing > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > Looks like > >> >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by > OST. > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > maybe not. You re-triggered with [1], which really > missed this patch. > >> >> >>>>>>>> > > > I did a rebase and now running with this patch in > build #6132 [2]. Let's > >> >> >>>>>>>> > > > wait > >> >> >>>>>>>> > for it to see if gerrit #104825 helps. > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > [1] > https://jenkins.ovirt.org/job/standard-manual-runner/909/ > >> >> >>>>>>>> > > > [2] > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > Miguel, do you think merging > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos > >> >> >>>>>>>> > > > > t-cq > >> >> >>>>>>>> > .repo.in > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > would solve this? > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > I've split the patch Dominik mentions above in two, one > of them adding > >> >> >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > Let's see if it fixes it. > >> >> >>>>>>>> > > >> >> >>>>>>>> > it fixes original issue, but OST still fails in > >> >> >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: > >> >> >>>>>>>> > > >> >> >>>>>>>> > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 > >> >> >>>>>>>> > >> >> >>>>>>>> I think Dominik was looking into this issue; +Dominik > Holler please confirm. > >> >> >>>>>>>> > >> >> >>>>>>>> Let me know if you need any help Dominik. > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> Thanks. > >> >> >>>>>>> The problem is that the hosts lost connection to storage: > >> >> >>>>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/exported-artifacts/test_logs/basic-suite-master/post-098_ovirt_provider_ovn.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log > : > >> >> >>>>>>> > >> >> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) > [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n > /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] > ignore_suspended_devices=1 write_cache_state=0 > disable_after_error_count=3 > filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", > "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 > wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' > --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o > uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name > (cwd None) (commands:153) > >> >> >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) > [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata > (monitor:501) > >> >> >>>>>>> Traceback (most recent call last): > >> >> >>>>>>> File > "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in > _pathChecked > >> >> >>>>>>> delay = result.delay() > >> >> >>>>>>> File > "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > >> >> >>>>>>> raise exception.MiscFileReadException(self.path, > self.rc, self.err) > >> >> >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal file > read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', > 1, 'Read timeout') > >> >> >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) > [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became > INVALID (monitor:472) > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> I failed to reproduce local to analyze this, I will try > again, any hints welcome. > >> >> >>>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that > 008_basic_ui_sanity.py triggers the problem. > >> >> >>>>>> Is there someone with knowledge about the basic_ui_sanity > around? > >> >> >>>>> > >> >> >>>>> How do you think it's related? By commenting out the ui sanity > tests and seeing OST with successful finish? > >> >> >>>>> > >> >> >>>>> Looking at 6134 run you were discussing: > >> >> >>>>> > >> >> >>>>> - timing of the ui sanity set-up [1]: > >> >> >>>>> > >> >> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: > >> >> >>>>> > >> >> >>>>> - timing of first encountered storage error [2]: > >> >> >>>>> > >> >> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) > [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata > (monitor:501) > >> >> >>>>> Traceback (most recent call last): > >> >> >>>>> File > "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in > _pathChecked > >> >> >>>>> delay = result.delay() > >> >> >>>>> File > "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > >> >> >>>>> raise exception.MiscFileReadException(self.path, self.rc, > self.err) > >> >> >>>>> vdsm.storage.exception.MiscFileReadException: Internal file > read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', > 1, 'Read timeout') > >> >> >>>>> > >> >> >>>>> Timezone difference aside, it seems to me that these storage > errors occured before doing anything ui-related. > >> >> >> > >> >> >> > >> >> >> > >> >> >> You are right, a time.sleep(8*60) in > >> >> >> https://gerrit.ovirt.org/#/c/104925/2 > >> >> >> has triggers the issue the same way. > >> >> > >> >> So this is a test issues, assuming that the UI tests can complete in > >> >> less than 8 minutes? > >> >> > >> > > >> > To my eyes this looks like storage is just stop working after some > time. > >> > > >> >> > >> >> >> > >> >> > > >> >> > Nir or Steve, can you please confirm that this is a storage > problem? > >> >> > >> >> Why do you think we have a storage problem? > >> >> > >> > > >> > I understand from the posted log snippets that they say that the > storage is not accessible anymore, > >> > >> No, so far one read timeout was reported, this does not mean storage > >> is not available anymore. > >> It can be temporary issue that does not harm anything. > >> > >> > while the host is still responsive. > >> > This might be triggered by something outside storage, e.g. the > network providing the storage stopped working, > >> > But I think a possible next step in analysing this issue would be to > find the reason why storage is not happy. > >> > > > > Sounds like there was a miscommunication in this thread. > > I try to address all of your points, please let me know if something is > missing or not clearly expressed. > > > >> > >> First step is to understand which test fails, > > > > > > 098_ovirt_provider_ovn.use_ovn_provider > > > >> > >> and why. This can be done by the owner of the test, > > > > > > The test was added by the network team. > > > >> > >> understanding what the test does > > > > > > The test tries to add a vNIC. > > > >> > >> and what is the expected system behavior. > >> > > > > It is expected that adding a vNIC works, because the VM should be up. > > What was the actual behavior? > > >> If the owner of the test thinks that the test failed because of a > storage issue > > > > > > I am not sure who is the owner, but I do. > > Can you explain why how a vNIC failed because of a storage issue? > > Test fails with: Cannot add a Network Interface when VM is not Down, Up or Image-Locked. engine.log says: {"jsonrpc": "2.0", "method": "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", "name": "vda", "path": "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, "notify_time": 4298388570}} vdsm.log says: 2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', 1, 'Read timeout') ... 2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' is blocked for 60.00 seconds (check:282) 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor] Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 408, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 710, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242, in statvfs return self._iop.statvfs(path) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 442, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) ioprocess.Timeout: Connection timed out > Can you explain how adding 8 minutes sleep instead of the UI tests > reproduced the issue? > > This shows that the issue is not triggered by the UI test, but maybe by passing time. > >> someone from storage can look at this. > >> > > > > Thanks, I would appreciate this. > > > >> > >> But the fact that adding long sleep reproduce the issue means it is not > related > >> in any way to storage. > >> > >> Nir > >> > >> > > >> >> > >> >> > > >> >> >> > >> >> >> > >> >> >>>>> > >> >> >>>>> I remember talking with Steven Rosenberg on IRC a couple of > days ago about some storage metadata issues and he said he got a response > from Nir, that "it's a known issue". > >> >> >>>>> > >> >> >>>>> Nir, Amit, can you comment on this? > >> >> >>>> > >> >> >>>> > >> >> >>>> The error mentioned here is not vdsm error but warning about > storage accessibility. We sould convert the tracebacks to warning. > >> >> >>>> > >> >> >>>> The reason for such issue can be misconfigured network (maybe > network team is testing negative flows?), > >> >> >>> > >> >> >>> > >> >> >>> No. > >> >> >>> > >> >> >>>> > >> >> >>>> or some issue in the NFS server. > >> >> >>>> > >> >> >>> > >> >> >>> Only hint I found is > >> >> >>> "Exiting Time2Retain handler because session_reinstatement=1" > >> >> >>> but I have no idea what this means or if this is relevant at all. > >> >> >>> > >> >> >>>> > >> >> >>>> One read timeout is not an issue. We have a real issue only if > we have consistent read timeouts or errors for couple of minutes, after > that engine can deactivate the storage domain or some hosts if only these > hosts are having trouble to access storage. > >> >> >>>> > >> >> >>>> In OST we never expect such conditions since we don't test > negative flows, and we should have good connectivity with the vms running > on the same host. > >> >> >>>> > >> >> >>> > >> >> >>> Ack, this seems to be the problem. > >> >> >>> > >> >> >>>> > >> >> >>>> Nir > >> >> >>>> > >> >> >>>> > >> >> >>>>> [1] > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console > >> >> >>>>> [2] > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/exported-artifacts/test_logs/basic-suite-master/post-098_ovirt_provider_ovn.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>> > >> >> >>>>> Marcin, could you please take a look? > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>>>> > >> >> >>>>>>>> > > >> >> >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > >> Who installs this rpm in OST? > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > I do not understand the question. > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > >> > [...] > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > See [2] for full error. > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > Can someone please take a look? > >> >> >>>>>>>> > > > > >> > Thanks > >> >> >>>>>>>> > > > > >> > Vojta > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > [1] > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > >> >> >>>>>>>> > > > > >> > [2] > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > >> >> >>>>>>>> > > > > >> / > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > > exported-artifacts/test_logs/basic-suite-master/ > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago- > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > >> >> >>>>>>>> > > > > >> ____ > >> >> >>>>>>>> > > > > >> ________________________________>> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org > >> >> >>>>>>>> > > > > >> > To unsubscribe send an email to > devel-leave@ovirt.org > >> >> >>>>>>>> > > > > >> > Privacy Statement: > https://www.ovirt.org/site/privacy-policy/ > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > oVirt Code of Conduct: > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > https://www.ovirt.org/community/about/community-guidelines/ > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > List Archives: > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > >> >> >>>>>>>> > > > > >> N26B > >> >> >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ > >> >> >>>>>>>> > > > > >> _______________________________________________ > >> >> >>>>>>>> > > > > >> Devel mailing list -- devel@ovirt.org > >> >> >>>>>>>> > > > > >> To unsubscribe send an email to > devel-leave@ovirt.org > >> >> >>>>>>>> > > > > >> Privacy Statement: > https://www.ovirt.org/site/privacy-policy/ > >> >> >>>>>>>> > > > > >> oVirt Code of Conduct: > >> >> >>>>>>>> > > > > >> > https://www.ovirt.org/community/about/community-guidelines/ > >> >> >>>>>>>> > > > > >> List Archives: > >> >> >>>>>>>> > > > > >> > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > >> >> >>>>>>>> > > > > >> N5K3 > >> >> >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > _______________________________________________ > >> >> >>>>>>>> > > Devel mailing list -- devel@ovirt.org > >> >> >>>>>>>> > > To unsubscribe send an email to devel-leave@ovirt.org > >> >> >>>>>>>> > > Privacy Statement: > https://www.ovirt.org/site/privacy-policy/ > >> >> >>>>>>>> > > oVirt Code of Conduct: > >> >> >>>>>>>> > > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: > >> >> >>>>>>>> > > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H > >> >> >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ > >> >> >>>>>>>> > > >> >> >>>>>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> -- > >> >> >>>>> Martin Perina > >> >> >>>>> Manager, Software Engineering > >> >> >>>>> Red Hat Czech s.r.o. > >> >> >>>>> > >> >> >>>>> > >> >> > >> > >

On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler <dholler@redhat.com> wrote: > > > > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler <dholler@redhat.com> wrote: >> >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler <dholler@redhat.com> wrote: >>> >>> >>> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com> wrote: >>>> >>>> >>>> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: >>>>> >>>>> >>>>> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com> wrote: >>>>>> >>>>>> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com> wrote: >>>>>>>> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com> wrote: >>>>>>>> > >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com> >>>>>>>> > > wrote: >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: >>>>>>>> > > > >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> >>>>>>>> > > > > wrote: >>>>>>>> > > > > >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> >>>>>>>> > > > > > wrote: >>>>>>>> > > > > > >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >>>>>>>> > > > > >> <vjuranek@redhat.com> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> wrote: >>>>>>>> > > > > >> >>>>>>>> > > > > >> > Hi, >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >>>>>>>> > > > > >> > fails >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> with >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >>>>>>>> > > > > >> > Error >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> occured: >>>>>>>> > > > > >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package >>>>>>>> > > > > >> > vdsm- >>>>>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> nmstate >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>>>>>>> > > > > >> > Problem 2: >>>>>>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> vdsm-network >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> installed\n >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> > - cannot install the best update candidate for package vdsm- >>>>>>>> > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides >>>>>>>> > > > > >> > nmstate >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> nmstate should be provided by copr repo enabled by >>>>>>>> > > > > >> ovirt-release-master. >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > I re-triggered as >>>>>>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >>>>>>>> > > > > > maybe >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >>>>>>>> > > > > > was missing >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > Looks like >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > maybe not. You re-triggered with [1], which really missed this patch. >>>>>>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's >>>>>>>> > > > wait >>>>>>>> > for it to see if gerrit #104825 helps. >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ >>>>>>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > > Miguel, do you think merging >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >>>>>>>> > > > > t-cq >>>>>>>> > .repo.in >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > would solve this? >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > I've split the patch Dominik mentions above in two, one of them adding >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. >>>>>>>> > > >>>>>>>> > > Let's see if it fixes it. >>>>>>>> > >>>>>>>> > it fixes original issue, but OST still fails in >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: >>>>>>>> > >>>>>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >>>>>>>> >>>>>>>> I think Dominik was looking into this issue; +Dominik Holler please confirm. >>>>>>>> >>>>>>>> Let me know if you need any help Dominik. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks. >>>>>>> The problem is that the hosts lost connection to storage: >>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : >>>>>>> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >>>>>>> Traceback (most recent call last): >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >>>>>>> delay = result.delay() >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >>>>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) >>>>>>> >>>>>>> >>>>>>> I failed to reproduce local to analyze this, I will try again, any hints welcome. >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. >>>>>> Is there someone with knowledge about the basic_ui_sanity around? >>>>> >>>>> How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? >>>>> >>>>> Looking at 6134 run you were discussing: >>>>> >>>>> - timing of the ui sanity set-up [1]: >>>>> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: >>>>> >>>>> - timing of first encountered storage error [2]: >>>>> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >>>>> Traceback (most recent call last): >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >>>>> delay = result.delay() >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >>>>> >>>>> Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. >> >> >> >> You are right, a time.sleep(8*60) in >> https://gerrit.ovirt.org/#/c/104925/2 >> has triggers the issue the same way.
So this is a test issues, assuming that the UI tests can complete in less than 8 minutes?
To my eyes this looks like storage is just stop working after some time.
>> > > Nir or Steve, can you please confirm that this is a storage problem?
Why do you think we have a storage problem?
I understand from the posted log snippets that they say that the storage is not accessible anymore,
No, so far one read timeout was reported, this does not mean storage is not available anymore. It can be temporary issue that does not harm anything.
while the host is still responsive. This might be triggered by something outside storage, e.g. the network providing the storage stopped working, But I think a possible next step in analysing this issue would be to find the reason why storage is not happy.
Sounds like there was a miscommunication in this thread. I try to address all of your points, please let me know if something is missing or not clearly expressed.
First step is to understand which test fails,
098_ovirt_provider_ovn.use_ovn_provider
and why. This can be done by the owner of the test,
The test was added by the network team.
understanding what the test does
The test tries to add a vNIC.
and what is the expected system behavior.
It is expected that adding a vNIC works, because the VM should be up.
What was the actual behavior?
If the owner of the test thinks that the test failed because of a storage issue
I am not sure who is the owner, but I do.
Can you explain why how a vNIC failed because of a storage issue?
Test fails with:
Cannot add a Network Interface when VM is not Down, Up or Image-Locked.
engine.log says: {"jsonrpc": "2.0", "method": "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", "name": "vda", "path": "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, "notify_time": 4298388570}}
So you think adding vNIC failed because the VM was paused?
vdsm.log says:
2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', 1, 'Read timeout')
Is this related to the paused vm? You did not provide a timestamp for the engine event above.
...
2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' is blocked for 60.00 seconds (check:282) 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor] Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 408, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 710, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242, in statvfs return self._iop.statvfs(path) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 442, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) ioprocess.Timeout: Connection timed out
This show that storage was not accessible for 60 seconds (ioprocess uses 60 seconds timeout). 60 seconds timeout is bad. If we have leases on this storage domain (e.g. SPM lease) they will expire in 20 seconds after this event and the vdsm on the SPM host will be killed. Do we have network tests changing the network used by the NFS storage domain before this event? What were the changes the network tests or code since OST was successful?
Can you explain how adding 8 minutes sleep instead of the UI tests reproduced the issue?
This shows that the issue is not triggered by the UI test, but maybe by passing time.
Do we run the ovn tests after the UI tests?
someone from storage can look at this.
Thanks, I would appreciate this.
But the fact that adding long sleep reproduce the issue means it is not related in any way to storage.
Nir
> >> >> >>>>> >>>>> I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue". >>>>> >>>>> Nir, Amit, can you comment on this? >>>> >>>> >>>> The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. >>>> >>>> The reason for such issue can be misconfigured network (maybe network team is testing negative flows?), >>> >>> >>> No. >>> >>>> >>>> or some issue in the NFS server. >>>> >>> >>> Only hint I found is >>> "Exiting Time2Retain handler because session_reinstatement=1" >>> but I have no idea what this means or if this is relevant at all. >>> >>>> >>>> One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage. >>>> >>>> In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. >>>> >>> >>> Ack, this seems to be the problem. >>> >>>> >>>> Nir >>>> >>>> >>>>> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console >>>>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... >>>>>> >>>>>> >>>>> >>>>> Marcin, could you please take a look? >>>>>> >>>>>> >>>>>> >>>>>>>> >>>>>>>> > >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >> Who installs this rpm in OST? >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > I do not understand the question. >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > >> > [...] >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > See [2] for full error. >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > Can someone please take a look? >>>>>>>> > > > > >> > Thanks >>>>>>>> > > > > >> > Vojta >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >>>>>>>> > > > > >> > [2] >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >>>>>>>> > > > > >> / >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago- >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >>>>>>>> > > > > >> ____ >>>>>>>> > > > > >> ________________________________>> >>>>>>>> > > > > >> >>>>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org >>>>>>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org >>>>>>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> > oVirt Code of Conduct: >>>>>>>> > > > > >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> > List Archives: >>>>>>>> > > > > >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >>>>>>>> > > > > >> N26B >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ >>>>>>>> > > > > >> _______________________________________________ >>>>>>>> > > > > >> Devel mailing list -- devel@ovirt.org >>>>>>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org >>>>>>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>> > > > > >> oVirt Code of Conduct: >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>> > > > > >> List Archives: >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >>>>>>>> > > > > >> N5K3 >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > >>>>>>>> > > _______________________________________________ >>>>>>>> > > Devel mailing list -- devel@ovirt.org >>>>>>>> > > To unsubscribe send an email to devel-leave@ovirt.org >>>>>>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>> > > oVirt Code of Conduct: >>>>>>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: >>>>>>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ >>>>>>>> > >>>>>>>> >>>>> >>>>> >>>>> -- >>>>> Martin Perina >>>>> Manager, Software Engineering >>>>> Red Hat Czech s.r.o. >>>>> >>>>>

On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dholler@redhat.com>
On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com>
wrote:
On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dholler@redhat.com>
wrote:
On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com>
wrote:
> > On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler < dholler@redhat.com> wrote: > > > > > > > > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler < dholler@redhat.com> wrote: > >> > >> > >> > >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler < dholler@redhat.com> wrote: > >>> > >>> > >>> > >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer < nsoffer@redhat.com> wrote: > >>>> > >>>> > >>>> > >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk < msobczyk@redhat.com> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: > >>>>> > >>>>> > >>>>> > >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler < dholler@redhat.com> wrote: > >>>>>> > >>>>>> > >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler < dholler@redhat.com> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com> wrote: > >>>>>>>> > >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < vjuranek@redhat.com> wrote: > >>>>>>>> > > >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: > >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> > >>>>>>>> > > wrote: > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: > >>>>>>>> > > > > >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < dholler@redhat.com> > >>>>>>>> > > > > wrote: > >>>>>>>> > > > > > >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < nsoffer@redhat.com> > >>>>>>>> > > > > > wrote: > >>>>>>>> > > > > > > >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > >>>>>>>> > > > > >> <vjuranek@redhat.com> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> wrote: > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > Hi, > >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It > >>>>>>>> > > > > >> > fails > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> with > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve > >>>>>>>> > > > > >> > Error > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> occured: > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package > >>>>>>>> > > > > >> > vdsm- > >>>>>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> nmstate > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > >>>>>>>> > > > > >> > Problem 2: > >>>>>>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> vdsm-network > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of
> >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> installed\n > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > - cannot install the best update candidate for package vdsm- > >>>>>>>> > > > > >> >
> >>>>>>>> > > > > >> > nmstate > >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> nmstate should be provided by copr repo enabled by > >>>>>>>> > > > > >> ovirt-release-master. > >>>>>>>> > > > > > > >>>>>>>> > > > > > > >>>>>>>> > > > > > > >>>>>>>> > > > > > I re-triggered as > >>>>>>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > >>>>>>>> > > > > > maybe > >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ > >>>>>>>> > > > > > was missing > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > Looks like > >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > maybe not. You re-triggered with [1], which really missed this patch. > >>>>>>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's > >>>>>>>> > > > wait > >>>>>>>> > for it to see if gerrit #104825 helps. > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ > >>>>>>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > Miguel, do you think merging > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos > >>>>>>>> > > > > t-cq > >>>>>>>> > .repo.in > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > would solve this? > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > I've split the patch Dominik mentions above in two, one of them adding > >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. > >>>>>>>> > > > >>>>>>>> > > Let's see if it fixes it. > >>>>>>>> > > >>>>>>>> > it fixes original issue, but OST still fails in > >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: > >>>>>>>> > > >>>>>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 > >>>>>>>> > >>>>>>>> I think Dominik was looking into this issue; +Dominik Holler please confirm. > >>>>>>>> > >>>>>>>> Let me know if you need any help Dominik. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Thanks. > >>>>>>> The problem is that the hosts lost connection to storage: > >>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : > >>>>>>> > >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) > >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) > >>>>>>> Traceback (most recent call last): > >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked > >>>>>>> delay = result.delay() > >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > >>>>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) > >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') > >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) > >>>>>>> > >>>>>>> > >>>>>>> I failed to reproduce local to analyze this, I will try again, any hints welcome. > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. > >>>>>> Is there someone with knowledge about the basic_ui_sanity around? > >>>>> > >>>>> How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? > >>>>> > >>>>> Looking at 6134 run you were discussing: > >>>>> > >>>>> - timing of the ui sanity set-up [1]: > >>>>> > >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: > >>>>> > >>>>> - timing of first encountered storage error [2]: > >>>>> > >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) > >>>>> Traceback (most recent call last): > >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked > >>>>> delay = result.delay() > >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > >>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) > >>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') > >>>>> > >>>>> Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. > >> > >> > >> > >> You are right, a time.sleep(8*60) in > >> https://gerrit.ovirt.org/#/c/104925/2 > >> has triggers the issue the same way. > > So this is a test issues, assuming that the UI tests can complete in > less than 8 minutes? >
To my eyes this looks like storage is just stop working after some time.
> > >> > > > > Nir or Steve, can you please confirm that this is a storage
wrote: the providers can be python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides problem?
> > Why do you think we have a storage problem? >
I understand from the posted log snippets that they say that the storage is not accessible anymore,
No, so far one read timeout was reported, this does not mean storage is not available anymore. It can be temporary issue that does not harm anything.
while the host is still responsive. This might be triggered by something outside storage, e.g. the network providing the storage stopped working, But I think a possible next step in analysing this issue would be to find the reason why storage is not happy.
Sounds like there was a miscommunication in this thread. I try to address all of your points, please let me know if something is missing or not clearly expressed.
First step is to understand which test fails,
098_ovirt_provider_ovn.use_ovn_provider
and why. This can be done by the owner of the test,
The test was added by the network team.
understanding what the test does
The test tries to add a vNIC.
and what is the expected system behavior.
It is expected that adding a vNIC works, because the VM should be up.
What was the actual behavior?
If the owner of the test thinks that the test failed because of a storage issue
I am not sure who is the owner, but I do.
Can you explain why how a vNIC failed because of a storage issue?
Test fails with:
Cannot add a Network Interface when VM is not Down, Up or Image-Locked.
engine.log says: {"jsonrpc": "2.0", "method": "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", "name": "vda", "path": "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, "notify_time": 4298388570}}
So you think adding vNIC failed because the VM was paused?
Yes, because of the error message "Cannot add a Network Interface when VM is not Down, Up or Image-Locked."
vdsm.log says:
2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', 1, 'Read timeout')
Is this related to the paused vm?
The log entry : '{"status": "Paused", "pauseCode": "EOTHER", "ioerror"' makes me thinking this.
You did not provide a timestamp for the engine event above.
I can't find last weeks log, maybe they are faded out already. Please find more recent logs in https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492
...
2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' is blocked for 60.00 seconds (check:282) 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor] Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 408, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 710, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242, in statvfs return self._iop.statvfs(path) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 442, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) ioprocess.Timeout: Connection timed out
This show that storage was not accessible for 60 seconds (ioprocess uses 60 seconds timeout).
60 seconds timeout is bad. If we have leases on this storage domain (e.g. SPM lease) they will expire in 20 seconds after this event and the vdsm on the SPM host will be killed.
Do we have network tests changing the network used by the NFS storage domain before this event?
No.
What were the changes the network tests or code since OST was successful?
I am not aware of a change, which might be relevant. Maybe the fact that the hosts are on CentOS 8, while the Engine (storage) is on CentOS 7 is relevant. Also the occurrence of this issue seems not to be 100% deterministic, I guess because it is timing related. The error is reproducible locally by running OST, and just keep the environment alive after basic-suite-master succeeded. After some time, the storage will become inaccessible.
Can you explain how adding 8 minutes sleep instead of the UI tests reproduced the issue?
This shows that the issue is not triggered by the UI test, but maybe by passing time.
Do we run the ovn tests after the UI tests?
someone from storage can look at this.
Thanks, I would appreciate this.
But the fact that adding long sleep reproduce the issue means it is
in any way to storage.
Nir
> > > > >> > >> > >>>>> > >>>>> I remember talking with Steven Rosenberg on IRC a couple of
days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue".
> >>>>> > >>>>> Nir, Amit, can you comment on this? > >>>> > >>>> > >>>> The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. > >>>> > >>>> The reason for such issue can be misconfigured network (maybe network team is testing negative flows?), > >>> > >>> > >>> No. > >>> > >>>> > >>>> or some issue in the NFS server. > >>>> > >>> > >>> Only hint I found is > >>> "Exiting Time2Retain handler because session_reinstatement=1" > >>> but I have no idea what this means or if this is relevant at all. > >>> > >>>> > >>>> One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after
not related that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage.
> >>>> > >>>> In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. > >>>> > >>> > >>> Ack, this seems to be the problem. > >>> > >>>> > >>>> Nir > >>>> > >>>> > >>>>> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console > >>>>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... > >>>>>> > >>>>>> > >>>>> > >>>>> Marcin, could you please take a look? > >>>>>> > >>>>>> > >>>>>> > >>>>>>>> > >>>>>>>> > > >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > >> Who installs this rpm in OST? > >>>>>>>> > > > > > > >>>>>>>> > > > > > > >>>>>>>> > > > > > > >>>>>>>> > > > > > I do not understand the question. > >>>>>>>> > > > > > > >>>>>>>> > > > > > > >>>>>>>> > > > > > > >>>>>>>> > > > > >> > [...] > >>>>>>>> > > > > >> > > >>>>>>>> > > > > >> > > >>>>>>>> > > > > >> > > >>>>>>>> > > > > >> > See [2] for full error. > >>>>>>>> > > > > >> > > >>>>>>>> > > > > >> > > >>>>>>>> > > > > >> > > >>>>>>>> > > > > >> > Can someone please take a look? > >>>>>>>> > > > > >> > Thanks > >>>>>>>> > > > > >> > Vojta > >>>>>>>> > > > > >> > > >>>>>>>> > > > > >> > > >>>>>>>> > > > > >> > > >>>>>>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > >>>>>>>> > > > > >> > [2] > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > >>>>>>>> > > > > >> / > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> post-002_bootstrap.py/lago- > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > >>>>>>>> > > > > >> ____ > >>>>>>>> > > > > >> ________________________________>> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org > >>>>>>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org > >>>>>>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > oVirt Code of Conduct: > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> > List Archives: > >>>>>>>> > > > > >> > >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > >>>>>>>> > > > > >> N26B > >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ > >>>>>>>> > > > > >>
> >>>>>>>> > > > > >> Devel mailing list -- devel@ovirt.org > >>>>>>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org > >>>>>>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >>>>>>>> > > > > >> oVirt Code of Conduct: > >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ > >>>>>>>> > > > > >> List Archives: > >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > >>>>>>>> > > > > >> N5K3 > >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > >>>>>>>> > > _______________________________________________ > >>>>>>>> > > Devel mailing list -- devel@ovirt.org > >>>>>>>> > > To unsubscribe send an email to devel-leave@ovirt.org > >>>>>>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >>>>>>>> > > oVirt Code of Conduct: > >>>>>>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: > >>>>>>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H > >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ > >>>>>>>> > > >>>>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Martin Perina > >>>>> Manager, Software Engineering > >>>>> Red Hat Czech s.r.o. > >>>>> > >>>>> >

On Tue, 26 Nov 2019, 10:19 Dominik Holler, <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dholler@redhat.com>
On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com>
wrote:
On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dholler@redhat.com>
wrote:
> > > > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> wrote: >> >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler < dholler@redhat.com> wrote: >> > >> > >> > >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler < dholler@redhat.com> wrote: >> >> >> >> >> >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler < dholler@redhat.com> wrote: >> >>> >> >>> >> >>> >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer < nsoffer@redhat.com> wrote: >> >>>> >> >>>> >> >>>> >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk < msobczyk@redhat.com> wrote: >> >>>>> >> >>>>> >> >>>>> >> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler < dholler@redhat.com> wrote: >> >>>>>> >> >>>>>> >> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler < dholler@redhat.com> wrote: >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com> wrote: >> >>>>>>>> >> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < vjuranek@redhat.com> wrote: >> >>>>>>>> > >> >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: >> >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> >> >>>>>>>> > > wrote: >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: >> >>>>>>>> > > > >> >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < dholler@redhat.com> >> >>>>>>>> > > > > wrote: >> >>>>>>>> > > > > >> >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < nsoffer@redhat.com> >> >>>>>>>> > > > > > wrote: >> >>>>>>>> > > > > > >> >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >> >>>>>>>> > > > > >> <vjuranek@redhat.com> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> wrote: >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > Hi, >> >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >> >>>>>>>> > > > > >> > fails >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> with >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >> >>>>>>>> > > > > >> > Error >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> occured: >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package >> >>>>>>>> > > > > >> > vdsm- >> >>>>>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> nmstate >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >>>>>>>> > > > > >> > Problem 2: >> >>>>>>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> vdsm-network >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> installed\n >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > - cannot install the best update candidate for package vdsm- >> >>>>>>>> > > > > >> >
>> >>>>>>>> > > > > >> > nmstate >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> nmstate should be provided by copr repo enabled by >> >>>>>>>> > > > > >> ovirt-release-master. >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > I re-triggered as >> >>>>>>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >> >>>>>>>> > > > > > maybe >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >> >>>>>>>> > > > > > was missing >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > Looks like >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > maybe not. You re-triggered with [1], which really missed this patch. >> >>>>>>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's >> >>>>>>>> > > > wait >> >>>>>>>> > for it to see if gerrit #104825 helps. >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ >> >>>>>>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > > Miguel, do you think merging >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >> >>>>>>>> > > > > t-cq >> >>>>>>>> > .repo.in >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > would solve this? >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> > > I've split the patch Dominik mentions above in two, one of them adding >> >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. >> >>>>>>>> > > >> >>>>>>>> > > Let's see if it fixes it. >> >>>>>>>> > >> >>>>>>>> > it fixes original issue, but OST still fails in >> >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: >> >>>>>>>> > >> >>>>>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >> >>>>>>>> >> >>>>>>>> I think Dominik was looking into this issue; +Dominik Holler please confirm. >> >>>>>>>> >> >>>>>>>> Let me know if you need any help Dominik. >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> Thanks. >> >>>>>>> The problem is that the hosts lost connection to storage: >> >>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : >> >>>>>>> >> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) >> >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >> >>>>>>> Traceback (most recent call last): >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >> >>>>>>> delay = result.delay() >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >> >>>>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >> >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >> >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) >> >>>>>>> >> >>>>>>> >> >>>>>>> I failed to reproduce local to analyze this, I will try again, any hints welcome. >> >>>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. >> >>>>>> Is there someone with knowledge about the basic_ui_sanity around? >> >>>>> >> >>>>> How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? >> >>>>> >> >>>>> Looking at 6134 run you were discussing: >> >>>>> >> >>>>> - timing of the ui sanity set-up [1]: >> >>>>> >> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: >> >>>>> >> >>>>> - timing of first encountered storage error [2]: >> >>>>> >> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >> >>>>> Traceback (most recent call last): >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >> >>>>> delay = result.delay() >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >> >>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >> >>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >> >>>>> >> >>>>> Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. >> >> >> >> >> >> >> >> You are right, a time.sleep(8*60) in >> >> https://gerrit.ovirt.org/#/c/104925/2 >> >> has triggers the issue the same way. >> >> So this is a test issues, assuming that the UI tests can complete in >> less than 8 minutes? >> > > To my eyes this looks like storage is just stop working after some time. > >> >> >> >> > >> > Nir or Steve, can you please confirm that this is a storage
wrote: python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides problem?
>> >> Why do you think we have a storage problem? >> > > I understand from the posted log snippets that they say that the storage is not accessible anymore,
No, so far one read timeout was reported, this does not mean storage is not available anymore. It can be temporary issue that does not harm anything.
> while the host is still responsive. > This might be triggered by something outside storage, e.g. the network providing the storage stopped working, > But I think a possible next step in analysing this issue would be to find the reason why storage is not happy.
Sounds like there was a miscommunication in this thread. I try to address all of your points, please let me know if something is missing or not clearly expressed.
First step is to understand which test fails,
098_ovirt_provider_ovn.use_ovn_provider
and why. This can be done by the owner of the test,
The test was added by the network team.
understanding what the test does
The test tries to add a vNIC.
and what is the expected system behavior.
It is expected that adding a vNIC works, because the VM should be up.
What was the actual behavior?
If the owner of the test thinks that the test failed because of a storage issue
I am not sure who is the owner, but I do.
Can you explain why how a vNIC failed because of a storage issue?
Test fails with:
Cannot add a Network Interface when VM is not Down, Up or Image-Locked.
engine.log says: {"jsonrpc": "2.0", "method": "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", "name": "vda", "path": "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, "notify_time": 4298388570}}
So you think adding vNIC failed because the VM was paused?
Yes, because of the error message "Cannot add a Network Interface when VM is not Down, Up or Image-Locked."
vdsm.log says:
2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', 1, 'Read timeout')
Is this related to the paused vm?
The log entry : '{"status": "Paused", "pauseCode": "EOTHER", "ioerror"' makes me thinking this.
You did not provide a timestamp for the engine event above.
I can't find last weeks log, maybe they are faded out already. Please find more recent logs in https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492
...
2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' is blocked for 60.00 seconds (check:282) 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor] Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 408, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 710, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242, in statvfs return self._iop.statvfs(path) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 442, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) ioprocess.Timeout: Connection timed out
This show that storage was not accessible for 60 seconds (ioprocess uses 60 seconds timeout).
60 seconds timeout is bad. If we have leases on this storage domain (e.g. SPM lease) they will expire in 20 seconds after this event and the vdsm on the SPM host will be killed.
Do we have network tests changing the network used by the NFS storage domain before this event?
No.
What were the changes the network tests or code since OST was successful?
I am not aware of a change, which might be relevant. Maybe the fact that the hosts are on CentOS 8, while the Engine (storage) is on CentOS 7 is relevant. Also the occurrence of this issue seems not to be 100% deterministic, I guess because it is timing related.
The error is reproducible locally by running OST, and just keep the environment alive after basic-suite-master succeeded. After some time, the storage will become inaccessible.
When this happens, does the storage domain change its state and goes south, or is it a temporary glitch that only halts VMs? Does the host or storage server host logs have something suspicious at that time (kernel messages, nfs logs)?
Can you explain how adding 8 minutes sleep instead of the UI tests reproduced the issue?
This shows that the issue is not triggered by the UI test, but maybe by passing time.
Do we run the ovn tests after the UI tests?
someone from storage can look at this.
Thanks, I would appreciate this.
But the fact that adding long sleep reproduce the issue means it is
in any way to storage.
Nir
> >> >> > >> >> >> >> >> >>>>> >> >>>>> I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue". >> >>>>> >> >>>>> Nir, Amit, can you comment on this? >> >>>> >> >>>> >> >>>> The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. >> >>>> >> >>>> The reason for such issue can be misconfigured network (maybe network team is testing negative flows?), >> >>> >> >>> >> >>> No. >> >>> >> >>>> >> >>>> or some issue in the NFS server. >> >>>> >> >>> >> >>> Only hint I found is >> >>> "Exiting Time2Retain handler because session_reinstatement=1" >> >>> but I have no idea what this means or if this is relevant at all. >> >>> >> >>>> >> >>>> One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after
not related that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage.
>> >>>> >> >>>> In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. >> >>>> >> >>> >> >>> Ack, this seems to be the problem. >> >>> >> >>>> >> >>>> Nir >> >>>> >> >>>> >> >>>>> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console >> >>>>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... >> >>>>>> >> >>>>>> >> >>>>> >> >>>>> Marcin, could you please take a look? >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>>>> >> >>>>>>>> > >> >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> Who installs this rpm in OST? >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > I do not understand the question. >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > >> > [...] >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > See [2] for full error. >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > Can someone please take a look? >> >>>>>>>> > > > > >> > Thanks >> >>>>>>>> > > > > >> > Vojta >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >> >>>>>>>> > > > > >> > [2] >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >> >>>>>>>> > > > > >> / >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago- >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >> >>>>>>>> > > > > >> ____ >> >>>>>>>> > > > > >> ________________________________>> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org >> >>>>>>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org >> >>>>>>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > oVirt Code of Conduct: >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > List Archives: >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >> >>>>>>>> > > > > >> N26B >> >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ >> >>>>>>>> > > > > >>
>> >>>>>>>> > > > > >> Devel mailing list -- devel@ovirt.org >> >>>>>>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org >> >>>>>>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >>>>>>>> > > > > >> oVirt Code of Conduct: >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >> >>>>>>>> > > > > >> List Archives: >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >> >>>>>>>> > > > > >> N5K3 >> >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > >> >>>>>>>> > > _______________________________________________ >> >>>>>>>> > > Devel mailing list -- devel@ovirt.org >> >>>>>>>> > > To unsubscribe send an email to devel-leave@ovirt.org >> >>>>>>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >>>>>>>> > > oVirt Code of Conduct: >> >>>>>>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: >> >>>>>>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >> >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ >> >>>>>>>> > >> >>>>>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> Martin Perina >> >>>>> Manager, Software Engineering >> >>>>> Red Hat Czech s.r.o. >> >>>>> >> >>>>> >>
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/KMRWVJNQ6GADNH...

I've just merged https://gerrit.ovirt.org/105111 which only silence the issue, but we really need to unblock OST, as it's suffering from this for more than 2 weeks now. Tal/Nir, could someone really investigate why the storage become unavailable after some time? It may be caused by recent switch of hosts to CentOS 8, but may be not related Thanks, Martin On Tue, Nov 26, 2019 at 9:17 AM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dholler@redhat.com>
On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com>
wrote:
On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dholler@redhat.com>
wrote:
> > > > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> wrote: >> >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler < dholler@redhat.com> wrote: >> > >> > >> > >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler < dholler@redhat.com> wrote: >> >> >> >> >> >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler < dholler@redhat.com> wrote: >> >>> >> >>> >> >>> >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer < nsoffer@redhat.com> wrote: >> >>>> >> >>>> >> >>>> >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk < msobczyk@redhat.com> wrote: >> >>>>> >> >>>>> >> >>>>> >> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler < dholler@redhat.com> wrote: >> >>>>>> >> >>>>>> >> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler < dholler@redhat.com> wrote: >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com> wrote: >> >>>>>>>> >> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < vjuranek@redhat.com> wrote: >> >>>>>>>> > >> >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: >> >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> >> >>>>>>>> > > wrote: >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: >> >>>>>>>> > > > >> >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler < dholler@redhat.com> >> >>>>>>>> > > > > wrote: >> >>>>>>>> > > > > >> >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < nsoffer@redhat.com> >> >>>>>>>> > > > > > wrote: >> >>>>>>>> > > > > > >> >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >> >>>>>>>> > > > > >> <vjuranek@redhat.com> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> wrote: >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > Hi, >> >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >> >>>>>>>> > > > > >> > fails >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> with >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >> >>>>>>>> > > > > >> > Error >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> occured: >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package >> >>>>>>>> > > > > >> > vdsm- >> >>>>>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> nmstate >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >>>>>>>> > > > > >> > Problem 2: >> >>>>>>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> vdsm-network >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> installed\n >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > - cannot install the best update candidate for package vdsm- >> >>>>>>>> > > > > >> >
>> >>>>>>>> > > > > >> > nmstate >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> nmstate should be provided by copr repo enabled by >> >>>>>>>> > > > > >> ovirt-release-master. >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > I re-triggered as >> >>>>>>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >> >>>>>>>> > > > > > maybe >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >> >>>>>>>> > > > > > was missing >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > Looks like >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > maybe not. You re-triggered with [1], which really missed this patch. >> >>>>>>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's >> >>>>>>>> > > > wait >> >>>>>>>> > for it to see if gerrit #104825 helps. >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ >> >>>>>>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > > Miguel, do you think merging >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >> >>>>>>>> > > > > t-cq >> >>>>>>>> > .repo.in >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > would solve this? >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> > > I've split the patch Dominik mentions above in two, one of them adding >> >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. >> >>>>>>>> > > >> >>>>>>>> > > Let's see if it fixes it. >> >>>>>>>> > >> >>>>>>>> > it fixes original issue, but OST still fails in >> >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: >> >>>>>>>> > >> >>>>>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >> >>>>>>>> >> >>>>>>>> I think Dominik was looking into this issue; +Dominik Holler please confirm. >> >>>>>>>> >> >>>>>>>> Let me know if you need any help Dominik. >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> Thanks. >> >>>>>>> The problem is that the hosts lost connection to storage: >> >>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : >> >>>>>>> >> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) >> >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >> >>>>>>> Traceback (most recent call last): >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >> >>>>>>> delay = result.delay() >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >> >>>>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >> >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >> >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) >> >>>>>>> >> >>>>>>> >> >>>>>>> I failed to reproduce local to analyze this, I will try again, any hints welcome. >> >>>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. >> >>>>>> Is there someone with knowledge about the basic_ui_sanity around? >> >>>>> >> >>>>> How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? >> >>>>> >> >>>>> Looking at 6134 run you were discussing: >> >>>>> >> >>>>> - timing of the ui sanity set-up [1]: >> >>>>> >> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: >> >>>>> >> >>>>> - timing of first encountered storage error [2]: >> >>>>> >> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >> >>>>> Traceback (most recent call last): >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >> >>>>> delay = result.delay() >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >> >>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >> >>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >> >>>>> >> >>>>> Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. >> >> >> >> >> >> >> >> You are right, a time.sleep(8*60) in >> >> https://gerrit.ovirt.org/#/c/104925/2 >> >> has triggers the issue the same way. >> >> So this is a test issues, assuming that the UI tests can complete in >> less than 8 minutes? >> > > To my eyes this looks like storage is just stop working after some time. > >> >> >> >> > >> > Nir or Steve, can you please confirm that this is a storage
wrote: python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides problem?
>> >> Why do you think we have a storage problem? >> > > I understand from the posted log snippets that they say that the storage is not accessible anymore,
No, so far one read timeout was reported, this does not mean storage is not available anymore. It can be temporary issue that does not harm anything.
> while the host is still responsive. > This might be triggered by something outside storage, e.g. the network providing the storage stopped working, > But I think a possible next step in analysing this issue would be to find the reason why storage is not happy.
Sounds like there was a miscommunication in this thread. I try to address all of your points, please let me know if something is missing or not clearly expressed.
First step is to understand which test fails,
098_ovirt_provider_ovn.use_ovn_provider
and why. This can be done by the owner of the test,
The test was added by the network team.
understanding what the test does
The test tries to add a vNIC.
and what is the expected system behavior.
It is expected that adding a vNIC works, because the VM should be up.
What was the actual behavior?
If the owner of the test thinks that the test failed because of a storage issue
I am not sure who is the owner, but I do.
Can you explain why how a vNIC failed because of a storage issue?
Test fails with:
Cannot add a Network Interface when VM is not Down, Up or Image-Locked.
engine.log says: {"jsonrpc": "2.0", "method": "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", "name": "vda", "path": "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, "notify_time": 4298388570}}
So you think adding vNIC failed because the VM was paused?
Yes, because of the error message "Cannot add a Network Interface when VM is not Down, Up or Image-Locked."
vdsm.log says:
2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', 1, 'Read timeout')
Is this related to the paused vm?
The log entry : '{"status": "Paused", "pauseCode": "EOTHER", "ioerror"' makes me thinking this.
You did not provide a timestamp for the engine event above.
I can't find last weeks log, maybe they are faded out already. Please find more recent logs in https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492
...
2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' is blocked for 60.00 seconds (check:282) 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor] Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 408, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 710, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242, in statvfs return self._iop.statvfs(path) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 442, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) ioprocess.Timeout: Connection timed out
This show that storage was not accessible for 60 seconds (ioprocess uses 60 seconds timeout).
60 seconds timeout is bad. If we have leases on this storage domain (e.g. SPM lease) they will expire in 20 seconds after this event and the vdsm on the SPM host will be killed.
Do we have network tests changing the network used by the NFS storage domain before this event?
No.
What were the changes the network tests or code since OST was successful?
I am not aware of a change, which might be relevant. Maybe the fact that the hosts are on CentOS 8, while the Engine (storage) is on CentOS 7 is relevant. Also the occurrence of this issue seems not to be 100% deterministic, I guess because it is timing related.
The error is reproducible locally by running OST, and just keep the environment alive after basic-suite-master succeeded. After some time, the storage will become inaccessible.
Can you explain how adding 8 minutes sleep instead of the UI tests reproduced the issue?
This shows that the issue is not triggered by the UI test, but maybe by passing time.
Do we run the ovn tests after the UI tests?
someone from storage can look at this.
Thanks, I would appreciate this.
But the fact that adding long sleep reproduce the issue means it is
in any way to storage.
Nir
> >> >> > >> >> >> >> >> >>>>> >> >>>>> I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue". >> >>>>> >> >>>>> Nir, Amit, can you comment on this? >> >>>> >> >>>> >> >>>> The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. >> >>>> >> >>>> The reason for such issue can be misconfigured network (maybe network team is testing negative flows?), >> >>> >> >>> >> >>> No. >> >>> >> >>>> >> >>>> or some issue in the NFS server. >> >>>> >> >>> >> >>> Only hint I found is >> >>> "Exiting Time2Retain handler because session_reinstatement=1" >> >>> but I have no idea what this means or if this is relevant at all. >> >>> >> >>>> >> >>>> One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after
not related that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage.
>> >>>> >> >>>> In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. >> >>>> >> >>> >> >>> Ack, this seems to be the problem. >> >>> >> >>>> >> >>>> Nir >> >>>> >> >>>> >> >>>>> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console >> >>>>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... >> >>>>>> >> >>>>>> >> >>>>> >> >>>>> Marcin, could you please take a look? >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>>>> >> >>>>>>>> > >> >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> Who installs this rpm in OST? >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > I do not understand the question. >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > >> > [...] >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > See [2] for full error. >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > Can someone please take a look? >> >>>>>>>> > > > > >> > Thanks >> >>>>>>>> > > > > >> > Vojta >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >> >>>>>>>> > > > > >> > [2] >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >> >>>>>>>> > > > > >> / >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago- >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >> >>>>>>>> > > > > >> ____ >> >>>>>>>> > > > > >> ________________________________>> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org >> >>>>>>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org >> >>>>>>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > oVirt Code of Conduct: >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> > List Archives: >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >> >>>>>>>> > > > > >> N26B >> >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ >> >>>>>>>> > > > > >>
>> >>>>>>>> > > > > >> Devel mailing list -- devel@ovirt.org >> >>>>>>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org >> >>>>>>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >>>>>>>> > > > > >> oVirt Code of Conduct: >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >> >>>>>>>> > > > > >> List Archives: >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >> >>>>>>>> > > > > >> N5K3 >> >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > >> >>>>>>>> > > _______________________________________________ >> >>>>>>>> > > Devel mailing list -- devel@ovirt.org >> >>>>>>>> > > To unsubscribe send an email to devel-leave@ovirt.org >> >>>>>>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >>>>>>>> > > oVirt Code of Conduct: >> >>>>>>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: >> >>>>>>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >> >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ >> >>>>>>>> > >> >>>>>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> Martin Perina >> >>>>> Manager, Software Engineering >> >>>>> Red Hat Czech s.r.o. >> >>>>> >> >>>>> >>
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

Hi, I ran OST on my physical server. I'm experiencing probably the same issues as described in the thread below. On one of the hosts: [root@lago-basic-suite-master-host-0 tmp]# ls -l /rhev/data-center/mnt/ ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1': Operation not permitted ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2': Operation not permitted ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_exported': Operation not permitted total 0 d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_exported d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share1 d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share2 drwxr-xr-x. 3 vdsm kvm 50 Nov 27 04:22 blockSD I think there's some problem with the nfs shares on engine. I can mount engine's nfs shares directly from server's native OS: ➜ /tmp mkdir -p /tmp/aaa && mount "192.168.200.4:/exports/nfs/share1" /tmp/aaa ➜ /tmp ls -l /tmp/aaa total 4 drwxr-xr-x. 5 36 kvm 4096 Nov 27 10:18 3332759c-a943-4fbd-80aa-a5f72cd87c7c ➜ /tmp But trying to do that from one of the hosts fails: [root@lago-basic-suite-master-host-0 tmp]# mkdir -p /tmp/aaa && mount -v "192.168.200.4:/exports/nfs/share1" /tmp/aaa mount.nfs: timeout set for Wed Nov 27 06:26:19 2019 mount.nfs: trying text-based options 'vers=4.2,addr=192.168.200.4,clientaddr=192.168.201.2' mount.nfs: mount(2): Operation not permitted mount.nfs: trying text-based options 'addr=192.168.200.4' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: portmap query failed: RPC: Remote system error - No route to host On the engine side, '/var/log/messages' seems to be flooded with nfs issues, example failures: Nov 27 06:25:25 lago-basic-suite-master-engine kernel: __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence: slotid 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid enter. seqid 405 slot_seqid 404 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #1: 53: status 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request from insecure port 192.168.200.1, port=51529! Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #2: 22: status 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound returned 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: --> nfsd4_store_cache_entry slot ffff9042c4d97000 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client (clientid 5dde5a1f/cc80daed) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd_dispatch: vers 4 proc 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence: slotid 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid enter. seqid 406 slot_seqid 405 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #1: 53: status 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request from insecure port 192.168.200.1, port=51529! Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #2: 22: status 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound returned 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: --> nfsd4_store_cache_entry slot ffff9042c4d97000 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client (clientid 5dde5a1f/cc80daed) Regards, Marcin On 11/26/19 8:40 PM, Martin Perina wrote:
I've just merged https://gerrit.ovirt.org/105111 which only silence the issue, but we really need to unblock OST, as it's suffering from this for more than 2 weeks now.
Tal/Nir, could someone really investigate why the storage become unavailable after some time? It may be caused by recent switch of hosts to CentOS 8, but may be not related
Thanks, Martin
On Tue, Nov 26, 2019 at 9:17 AM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote:
On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <nsoffer@redhat.com <mailto:nsoffer@redhat.com>> wrote:
On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote: > > > > On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <nsoffer@redhat.com <mailto:nsoffer@redhat.com>> wrote: >> >> On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote: >> > >> > >> > >> > On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com <mailto:nsoffer@redhat.com>> wrote: >> >> >> >> On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote: >> >> > >> >> > >> >> > >> >> > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com <mailto:nsoffer@redhat.com>> wrote: >> >> >> >> >> >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote: >> >> >> > >> >> >> > >> >> >> > >> >> >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote: >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer <nsoffer@redhat.com <mailto:nsoffer@redhat.com>> wrote: >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk <msobczyk@redhat.com <mailto:msobczyk@redhat.com>> wrote: >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote: >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> wrote: >> >> >> >>>>>>> >> >> >> >>>>>>> >> >> >> >>>>>>> >> >> >> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com <mailto:mdbarroso@redhat.com>> wrote: >> >> >> >>>>>>>> >> >> >> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek <vjuranek@redhat.com <mailto:vjuranek@redhat.com>> wrote: >> >> >> >>>>>>>> > >> >> >> >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: >> >> >> >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek <vjuranek@redhat.com <mailto:vjuranek@redhat.com>> >> >> >> >>>>>>>> > > wrote: >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com <mailto:dholler@redhat.com>> >> >> >> >>>>>>>> > > > > wrote: >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com <mailto:nsoffer@redhat.com>> >> >> >> >>>>>>>> > > > > > wrote: >> >> >> >>>>>>>> > > > > > >> >> >> >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >> >> >> >>>>>>>> > > > > >> <vjuranek@redhat.com <mailto:vjuranek@redhat.com>> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> wrote: >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> > Hi, >> >> >> >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >> >> >> >>>>>>>> > > > > >> > fails >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> with >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >> >> >> >>>>>>>> > > > > >> > Error >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> occured: >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package >> >> >> >>>>>>>> > > > > >> > vdsm- >> >> >> >>>>>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> nmstate >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >> >> >>>>>>>> > > > > >> > Problem 2: >> >> >> >>>>>>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> vdsm-network >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> installed\n >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> > - cannot install the best update candidate for package vdsm- >> >> >> >>>>>>>> > > > > >> > python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides >> >> >> >>>>>>>> > > > > >> > nmstate >> >> >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> nmstate should be provided by copr repo enabled by >> >> >> >>>>>>>> > > > > >> ovirt-release-master. >> >> >> >>>>>>>> > > > > > >> >> >> >>>>>>>> > > > > > >> >> >> >>>>>>>> > > > > > >> >> >> >>>>>>>> > > > > > I re-triggered as >> >> >> >>>>>>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >> >> >> >>>>>>>> > > > > > maybe >> >> >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >> >> >> >>>>>>>> > > > > > was missing >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > Looks like >> >> >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > maybe not. You re-triggered with [1], which really missed this patch. >> >> >> >>>>>>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's >> >> >> >>>>>>>> > > > wait >> >> >> >>>>>>>> > for it to see if gerrit #104825 helps. >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ >> >> >> >>>>>>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > > Miguel, do you think merging >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >> >> >> >>>>>>>> > > > > t-cq >> >> >> >>>>>>>> > .repo.in <http://repo.in> >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > would solve this? >> >> >> >>>>>>>> > > >> >> >> >>>>>>>> > > >> >> >> >>>>>>>> > > I've split the patch Dominik mentions above in two, one of them adding >> >> >> >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. >> >> >> >>>>>>>> > > >> >> >> >>>>>>>> > > Let's see if it fixes it. >> >> >> >>>>>>>> > >> >> >> >>>>>>>> > it fixes original issue, but OST still fails in >> >> >> >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: >> >> >> >>>>>>>> > >> >> >> >>>>>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >> >> >> >>>>>>>> >> >> >> >>>>>>>> I think Dominik was looking into this issue; +Dominik Holler please confirm. >> >> >> >>>>>>>> >> >> >> >>>>>>>> Let me know if you need any help Dominik. >> >> >> >>>>>>> >> >> >> >>>>>>> >> >> >> >>>>>>> >> >> >> >>>>>>> Thanks. >> >> >> >>>>>>> The problem is that the hosts lost connection to storage: >> >> >> >>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : >> >> >> >>>>>>> >> >> >> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) >> >> >> >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >> >> >> >>>>>>> Traceback (most recent call last): >> >> >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >> >> >> >>>>>>> delay = result.delay() >> >> >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >> >> >> >>>>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >> >> >> >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >> >> >> >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) >> >> >> >>>>>>> >> >> >> >>>>>>> >> >> >> >>>>>>> I failed to reproduce local to analyze this, I will try again, any hints welcome. >> >> >> >>>>>>> >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. >> >> >> >>>>>> Is there someone with knowledge about the basic_ui_sanity around? >> >> >> >>>>> >> >> >> >>>>> How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? >> >> >> >>>>> >> >> >> >>>>> Looking at 6134 run you were discussing: >> >> >> >>>>> >> >> >> >>>>> - timing of the ui sanity set-up [1]: >> >> >> >>>>> >> >> >> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: >> >> >> >>>>> >> >> >> >>>>> - timing of first encountered storage error [2]: >> >> >> >>>>> >> >> >> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >> >> >> >>>>> Traceback (most recent call last): >> >> >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >> >> >> >>>>> delay = result.delay() >> >> >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >> >> >> >>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >> >> >> >>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >> >> >> >>>>> >> >> >> >>>>> Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> You are right, a time.sleep(8*60) in >> >> >> >> https://gerrit.ovirt.org/#/c/104925/2 >> >> >> >> has triggers the issue the same way. >> >> >> >> >> >> So this is a test issues, assuming that the UI tests can complete in >> >> >> less than 8 minutes? >> >> >> >> >> > >> >> > To my eyes this looks like storage is just stop working after some time. >> >> > >> >> >> >> >> >> >> >> >> >> > >> >> >> > Nir or Steve, can you please confirm that this is a storage problem? >> >> >> >> >> >> Why do you think we have a storage problem? >> >> >> >> >> > >> >> > I understand from the posted log snippets that they say that the storage is not accessible anymore, >> >> >> >> No, so far one read timeout was reported, this does not mean storage >> >> is not available anymore. >> >> It can be temporary issue that does not harm anything. >> >> >> >> > while the host is still responsive. >> >> > This might be triggered by something outside storage, e.g. the network providing the storage stopped working, >> >> > But I think a possible next step in analysing this issue would be to find the reason why storage is not happy. >> >> >> > >> > Sounds like there was a miscommunication in this thread. >> > I try to address all of your points, please let me know if something is missing or not clearly expressed. >> > >> >> >> >> First step is to understand which test fails, >> > >> > >> > 098_ovirt_provider_ovn.use_ovn_provider >> > >> >> >> >> and why. This can be done by the owner of the test, >> > >> > >> > The test was added by the network team. >> > >> >> >> >> understanding what the test does >> > >> > >> > The test tries to add a vNIC. >> > >> >> >> >> and what is the expected system behavior. >> >> >> > >> > It is expected that adding a vNIC works, because the VM should be up. >> >> What was the actual behavior? >> >> >> If the owner of the test thinks that the test failed because of a storage issue >> > >> > >> > I am not sure who is the owner, but I do. >> >> Can you explain why how a vNIC failed because of a storage issue? >> > > > Test fails with: > > Cannot add a Network Interface when VM is not Down, Up or Image-Locked. > > engine.log says: > {"jsonrpc": "2.0", "method": "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", "name": "vda", "path": "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, "notify_time": 4298388570}}
So you think adding vNIC failed because the VM was paused?
Yes, because of the error message "Cannot add a Network Interface when VM is not Down, Up or Image-Locked."
> vdsm.log says: > > 2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata (monitor:501) > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked > delay = result.delay() > File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > raise exception.MiscFileReadException(self.path, self.rc, self.err) > vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', 1, 'Read timeout')
Is this related to the paused vm?
The log entry : '{"status": "Paused", "pauseCode": "EOTHER", "ioerror"' makes me thinking this.
You did not provide a timestamp for the engine event above.
I can't find last weeks log, maybe they are faded out already. Please find more recent logs in https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492
> ... > > 2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' is blocked for 60.00 seconds (check:282) > 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor] Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 408, in _checkDomainStatus > self.domain.selftest() > File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 710, in selftest > self.oop.os.statvfs(self.domaindir) > File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242, in statvfs > return self._iop.statvfs(path) > File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in statvfs > resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) > File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 442, in _sendCommand > raise Timeout(os.strerror(errno.ETIMEDOUT)) > ioprocess.Timeout: Connection timed out
This show that storage was not accessible for 60 seconds (ioprocess uses 60 seconds timeout).
60 seconds timeout is bad. If we have leases on this storage domain (e.g. SPM lease) they will expire in 20 seconds after this event and the vdsm on the SPM host will be killed.
Do we have network tests changing the network used by the NFS storage domain before this event?
No.
What were the changes the network tests or code since OST was successful?
I am not aware of a change, which might be relevant. Maybe the fact that the hosts are on CentOS 8, while the Engine (storage) is on CentOS 7 is relevant. Also the occurrence of this issue seems not to be 100% deterministic, I guess because it is timing related.
The error is reproducible locally by running OST, and just keep the environment alive after basic-suite-master succeeded. After some time, the storage will become inaccessible.
>> Can you explain how adding 8 minutes sleep instead of the UI tests >> reproduced the issue? >> > > > This shows that the issue is not triggered by the UI test, but maybe by passing time.
Do we run the ovn tests after the UI tests?
>> >> someone from storage can look at this. >> >> >> > >> > Thanks, I would appreciate this. >> > >> >> >> >> But the fact that adding long sleep reproduce the issue means it is not related >> >> in any way to storage. >> >> >> >> Nir >> >> >> >> > >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >>>>> >> >> >> >>>>> I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue". >> >> >> >>>>> >> >> >> >>>>> Nir, Amit, can you comment on this? >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. >> >> >> >>>> >> >> >> >>>> The reason for such issue can be misconfigured network (maybe network team is testing negative flows?), >> >> >> >>> >> >> >> >>> >> >> >> >>> No. >> >> >> >>> >> >> >> >>>> >> >> >> >>>> or some issue in the NFS server. >> >> >> >>>> >> >> >> >>> >> >> >> >>> Only hint I found is >> >> >> >>> "Exiting Time2Retain handler because session_reinstatement=1" >> >> >> >>> but I have no idea what this means or if this is relevant at all. >> >> >> >>> >> >> >> >>>> >> >> >> >>>> One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage. >> >> >> >>>> >> >> >> >>>> In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. >> >> >> >>>> >> >> >> >>> >> >> >> >>> Ack, this seems to be the problem. >> >> >> >>> >> >> >> >>>> >> >> >> >>>> Nir >> >> >> >>>> >> >> >> >>>> >> >> >> >>>>> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console >> >> >> >>>>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> >>>>> >> >> >> >>>>> Marcin, could you please take a look? >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> >>>>>>>> >> >> >> >>>>>>>> > >> >> >> >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >> >> >> >>>>>>>> > > >> >> >> >>>>>>>> > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> Who installs this rpm in OST? >> >> >> >>>>>>>> > > > > > >> >> >> >>>>>>>> > > > > > >> >> >> >>>>>>>> > > > > > >> >> >> >>>>>>>> > > > > > I do not understand the question. >> >> >> >>>>>>>> > > > > > >> >> >> >>>>>>>> > > > > > >> >> >> >>>>>>>> > > > > > >> >> >> >>>>>>>> > > > > >> > [...] >> >> >> >>>>>>>> > > > > >> > >> >> >> >>>>>>>> > > > > >> > >> >> >> >>>>>>>> > > > > >> > >> >> >> >>>>>>>> > > > > >> > See [2] for full error. >> >> >> >>>>>>>> > > > > >> > >> >> >> >>>>>>>> > > > > >> > >> >> >> >>>>>>>> > > > > >> > >> >> >> >>>>>>>> > > > > >> > Can someone please take a look? >> >> >> >>>>>>>> > > > > >> > Thanks >> >> >> >>>>>>>> > > > > >> > Vojta >> >> >> >>>>>>>> > > > > >> > >> >> >> >>>>>>>> > > > > >> > >> >> >> >>>>>>>> > > > > >> > >> >> >> >>>>>>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >> >> >> >>>>>>>> > > > > >> > [2] >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >> >> >> >>>>>>>> > > > > >> / >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago- <http://post-002_bootstrap.py/lago-> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >> >> >> >>>>>>>> > > > > >> ____ >> >> >> >>>>>>>> > > > > >> ________________________________>> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org <mailto:devel@ovirt.org> >> >> >> >>>>>>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org <mailto:devel-leave@ovirt.org> >> >> >> >>>>>>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> > oVirt Code of Conduct: >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> > List Archives: >> >> >> >>>>>>>> > > > > >> >> >> >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >> >> >> >>>>>>>> > > > > >> N26B >> >> >> >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ >> >> >> >>>>>>>> > > > > >> _______________________________________________ >> >> >> >>>>>>>> > > > > >> Devel mailing list -- devel@ovirt.org <mailto:devel@ovirt.org> >> >> >> >>>>>>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org <mailto:devel-leave@ovirt.org> >> >> >> >>>>>>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >> >> >>>>>>>> > > > > >> oVirt Code of Conduct: >> >> >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >> >> >> >>>>>>>> > > > > >> List Archives: >> >> >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >> >> >> >>>>>>>> > > > > >> N5K3 >> >> >> >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > > >> >> >> >>>>>>>> > > >> >> >> >>>>>>>> > > _______________________________________________ >> >> >> >>>>>>>> > > Devel mailing list -- devel@ovirt.org <mailto:devel@ovirt.org> >> >> >> >>>>>>>> > > To unsubscribe send an email to devel-leave@ovirt.org <mailto:devel-leave@ovirt.org> >> >> >> >>>>>>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >> >> >>>>>>>> > > oVirt Code of Conduct: >> >> >> >>>>>>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: >> >> >> >>>>>>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >> >> >> >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ >> >> >> >>>>>>>> > >> >> >> >>>>>>>> >> >> >> >>>>> >> >> >> >>>>> >> >> >> >>>>> -- >> >> >> >>>>> Martin Perina >> >> >> >>>>> Manager, Software Engineering >> >> >> >>>>> Red Hat Czech s.r.o. >> >> >> >>>>> >> >> >> >>>>> >> >> >> >> >> >>
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Wed, Nov 27, 2019 at 1:27 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
I ran OST on my physical server. I'm experiencing probably the same issues as described in the thread below.
On one of the hosts:
[root@lago-basic-suite-master-host-0 tmp]# ls -l /rhev/data-center/mnt/ ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1': Operation not permitted ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2': Operation not permitted ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_exported': Operation not permitted total 0 d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_exported d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share1 d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share2 drwxr-xr-x. 3 vdsm kvm 50 Nov 27 04:22 blockSD
I think there's some problem with the nfs shares on engine.
We saw it recently with the move to RHEL8, Nir isn't that the same issue with the NFS squashing?
I can mount engine's nfs shares directly from server's native OS:
➜ /tmp mkdir -p /tmp/aaa && mount "192.168.200.4:/exports/nfs/share1" /tmp/aaa ➜ /tmp ls -l /tmp/aaa total 4 drwxr-xr-x. 5 36 kvm 4096 Nov 27 10:18 3332759c-a943-4fbd-80aa-a5f72cd87c7c ➜ /tmp
But trying to do that from one of the hosts fails:
[root@lago-basic-suite-master-host-0 tmp]# mkdir -p /tmp/aaa && mount -v "192.168.200.4:/exports/nfs/share1" /tmp/aaa mount.nfs: timeout set for Wed Nov 27 06:26:19 2019 mount.nfs: trying text-based options 'vers=4.2,addr=192.168.200.4,clientaddr=192.168.201.2' mount.nfs: mount(2): Operation not permitted mount.nfs: trying text-based options 'addr=192.168.200.4' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: portmap query failed: RPC: Remote system error - No route to host
On the engine side, '/var/log/messages' seems to be flooded with nfs issues, example failures:
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence: slotid 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid enter. seqid 405 slot_seqid 404 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #1: 53: status 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request from insecure port 192.168.200.1, port=51529! Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #2: 22: status 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound returned 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: --> nfsd4_store_cache_entry slot ffff9042c4d97000 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client (clientid 5dde5a1f/cc80daed) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd_dispatch: vers 4 proc 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence: slotid 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid enter. seqid 406 slot_seqid 405 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #1: 53: status 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request from insecure port 192.168.200.1, port=51529! Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #2: 22: status 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound returned 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: --> nfsd4_store_cache_entry slot ffff9042c4d97000 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client (clientid 5dde5a1f/cc80daed)
Regards, Marcin
On 11/26/19 8:40 PM, Martin Perina wrote:
I've just merged https://gerrit.ovirt.org/105111 which only silence the issue, but we really need to unblock OST, as it's suffering from this for more than 2 weeks now.
Tal/Nir, could someone really investigate why the storage become unavailable after some time? It may be caused by recent switch of hosts to CentOS 8, but may be not related
Thanks, Martin
On Tue, Nov 26, 2019 at 9:17 AM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dholler@redhat.com>
On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com>
wrote:
> > On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler <dholler@redhat.com> wrote: > > > > > > > > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> wrote: > >> > >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler < dholler@redhat.com> wrote: > >> > > >> > > >> > > >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler < dholler@redhat.com> wrote: > >> >> > >> >> > >> >> > >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler < dholler@redhat.com> wrote: > >> >>> > >> >>> > >> >>> > >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer < nsoffer@redhat.com> wrote: > >> >>>> > >> >>>> > >> >>>> > >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk < msobczyk@redhat.com> wrote: > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler < dholler@redhat.com> wrote: > >> >>>>>> > >> >>>>>> > >> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler < dholler@redhat.com> wrote: > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com> wrote: > >> >>>>>>>> > >> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < vjuranek@redhat.com> wrote: > >> >>>>>>>> > > >> >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: > >> >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> > >> >>>>>>>> > > wrote: > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: > >> >>>>>>>> > > > > >> >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> > >> >>>>>>>> > > > > wrote: > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer < nsoffer@redhat.com> > >> >>>>>>>> > > > > > wrote: > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > >> >>>>>>>> > > > > >> <vjuranek@redhat.com> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> wrote: > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > Hi, > >> >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It > >> >>>>>>>> > > > > >> > fails > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> with > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve > >> >>>>>>>> > > > > >> > Error > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> occured: > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package > >> >>>>>>>> > > > > >> > vdsm- > >> >>>>>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> nmstate > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > >> >>>>>>>> > > > > >> > Problem 2: > >> >>>>>>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> vdsm-network > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> installed\n > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > - cannot install the best update candidate for package vdsm- > >> >>>>>>>> > > > > >> >
> >> >>>>>>>> > > > > >> > nmstate > >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> nmstate should be provided by copr repo enabled by > >> >>>>>>>> > > > > >> ovirt-release-master. > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > > I re-triggered as > >> >>>>>>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > >> >>>>>>>> > > > > > maybe > >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ > >> >>>>>>>> > > > > > was missing > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > Looks like > >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > maybe not. You re-triggered with [1], which really missed this patch. > >> >>>>>>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's > >> >>>>>>>> > > > wait > >> >>>>>>>> > for it to see if gerrit #104825 helps. > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ > >> >>>>>>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > Miguel, do you think merging > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos > >> >>>>>>>> > > > > t-cq > >> >>>>>>>> > .repo.in > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > would solve this? > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > I've split the patch Dominik mentions above in two, one of them adding > >> >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. > >> >>>>>>>> > > > >> >>>>>>>> > > Let's see if it fixes it. > >> >>>>>>>> > > >> >>>>>>>> > it fixes original issue, but OST still fails in > >> >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: > >> >>>>>>>> > > >> >>>>>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 > >> >>>>>>>> > >> >>>>>>>> I think Dominik was looking into this issue; +Dominik Holler please confirm. > >> >>>>>>>> > >> >>>>>>>> Let me know if you need any help Dominik. > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Thanks. > >> >>>>>>> The problem is that the hosts lost connection to storage: > >> >>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : > >> >>>>>>> > >> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) > >> >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) > >> >>>>>>> Traceback (most recent call last): > >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked > >> >>>>>>> delay = result.delay() > >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > >> >>>>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) > >> >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') > >> >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> I failed to reproduce local to analyze this, I will try again, any hints welcome. > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. > >> >>>>>> Is there someone with knowledge about the basic_ui_sanity around? > >> >>>>> > >> >>>>> How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? > >> >>>>> > >> >>>>> Looking at 6134 run you were discussing: > >> >>>>> > >> >>>>> - timing of the ui sanity set-up [1]: > >> >>>>> > >> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: > >> >>>>> > >> >>>>> - timing of first encountered storage error [2]: > >> >>>>> > >> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) > >> >>>>> Traceback (most recent call last): > >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked > >> >>>>> delay = result.delay() > >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > >> >>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) > >> >>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') > >> >>>>> > >> >>>>> Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. > >> >> > >> >> > >> >> > >> >> You are right, a time.sleep(8*60) in > >> >> https://gerrit.ovirt.org/#/c/104925/2 > >> >> has triggers the issue the same way. > >> > >> So this is a test issues, assuming that the UI tests can complete in > >> less than 8 minutes? > >> > > > > To my eyes this looks like storage is just stop working after some time. > > > >> > >> >> > >> > > >> > Nir or Steve, can you please confirm that this is a storage
wrote: python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides problem?
> >> > >> Why do you think we have a storage problem? > >> > > > > I understand from the posted log snippets that they say that the storage is not accessible anymore, > > No, so far one read timeout was reported, this does not mean storage > is not available anymore. > It can be temporary issue that does not harm anything. > > > while the host is still responsive. > > This might be triggered by something outside storage, e.g. the network providing the storage stopped working, > > But I think a possible next step in analysing this issue would be to find the reason why storage is not happy. >
Sounds like there was a miscommunication in this thread. I try to address all of your points, please let me know if something is missing or not clearly expressed.
> > First step is to understand which test fails,
098_ovirt_provider_ovn.use_ovn_provider
> > and why. This can be done by the owner of the test,
The test was added by the network team.
> > understanding what the test does
The test tries to add a vNIC.
> > and what is the expected system behavior. >
It is expected that adding a vNIC works, because the VM should be up.
What was the actual behavior?
> If the owner of the test thinks that the test failed because of a storage issue
I am not sure who is the owner, but I do.
Can you explain why how a vNIC failed because of a storage issue?
Test fails with:
Cannot add a Network Interface when VM is not Down, Up or Image-Locked.
engine.log says: {"jsonrpc": "2.0", "method": "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", "name": "vda", "path": "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, "notify_time": 4298388570}}
So you think adding vNIC failed because the VM was paused?
Yes, because of the error message "Cannot add a Network Interface when VM is not Down, Up or Image-Locked."
vdsm.log says:
2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', 1, 'Read timeout')
Is this related to the paused vm?
The log entry : '{"status": "Paused", "pauseCode": "EOTHER", "ioerror"' makes me thinking this.
You did not provide a timestamp for the engine event above.
I can't find last weeks log, maybe they are faded out already. Please find more recent logs in https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492
...
2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' is blocked for 60.00 seconds (check:282) 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor] Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 408, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 710, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242, in statvfs return self._iop.statvfs(path) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 442, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) ioprocess.Timeout: Connection timed out
This show that storage was not accessible for 60 seconds (ioprocess uses 60 seconds timeout).
60 seconds timeout is bad. If we have leases on this storage domain (e.g. SPM lease) they will expire in 20 seconds after this event and the vdsm on the SPM host will be killed.
Do we have network tests changing the network used by the NFS storage domain before this event?
No.
What were the changes the network tests or code since OST was successful?
I am not aware of a change, which might be relevant. Maybe the fact that the hosts are on CentOS 8, while the Engine (storage) is on CentOS 7 is relevant. Also the occurrence of this issue seems not to be 100% deterministic, I guess because it is timing related.
The error is reproducible locally by running OST, and just keep the environment alive after basic-suite-master succeeded. After some time, the storage will become inaccessible.
Can you explain how adding 8 minutes sleep instead of the UI tests reproduced the issue?
This shows that the issue is not triggered by the UI test, but maybe by passing time.
Do we run the ovn tests after the UI tests?
> someone from storage can look at this. >
Thanks, I would appreciate this.
> > But the fact that adding long sleep reproduce the issue means it is not related > in any way to storage. > > Nir > > > > >> > >> > > >> >> > >> >> > >> >>>>> > >> >>>>> I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue". > >> >>>>> > >> >>>>> Nir, Amit, can you comment on this? > >> >>>> > >> >>>> > >> >>>> The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. > >> >>>> > >> >>>> The reason for such issue can be misconfigured network (maybe network team is testing negative flows?), > >> >>> > >> >>> > >> >>> No. > >> >>> > >> >>>> > >> >>>> or some issue in the NFS server. > >> >>>> > >> >>> > >> >>> Only hint I found is > >> >>> "Exiting Time2Retain handler because session_reinstatement=1" > >> >>> but I have no idea what this means or if this is relevant at all. > >> >>> > >> >>>> > >> >>>> One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage. > >> >>>> > >> >>>> In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. > >> >>>> > >> >>> > >> >>> Ack, this seems to be the problem. > >> >>> > >> >>>> > >> >>>> Nir > >> >>>> > >> >>>> > >> >>>>> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console > >> >>>>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> Marcin, could you please take a look? > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>>>> > >> >>>>>>>> > > >> >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > > >> >>>>>>>> > > > > >> Who installs this rpm in OST? > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > > I do not understand the question. > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > > > >> >>>>>>>> > > > > >> > [...] > >> >>>>>>>> > > > > >> > > >> >>>>>>>> > > > > >> > > >> >>>>>>>> > > > > >> > > >> >>>>>>>> > > > > >> > See [2] for full error. > >> >>>>>>>> > > > > >> > > >> >>>>>>>> > > > > >> > > >> >>>>>>>> > > > > >> > > >> >>>>>>>> > > > > >> > Can someone please take a look? > >> >>>>>>>> > > > > >> > Thanks > >> >>>>>>>> > > > > >> > Vojta > >> >>>>>>>> > > > > >> > > >> >>>>>>>> > > > > >> > > >> >>>>>>>> > > > > >> > > >> >>>>>>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > >> >>>>>>>> > > > > >> > [2] > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > >> >>>>>>>> > > > > >> / > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago- > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > >> >>>>>>>> > > > > >> ____ > >> >>>>>>>> > > > > >> ________________________________>> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org > >> >>>>>>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org > >> >>>>>>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > oVirt Code of Conduct: > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> > List Archives: > >> >>>>>>>> > > > > >> > >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > >> >>>>>>>> > > > > >> N26B > >> >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ > >> >>>>>>>> > > > > >>
> >> >>>>>>>> > > > > >> Devel mailing list -- devel@ovirt.org > >> >>>>>>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org > >> >>>>>>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >> >>>>>>>> > > > > >> oVirt Code of Conduct: > >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ > >> >>>>>>>> > > > > >> List Archives: > >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > >> >>>>>>>> > > > > >> N5K3 > >> >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > >> >>>>>>>> > > _______________________________________________ > >> >>>>>>>> > > Devel mailing list -- devel@ovirt.org > >> >>>>>>>> > > To unsubscribe send an email to devel-leave@ovirt.org > >> >>>>>>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >> >>>>>>>> > > oVirt Code of Conduct: > >> >>>>>>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: > >> >>>>>>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H > >> >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ > >> >>>>>>>> > > >> >>>>>>>> > >> >>>>> > >> >>>>> > >> >>>>> -- > >> >>>>> Martin Perina > >> >>>>> Manager, Software Engineering > >> >>>>> Red Hat Czech s.r.o. > >> >>>>> > >> >>>>> > >> >
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Wed, Nov 27, 2019 at 5:54 PM Tal Nisan <tnisan@redhat.com> wrote:
On Wed, Nov 27, 2019 at 1:27 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
I ran OST on my physical server. I'm experiencing probably the same issues as described in the thread below.
On one of the hosts:
[root@lago-basic-suite-master-host-0 tmp]# ls -l /rhev/data-center/mnt/ ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1': Operation not permitted ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2': Operation not permitted ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_exported': Operation not permitted total 0 d?????????? ? ? ? ? ? 192.168.200.4: _exports_nfs_exported d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share1 d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share2 drwxr-xr-x. 3 vdsm kvm 50 Nov 27 04:22 blockSD
I think there's some problem with the nfs shares on engine.
We saw it recently with the move to RHEL8, Nir isn't that the same issue with the NFS squashing?
root not being able to access NFS is expected if the NFS server is not configured with annonuid=36,annongid=36. This is not new and did not change in rhel8. The change is probably in libvirt, trying to access disk it should not access since we disable dac in the xml for disks. When this happens vms do not start, and here the issue seems to be that vm get paused after some time because storage becomes inaccessible. I can mount engine's nfs shares directly from server's native OS:
➜ /tmp mkdir -p /tmp/aaa && mount "192.168.200.4:/exports/nfs/share1" /tmp/aaa ➜ /tmp ls -l /tmp/aaa total 4 drwxr-xr-x. 5 36 kvm 4096 Nov 27 10:18 3332759c-a943-4fbd-80aa-a5f72cd87c7c ➜ /tmp
But trying to do that from one of the hosts fails:
[root@lago-basic-suite-master-host-0 tmp]# mkdir -p /tmp/aaa && mount -v "192.168.200.4:/exports/nfs/share1" /tmp/aaa mount.nfs: timeout set for Wed Nov 27 06:26:19 2019 mount.nfs: trying text-based options 'vers=4.2,addr=192.168.200.4,clientaddr=192.168.201.2' mount.nfs: mount(2): Operation not permitted mount.nfs: trying text-based options 'addr=192.168.200.4' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: portmap query failed: RPC: Remote system error - No route to host
Smells like broken network.
On the engine side, '/var/log/messages' seems to be flooded with nfs
issues, example failures:
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence: slotid 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid enter. seqid 405 slot_seqid 404 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #1: 53: status 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request from insecure port 192.168.200.1, port=51529! Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #2: 22: status 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound returned 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: --> nfsd4_store_cache_entry slot ffff9042c4d97000 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client (clientid 5dde5a1f/cc80daed) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd_dispatch: vers 4 proc 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence: slotid 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid enter. seqid 406 slot_seqid 405 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #1: 53: status 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request from insecure port 192.168.200.1, port=51529! Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #2: 22: status 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound returned 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: --> nfsd4_store_cache_entry slot ffff9042c4d97000 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client (clientid 5dde5a1f/cc80daed)
Regards, Marcin
On 11/26/19 8:40 PM, Martin Perina wrote:
I've just merged https://gerrit.ovirt.org/105111 which only silence the issue, but we really need to unblock OST, as it's suffering from this for more than 2 weeks now.
Tal/Nir, could someone really investigate why the storage become unavailable after some time? It may be caused by recent switch of hosts to CentOS 8, but may be not related
Thanks, Martin
On Tue, Nov 26, 2019 at 9:17 AM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <nsoffer@redhat.com>
On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dholler@redhat.com>
wrote:
> > > > On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com> wrote: >> >> On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler < dholler@redhat.com> wrote: >> > >> > >> > >> > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> wrote: >> >> >> >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler < dholler@redhat.com> wrote: >> >> > >> >> > >> >> > >> >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler < dholler@redhat.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler < dholler@redhat.com> wrote: >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer < nsoffer@redhat.com> wrote: >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk < msobczyk@redhat.com> wrote: >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler < dholler@redhat.com> wrote: >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler < dholler@redhat.com> wrote: >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com> wrote: >> >> >>>>>>>> >> >> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < vjuranek@redhat.com> wrote: >> >> >>>>>>>> > >> >> >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: >> >> >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> >> >> >>>>>>>> > > wrote: >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> >> >> >>>>>>>> > > > > wrote: >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> >> >> >>>>>>>> > > > > > wrote: >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek >> >> >>>>>>>> > > > > >> <vjuranek@redhat.com> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> wrote: >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> > Hi, >> >> >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It >> >> >>>>>>>> > > > > >> > fails >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> with >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve >> >> >>>>>>>> > > > > >> > Error >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> occured: >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package >> >> >>>>>>>> > > > > >> > vdsm- >> >> >>>>>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> nmstate >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >> >>>>>>>> > > > > >> > Problem 2: >> >> >>>>>>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> vdsm-network >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> installed\n >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> > - cannot install the best update candidate for package vdsm- >> >> >>>>>>>> > > > > >> >
>> >> >>>>>>>> > > > > >> > nmstate >> >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> nmstate should be provided by copr repo enabled by >> >> >>>>>>>> > > > > >> ovirt-release-master. >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > I re-triggered as >> >> >>>>>>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 >> >> >>>>>>>> > > > > > maybe >> >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ >> >> >>>>>>>> > > > > > was missing >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > Looks like >> >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > maybe not. You re-triggered with [1], which really missed this patch. >> >> >>>>>>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's >> >> >>>>>>>> > > > wait >> >> >>>>>>>> > for it to see if gerrit #104825 helps. >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ >> >> >>>>>>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > > Miguel, do you think merging >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos >> >> >>>>>>>> > > > > t-cq >> >> >>>>>>>> > .repo.in >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > would solve this? >> >> >>>>>>>> > > >> >> >>>>>>>> > > >> >> >>>>>>>> > > I've split the patch Dominik mentions above in two, one of them adding >> >> >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. >> >> >>>>>>>> > > >> >> >>>>>>>> > > Let's see if it fixes it. >> >> >>>>>>>> > >> >> >>>>>>>> > it fixes original issue, but OST still fails in >> >> >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: >> >> >>>>>>>> > >> >> >>>>>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 >> >> >>>>>>>> >> >> >>>>>>>> I think Dominik was looking into this issue; +Dominik Holler please confirm. >> >> >>>>>>>> >> >> >>>>>>>> Let me know if you need any help Dominik. >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> Thanks. >> >> >>>>>>> The problem is that the hosts lost connection to storage: >> >> >>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : >> >> >>>>>>> >> >> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) >> >> >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >> >> >>>>>>> Traceback (most recent call last): >> >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >> >> >>>>>>> delay = result.delay() >> >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >> >> >>>>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >> >> >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >> >> >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> I failed to reproduce local to analyze this, I will
>> >> >>>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. >> >> >>>>>> Is there someone with knowledge about the basic_ui_sanity around? >> >> >>>>> >> >> >>>>> How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? >> >> >>>>> >> >> >>>>> Looking at 6134 run you were discussing: >> >> >>>>> >> >> >>>>> - timing of the ui sanity set-up [1]: >> >> >>>>> >> >> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: >> >> >>>>> >> >> >>>>> - timing of first encountered storage error [2]: >> >> >>>>> >> >> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) >> >> >>>>> Traceback (most recent call last): >> >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked >> >> >>>>> delay = result.delay() >> >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay >> >> >>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) >> >> >>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') >> >> >>>>> >> >> >>>>> Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. >> >> >> >> >> >> >> >> >> >> >> >> You are right, a time.sleep(8*60) in >> >> >> https://gerrit.ovirt.org/#/c/104925/2 >> >> >> has triggers the issue the same way. >> >> >> >> So this is a test issues, assuming that the UI tests can complete in >> >> less than 8 minutes? >> >> >> > >> > To my eyes this looks like storage is just stop working after some time. >> > >> >> >> >> >> >> >> > >> >> > Nir or Steve, can you please confirm that this is a storage
>> >> >> >> Why do you think we have a storage problem? >> >> >> > >> > I understand from the posted log snippets that they say that
wrote: python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides try again, any hints welcome. problem? the storage is not accessible anymore,
>> >> No, so far one read timeout was reported, this does not mean storage >> is not available anymore. >> It can be temporary issue that does not harm anything. >> >> > while the host is still responsive. >> > This might be triggered by something outside storage, e.g. the network providing the storage stopped working, >> > But I think a possible next step in analysing this issue would be to find the reason why storage is not happy. >> > > Sounds like there was a miscommunication in this thread. > I try to address all of your points, please let me know if something is missing or not clearly expressed. > >> >> First step is to understand which test fails, > > > 098_ovirt_provider_ovn.use_ovn_provider > >> >> and why. This can be done by the owner of the test, > > > The test was added by the network team. > >> >> understanding what the test does > > > The test tries to add a vNIC. > >> >> and what is the expected system behavior. >> > > It is expected that adding a vNIC works, because the VM should be up.
What was the actual behavior?
>> If the owner of the test thinks that the test failed because of a storage issue > > > I am not sure who is the owner, but I do.
Can you explain why how a vNIC failed because of a storage issue?
Test fails with:
Cannot add a Network Interface when VM is not Down, Up or Image-Locked.
engine.log says: {"jsonrpc": "2.0", "method": "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", "name": "vda", "path": "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, "notify_time": 4298388570}}
So you think adding vNIC failed because the VM was paused?
Yes, because of the error message "Cannot add a Network Interface when VM is not Down, Up or Image-Locked."
vdsm.log says:
2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', 1, 'Read timeout')
Is this related to the paused vm?
The log entry : '{"status": "Paused", "pauseCode": "EOTHER", "ioerror"' makes me thinking this.
You did not provide a timestamp for the engine event above.
I can't find last weeks log, maybe they are faded out already. Please find more recent logs in
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492
...
2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' is blocked for 60.00 seconds (check:282) 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor] Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 408, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 710, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242, in statvfs return self._iop.statvfs(path) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 442, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) ioprocess.Timeout: Connection timed out
This show that storage was not accessible for 60 seconds (ioprocess uses 60 seconds timeout).
60 seconds timeout is bad. If we have leases on this storage domain (e.g. SPM lease) they will expire in 20 seconds after this event and the vdsm on the SPM host will be killed.
Do we have network tests changing the network used by the NFS storage domain before this event?
No.
What were the changes the network tests or code since OST was successful?
I am not aware of a change, which might be relevant. Maybe the fact that the hosts are on CentOS 8, while the Engine (storage) is on CentOS 7 is relevant. Also the occurrence of this issue seems not to be 100% deterministic, I guess because it is timing related.
The error is reproducible locally by running OST, and just keep the environment alive after basic-suite-master succeeded. After some time, the storage will become inaccessible.
Can you explain how adding 8 minutes sleep instead of the UI tests reproduced the issue?
This shows that the issue is not triggered by the UI test, but maybe by passing time.
Do we run the ovn tests after the UI tests?
>> someone from storage can look at this. >> > > Thanks, I would appreciate this. > >> >> But the fact that adding long sleep reproduce the issue means it is not related >> in any way to storage. >> >> Nir >> >> > >> >> >> >> > >> >> >> >> >> >> >> >> >>>>> >> >> >>>>> I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue". >> >> >>>>> >> >> >>>>> Nir, Amit, can you comment on this? >> >> >>>> >> >> >>>> >> >> >>>> The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. >> >> >>>> >> >> >>>> The reason for such issue can be misconfigured network (maybe network team is testing negative flows?), >> >> >>> >> >> >>> >> >> >>> No. >> >> >>> >> >> >>>> >> >> >>>> or some issue in the NFS server. >> >> >>>> >> >> >>> >> >> >>> Only hint I found is >> >> >>> "Exiting Time2Retain handler because session_reinstatement=1" >> >> >>> but I have no idea what this means or if this is relevant at all. >> >> >>> >> >> >>>> >> >> >>>> One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage. >> >> >>>> >> >> >>>> In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. >> >> >>>> >> >> >>> >> >> >>> Ack, this seems to be the problem. >> >> >>> >> >> >>>> >> >> >>>> Nir >> >> >>>> >> >> >>>> >> >> >>>>> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console >> >> >>>>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... >> >> >>>>>> >> >> >>>>>> >> >> >>>>> >> >> >>>>> Marcin, could you please take a look? >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>>>> >> >> >>>>>>>> > >> >> >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ >> >> >>>>>>>> > > >> >> >>>>>>>> > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> Who installs this rpm in OST? >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > I do not understand the question. >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > >> > [...] >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > See [2] for full error. >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > Can someone please take a look? >> >> >>>>>>>> > > > > >> > Thanks >> >> >>>>>>>> > > > > >> > Vojta >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ >> >> >>>>>>>> > > > > >> > [2] >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact >> >> >>>>>>>> > > > > >> / >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago- >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ >> >> >>>>>>>> > > > > >> ____ >> >> >>>>>>>> > > > > >> ________________________________>> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org >> >> >>>>>>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org >> >> >>>>>>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> > oVirt Code of Conduct: >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> > List Archives: >> >> >>>>>>>> > > > > >> >> >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ >> >> >>>>>>>> > > > > >> N26B >> >> >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ >> >> >>>>>>>> > > > > >>
>> >> >>>>>>>> > > > > >> Devel mailing list -- devel@ovirt.org >> >> >>>>>>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org >> >> >>>>>>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >> >>>>>>>> > > > > >> oVirt Code of Conduct: >> >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ >> >> >>>>>>>> > > > > >> List Archives: >> >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ >> >> >>>>>>>> > > > > >> N5K3 >> >> >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > >> >> >>>>>>>> > > _______________________________________________ >> >> >>>>>>>> > > Devel mailing list -- devel@ovirt.org >> >> >>>>>>>> > > To unsubscribe send an email to devel-leave@ovirt.org >> >> >>>>>>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> >> >>>>>>>> > > oVirt Code of Conduct: >> >> >>>>>>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: >> >> >>>>>>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H >> >> >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ >> >> >>>>>>>> > >> >> >>>>>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> -- >> >> >>>>> Martin Perina >> >> >>>>> Manager, Software Engineering >> >> >>>>> Red Hat Czech s.r.o. >> >> >>>>> >> >> >>>>> >> >> >>
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Wed, Nov 27, 2019 at 5:44 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Wed, Nov 27, 2019 at 5:54 PM Tal Nisan <tnisan@redhat.com> wrote:
On Wed, Nov 27, 2019 at 1:27 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
I ran OST on my physical server. I'm experiencing probably the same issues as described in the thread below.
On one of the hosts:
[root@lago-basic-suite-master-host-0 tmp]# ls -l /rhev/data-center/mnt/ ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1': Operation not permitted ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2': Operation not permitted ls: cannot access '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_exported': Operation not permitted total 0 d?????????? ? ? ? ? ? 192.168.200.4: _exports_nfs_exported d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share1 d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share2 drwxr-xr-x. 3 vdsm kvm 50 Nov 27 04:22 blockSD
I think there's some problem with the nfs shares on engine.
We saw it recently with the move to RHEL8, Nir isn't that the same issue with the NFS squashing?
root not being able to access NFS is expected if the NFS server is not configured with annonuid=36,annongid=36.
This is not new and did not change in rhel8. The change is probably in libvirt, trying to access disk it should not access since we disable dac in the xml for disks.
When this happens vms do not start, and here the issue seems to be that vm get paused after some time because storage becomes inaccessible.
I can mount engine's nfs shares directly from server's native OS:
➜ /tmp mkdir -p /tmp/aaa && mount "192.168.200.4:/exports/nfs/share1" /tmp/aaa ➜ /tmp ls -l /tmp/aaa total 4 drwxr-xr-x. 5 36 kvm 4096 Nov 27 10:18 3332759c-a943-4fbd-80aa-a5f72cd87c7c ➜ /tmp
But trying to do that from one of the hosts fails:
[root@lago-basic-suite-master-host-0 tmp]# mkdir -p /tmp/aaa && mount -v "192.168.200.4:/exports/nfs/share1" /tmp/aaa mount.nfs: timeout set for Wed Nov 27 06:26:19 2019 mount.nfs: trying text-based options 'vers=4.2,addr=192.168.200.4,clientaddr=192.168.201.2' mount.nfs: mount(2): Operation not permitted mount.nfs: trying text-based options 'addr=192.168.200.4' mount.nfs: prog 100003, trying vers=3, prot=6 mount.nfs: portmap query failed: RPC: Remote system error - No route to host
Smells like broken network.
As I reproduced this scenario, ping was working, while NFS not working.
On the engine side, '/var/log/messages' seems to be flooded with nfs
issues, example failures:
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence: slotid 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid enter. seqid 405 slot_seqid 404 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #1: 53: status 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request from insecure port 192.168.200.1, port=51529! Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #2: 22: status 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound returned 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: --> nfsd4_store_cache_entry slot ffff9042c4d97000 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client (clientid 5dde5a1f/cc80daed) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd_dispatch: vers 4 proc 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: __find_in_sessionid_hashtbl: 1574853151:3430996717:11:0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence: slotid 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid enter. seqid 406 slot_seqid 405 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #1: 53: status 0 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991) Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request from insecure port 192.168.200.1, port=51529! Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op ffff9042fc202080 opcnt 3 #2: 22: status 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound returned 1 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: --> nfsd4_store_cache_entry slot ffff9042c4d97000 Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client (clientid 5dde5a1f/cc80daed)
Regards, Marcin
On 11/26/19 8:40 PM, Martin Perina wrote:
I've just merged https://gerrit.ovirt.org/105111 which only silence the issue, but we really need to unblock OST, as it's suffering from this for more than 2 weeks now.
Tal/Nir, could someone really investigate why the storage become unavailable after some time? It may be caused by recent switch of hosts to CentOS 8, but may be not related
Thanks, Martin
On Tue, Nov 26, 2019 at 9:17 AM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler <dholler@redhat.com> wrote:
On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer <nsoffer@redhat.com>
> > On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler <dholler@redhat.com> wrote: > > > > > > > > On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer <nsoffer@redhat.com> wrote: > >> > >> On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler < dholler@redhat.com> wrote: > >> > > >> > > >> > > >> > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer <nsoffer@redhat.com> wrote: > >> >> > >> >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler < dholler@redhat.com> wrote: > >> >> > > >> >> > > >> >> > > >> >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler < dholler@redhat.com> wrote: > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik Holler < dholler@redhat.com> wrote: > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir Soffer < nsoffer@redhat.com> wrote: > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin Sobczyk < msobczyk@redhat.com> wrote: > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> On 11/22/19 4:54 PM, Martin Perina wrote: > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM Dominik Holler < dholler@redhat.com> wrote: > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> On Fri, Nov 22, 2019 at 12:17 PM Dominik Holler < dholler@redhat.com> wrote: > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> On Fri, Nov 22, 2019 at 12:00 PM Miguel Duarte de Mora Barroso <mdbarroso@redhat.com> wrote: > >> >> >>>>>>>> > >> >> >>>>>>>> On Fri, Nov 22, 2019 at 11:54 AM Vojtech Juranek < vjuranek@redhat.com> wrote: > >> >> >>>>>>>> > > >> >> >>>>>>>> > On pátek 22. listopadu 2019 9:56:56 CET Miguel Duarte de Mora Barroso wrote: > >> >> >>>>>>>> > > On Fri, Nov 22, 2019 at 9:49 AM Vojtech Juranek < vjuranek@redhat.com> > >> >> >>>>>>>> > > wrote: > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > On pátek 22. listopadu 2019 9:41:26 CET Dominik Holler wrote: > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > On Fri, Nov 22, 2019 at 8:40 AM Dominik Holler <dholler@redhat.com> > >> >> >>>>>>>> > > > > wrote: > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > On Thu, Nov 21, 2019 at 10:54 PM Nir Soffer <nsoffer@redhat.com> > >> >> >>>>>>>> > > > > > wrote: > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > >> On Thu, Nov 21, 2019 at 11:24 PM Vojtech Juranek > >> >> >>>>>>>> > > > > >> <vjuranek@redhat.com> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> wrote: > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > Hi, > >> >> >>>>>>>> > > > > >> > OST fails (see e.g. [1]) in 002_bootstrap.check_update_host. It > >> >> >>>>>>>> > > > > >> > fails > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> with > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > FAILED! => {"changed": false, "failures": [], "msg": "Depsolve > >> >> >>>>>>>> > > > > >> > Error > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> occured: > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > \n Problem 1: cannot install the best update candidate for package > >> >> >>>>>>>> > > > > >> > vdsm- > >> >> >>>>>>>> > > > > >> > network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> nmstate > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > >> >> >>>>>>>> > > > > >> > Problem 2: > >> >> >>>>>>>> > > > > >> > package vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> vdsm-network > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > = 4.40.0-1271.git524e08c8a.el8, but none of the providers can be > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> installed\n > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > - cannot install the best update candidate for package vdsm- > >> >> >>>>>>>> > > > > >> >
> >> >> >>>>>>>> > > > > >> > nmstate > >> >> >>>>>>>> > > > > >> > needed by vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> nmstate should be provided by copr repo enabled by > >> >> >>>>>>>> > > > > >> ovirt-release-master. > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > I re-triggered as > >> >> >>>>>>>> > > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131 > >> >> >>>>>>>> > > > > > maybe > >> >> >>>>>>>> > > > > > https://gerrit.ovirt.org/#/c/104825/ > >> >> >>>>>>>> > > > > > was missing > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > Looks like > >> >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104825/ is ignored by OST. > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > maybe not. You re-triggered with [1], which really missed this patch. > >> >> >>>>>>>> > > > I did a rebase and now running with this patch in build #6132 [2]. Let's > >> >> >>>>>>>> > > > wait > >> >> >>>>>>>> > for it to see if gerrit #104825 helps. > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > [1] https://jenkins.ovirt.org/job/standard-manual-runner/909/ > >> >> >>>>>>>> > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/ > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > Miguel, do you think merging > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos > >> >> >>>>>>>> > > > > t-cq > >> >> >>>>>>>> > .repo.in > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > would solve this? > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > I've split the patch Dominik mentions above in two, one of them adding > >> >> >>>>>>>> > > the nmstate / networkmanager copr repos - [3]. > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > Let's see if it fixes it. > >> >> >>>>>>>> > > >> >> >>>>>>>> > it fixes original issue, but OST still fails in > >> >> >>>>>>>> > 098_ovirt_provider_ovn.use_ovn_provider: > >> >> >>>>>>>> > > >> >> >>>>>>>> > https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134 > >> >> >>>>>>>> > >> >> >>>>>>>> I think Dominik was looking into this issue; +Dominik Holler please confirm. > >> >> >>>>>>>> > >> >> >>>>>>>> Let me know if you need any help Dominik. > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> Thanks. > >> >> >>>>>>> The problem is that the hosts lost connection to storage: > >> >> >>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... : > >> >> >>>>>>> > >> >> >>>>>>> 2019-11-22 05:39:12,326-0500 DEBUG (jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|", "r|.*|"] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name (cwd None) (commands:153) > >> >> >>>>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) > >> >> >>>>>>> Traceback (most recent call last): > >> >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked > >> >> >>>>>>> delay = result.delay() > >> >> >>>>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > >> >> >>>>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) > >> >> >>>>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') > >> >> >>>>>>> 2019-11-22 05:39:12,416-0500 INFO (check/loop) [storage.Monitor] Domain d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472) > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> I failed to reproduce local to analyze this, I will
> >> >> >>>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> https://gerrit.ovirt.org/#/c/104925/1/ shows that 008_basic_ui_sanity.py triggers the problem. > >> >> >>>>>> Is there someone with knowledge about the basic_ui_sanity around? > >> >> >>>>> > >> >> >>>>> How do you think it's related? By commenting out the ui sanity tests and seeing OST with successful finish? > >> >> >>>>> > >> >> >>>>> Looking at 6134 run you were discussing: > >> >> >>>>> > >> >> >>>>> - timing of the ui sanity set-up [1]: > >> >> >>>>> > >> >> >>>>> 11:40:20 @ Run test: 008_basic_ui_sanity.py: > >> >> >>>>> > >> >> >>>>> - timing of first encountered storage error [2]: > >> >> >>>>> > >> >> >>>>> 2019-11-22 05:39:12,415-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata (monitor:501) > >> >> >>>>> Traceback (most recent call last): > >> >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked > >> >> >>>>> delay = result.delay() > >> >> >>>>> File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay > >> >> >>>>> raise exception.MiscFileReadException(self.path, self.rc, self.err) > >> >> >>>>> vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata', 1, 'Read timeout') > >> >> >>>>> > >> >> >>>>> Timezone difference aside, it seems to me that these storage errors occured before doing anything ui-related. > >> >> >> > >> >> >> > >> >> >> > >> >> >> You are right, a time.sleep(8*60) in > >> >> >> https://gerrit.ovirt.org/#/c/104925/2 > >> >> >> has triggers the issue the same way. > >> >> > >> >> So this is a test issues, assuming that the UI tests can complete in > >> >> less than 8 minutes? > >> >> > >> > > >> > To my eyes this looks like storage is just stop working after some time. > >> > > >> >> > >> >> >> > >> >> > > >> >> > Nir or Steve, can you please confirm that this is a storage
> >> >> > >> >> Why do you think we have a storage problem? > >> >> > >> > > >> > I understand from the posted log snippets that they say that
wrote: python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides try again, any hints welcome. problem? the storage is not accessible anymore,
> >> > >> No, so far one read timeout was reported, this does not mean storage > >> is not available anymore. > >> It can be temporary issue that does not harm anything. > >> > >> > while the host is still responsive. > >> > This might be triggered by something outside storage, e.g. the network providing the storage stopped working, > >> > But I think a possible next step in analysing this issue would be to find the reason why storage is not happy. > >> > > > > Sounds like there was a miscommunication in this thread. > > I try to address all of your points, please let me know if something is missing or not clearly expressed. > > > >> > >> First step is to understand which test fails, > > > > > > 098_ovirt_provider_ovn.use_ovn_provider > > > >> > >> and why. This can be done by the owner of the test, > > > > > > The test was added by the network team. > > > >> > >> understanding what the test does > > > > > > The test tries to add a vNIC. > > > >> > >> and what is the expected system behavior. > >> > > > > It is expected that adding a vNIC works, because the VM should be up. > > What was the actual behavior? > > >> If the owner of the test thinks that the test failed because of a storage issue > > > > > > I am not sure who is the owner, but I do. > > Can you explain why how a vNIC failed because of a storage issue? >
Test fails with:
Cannot add a Network Interface when VM is not Down, Up or Image-Locked.
engine.log says: {"jsonrpc": "2.0", "method": "|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf", "params": {"308bd254-9af9-4570-98ea-822609550acf": {"status": "Paused", "pauseCode": "EOTHER", "ioerror": {"alias": "ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", "name": "vda", "path": "/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}}, "notify_time": 4298388570}}
So you think adding vNIC failed because the VM was paused?
Yes, because of the error message "Cannot add a Network Interface when VM is not Down, Up or Image-Locked."
vdsm.log says:
2019-11-20 10:51:06,026-0500 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata (monitor:501) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 499, in _pathChecked delay = result.delay() File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata', 1, 'Read timeout')
Is this related to the paused vm?
The log entry : '{"status": "Paused", "pauseCode": "EOTHER", "ioerror"' makes me thinking this.
You did not provide a timestamp for the engine event above.
I can't find last weeks log, maybe they are faded out already. Please find more recent logs in
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492
...
2019-11-20 10:51:56,249-0500 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata' is blocked for 60.00 seconds (check:282) 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710) [storage.Monitor] Error checking domain 775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 408, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 710, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 242, in statvfs return self._iop.statvfs(path) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 442, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) ioprocess.Timeout: Connection timed out
This show that storage was not accessible for 60 seconds (ioprocess uses 60 seconds timeout).
60 seconds timeout is bad. If we have leases on this storage domain (e.g. SPM lease) they will expire in 20 seconds after this event and the vdsm on the SPM host will be killed.
Do we have network tests changing the network used by the NFS storage domain before this event?
No.
What were the changes the network tests or code since OST was successful?
I am not aware of a change, which might be relevant. Maybe the fact that the hosts are on CentOS 8, while the Engine (storage) is on CentOS 7 is relevant. Also the occurrence of this issue seems not to be 100% deterministic, I guess because it is timing related.
The error is reproducible locally by running OST, and just keep the environment alive after basic-suite-master succeeded. After some time, the storage will become inaccessible.
> Can you explain how adding 8 minutes sleep instead of the UI tests > reproduced the issue? >
This shows that the issue is not triggered by the UI test, but maybe by passing time.
Do we run the ovn tests after the UI tests?
> >> someone from storage can look at this. > >> > > > > Thanks, I would appreciate this. > > > >> > >> But the fact that adding long sleep reproduce the issue means it is not related > >> in any way to storage. > >> > >> Nir > >> > >> > > >> >> > >> >> > > >> >> >> > >> >> >> > >> >> >>>>> > >> >> >>>>> I remember talking with Steven Rosenberg on IRC a couple of days ago about some storage metadata issues and he said he got a response from Nir, that "it's a known issue". > >> >> >>>>> > >> >> >>>>> Nir, Amit, can you comment on this? > >> >> >>>> > >> >> >>>> > >> >> >>>> The error mentioned here is not vdsm error but warning about storage accessibility. We sould convert the tracebacks to warning. > >> >> >>>> > >> >> >>>> The reason for such issue can be misconfigured network (maybe network team is testing negative flows?), > >> >> >>> > >> >> >>> > >> >> >>> No. > >> >> >>> > >> >> >>>> > >> >> >>>> or some issue in the NFS server. > >> >> >>>> > >> >> >>> > >> >> >>> Only hint I found is > >> >> >>> "Exiting Time2Retain handler because session_reinstatement=1" > >> >> >>> but I have no idea what this means or if this is relevant at all. > >> >> >>> > >> >> >>>> > >> >> >>>> One read timeout is not an issue. We have a real issue only if we have consistent read timeouts or errors for couple of minutes, after that engine can deactivate the storage domain or some hosts if only these hosts are having trouble to access storage. > >> >> >>>> > >> >> >>>> In OST we never expect such conditions since we don't test negative flows, and we should have good connectivity with the vms running on the same host. > >> >> >>>> > >> >> >>> > >> >> >>> Ack, this seems to be the problem. > >> >> >>> > >> >> >>>> > >> >> >>>> Nir > >> >> >>>> > >> >> >>>> > >> >> >>>>> [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console > >> >> >>>>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/export... > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>> > >> >> >>>>> Marcin, could you please take a look? > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>>>> > >> >> >>>>>>>> > > >> >> >>>>>>>> > > [3] - https://gerrit.ovirt.org/#/c/104897/ > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > > >> >> >>>>>>>> > > > > >> Who installs this rpm in OST? > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > I do not understand the question. > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > > > >> >> >>>>>>>> > > > > >> > [...] > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > See [2] for full error. > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > Can someone please take a look? > >> >> >>>>>>>> > > > > >> > Thanks > >> >> >>>>>>>> > > > > >> > Vojta > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > > >> >> >>>>>>>> > > > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/ > >> >> >>>>>>>> > > > > >> > [2] > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact > >> >> >>>>>>>> > > > > >> / > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > exported-artifacts/test_logs/basic-suite-master/ > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> post-002_bootstrap.py/lago- > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________ > >> >> >>>>>>>> > > > > >> ____ > >> >> >>>>>>>> > > > > >> ________________________________>> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > Devel mailing list -- devel@ovirt.org > >> >> >>>>>>>> > > > > >> > To unsubscribe send an email to devel-leave@ovirt.org > >> >> >>>>>>>> > > > > >> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > oVirt Code of Conduct: > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> > List Archives: > >> >> >>>>>>>> > > > > >> > >> >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ > >> >> >>>>>>>> > > > > >> N26B > >> >> >>>>>>>> > > > > >> L73K7D45A2IR7R3UMMM23/ > >> >> >>>>>>>> > > > > >>
> >> >> >>>>>>>> > > > > >> Devel mailing list -- devel@ovirt.org > >> >> >>>>>>>> > > > > >> To unsubscribe send an email to devel-leave@ovirt.org > >> >> >>>>>>>> > > > > >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >> >> >>>>>>>> > > > > >> oVirt Code of Conduct: > >> >> >>>>>>>> > > > > >> https://www.ovirt.org/community/about/community-guidelines/ > >> >> >>>>>>>> > > > > >> List Archives: > >> >> >>>>>>>> > > > > >> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ > >> >> >>>>>>>> > > > > >> N5K3 > >> >> >>>>>>>> > > > > >> NS5TGXFCILYES77KI5TZU/ > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> > > _______________________________________________ > >> >> >>>>>>>> > > Devel mailing list -- devel@ovirt.org > >> >> >>>>>>>> > > To unsubscribe send an email to devel-leave@ovirt.org > >> >> >>>>>>>> > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > >> >> >>>>>>>> > > oVirt Code of Conduct: > >> >> >>>>>>>> > > https://www.ovirt.org/community/about/community-guidelines/ List Archives: > >> >> >>>>>>>> > > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H > >> >> >>>>>>>> > > 5BQ3SCHOYZX6JMTQPBW/ > >> >> >>>>>>>> > > >> >> >>>>>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> -- > >> >> >>>>> Martin Perina > >> >> >>>>> Manager, Software Engineering > >> >> >>>>> Red Hat Czech s.r.o. > >> >> >>>>> > >> >> >>>>> > >> >> > >> >
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.
participants (8)
-
Dan Kenigsberg
-
Dominik Holler
-
Marcin Sobczyk
-
Martin Perina
-
Miguel Duarte de Mora Barroso
-
Nir Soffer
-
Tal Nisan
-
Vojtech Juranek