Hi,
I ran OST on my physical server.
I'm experiencing probably the same issues as described in the thread below.
On one of the hosts:
[root@lago-basic-suite-master-host-0 tmp]# ls -l /rhev/data-center/mnt/
ls: cannot access
'/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1': Operation not
permitted
ls: cannot access
'/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2': Operation not
permitted
ls: cannot access
'/rhev/data-center/mnt/192.168.200.4:_exports_nfs_exported': Operation
not permitted
total 0
d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_exported
d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share1
d?????????? ? ? ? ? ? 192.168.200.4:_exports_nfs_share2
drwxr-xr-x. 3 vdsm kvm 50 Nov 27 04:22 blockSD
I think there's some problem with the nfs shares on engine.
I can mount engine's nfs shares directly from server's native OS:
➜ /tmp mkdir -p /tmp/aaa && mount "192.168.200.4:/exports/nfs/share1"
/tmp/aaa
➜ /tmp ls -l /tmp/aaa
total 4
drwxr-xr-x. 5 36 kvm 4096 Nov 27 10:18 3332759c-a943-4fbd-80aa-a5f72cd87c7c
➜ /tmp
But trying to do that from one of the hosts fails:
[root@lago-basic-suite-master-host-0 tmp]# mkdir -p /tmp/aaa && mount -v
"192.168.200.4:/exports/nfs/share1" /tmp/aaa
mount.nfs: timeout set for Wed Nov 27 06:26:19 2019
mount.nfs: trying text-based options
'vers=4.2,addr=192.168.200.4,clientaddr=192.168.201.2'
mount.nfs: mount(2): Operation not permitted
mount.nfs: trying text-based options 'addr=192.168.200.4'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: portmap query failed: RPC: Remote system error - No route to host
On the engine side, '/var/log/messages' seems to be flooded with nfs
issues, example failures:
Nov 27 06:25:25 lago-basic-suite-master-engine kernel:
__find_in_sessionid_hashtbl: 1574853151:3430996717:11:0
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence:
slotid 0
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid
enter. seqid 405 slot_seqid 404
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
ffff9042fc202080 opcnt 3 #1: 53: status 0
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
#2/3: 22 (OP_PUTFH)
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd:
fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991)
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request
from insecure port 192.168.200.1, port=51529!
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
ffff9042fc202080 opcnt 3 #2: 22: status 1
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound
returned 1
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: -->
nfsd4_store_cache_entry slot ffff9042c4d97000
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client
(clientid 5dde5a1f/cc80daed)
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd_dispatch:
vers 4 proc 1
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
#1/3: 53 (OP_SEQUENCE)
Nov 27 06:25:25 lago-basic-suite-master-engine kernel:
__find_in_sessionid_hashtbl: 1574853151:3430996717:11:0
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd4_sequence:
slotid 0
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: check_slot_seqid
enter. seqid 406 slot_seqid 405
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
ffff9042fc202080 opcnt 3 #1: 53: status 0
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
#2/3: 22 (OP_PUTFH)
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd:
fh_verify(28: 00070001 00340001 00000000 e50ae88b 5c44c45a 2b7c3991)
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsd: request
from insecure port 192.168.200.1, port=51529!
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound op
ffff9042fc202080 opcnt 3 #2: 22: status 1
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: nfsv4 compound
returned 1
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: -->
nfsd4_store_cache_entry slot ffff9042c4d97000
Nov 27 06:25:25 lago-basic-suite-master-engine kernel: renewing client
(clientid 5dde5a1f/cc80daed)
Regards, Marcin
On 11/26/19 8:40 PM, Martin Perina wrote:
I've just merged
https://gerrit.ovirt.org/105111 which only
silence
the issue, but we really need to unblock OST, as it's suffering from
this for more than 2 weeks now.
Tal/Nir, could someone really investigate why the storage become
unavailable after some time? It may be caused by recent switch of
hosts to CentOS 8, but may be not related
Thanks,
Martin
On Tue, Nov 26, 2019 at 9:17 AM Dominik Holler <dholler(a)redhat.com
<mailto:dholler@redhat.com>> wrote:
On Mon, Nov 25, 2019 at 7:12 PM Nir Soffer <nsoffer(a)redhat.com
<mailto:nsoffer@redhat.com>> wrote:
On Mon, Nov 25, 2019 at 7:15 PM Dominik Holler
<dholler(a)redhat.com <mailto:dholler@redhat.com>> wrote:
>
>
>
> On Mon, Nov 25, 2019 at 6:03 PM Nir Soffer
<nsoffer(a)redhat.com <mailto:nsoffer@redhat.com>> wrote:
>>
>> On Mon, Nov 25, 2019 at 6:48 PM Dominik Holler
<dholler(a)redhat.com <mailto:dholler@redhat.com>> wrote:
>> >
>> >
>> >
>> > On Mon, Nov 25, 2019 at 5:16 PM Nir Soffer
<nsoffer(a)redhat.com <mailto:nsoffer@redhat.com>> wrote:
>> >>
>> >> On Mon, Nov 25, 2019 at 6:05 PM Dominik Holler
<dholler(a)redhat.com <mailto:dholler@redhat.com>> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Mon, Nov 25, 2019 at 4:50 PM Nir Soffer
<nsoffer(a)redhat.com <mailto:nsoffer@redhat.com>> wrote:
>> >> >>
>> >> >> On Mon, Nov 25, 2019 at 11:00 AM Dominik Holler
<dholler(a)redhat.com <mailto:dholler@redhat.com>> wrote:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Fri, Nov 22, 2019 at 8:57 PM Dominik Holler
<dholler(a)redhat.com <mailto:dholler@redhat.com>> wrote:
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Fri, Nov 22, 2019 at 5:54 PM Dominik
Holler
<dholler(a)redhat.com <mailto:dholler@redhat.com>> wrote:
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> On Fri, Nov 22, 2019 at 5:48 PM Nir
Soffer
<nsoffer(a)redhat.com <mailto:nsoffer@redhat.com>> wrote:
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> On Fri, Nov 22, 2019, 18:18 Marcin
Sobczyk
<msobczyk(a)redhat.com <mailto:msobczyk@redhat.com>> wrote:
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> On 11/22/19 4:54 PM, Martin
Perina wrote:
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> On Fri, Nov 22, 2019 at 4:43 PM
Dominik Holler
<dholler(a)redhat.com <mailto:dholler@redhat.com>> wrote:
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>> On Fri, Nov 22, 2019 at 12:17
PM Dominik
Holler <dholler(a)redhat.com <mailto:dholler@redhat.com>> wrote:
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> On Fri, Nov 22, 2019 at
12:00 PM Miguel
Duarte de Mora Barroso <mdbarroso(a)redhat.com
<mailto:mdbarroso@redhat.com>> wrote:
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> On Fri, Nov 22, 2019
at 11:54 AM Vojtech
Juranek <vjuranek(a)redhat.com <mailto:vjuranek@redhat.com>> wrote:
>> >> >> >>>>>>>> >
>> >> >> >>>>>>>> > On pátek 22.
listopadu 2019 9:56:56 CET
Miguel Duarte de Mora Barroso wrote:
>> >> >> >>>>>>>> > > On Fri, Nov
22, 2019 at 9:49 AM Vojtech
Juranek <vjuranek(a)redhat.com <mailto:vjuranek@redhat.com>>
>> >> >> >>>>>>>> > > wrote:
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > > On
pátek 22. listopadu 2019 9:41:26
CET Dominik Holler wrote:
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > > >
On Fri, Nov 22, 2019 at 8:40 AM
Dominik Holler <dholler(a)redhat.com <mailto:dholler@redhat.com>>
>> >> >> >>>>>>>> > > > >
wrote:
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
> On Thu, Nov 21, 2019 at 10:54 PM
Nir Soffer <nsoffer(a)redhat.com <mailto:nsoffer@redhat.com>>
>> >> >> >>>>>>>> > > > >
> wrote:
>> >> >> >>>>>>>> > > > >
>
>> >> >> >>>>>>>> > > > >
>> On Thu, Nov 21, 2019 at 11:24 PM
Vojtech Juranek
>> >> >> >>>>>>>> > > > >
>> <vjuranek(a)redhat.com
<mailto:vjuranek@redhat.com>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> wrote:
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> > Hi,
>> >> >> >>>>>>>> > > > >
>> > OST fails (see e.g. [1]) in
002_bootstrap.check_update_host. It
>> >> >> >>>>>>>> > > > >
>> > fails
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> with
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> > FAILED! => {"changed": false,
"failures": [], "msg": "Depsolve
>> >> >> >>>>>>>> > > > >
>> > Error
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> occured:
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> > \n Problem 1: cannot install
the best update candidate for package
>> >> >> >>>>>>>> > > > >
>> > vdsm-
>> >> >> >>>>>>>> > > > >
>> >
network-4.40.0-1236.git63ea8cb8b.el8.x86_64\n - nothing provides
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> nmstate
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> > needed by
vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
>> >> >> >>>>>>>> > > > >
>> > Problem 2:
>> >> >> >>>>>>>> > > > >
>> > package
vdsm-python-4.40.0-1271.git524e08c8a.el8.noarch requires
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> vdsm-network
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> > = 4.40.0-1271.git524e08c8a.el8,
but none of the providers can be
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> installed\n
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> > - cannot install the best
update candidate for package vdsm-
>> >> >> >>>>>>>> > > > >
>> >
python-4.40.0-1236.git63ea8cb8b.el8.noarch\n - nothing provides
>> >> >> >>>>>>>> > > > >
>> > nmstate
>> >> >> >>>>>>>> > > > >
>> > needed by
vdsm-network-4.40.0-1271.git524e08c8a.el8.x86_64\n
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> nmstate should be provided by
copr repo enabled by
>> >> >> >>>>>>>> > > > >
>> ovirt-release-master.
>> >> >> >>>>>>>> > > > >
>
>> >> >> >>>>>>>> > > > >
>
>> >> >> >>>>>>>> > > > >
>
>> >> >> >>>>>>>> > > > >
> I re-triggered as
>> >> >> >>>>>>>> > > > >
>
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6131
>> >> >> >>>>>>>> > > > >
> maybe
>> >> >> >>>>>>>> > > > >
>
https://gerrit.ovirt.org/#/c/104825/
>> >> >> >>>>>>>> > > > >
> was missing
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
Looks like
>> >> >> >>>>>>>> > > > >
https://gerrit.ovirt.org/#/c/104825/
is ignored by OST.
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > > maybe
not. You re-triggered with [1],
which really missed this patch.
>> >> >> >>>>>>>> > > > I did
a rebase and now running with
this patch in build #6132 [2]. Let's
>> >> >> >>>>>>>> > > > wait
>> >> >> >>>>>>>> > for it to see
if gerrit #104825 helps.
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > > [1]
https://jenkins.ovirt.org/job/standard-manual-runner/909/
>> >> >> >>>>>>>> > > > [2]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6132/
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > > >
Miguel, do you think merging
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
https://gerrit.ovirt.org/#/c/104495/15/common/yum-repos/ovirt-master-hos
>> >> >> >>>>>>>> > > > >
t-cq
>> >> >> >>>>>>>> > .repo.in
<
http://repo.in>
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
would solve this?
>> >> >> >>>>>>>> > >
>> >> >> >>>>>>>> > >
>> >> >> >>>>>>>> > > I've
split the patch Dominik mentions
above in two, one of them adding
>> >> >> >>>>>>>> > > the nmstate
/ networkmanager copr repos
- [3].
>> >> >> >>>>>>>> > >
>> >> >> >>>>>>>> > > Let's
see if it fixes it.
>> >> >> >>>>>>>> >
>> >> >> >>>>>>>> > it fixes
original issue, but OST still
fails in
>> >> >> >>>>>>>> >
098_ovirt_provider_ovn.use_ovn_provider:
>> >> >> >>>>>>>> >
>> >> >> >>>>>>>> >
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> I think Dominik was
looking into this issue;
+Dominik Holler please confirm.
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> Let me know if you
need any help Dominik.
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> Thanks.
>> >> >> >>>>>>> The problem is that the
hosts lost connection
to storage:
>> >> >> >>>>>>>
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/exp...
:
>> >> >> >>>>>>>
>> >> >> >>>>>>> 2019-11-22
05:39:12,326-0500 DEBUG
(jsonrpc/5) [common.commands] /usr/bin/taskset --cpu-list 0-1
/usr/bin/sudo -n /sbin/lvm vgs --config 'devices {
preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3
filter=["a|^/dev/mapper/36001405107ea8b4e3ac4ddeb3e19890f$|^/dev/mapper/360014054924c91df75e41178e4b8a80c$|^/dev/mapper/3600140561c0d02829924b77ab7323f17$|^/dev/mapper/3600140582feebc04ca5409a99660dbbc$|^/dev/mapper/36001405c3c53755c13c474dada6be354$|",
"r|.*|"] } global { locking_type=1 prioritise_write_locks=1
wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50
retain_days=0 }' --noheadings --units b --nosuffix --separator
'|' --ignoreskippedcluster -o
uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
(cwd None) (commands:153)
>> >> >> >>>>>>> 2019-11-22
05:39:12,415-0500 ERROR
(check/loop) [storage.Monitor] Error checking path
/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata
(monitor:501)
>> >> >> >>>>>>> Traceback (most recent
call last):
>> >> >> >>>>>>> File
"/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py",
line 499, in _pathChecked
>> >> >> >>>>>>> delay =
result.delay()
>> >> >> >>>>>>> File
"/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line
391, in delay
>> >> >> >>>>>>> raise
exception.MiscFileReadException(self.path, self.rc, self.err)
>> >> >> >>>>>>>
vdsm.storage.exception.MiscFileReadException:
Internal file read failure:
('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata',
1, 'Read timeout')
>> >> >> >>>>>>> 2019-11-22
05:39:12,416-0500 INFO
(check/loop) [storage.Monitor] Domain
d10879c6-8de1-40ba-87fa-f447844eed2a became INVALID (monitor:472)
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> I failed to reproduce
local to analyze this,
I will try again, any hints welcome.
>> >> >> >>>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
https://gerrit.ovirt.org/#/c/104925/1/ shows
that 008_basic_ui_sanity.py triggers the problem.
>> >> >> >>>>>> Is there someone with
knowledge about the
basic_ui_sanity around?
>> >> >> >>>>>
>> >> >> >>>>> How do you think it's
related? By commenting
out the ui sanity tests and seeing OST with successful finish?
>> >> >> >>>>>
>> >> >> >>>>> Looking at 6134 run you were
discussing:
>> >> >> >>>>>
>> >> >> >>>>> - timing of the ui sanity set-up
[1]:
>> >> >> >>>>>
>> >> >> >>>>> 11:40:20 @ Run test:
008_basic_ui_sanity.py:
>> >> >> >>>>>
>> >> >> >>>>> - timing of first encountered
storage error [2]:
>> >> >> >>>>>
>> >> >> >>>>> 2019-11-22 05:39:12,415-0500
ERROR (check/loop)
[storage.Monitor] Error checking path
/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata
(monitor:501)
>> >> >> >>>>> Traceback (most recent call
last):
>> >> >> >>>>> File
"/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py",
line 499, in _pathChecked
>> >> >> >>>>> delay = result.delay()
>> >> >> >>>>> File
"/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line
391, in delay
>> >> >> >>>>> raise
exception.MiscFileReadException(self.path, self.rc, self.err)
>> >> >> >>>>>
vdsm.storage.exception.MiscFileReadException:
Internal file read failure:
('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/d10879c6-8de1-40ba-87fa-f447844eed2a/dom_md/metadata',
1, 'Read timeout')
>> >> >> >>>>>
>> >> >> >>>>> Timezone difference aside, it
seems to me that
these storage errors occured before doing anything ui-related.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> You are right, a time.sleep(8*60) in
>> >> >> >>
https://gerrit.ovirt.org/#/c/104925/2
>> >> >> >> has triggers the issue the same way.
>> >> >>
>> >> >> So this is a test issues, assuming that the UI tests
can complete in
>> >> >> less than 8 minutes?
>> >> >>
>> >> >
>> >> > To my eyes this looks like storage is just stop
working after some time.
>> >> >
>> >> >>
>> >> >> >>
>> >> >> >
>> >> >> > Nir or Steve, can you please confirm that this is
a
storage problem?
>> >> >>
>> >> >> Why do you think we have a storage problem?
>> >> >>
>> >> >
>> >> > I understand from the posted log snippets that they
say that the storage is not accessible anymore,
>> >>
>> >> No, so far one read timeout was reported, this does not
mean storage
>> >> is not available anymore.
>> >> It can be temporary issue that does not harm anything.
>> >>
>> >> > while the host is still responsive.
>> >> > This might be triggered by something outside storage,
e.g. the network providing the storage stopped working,
>> >> > But I think a possible next step in analysing this
issue would be to find the reason why storage is not happy.
>> >>
>> >
>> > Sounds like there was a miscommunication in this thread.
>> > I try to address all of your points, please let me know
if something is missing or not clearly expressed.
>> >
>> >>
>> >> First step is to understand which test fails,
>> >
>> >
>> > 098_ovirt_provider_ovn.use_ovn_provider
>> >
>> >>
>> >> and why. This can be done by the owner of the test,
>> >
>> >
>> > The test was added by the network team.
>> >
>> >>
>> >> understanding what the test does
>> >
>> >
>> > The test tries to add a vNIC.
>> >
>> >>
>> >> and what is the expected system behavior.
>> >>
>> >
>> > It is expected that adding a vNIC works, because the VM
should be up.
>>
>> What was the actual behavior?
>>
>> >> If the owner of the test thinks that the test failed
because of a storage issue
>> >
>> >
>> > I am not sure who is the owner, but I do.
>>
>> Can you explain why how a vNIC failed because of a storage
issue?
>>
>
>
> Test fails with:
>
> Cannot add a Network Interface when VM is not Down, Up or
Image-Locked.
>
> engine.log says:
> {"jsonrpc": "2.0", "method":
"|virt|VM_status|308bd254-9af9-4570-98ea-822609550acf",
"params": {"308bd254-9af9-4570-98ea-822609550acf":
{"status":
"Paused", "pauseCode": "EOTHER",
"ioerror": {"alias":
"ua-953dd722-5e8b-4b24-bccd-a2a5d5befeb6", "name":
"vda",
"path":
"/rhev/data-center/38c691d4-8556-4882-8f04-a88dff5d0973/bcd1622c-876b-460c-95a7-d09536c42ffe/images/953dd722-5e8b-4b24-bccd-a2a5d5befeb6/dcb5fec4-f219-4d3f-986c-628b0d00b349"}},
"notify_time": 4298388570}}
So you think adding vNIC failed because the VM was paused?
Yes, because of the error message "Cannot add a Network Interface
when VM is not Down, Up or Image-Locked."
> vdsm.log says:
>
> 2019-11-20 10:51:06,026-0500 ERROR (check/loop)
[storage.Monitor] Error checking path
/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata
(monitor:501)
> Traceback (most recent call last):
> File
"/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py",
line 499, in _pathChecked
> delay = result.delay()
> File
"/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line
391, in delay
> raise exception.MiscFileReadException(self.path,
self.rc, self.err)
> vdsm.storage.exception.MiscFileReadException: Internal file
read failure:
('/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/bcd1622c-876b-460c-95a7-d09536c42ffe/dom_md/metadata',
1, 'Read timeout')
Is this related to the paused vm?
The log entry : '{"status": "Paused", "pauseCode":
"EOTHER",
"ioerror"' makes me thinking this.
You did not provide a timestamp for the engine event above.
I can't find last weeks log, maybe they are faded out already.
Please find more recent logs in
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/6492
> ...
>
> 2019-11-20 10:51:56,249-0500 WARN (check/loop)
[storage.check] Checker
'/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share2/64daa060-1d83-46b9-b7e8-72a902e1134b/dom_md/metadata'
is blocked for 60.00 seconds (check:282)
> 2019-11-20 10:51:56,885-0500 ERROR (monitor/775b710)
[storage.Monitor] Error checking domain
775b7102-7f2c-4eee-a4d0-a41b55451f7e (monitor:427)
> Traceback (most recent call last):
> File
"/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py",
line 408, in _checkDomainStatus
> self.domain.selftest()
> File
"/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py",
line 710, in selftest
> self.oop.os.statvfs(self.domaindir)
> File
"/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py",
line 242, in statvfs
> return self._iop.statvfs(path)
> File
"/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line
479, in statvfs
> resdict = self._sendCommand("statvfs", {"path":
path},
self.timeout)
> File
"/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line
442, in _sendCommand
> raise Timeout(os.strerror(errno.ETIMEDOUT))
> ioprocess.Timeout: Connection timed out
This show that storage was not accessible for 60 seconds
(ioprocess
uses 60 seconds timeout).
60 seconds timeout is bad. If we have leases on this storage
domain
(e.g. SPM lease) they will
expire in 20 seconds after this event and the vdsm on the SPM host
will be killed.
Do we have network tests changing the network used by the NFS
storage
domain before this event?
No.
What were the changes the network tests or code since OST was
successful?
I am not aware of a change, which might be relevant.
Maybe the fact that the hosts are on CentOS 8, while the Engine
(storage) is on CentOS 7 is relevant.
Also the occurrence of this issue seems not to be 100%
deterministic, I guess because it is timing related.
The error is reproducible locally by running OST, and just keep
the environment alive after basic-suite-master succeeded.
After some time, the storage will become inaccessible.
>> Can you explain how adding 8 minutes sleep instead of the
UI tests
>> reproduced the issue?
>>
>
>
> This shows that the issue is not triggered by the UI test,
but maybe by passing time.
Do we run the ovn tests after the UI tests?
>> >> someone from storage can look at this.
>> >>
>> >
>> > Thanks, I would appreciate this.
>> >
>> >>
>> >> But the fact that adding long sleep reproduce the issue
means it is not related
>> >> in any way to storage.
>> >>
>> >> Nir
>> >>
>> >> >
>> >> >>
>> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >>>>>
>> >> >> >>>>> I remember talking with Steven
Rosenberg on IRC
a couple of days ago about some storage metadata issues and he
said he got a response from Nir, that "it's a known issue".
>> >> >> >>>>>
>> >> >> >>>>> Nir, Amit, can you comment on
this?
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> The error mentioned here is not vdsm
error but
warning about storage accessibility. We sould convert the
tracebacks to warning.
>> >> >> >>>>
>> >> >> >>>> The reason for such issue can be
misconfigured
network (maybe network team is testing negative flows?),
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> No.
>> >> >> >>>
>> >> >> >>>>
>> >> >> >>>> or some issue in the NFS server.
>> >> >> >>>>
>> >> >> >>>
>> >> >> >>> Only hint I found is
>> >> >> >>> "Exiting Time2Retain handler
because
session_reinstatement=1"
>> >> >> >>> but I have no idea what this means or if
this is
relevant at all.
>> >> >> >>>
>> >> >> >>>>
>> >> >> >>>> One read timeout is not an issue. We
have a real
issue only if we have consistent read timeouts or errors for
couple of minutes, after that engine can deactivate the
storage domain or some hosts if only these hosts are having
trouble to access storage.
>> >> >> >>>>
>> >> >> >>>> In OST we never expect such
conditions since we
don't test negative flows, and we should have good
connectivity with the vms running on the same host.
>> >> >> >>>>
>> >> >> >>>
>> >> >> >>> Ack, this seems to be the problem.
>> >> >> >>>
>> >> >> >>>>
>> >> >> >>>> Nir
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>> [1]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/console
>> >> >> >>>>> [2]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6134/artifact/exp...
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>
>> >> >> >>>>> Marcin, could you please take a
look?
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> >
>> >> >> >>>>>>>> > > [3] -
https://gerrit.ovirt.org/#/c/104897/
>> >> >> >>>>>>>> > >
>> >> >> >>>>>>>> > >
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
>> >> >> >>>>>>>> > > > >
>> Who installs this rpm in OST?
>> >> >> >>>>>>>> > > > >
>
>> >> >> >>>>>>>> > > > >
>
>> >> >> >>>>>>>> > > > >
>
>> >> >> >>>>>>>> > > > >
> I do not understand the question.
>> >> >> >>>>>>>> > > > >
>
>> >> >> >>>>>>>> > > > >
>
>> >> >> >>>>>>>> > > > >
>
>> >> >> >>>>>>>> > > > >
>> > [...]
>> >> >> >>>>>>>> > > > >
>> >
>> >> >> >>>>>>>> > > > >
>> >
>> >> >> >>>>>>>> > > > >
>> >
>> >> >> >>>>>>>> > > > >
>> > See [2] for full error.
>> >> >> >>>>>>>> > > > >
>> >
>> >> >> >>>>>>>> > > > >
>> >
>> >> >> >>>>>>>> > > > >
>> >
>> >> >> >>>>>>>> > > > >
>> > Can someone please take a look?
>> >> >> >>>>>>>> > > > >
>> > Thanks
>> >> >> >>>>>>>> > > > >
>> > Vojta
>> >> >> >>>>>>>> > > > >
>> >
>> >> >> >>>>>>>> > > > >
>> >
>> >> >> >>>>>>>> > > > >
>> >
>> >> >> >>>>>>>> > > > >
>> > [1]
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/
>> >> >> >>>>>>>> > > > >
>> > [2]
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6128/artifact
>> >> >> >>>>>>>> > > > >
>> /
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> >
exported-artifacts/test_logs/basic-suite-master/
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> post-002_bootstrap.py/lago-
<
http://post-002_bootstrap.py/lago->
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
basic-suite-master-engine/_var_log/ovirt-engine/engine.log___________
>> >> >> >>>>>>>> > > > >
>> ____
>> >> >> >>>>>>>> > > > >
>> ________________________________>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> > Devel mailing list --
devel(a)ovirt.org <mailto:devel@ovirt.org>
>> >> >> >>>>>>>> > > > >
>> > To unsubscribe send an email to
devel-leave(a)ovirt.org <mailto:devel-leave@ovirt.org>
>> >> >> >>>>>>>> > > > >
>> > Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> > oVirt Code of Conduct:
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
https://www.ovirt.org/community/about/community-guidelines/
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>> > List Archives:
>> >> >> >>>>>>>> > > > >
>>
>> >> >> >>>>>>>> > > > >
>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/4K5N3VQ
>> >> >> >>>>>>>> > > > >
>> N26B
>> >> >> >>>>>>>> > > > >
>> L73K7D45A2IR7R3UMMM23/
>> >> >> >>>>>>>> > > > >
>>
_______________________________________________
>> >> >> >>>>>>>> > > > >
>> Devel mailing list --
devel(a)ovirt.org <mailto:devel@ovirt.org>
>> >> >> >>>>>>>> > > > >
>> To unsubscribe send an email to
devel-leave(a)ovirt.org <mailto:devel-leave@ovirt.org>
>> >> >> >>>>>>>> > > > >
>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>> >> >> >>>>>>>> > > > >
>> oVirt Code of Conduct:
>> >> >> >>>>>>>> > > > >
>>
https://www.ovirt.org/community/about/community-guidelines/
>> >> >> >>>>>>>> > > > >
>> List Archives:
>> >> >> >>>>>>>> > > > >
>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JN7MNUZ
>> >> >> >>>>>>>> > > > >
>> N5K3
>> >> >> >>>>>>>> > > > >
>> NS5TGXFCILYES77KI5TZU/
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > > >
>> >> >> >>>>>>>> > >
>> >> >> >>>>>>>> > >
_______________________________________________
>> >> >> >>>>>>>> > > Devel
mailing list -- devel(a)ovirt.org
<mailto:devel@ovirt.org>
>> >> >> >>>>>>>> > > To
unsubscribe send an email to
devel-leave(a)ovirt.org <mailto:devel-leave@ovirt.org>
>> >> >> >>>>>>>> > > Privacy
Statement:
https://www.ovirt.org/site/privacy-policy/
>> >> >> >>>>>>>> > > oVirt Code
of Conduct:
>> >> >> >>>>>>>> > >
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
>> >> >> >>>>>>>> > >
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/UPJ5SEAV5Z65H
>> >> >> >>>>>>>> > >
5BQ3SCHOYZX6JMTQPBW/
>> >> >> >>>>>>>> >
>> >> >> >>>>>>>>
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> --
>> >> >> >>>>> Martin Perina
>> >> >> >>>>> Manager, Software Engineering
>> >> >> >>>>> Red Hat Czech s.r.o.
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >>
>> >>
>>
--
Martin Perina
Manager, Software Engineering
Red Hat Czech s.r.o.