On Sun, Oct 4, 2020 at 2:07 AM Gianluca Cecchi <gianluca.cecchi(a)gmail.com>
wrote:
On Sat, Oct 3, 2020 at 9:42 PM Amit Bawer <abawer(a)redhat.com>
wrote:
>
>
> On Sat, Oct 3, 2020 at 10:24 PM Amit Bawer <abawer(a)redhat.com> wrote:
>
>>
>>
>> For the gluster bricks being filtered out in 4.4.2, this seems like [1].
>>
>> [1]
https://bugzilla.redhat.com/show_bug.cgi?id=1883805
>>
>
> Maybe remove the lvm filter from /etc/lvm/lvm.conf while in 4.4.2
> maintenance mode
> if the fs is mounted as read only, try
>
> mount -o remount,rw /
>
> sync and try to reboot 4.4.2.
>
>
Indeed if i run, when in emergency shell in 4.4.2, the command:
lvs --config 'devices { filter = [ "a|.*|" ] }'
I see also all the gluster volumes, so I think the update injected the
nasty filter.
Possibly during update the command
# vdsm-tool config-lvm-filter -y
was executed and erroneously created the filter?
Since there wasn't a filter set on the node, the 4.4.2 update added the
default filter for the root-lv pv
if there was some filter set before the upgrade, it would not have been
added by the 4.4.2 update.
Anyway remounting read write the root filesystem and removing the
filter
line from lvm.conf and rebooting worked and 4.4.2 booted ok and I was able
to exit global maintenance and have the engine up.
Thanks Amit for the help and all the insights.
Right now only two problems:
1) a long running problem that from engine web admin all the volumes are
seen as up and also the storage domains up, while only the hosted engine
one is up, while "data" and vmstore" are down, as I can verify from the
host, only one /rhev/data-center/ mount:
[root@ovirt01 ~]# df -h
Filesystem Size Used Avail
Use% Mounted on
devtmpfs 16G 0 16G
0% /dev
tmpfs 16G 16K 16G
1% /dev/shm
tmpfs 16G 18M 16G
1% /run
tmpfs 16G 0 16G
0% /sys/fs/cgroup
/dev/mapper/onn-ovirt--node--ng--4.4.2--0.20200918.0+1 133G 3.9G 129G
3% /
/dev/mapper/onn-tmp 1014M 40M 975M
4% /tmp
/dev/mapper/gluster_vg_sda-gluster_lv_engine 100G 9.0G 91G
9% /gluster_bricks/engine
/dev/mapper/gluster_vg_sda-gluster_lv_data 500G 126G 375G
26% /gluster_bricks/data
/dev/mapper/gluster_vg_sda-gluster_lv_vmstore 90G 6.9G 84G
8% /gluster_bricks/vmstore
/dev/mapper/onn-home 1014M 40M 975M
4% /home
/dev/sdb2 976M 307M 603M
34% /boot
/dev/sdb1 599M 6.8M 593M
2% /boot/efi
/dev/mapper/onn-var 15G 263M 15G
2% /var
/dev/mapper/onn-var_log 8.0G 541M 7.5G
7% /var/log
/dev/mapper/onn-var_crash 10G 105M 9.9G
2% /var/crash
/dev/mapper/onn-var_log_audit 2.0G 79M 2.0G
4% /var/log/audit
ovirt01st.lutwyn.storage:/engine 100G 10G 90G
10% /rhev/data-center/mnt/glusterSD/ovirt01st.lutwyn.storage:_engine
tmpfs 3.2G 0 3.2G
0% /run/user/1000
[root@ovirt01 ~]#
I can also wait 10 minutes and no change. The way I use to exit from this
stalled situation is power on a VM, so that obviously it fails
VM f32 is down with error. Exit message: Unable to get volume size for
domain d39ed9a3-3b10-46bf-b334-e8970f5deca1 volume
242d16c6-1fd9-4918-b9dd-0d477a86424c.
10/4/20 12:50:41 AM
and suddenly all the data storage domains are deactivated (from engine
point of view, because actually they were not active...):
Storage Domain vmstore (Data Center Default) was deactivated by system
because it's not visible by any of the hosts.
10/4/20 12:50:31 AM
and I can go in Data Centers --> Default --> Storage and activate
"vmstore" and "data" storage domains and suddenly I get them
activated and
filesystems mounted.
[root@ovirt01 ~]# df -h | grep rhev
ovirt01st.lutwyn.storage:/engine 100G 10G 90G
10% /rhev/data-center/mnt/glusterSD/ovirt01st.lutwyn.storage:_engine
ovirt01st.lutwyn.storage:/data 500G 131G 370G
27% /rhev/data-center/mnt/glusterSD/ovirt01st.lutwyn.storage:_data
ovirt01st.lutwyn.storage:/vmstore 90G 7.8G 83G
9% /rhev/data-center/mnt/glusterSD/ovirt01st.lutwyn.storage:_vmstore
[root@ovirt01 ~]#
and VM starts ok now.
I already reported this, but I don't know if there is yet a bugzilla open
for it.
Did you get any response for the original mail? haven't seen it on the
users-list.
2) I see that I cannot connect to cockpit console of node.
In firefox (version 80) in my Fedora 31 I get:
"
Secure Connection Failed
An error occurred during a connection to ovirt01.lutwyn.local:9090.
PR_CONNECT_RESET_ERROR
The page you are trying to view cannot be shown because the
authenticity of the received data could not be verified.
Please contact the website owners to inform them of this problem.
Learn more…
"
In Chrome (build 85.0.4183.121)
"
Your connection is not private
Attackers might be trying to steal your information from
ovirt01.lutwyn.local (for example, passwords, messages, or credit cards).
Learn more
NET::ERR_CERT_AUTHORITY_INVALID
"
Click Advanced and select to go to the site
"
This server could not prove that it is ovirt01.lutwyn.local; its security
certificate is not trusted by your computer's operating system. This may be
caused by a misconfiguration or an attacker intercepting your connection."
If I select
"
This page isn’t working ovirt01.lutwyn.local didn’t send any data.
ERR_EMPTY_RESPONSE
"
NOTE: the ost is not resolved by DNS but I put an entry in my hosts client.
Might be required to set DNS for authenticity, maybe other members on the
list could tell better.
On host:
[root@ovirt01 ~]# systemctl status cockpit.socket --no-pager
● cockpit.socket - Cockpit Web Service Socket
Loaded: loaded (/usr/lib/systemd/system/cockpit.socket; disabled;
vendor preset: enabled)
Active: active (listening) since Sun 2020-10-04 00:36:36 CEST; 25min ago
Docs: man:cockpit-ws(8)
Listen: [::]:9090 (Stream)
Process: 1425 ExecStartPost=/bin/ln -snf active.motd /run/cockpit/motd
(code=exited, status=0/SUCCESS)
Process: 1417 ExecStartPost=/usr/share/cockpit/motd/update-motd
localhost (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 202981)
Memory: 1.6M
CGroup: /system.slice/cockpit.socket
Oct 04 00:36:36 ovirt01.lutwyn.local systemd[1]: Starting Cockpit Web
Service Socket.
Oct 04 00:36:36 ovirt01.lutwyn.local systemd[1]: Listening on Cockpit Web
Service Socket.
[root@ovirt01 ~]#
[root@ovirt01 ~]# systemctl status cockpit.service --no-pager
● cockpit.service - Cockpit Web Service
Loaded: loaded (/usr/lib/systemd/system/cockpit.service; static; vendor
preset: disabled)
Active: active (running) since Sun 2020-10-04 00:58:09 CEST; 3min 30s
ago
Docs: man:cockpit-ws(8)
Process: 19260 ExecStartPre=/usr/sbin/remotectl certificate --ensure
--user=root --group=cockpit-ws --selinux-type=etc_t (code=exited,
status=0/SUCCESS)
Main PID: 19263 (cockpit-tls)
Tasks: 1 (limit: 202981)
Memory: 1.4M
CGroup: /system.slice/cockpit.service
└─19263 /usr/libexec/cockpit-tls
Oct 04 00:59:59 ovirt01.lutwyn.local cockpit-tls[19263]: cockpit-tls:
connect(http-redirect.sock) failed: Permission denied
Oct 04 00:59:59 ovirt01.lutwyn.local cockpit-tls[19263]: cockpit-tls:
connect(http-redirect.sock) failed: Permission denied
Oct 04 01:00:11 ovirt01.lutwyn.local cockpit-tls[19263]: cockpit-tls:
gnutls_handshake failed: A TLS fatal alert has been received.
Oct 04 01:00:11 ovirt01.lutwyn.local cockpit-tls[19263]: cockpit-tls:
connect(https-factory.sock) failed: Permission denied
Oct 04 01:00:11 ovirt01.lutwyn.local cockpit-tls[19263]: cockpit-tls:
gnutls_handshake failed: A TLS fatal alert has been received.
Oct 04 01:00:11 ovirt01.lutwyn.local cockpit-tls[19263]: cockpit-tls:
connect(https-factory.sock) failed: Permission denied
Oct 04 01:00:16 ovirt01.lutwyn.local cockpit-tls[19263]: cockpit-tls:
gnutls_handshake failed: A TLS fatal alert has been received.
Oct 04 01:00:16 ovirt01.lutwyn.local cockpit-tls[19263]: cockpit-tls:
gnutls_handshake failed: A TLS fatal alert has been received.
Oct 04 01:00:16 ovirt01.lutwyn.local cockpit-tls[19263]: cockpit-tls:
gnutls_handshake failed: A TLS fatal alert has been received.
Oct 04 01:00:16 ovirt01.lutwyn.local cockpit-tls[19263]: cockpit-tls:
connect(https-factory.sock) failed: Permission denied
[root@ovirt01 ~]#
Gianluca