On Mon, Aug 26, 2019 at 6:13 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Mon, Aug 26, 2019 at 12:44 PM Ales Musil <amusil@redhat.com> wrote:


On Mon, Aug 26, 2019 at 12:30 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Mon, Aug 26, 2019 at 11:58 AM Ales Musil <amusil@redhat.com> wrote:
 
I can see that MOM is failing to start because some of the MOM dependencies is not starting. Can you please post output from 'systemctl status momd'?


 
 ● momd.service - Memory Overcommitment Manager Daemon
   Loaded: loaded (/usr/lib/systemd/system/momd.service; static; vendor preset: disabled)
   Active: inactive (dead)

perhaps any other daemon status?
Or any momd related log file generated?

BTW: I see on a running oVirt 4.3.5 node from another environment that the status of momd is the same inactive (dead)

 
What happens if you try to start the momd?
 
[root@ovirt01 ~]# systemctl status momd
● momd.service - Memory Overcommitment Manager Daemon
   Loaded: loaded (/usr/lib/systemd/system/momd.service; static; vendor preset: disabled)
   Active: inactive (dead)
[root@ovirt01 ~]# systemctl start momd
[root@ovirt01 ~]#

[root@ovirt01 ~]# systemctl status momd -l
● momd.service - Memory Overcommitment Manager Daemon
   Loaded: loaded (/usr/lib/systemd/system/momd.service; static; vendor preset: disabled)
   Active: inactive (dead) since Mon 2019-08-26 18:10:20 CEST; 6s ago
  Process: 18417 ExecStart=/usr/sbin/momd -c /etc/momd.conf -d --pid-file /var/run/momd.pid (code=exited, status=0/SUCCESS)
 Main PID: 18419 (code=exited, status=0/SUCCESS)

Aug 26 18:10:20 ovirt01.mydomain systemd[1]: Starting Memory Overcommitment Manager Daemon...
Aug 26 18:10:20 ovirt01.mydomain systemd[1]: momd.service: Supervising process 18419 which is not our child. We'll most likely not notice when it exits.
Aug 26 18:10:20 ovirt01.mydomain systemd[1]: Started Memory Overcommitment Manager Daemon.
Aug 26 18:10:20 ovirt01.mydomain python[18419]: No worthy mechs found
[root@ovirt01 ~]# 

[root@ovirt01 ~]# ps -fp 18419
UID        PID  PPID  C STIME TTY          TIME CMD
[root@ovirt01 ~]#

[root@ovirt01 vdsm]# ps -fp 18417
UID        PID  PPID  C STIME TTY          TIME CMD
[root@ovirt01 vdsm]#

No log file update under /var/log/vdsm

[root@ovirt01 vdsm]# ls -lt | head -5
total 118972
-rw-r--r--. 1 root root 3406465 Aug 23 00:25 supervdsm.log
-rw-r--r--. 1 root root   73621 Aug 23 00:25 upgrade.log
-rw-r--r--. 1 vdsm kvm        0 Aug 23 00:01 vdsm.log
-rw-r--r--. 1 vdsm kvm   538480 Aug 22 23:46 vdsm.log.1.xz
[root@ovirt01 vdsm]#

Gianluca

It seems that these steps below solved the problem (donna what it was though..).
Based on this similar (No worthy mechs found) I found inspiration from:

https://lists.ovirt.org/pipermail/users/2017-January/079009.html


[root@ovirt01 ~]# vdsm-tool configure

Checking configuration status...

abrt is not configured for vdsm
Managed volume database is already configured
lvm is configured for vdsm
libvirt is already configured for vdsm
SUCCESS: ssl configured to true. No conflicts
Manual override for multipath.conf detected - preserving current configuration
This manual override for multipath.conf was based on downrevved template. You are strongly advised to contact your support representatives

Running configure...
Reconfiguration of abrt is done.

Done configuring modules to VDSM.
[root@ovirt01 ~]#

[root@ovirt01 ~]# systemctl restart vdsmd
[root@ovirt01 ~]# systemctl status vdsmd
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/etc/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2019-08-26 18:23:29 CEST; 19s ago
  Process: 27326 ExecStopPost=/usr/libexec/vdsm/vdsmd_init_common.sh --post-stop (code=exited, status=0/SUCCESS)
  Process: 27329 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS)
 Main PID: 27401 (vdsmd)
    Tasks: 75
   CGroup: /system.slice/vdsmd.service
           ├─27401 /usr/bin/python2 /usr/share/vdsm/vdsmd
           ├─27524 /usr/libexec/ioprocess --read-pipe-fd 49 --write-pipe-fd 47 --max-threads 10 --max-queued-requests 10
           ├─27531 /usr/libexec/ioprocess --read-pipe-fd 55 --write-pipe-fd 54 --max-threads 10 --max-queued-requests 10
           ├─27544 /usr/libexec/ioprocess --read-pipe-fd 60 --write-pipe-fd 59 --max-threads 10 --max-queued-requests 10
           ├─27553 /usr/libexec/ioprocess --read-pipe-fd 67 --write-pipe-fd 66 --max-threads 10 --max-queued-requests 10
           ├─27559 /usr/libexec/ioprocess --read-pipe-fd 72 --write-pipe-fd 71 --max-threads 10 --max-queued-requests 10
           └─27566 /usr/libexec/ioprocess --read-pipe-fd 78 --write-pipe-fd 77 --max-threads 10 --max-queued-requests 10

Aug 26 18:23:29 ovirt01.mydomain vdsmd_init_common.sh[27329]: vdsm: Running dummybr
Aug 26 18:23:29 ovirt01.mydomain vdsmd_init_common.sh[27329]: vdsm: Running tune_system
Aug 26 18:23:29 ovirt01.mydomain vdsmd_init_common.sh[27329]: vdsm: Running test_space
Aug 26 18:23:29 ovirt01.mydomain vdsmd_init_common.sh[27329]: vdsm: Running test_lo
Aug 26 18:23:29 ovirt01.mydomain systemd[1]: Started Virtual Desktop Server Manager.
Aug 26 18:23:30 ovirt01.mydomain vdsm[27401]: WARN unhandled write event
Aug 26 18:23:30 ovirt01.mydomain vdsm[27401]: WARN MOM not available.
Aug 26 18:23:30 ovirt01.mydomain vdsm[27401]: WARN MOM not available, KSM stats will be missing.
Aug 26 18:23:31 ovirt01.mydomain vdsm[27401]: WARN Not ready yet, ignoring event '|virt|VM_status|4dae6016-ff01-4a...r shu
Aug 26 18:23:45 ovirt01.mydomain vdsm[27401]: WARN Worker blocked: <Worker name=periodic/1 running <Task <Operatio...back:
                                                File: "/usr/lib64/python2.7/threading.py", line 785, i
Hint: Some lines were ellipsized, use -l to show in full.
[root@ovirt01 ~]#

Previously I had restarted vdsmd many times without effect...

After a while (2 minutes)

[root@ovirt01 ~]# hosted-engine --vm-status


!! Cluster is in GLOBAL MAINTENANCE mode !!



--== Host ovirt01.mydomain (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt01.mydomain
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "Down"}
Score                              : 3000
stopped                            : False
Local maintenance                  : False
crc32                              : a68d97bb
local_conf_timestamp               : 324335
Host timestamp                     : 324335
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=324335 (Mon Aug 26 18:30:39 2019)
host-id=1
score=3000
vm_conf_refresh_time=324335 (Mon Aug 26 18:30:39 2019)
conf_on_shared_storage=True
maintenance=False
state=GlobalMaintenance
stopped=False


!! Cluster is in GLOBAL MAINTENANCE mode !!

[root@ovirt01 ~]#

that was the state when I updated the only present node

exit from global maintenance:
[root@ovirt01 ~]# hosted-engine --set-maintenance --mode=none
[root@ovirt01 ~]#


[root@ovirt01 ~]# hosted-engine --vm-status


--== Host ovirt01.mydomain (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt01.mydomain
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "Down"}
Score                              : 3000
stopped                            : False
Local maintenance                  : False
crc32                              : 7b58fabd
local_conf_timestamp               : 324386
Host timestamp                     : 324386
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=324386 (Mon Aug 26 18:31:29 2019)
host-id=1
score=3000
vm_conf_refresh_time=324386 (Mon Aug 26 18:31:30 2019)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
[root@ovirt01 ~]#

[root@ovirt01 ~]# hosted-engine --vm-status


--== Host ovirt01.mydomain (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirt01.mydomain
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 5e824330
local_conf_timestamp               : 324468
Host timestamp                     : 324468
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=324468 (Mon Aug 26 18:32:51 2019)
host-id=1
score=3400
vm_conf_refresh_time=324468 (Mon Aug 26 18:32:51 2019)
conf_on_shared_storage=True
maintenance=False
state=EngineUp
stopped=False
[root@ovirt01 ~]#

And I'm able to connect to my engine web admin gui.
After a couple of more minutes, 5 or so, the data domain comes up and I'm able to power on the other VMs.

[root@ovirt01 vdsm]# ls -lt | head -5
total 123972
-rw-r--r--. 1 vdsm kvm   201533 Aug 26 18:38 mom.log
-rw-r--r--. 1 vdsm kvm  2075421 Aug 26 18:38 vdsm.log
-rw-r--r--. 1 root root 3923102 Aug 26 18:38 supervdsm.log
-rw-r--r--. 1 root root   73621 Aug 23 00:25 upgrade.log
[root@ovirt01 vdsm]#

let me know if you want any file to read and think about the reason...

Thanks for the moment.

Gianluca