On Fri, Oct 23, 2015 at 5:55 PM, Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Fri, Oct 23, 2015 at 5:05 PM, Simone Tiraboschi <stirabos@redhat.com> wrote:


OK, can you please try again the whole reboot procedure just to ensure that it was just a temporary NFS glitch?


It seems reproducible.

This time I was able to shutdown the hypervisor without manual power off.
Only strange thing is that I ran 

shutdown -h now

and actually the VM at some point (I was able to see that the watchdog stopped...) booted.... ?

Related lines in messages:
Oct 23 17:33:32 ovc71 systemd: Unmounting RPC Pipe File System...
Oct 23 17:33:32 ovc71 systemd: Stopping Session 11 of user root.
Oct 23 17:33:33 ovc71 systemd: Stopped Session 11 of user root.
Oct 23 17:33:33 ovc71 systemd: Stopping user-0.slice.
Oct 23 17:33:33 ovc71 systemd: Removed slice user-0.slice.
Oct 23 17:33:33 ovc71 systemd: Stopping vdsm-dhclient.slice.
Oct 23 17:33:33 ovc71 systemd: Removed slice vdsm-dhclient.slice.
Oct 23 17:33:33 ovc71 systemd: Stopping vdsm.slice.
Oct 23 17:33:33 ovc71 systemd: Removed slice vdsm.slice.
Oct 23 17:33:33 ovc71 systemd: Stopping Sound Card.
Oct 23 17:33:33 ovc71 systemd: Stopped target Sound Card.
Oct 23 17:33:33 ovc71 systemd: Stopping LVM2 PV scan on device 8:2...
Oct 23 17:33:33 ovc71 systemd: Stopping LVM2 PV scan on device 8:16...
Oct 23 17:33:33 ovc71 systemd: Stopping Dump dmesg to /var/log/dmesg...
Oct 23 17:33:33 ovc71 systemd: Stopped Dump dmesg to /var/log/dmesg.
Oct 23 17:33:33 ovc71 systemd: Stopping Watchdog Multiplexing Daemon...
Oct 23 17:33:33 ovc71 systemd: Stopping Multi-User System.
Oct 23 17:33:33 ovc71 systemd: Stopped target Multi-User System.
Oct 23 17:33:33 ovc71 systemd: Stopping ABRT kernel log watcher...
Oct 23 17:33:33 ovc71 systemd: Stopping Command Scheduler...
Oct 23 17:33:33 ovc71 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="690" x-info="http://www.rsyslog.com"] exiting on signal 15.
Oct 23 17:36:24 ovc71 rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="697" x-info="http://www.rsyslog.com"] start
Oct 23 17:36:21 ovc71 journal: Runtime journal is using 8.0M (max 500.0M, leaving 750.0M of free 4.8G, current limit 500.0M).
Oct 23 17:36:21 ovc71 kernel: Initializing cgroup subsys cpuset


Coming back with the ovrt processes I see:

[root@ovc71 ~]# systemctl status ovirt-ha-broker
ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled)
   Active: inactive (dead) since Fri 2015-10-23 17:36:25 CEST; 31s ago
  Process: 849 ExecStop=/usr/lib/systemd/systemd-ovirt-ha-broker stop (code=exited, status=0/SUCCESS)
  Process: 723 ExecStart=/usr/lib/systemd/systemd-ovirt-ha-broker start (code=exited, status=0/SUCCESS)
 Main PID: 844 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/ovirt-ha-broker.service

Oct 23 17:36:24 ovc71.localdomain.local systemd-ovirt-ha-broker[723]: Starting ovirt-ha-broker: [...
Oct 23 17:36:24 ovc71.localdomain.local systemd[1]: Started oVirt Hosted Engine High Availabili...r.
Oct 23 17:36:25 ovc71.localdomain.local systemd-ovirt-ha-broker[849]: Stopping ovirt-ha-broker: [...
Hint: Some lines were ellipsized, use -l to show in full.

ANd
[root@ovc71 ~]# systemctl status nfs-server
nfs-server.service - NFS server and services
   Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; enabled)
   Active: active (exited) since Fri 2015-10-23 17:36:27 CEST; 1min 9s ago
  Process: 1123 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited, status=0/SUCCESS)
  Process: 1113 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS)
 Main PID: 1123 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/nfs-server.service

Oct 23 17:36:27 ovc71.localdomain.local systemd[1]: Starting NFS server and services...
Oct 23 17:36:27 ovc71.localdomain.local systemd[1]: Started NFS server and services.

So it seems that the broker tries to start and fails (17:36:25) before NFS server start phase completes (17:36:27)...?

Again if I then manually start ha-broker and ha-agent, they start ok and I'm able to become operational again with the sh engine up

systemd file for broker is this

[Unit]
Description=oVirt Hosted Engine High Availability Communications Broker

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/ovirt-ha-broker
ExecStart=/usr/lib/systemd/systemd-ovirt-ha-broker start
ExecStop=/usr/lib/systemd/systemd-ovirt-ha-broker stop

[Install]
WantedBy=multi-user.target

Probably inside the [unit] section I should add
After=nfs-server.service


Ok, I understood.
You are right: the broker was failing cause the NFS storage was not ready cause it was served in loopback and there isn't any explicit service dependency on that. 

We are not imposing it cause generally an NFS shared domain is generally thought to be served from and external system while a loopback NFS is just a degenerate case.
Simply fix it manually.
 
but this should be true only for sh engine configured with NFS.... so to be done at install/setup time?

If you want I can set this change for my environment and verify...


 

The issue was here:  --spice-host-subject="C=EN, L=Test, O=Test, CN=Test"
This one was just the temporary subject used by hosted-engine-setup during the bootstrap sequence when your engine was still to come.
At the end that cert got replace by the engine CA signed ones and so you have to substitute that subject to match the one you used during your setup.
 

Even using correct certificate I have problem
On hypervisor

[root@ovc71 ~]# openssl x509 -in /etc/pki/vdsm/libvirt-spice/ca-cert.pem -text | grep Subject
        Subject: C=US, O=localdomain.local, CN=shengine.localdomain.local.75331
        Subject Public Key Info:
            X509v3 Subject Key Identifier: 

On engine
[root@shengine ~]# openssl x509 -in  /etc/pki/ovirt-engine/ca.pem -text | grep Subject
        Subject: C=US, O=localdomain.local, CN=shengine.localdomain.local.75331
        Subject Public Key Info:
            X509v3 Subject Key Identifier: 

but

[root@ovc71 ~]# hosted-engine --add-console-password
Enter password: 
code = 0
message = 'Done'

[root@ovc71 ~]# remote-viewer --spice-ca-file=/etc/pki/vdsm/libvirt-spice/ca-cert.pem spice://localhost?tls-port=5900 --spice-host-subject="C=US, O=localdomain.local, CN=shengine.localdomain.local.75331"

it should be:  
remote-viewer --spice-ca-file=/etc/pki/vdsm/libvirt-spice/ca-cert.pem spice://ovc71.localdomain.local?tls-port=5900 --spice-host-subject="C=US, O=localdomain.local, CN=ovc71.localdomain.local" 



** (remote-viewer:4297): WARNING **: Couldn't connect to accessibility bus: Failed to connect to socket /tmp/dbus-Gb5xXSKiKK: Connection refused
GLib-GIO-Message: Using the 'memory' GSettings backend.  Your settings will not be saved or shared with other applications.
(/usr/bin/remote-viewer:4297): Spice-Warning **: ssl_verify.c:492:openssl_verify: ssl: subject 'C=US, O=localdomain.local, CN=shengine.localdomain.local.75331' verification failed
(/usr/bin/remote-viewer:4297): Spice-Warning **: ssl_verify.c:494:openssl_verify: ssl: verification failed

(remote-viewer:4297): GSpice-WARNING **: main-1:0: SSL_connect: error:00000001:lib(0):func(0):reason(1)


and the remote-viewer window with


 Unable to connect to the graphic server spice://localhost?tls-port=5900