Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on F20 (The VDSM host was found in a failed state)

Wednesday, 11 March 2015

----- Original Message -----
...
 From: "Bob Doolittle" <bob(a)doolittle.us.com&gt;
 To: "Simone Tiraboschi" <stirabos(a)redhat.com&gt;
 Cc: "users-ovirt" <users(a)ovirt.org&gt;
 Sent: Tuesday, March 10, 2015 7:29:44 PM
 Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on F20 (The VDSM
host was found in a failed
 state)

 On 03/10/2015 10:20 AM, Simone Tiraboschi wrote:
 >
 > ----- Original Message -----
 >> From: "Bob Doolittle" <bob(a)doolittle.us.com&gt;
 >> To: "Simone Tiraboschi" <stirabos(a)redhat.com&gt;
 >> Cc: "users-ovirt" <users(a)ovirt.org&gt;
 >> Sent: Tuesday, March 10, 2015 2:40:13 PM
 >> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on
 >> F20 (The VDSM host was found in a failed
 >> state)
 >>
 >>
 >> On 03/10/2015 04:58 AM, Simone Tiraboschi wrote:
 >>> ----- Original Message -----
 >>>> From: "Bob Doolittle" <bob(a)doolittle.us.com&gt;
 >>>> To: "Simone Tiraboschi" <stirabos(a)redhat.com&gt;
 >>>> Cc: "users-ovirt" <users(a)ovirt.org&gt;
 >>>> Sent: Monday, March 9, 2015 11:48:03 PM
 >>>> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1
on
 >>>> F20 (The VDSM host was found in a failed
 >>>> state)
 >>>>
 >>>>
 >>>> On 03/09/2015 02:47 PM, Bob Doolittle wrote:
 >>>>> Resending with CC to list (and an update).
 >>>>>
 >>>>> On 03/09/2015 01:40 PM, Simone Tiraboschi wrote:
 >>>>>> ----- Original Message -----
 >>>>>>> From: "Bob Doolittle"
<bob(a)doolittle.us.com&gt;
 >>>>>>> To: "Simone Tiraboschi"
<stirabos(a)redhat.com&gt;
 >>>>>>> Cc: "users-ovirt" <users(a)ovirt.org&gt;
 >>>>>>> Sent: Monday, March 9, 2015 6:26:30 PM
 >>>>>>> Subject: Re: [ovirt-users] Error during hosted-engine-setup
for 3.5.1
 >>>>>>> on
 >>>>>>> F20 (Cannot add the host to cluster ... SSH
 >>>>>>> has failed)
 >>>>>>>
 >> ...
 >>>>>>> OK, I've started over. Simply removing the storage
domain was
 >>>>>>> insufficient,
 >>>>>>> the hosted-engine deploy failed when it found the HA and
Broker
 >>>>>>> services
 >>>>>>> already configured. I decided to just start over fresh
starting with
 >>>>>>> re-installing the OS on my host.
 >>>>>>>
 >>>>>>> I can't deploy DNS at the moment, so I have to simply
replicate
 >>>>>>> /etc/hosts
 >>>>>>> files on my host/engine. I did that this time, but have run
into a
 >>>>>>> new
 >>>>>>> problem:
 >>>>>>>
 >>>>>>> [ INFO  ] Engine replied: DB Up!Welcome to Health Status!
 >>>>>>>           Enter the name of the cluster to which you want to
add the
 >>>>>>>           host
 >>>>>>>           (Default) [Default]:
 >>>>>>> [ INFO  ] Waiting for the host to become operational in the
engine.
 >>>>>>> This
 >>>>>>> may
 >>>>>>> take several minutes...
 >>>>>>> [ ERROR ] The VDSM host was found in a failed state. Please
check
 >>>>>>> engine
 >>>>>>> and
 >>>>>>> bootstrap installation logs.
 >>>>>>> [ ERROR ] Unable to add ovirt-vm to the manager
 >>>>>>>           Please shutdown the VM allowing the system to
launch it as
 >>>>>>>           a
 >>>>>>>           monitored service.
 >>>>>>>           The system will wait until the VM is down.
 >>>>>>> [ ERROR ] Failed to execute stage 'Closing up':
[Errno 111]
 >>>>>>> Connection
 >>>>>>> refused
 >>>>>>> [ INFO  ] Stage: Clean up
 >>>>>>> [ ERROR ] Failed to execute stage 'Clean up': [Errno
111] Connection
 >>>>>>> refused
 >>>>>>>
 >>>>>>>
 >>>>>>> I've attached my engine log and the
ovirt-hosted-engine-setup log. I
 >>>>>>> think I
 >>>>>>> had an issue with resolving external hostnames, or else a
 >>>>>>> connectivity
 >>>>>>> issue
 >>>>>>> during the install.
 >>>>>> For some reason your engine wasn't able to deploy your hosts
but the
 >>>>>> SSH
 >>>>>> session this time was established.
 >>>>>> 2015-03-09 13:05:58,514 ERROR
 >>>>>> [org.ovirt.engine.core.bll.InstallVdsInternalCommand]
 >>>>>> (org.ovirt.thread.pool-8-thread-3) [3cf91626] Host installation
failed
 >>>>>> for host 217016bb-fdcd-4344-a0ca-4548262d10a8, ovirt-vm.:
 >>>>>> java.io.IOException: Command returned failure code 1 during SSH
 >>>>>> session
 >>>>>> &#39;root(a)xion2.smartcity.net&#39;
 >>>>>>
 >>>>>> Can you please attach host-deploy logs from the engine VM?
 >>>>> OK, attached.
 >>>>>
 >>>>> Like I said, it looks to me like a name-resolution issue during the
yum
 >>>>> update on the engine. I think I've fixed that, but do you have a
better
 >>>>> suggestion for cleaning up and re-deploying other than installing
the
 >>>>> OS
 >>>>> on my host and starting all over again?
 >>>> I just finished starting over from scratch, starting with OS
 >>>> installation
 >>>> on
 >>>> my host/node, and wound up with a very similar problem - the engine
 >>>> couldn't
 >>>> reach the hosts during the yum operation. But this time the error was
 >>>> "Network is unreachable". Which is weird, because I can ssh
into the
 >>>> engine
 >>>> and ping many of those hosts, after the operation has failed.
 >>>>
 >>>> Here's my latest host-deploy log from the engine. I'd appreciate
any
 >>>> clues.
 >>> It seams that now your host is able to resolve that addresses but it's
 >>> not
 >>> able to connect over http.
 >>> On your hosts some of them resolves as IPv6 addresses; can you please try
 >>> to use curl to get one of the file that it wasn't able to fetch?
 >>> Can you please check your network configuration before and after
 >>> host-deploy?
 >> I can give you the network configuration after host-deploy, at least for
 >> the
 >> host/Node. The engine won't start for me this morning, after I shut down
 >> the
 >> host for the night.
 >>
 >> In order to give you the config before host-deploy (or, apparently for the
 >> engine), I'll have to re-install the OS on the host and start again from
 >> scratch. Obviously I'd rather not do that unless absolutely necessary.
 >>
 >> Here's the host config after the failed host-deploy:
 >>
 >> Host/Node:
 >>
 >> # ip route
 >> 169.254.0.0/16 dev ovirtmgmt  scope link  metric 1007
 >> 172.16.0.0/16 dev ovirtmgmt  proto kernel  scope link  src 172.16.0.58
 > You are missing a default gateway and so the issue.
 > Are you sure that it was properly configured before trying to deploy that
 > host?

 It should have been, it was a fresh OS install. So I'm starting again, and
 keeping careful records of my network config.

 Here is my initial network config of my host/node, immediately following a
 new OS install:

 % ip route
 default via 172.16.0.1 dev p3p1  proto static  metric 1024
 172.16.0.0/16 dev p3p1  proto kernel  scope link  src 172.16.0.58

 % ip addr
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
 default
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
 2: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP
 group default qlen 1000
     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
     inet 172.16.0.58/16 brd 172.16.255.255 scope global p3p1
        valid_lft forever preferred_lft forever
     inet6 fe80::baca:3aff:fe79:2212/64 scope link
        valid_lft forever preferred_lft forever
 3: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
 group default qlen 1000
     link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff

 After the VM is first created, the host/node config is:

 # ip route
 default via 172.16.0.1 dev ovirtmgmt
 169.254.0.0/16 dev ovirtmgmt  scope link  metric 1006
 172.16.0.0/16 dev ovirtmgmt  proto kernel  scope link  src 172.16.0.58

 # ip addr
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
 default
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
 2: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master
 ovirtmgmt state UP group default qlen 1000
     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
     inet6 fe80::baca:3aff:fe79:2212/64 scope link
        valid_lft forever preferred_lft forever
 3: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
 group default qlen 1000
     link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff
 4: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue
 state DOWN group default
     link/ether 92:cb:9d:97:18:36 brd ff:ff:ff:ff:ff:ff
 5: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
 default
     link/ether 9a:bc:29:52:82:38 brd ff:ff:ff:ff:ff:ff
 6: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state
 UP group default
     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
     inet 172.16.0.58/16 brd 172.16.255.255 scope global ovirtmgmt
        valid_lft forever preferred_lft forever
     inet6 fe80::baca:3aff:fe79:2212/64 scope link
        valid_lft forever preferred_lft forever
 7: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master
 ovirtmgmt state UNKNOWN group default qlen 500
     link/ether fe:16:3e:16:a4:37 brd ff:ff:ff:ff:ff:ff
     inet6 fe80::fc16:3eff:fe16:a437/64 scope link
        valid_lft forever preferred_lft forever

 At this point, I was already seeing a problem on the host/node. I remembered
 that a newer version of sos package is delivered from the ovirt
 repositories. So I tried to do a "yum update" on my host, and got a similar
 problem:

 % sudo yum update
 [sudo] password for rad:
 Loaded plugins: langpacks, refresh-packagekit
 Resolving Dependencies
 --> Running transaction check
 ---> Package sos.noarch 0:3.1-1.fc20 will be updated
 ---> Package sos.noarch 0:3.2-0.2.fc20.ovirt will be an update
 --> Finished Dependency Resolution

 Dependencies Resolved

================================================================================================================
  Package             Arch                   Version
  Repository                 Size

================================================================================================================
 Updating:
  sos                 noarch                 3.2-0.2.fc20.ovirt
  ovirt-3.5                 292 k

 Transaction Summary

================================================================================================================
 Upgrade  1 Package

 Total download size: 292 k
 Is this ok [y/d/N]: y
 Downloading packages:
 No Presto metadata available for ovirt-3.5
 sos-3.2-0.2.fc20.ovirt.noarch. FAILED

http://www.gtlib.gatech.edu/pub/oVirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3...:
 [Errno 14] curl#6 - "Could not resolve host: www.gtlib.gatech.edu"
 Trying other mirror.
 sos-3.2-0.2.fc20.ovirt.noarch. FAILED

ftp://ftp.gtlib.gatech.edu/pub/oVirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3....:
 [Errno 14] curl#6 - "Could not resolve host: ftp.gtlib.gatech.edu"
 Trying other mirror.
 sos-3.2-0.2.fc20.ovirt.noarch. FAILED

http://resources.ovirt.org/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20...:
 [Errno 14] curl#6 - "Could not resolve host: resources.ovirt.org"
 Trying other mirror.
 sos-3.2-0.2.fc20.ovirt.noarch. FAILED

http://ftp.snt.utwente.nl/pub/software/ovirt/ovirt-3.5/rpm/fc20/noarch/so...:
 [Errno 14] curl#6 - "Could not resolve host: ftp.snt.utwente.nl"
 Trying other mirror.
 sos-3.2-0.2.fc20.ovirt.noarch. FAILED

http://ftp.nluug.nl/os/Linux/virtual/ovirt/ovirt-3.5/rpm/fc20/noarch/sos-...:
 [Errno 14] curl#6 - "Could not resolve host: ftp.nluug.nl"
 Trying other mirror.
 sos-3.2-0.2.fc20.ovirt.noarch. FAILED

http://mirror.linux.duke.edu/ovirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-...:
 [Errno 14] curl#6 - "Could not resolve host: mirror.linux.duke.edu"
 Trying other mirror.

 Error downloading packages:
   sos-3.2-0.2.fc20.ovirt.noarch: [Errno 256] No more mirrors to try.

 This was similar to my previous failures. I took a look, and the problem was
 that /etc/resolv.conf had no nameservers, and the
 /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt file contained no entries for
 DNS1 or DOMAIN.

 So, it appears that when hosted-engine set up my bridged network, it
 neglected to carry over the DNS configuration necessary to the bridge. 
Unfortunately you find a know bug:
VDSM doesn't report static DNS (DNS1 from /etc/sysconfig/network-scripts/ifcfg-ethX)
and so we are going to loose them simply deploying the host:
https://bugzilla.redhat.com/show_bug.cgi?id=1160667
https://bugzilla.redhat.com/show_bug.cgi?id=1160423
We are going to fix it for 3.6; thanks for reporting.

...
 Note that I am using *static* network configuration, rather than
DHCP. During
 installation of the OS I am setting up the network configuration as Manual.
 Perhaps the hosted-engine script is not properly prepared to deal with that?

 I went ahead and modified the ifcfg-ovirtmgmt network script (for the next
 service restart/boot) and resolv.conf (I was afraid to restart the network
 in the middle of hosted-engine execution since I don't know what might
 already be connected to the engine). This time it got further, but
 ultimately it still failed at the very end: 
Manually fixing /etc/resolv.conf is a valid workaroud.

...
 [ INFO  ] Waiting for the host to become operational in the engine.
This may
 take several minutes...
 [ INFO  ] Still waiting for VDSM host to become operational...
 [ INFO  ] The VDSM Host is now operational
           Please shutdown the VM allowing the system to launch it as a
           monitored service.
           The system will wait until the VM is down.
 [ ERROR ] Failed to execute stage 'Closing up': Error acquiring VM status
 [ INFO  ] Stage: Clean up
 [ INFO  ] Generating answer file
 '/var/lib/ovirt-hosted-engine-setup/answers/answers-20150310140028.conf'
 [ INFO  ] Stage: Pre-termination
 [ INFO  ] Stage: Termination

 At that point, neither the ovirt-ha-broker or ovirt-ha-agent services were
 running.

 Note there was no significant pause after it said "The system will wait until
 the VM is down".

 After the script completed, I shut down the VM, and manually started the ha
 services, and the VM came up. I could login to the Administration Portal,
 and finally see my HostedEngine VM. :-)

 I seem to be in a bad state however: The Data Center has *no* storage domains
 attached. I'm not sure what else might need cleaning up. Any assistance
 appreciated. 
No, it's right: hosted engine storage domain is a special one and is currently not
reported by the engine cause you cannot use it for other VMs.
Simply add another storage domain and, after all, you are done.

...
 -Bob

 >> # ip addr
 >> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
 >> default
 >>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 >>     inet 127.0.0.1/8 scope host lo
 >>        valid_lft forever preferred_lft forever
 >>     inet6 ::1/128 scope host
 >>        valid_lft forever preferred_lft forever
 >> 2: p3p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
 >> master
 >> ovirtmgmt state UP group default qlen 1000
 >>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
 >>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
 >>        valid_lft forever preferred_lft forever
 >> 3: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc
 >> noqueue
 >> state DOWN group default
 >>     link/ether 56:56:f7:cf:73:27 brd ff:ff:ff:ff:ff:ff
 >> 4: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state
 >> DOWN
 >> group default qlen 1000
 >>     link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff
 >> 6: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
group
 >> default
 >>     link/ether 22:a1:01:9e:30:71 brd ff:ff:ff:ff:ff:ff
 >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
 >> state
 >> UP group default
 >>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
 >>     inet 172.16.0.58/16 brd 172.16.255.255 scope global ovirtmgmt
 >>        valid_lft forever preferred_lft forever
 >>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
 >>        valid_lft forever preferred_lft forever
 >>
 >>
 >> The only unusual thing about my setup that I can think of, from the
 >> network
 >> perspective, is that my physical host has a wireless interface, which I've
 >> not configured. Could it be confusing hosted-engine --deploy?
 >>
 >> -Bob
 >>
 >>

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on F20 (The VDSM host was found in a failed state)