[ovirt-users] Error during hosted-engine-setup for 3.5.1 on F20 (The VDSM host was found in a failed state)

Simone Tiraboschi stirabos at redhat.com
Wed Mar 11 05:57:15 EDT 2015



----- Original Message -----
> From: "Bob Doolittle" <bob at doolittle.us.com>
> To: "Simone Tiraboschi" <stirabos at redhat.com>
> Cc: "users-ovirt" <users at ovirt.org>
> Sent: Tuesday, March 10, 2015 7:29:44 PM
> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on F20 (The VDSM host was found in a failed
> state)
> 
> 
> On 03/10/2015 10:20 AM, Simone Tiraboschi wrote:
> >
> > ----- Original Message -----
> >> From: "Bob Doolittle" <bob at doolittle.us.com>
> >> To: "Simone Tiraboschi" <stirabos at redhat.com>
> >> Cc: "users-ovirt" <users at ovirt.org>
> >> Sent: Tuesday, March 10, 2015 2:40:13 PM
> >> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on
> >> F20 (The VDSM host was found in a failed
> >> state)
> >>
> >>
> >> On 03/10/2015 04:58 AM, Simone Tiraboschi wrote:
> >>> ----- Original Message -----
> >>>> From: "Bob Doolittle" <bob at doolittle.us.com>
> >>>> To: "Simone Tiraboschi" <stirabos at redhat.com>
> >>>> Cc: "users-ovirt" <users at ovirt.org>
> >>>> Sent: Monday, March 9, 2015 11:48:03 PM
> >>>> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on
> >>>> F20 (The VDSM host was found in a failed
> >>>> state)
> >>>>
> >>>>
> >>>> On 03/09/2015 02:47 PM, Bob Doolittle wrote:
> >>>>> Resending with CC to list (and an update).
> >>>>>
> >>>>> On 03/09/2015 01:40 PM, Simone Tiraboschi wrote:
> >>>>>> ----- Original Message -----
> >>>>>>> From: "Bob Doolittle" <bob at doolittle.us.com>
> >>>>>>> To: "Simone Tiraboschi" <stirabos at redhat.com>
> >>>>>>> Cc: "users-ovirt" <users at ovirt.org>
> >>>>>>> Sent: Monday, March 9, 2015 6:26:30 PM
> >>>>>>> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1
> >>>>>>> on
> >>>>>>> F20 (Cannot add the host to cluster ... SSH
> >>>>>>> has failed)
> >>>>>>>
> >> ...
> >>>>>>> OK, I've started over. Simply removing the storage domain was
> >>>>>>> insufficient,
> >>>>>>> the hosted-engine deploy failed when it found the HA and Broker
> >>>>>>> services
> >>>>>>> already configured. I decided to just start over fresh starting with
> >>>>>>> re-installing the OS on my host.
> >>>>>>>
> >>>>>>> I can't deploy DNS at the moment, so I have to simply replicate
> >>>>>>> /etc/hosts
> >>>>>>> files on my host/engine. I did that this time, but have run into a
> >>>>>>> new
> >>>>>>> problem:
> >>>>>>>
> >>>>>>> [ INFO  ] Engine replied: DB Up!Welcome to Health Status!
> >>>>>>>           Enter the name of the cluster to which you want to add the
> >>>>>>>           host
> >>>>>>>           (Default) [Default]:
> >>>>>>> [ INFO  ] Waiting for the host to become operational in the engine.
> >>>>>>> This
> >>>>>>> may
> >>>>>>> take several minutes...
> >>>>>>> [ ERROR ] The VDSM host was found in a failed state. Please check
> >>>>>>> engine
> >>>>>>> and
> >>>>>>> bootstrap installation logs.
> >>>>>>> [ ERROR ] Unable to add ovirt-vm to the manager
> >>>>>>>           Please shutdown the VM allowing the system to launch it as
> >>>>>>>           a
> >>>>>>>           monitored service.
> >>>>>>>           The system will wait until the VM is down.
> >>>>>>> [ ERROR ] Failed to execute stage 'Closing up': [Errno 111]
> >>>>>>> Connection
> >>>>>>> refused
> >>>>>>> [ INFO  ] Stage: Clean up
> >>>>>>> [ ERROR ] Failed to execute stage 'Clean up': [Errno 111] Connection
> >>>>>>> refused
> >>>>>>>
> >>>>>>>
> >>>>>>> I've attached my engine log and the ovirt-hosted-engine-setup log. I
> >>>>>>> think I
> >>>>>>> had an issue with resolving external hostnames, or else a
> >>>>>>> connectivity
> >>>>>>> issue
> >>>>>>> during the install.
> >>>>>> For some reason your engine wasn't able to deploy your hosts but the
> >>>>>> SSH
> >>>>>> session this time was established.
> >>>>>> 2015-03-09 13:05:58,514 ERROR
> >>>>>> [org.ovirt.engine.core.bll.InstallVdsInternalCommand]
> >>>>>> (org.ovirt.thread.pool-8-thread-3) [3cf91626] Host installation failed
> >>>>>> for host 217016bb-fdcd-4344-a0ca-4548262d10a8, ovirt-vm.:
> >>>>>> java.io.IOException: Command returned failure code 1 during SSH
> >>>>>> session
> >>>>>> 'root at xion2.smartcity.net'
> >>>>>>
> >>>>>> Can you please attach host-deploy logs from the engine VM?
> >>>>> OK, attached.
> >>>>>
> >>>>> Like I said, it looks to me like a name-resolution issue during the yum
> >>>>> update on the engine. I think I've fixed that, but do you have a better
> >>>>> suggestion for cleaning up and re-deploying other than installing the
> >>>>> OS
> >>>>> on my host and starting all over again?
> >>>> I just finished starting over from scratch, starting with OS
> >>>> installation
> >>>> on
> >>>> my host/node, and wound up with a very similar problem - the engine
> >>>> couldn't
> >>>> reach the hosts during the yum operation. But this time the error was
> >>>> "Network is unreachable". Which is weird, because I can ssh into the
> >>>> engine
> >>>> and ping many of those hosts, after the operation has failed.
> >>>>
> >>>> Here's my latest host-deploy log from the engine. I'd appreciate any
> >>>> clues.
> >>> It seams that now your host is able to resolve that addresses but it's
> >>> not
> >>> able to connect over http.
> >>> On your hosts some of them resolves as IPv6 addresses; can you please try
> >>> to use curl to get one of the file that it wasn't able to fetch?
> >>> Can you please check your network configuration before and after
> >>> host-deploy?
> >> I can give you the network configuration after host-deploy, at least for
> >> the
> >> host/Node. The engine won't start for me this morning, after I shut down
> >> the
> >> host for the night.
> >>
> >> In order to give you the config before host-deploy (or, apparently for the
> >> engine), I'll have to re-install the OS on the host and start again from
> >> scratch. Obviously I'd rather not do that unless absolutely necessary.
> >>
> >> Here's the host config after the failed host-deploy:
> >>
> >> Host/Node:
> >>
> >> # ip route
> >> 169.254.0.0/16 dev ovirtmgmt  scope link  metric 1007
> >> 172.16.0.0/16 dev ovirtmgmt  proto kernel  scope link  src 172.16.0.58
> > You are missing a default gateway and so the issue.
> > Are you sure that it was properly configured before trying to deploy that
> > host?
> 
> It should have been, it was a fresh OS install. So I'm starting again, and
> keeping careful records of my network config.
> 
> Here is my initial network config of my host/node, immediately following a
> new OS install:
> 
> % ip route
> default via 172.16.0.1 dev p3p1  proto static  metric 1024
> 172.16.0.0/16 dev p3p1  proto kernel  scope link  src 172.16.0.58
> 
> % ip addr
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
> default
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 scope host lo
>        valid_lft forever preferred_lft forever
>     inet6 ::1/128 scope host
>        valid_lft forever preferred_lft forever
> 2: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP
> group default qlen 1000
>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
>     inet 172.16.0.58/16 brd 172.16.255.255 scope global p3p1
>        valid_lft forever preferred_lft forever
>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
>        valid_lft forever preferred_lft forever
> 3: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
> group default qlen 1000
>     link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff
> 
> 
> After the VM is first created, the host/node config is:
> 
> # ip route
> default via 172.16.0.1 dev ovirtmgmt
> 169.254.0.0/16 dev ovirtmgmt  scope link  metric 1006
> 172.16.0.0/16 dev ovirtmgmt  proto kernel  scope link  src 172.16.0.58
> 
> # ip addr
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
> default
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 scope host lo
>        valid_lft forever preferred_lft forever
>     inet6 ::1/128 scope host
>        valid_lft forever preferred_lft forever
> 2: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master
> ovirtmgmt state UP group default qlen 1000
>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
>        valid_lft forever preferred_lft forever
> 3: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
> group default qlen 1000
>     link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff
> 4: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue
> state DOWN group default
>     link/ether 92:cb:9d:97:18:36 brd ff:ff:ff:ff:ff:ff
> 5: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
> default
>     link/ether 9a:bc:29:52:82:38 brd ff:ff:ff:ff:ff:ff
> 6: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state
> UP group default
>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
>     inet 172.16.0.58/16 brd 172.16.255.255 scope global ovirtmgmt
>        valid_lft forever preferred_lft forever
>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
>        valid_lft forever preferred_lft forever
> 7: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master
> ovirtmgmt state UNKNOWN group default qlen 500
>     link/ether fe:16:3e:16:a4:37 brd ff:ff:ff:ff:ff:ff
>     inet6 fe80::fc16:3eff:fe16:a437/64 scope link
>        valid_lft forever preferred_lft forever
> 
> 
> At this point, I was already seeing a problem on the host/node. I remembered
> that a newer version of sos package is delivered from the ovirt
> repositories. So I tried to do a "yum update" on my host, and got a similar
> problem:
> 
> % sudo yum update
> [sudo] password for rad:
> Loaded plugins: langpacks, refresh-packagekit
> Resolving Dependencies
> --> Running transaction check
> ---> Package sos.noarch 0:3.1-1.fc20 will be updated
> ---> Package sos.noarch 0:3.2-0.2.fc20.ovirt will be an update
> --> Finished Dependency Resolution
> 
> Dependencies Resolved
> 
> ================================================================================================================
>  Package             Arch                   Version
>  Repository                 Size
> ================================================================================================================
> Updating:
>  sos                 noarch                 3.2-0.2.fc20.ovirt
>  ovirt-3.5                 292 k
> 
> Transaction Summary
> ================================================================================================================
> Upgrade  1 Package
> 
> Total download size: 292 k
> Is this ok [y/d/N]: y
> Downloading packages:
> No Presto metadata available for ovirt-3.5
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED
> http://www.gtlib.gatech.edu/pub/oVirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm:
> [Errno 14] curl#6 - "Could not resolve host: www.gtlib.gatech.edu"
> Trying other mirror.
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED
> ftp://ftp.gtlib.gatech.edu/pub/oVirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm:
> [Errno 14] curl#6 - "Could not resolve host: ftp.gtlib.gatech.edu"
> Trying other mirror.
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED
> http://resources.ovirt.org/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm:
> [Errno 14] curl#6 - "Could not resolve host: resources.ovirt.org"
> Trying other mirror.
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED
> http://ftp.snt.utwente.nl/pub/software/ovirt/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm:
> [Errno 14] curl#6 - "Could not resolve host: ftp.snt.utwente.nl"
> Trying other mirror.
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED
> http://ftp.nluug.nl/os/Linux/virtual/ovirt/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm:
> [Errno 14] curl#6 - "Could not resolve host: ftp.nluug.nl"
> Trying other mirror.
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED
> http://mirror.linux.duke.edu/ovirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm:
> [Errno 14] curl#6 - "Could not resolve host: mirror.linux.duke.edu"
> Trying other mirror.
> 
> 
> Error downloading packages:
>   sos-3.2-0.2.fc20.ovirt.noarch: [Errno 256] No more mirrors to try.
> 
> 
> This was similar to my previous failures. I took a look, and the problem was
> that /etc/resolv.conf had no nameservers, and the
> /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt file contained no entries for
> DNS1 or DOMAIN.
> 
> So, it appears that when hosted-engine set up my bridged network, it
> neglected to carry over the DNS configuration necessary to the bridge.

Unfortunately you find a know bug:
VDSM doesn't report static DNS (DNS1 from /etc/sysconfig/network-scripts/ifcfg-ethX) and so we are going to loose them simply deploying the host:
https://bugzilla.redhat.com/show_bug.cgi?id=1160667
https://bugzilla.redhat.com/show_bug.cgi?id=1160423
We are going to fix it for 3.6; thanks for reporting.
 
> Note that I am using *static* network configuration, rather than DHCP. During
> installation of the OS I am setting up the network configuration as Manual.
> Perhaps the hosted-engine script is not properly prepared to deal with that?
> 
> I went ahead and modified the ifcfg-ovirtmgmt network script (for the next
> service restart/boot) and resolv.conf (I was afraid to restart the network
> in the middle of hosted-engine execution since I don't know what might
> already be connected to the engine). This time it got further, but
> ultimately it still failed at the very end:

Manually fixing /etc/resolv.conf is a valid workaroud.

> [ INFO  ] Waiting for the host to become operational in the engine. This may
> take several minutes...
> [ INFO  ] Still waiting for VDSM host to become operational...
> [ INFO  ] The VDSM Host is now operational
>           Please shutdown the VM allowing the system to launch it as a
>           monitored service.
>           The system will wait until the VM is down.
> [ ERROR ] Failed to execute stage 'Closing up': Error acquiring VM status
> [ INFO  ] Stage: Clean up
> [ INFO  ] Generating answer file
> '/var/lib/ovirt-hosted-engine-setup/answers/answers-20150310140028.conf'
> [ INFO  ] Stage: Pre-termination
> [ INFO  ] Stage: Termination
> 
> 
> At that point, neither the ovirt-ha-broker or ovirt-ha-agent services were
> running.
> 
> Note there was no significant pause after it said "The system will wait until
> the VM is down".
> 
> After the script completed, I shut down the VM, and manually started the ha
> services, and the VM came up. I could login to the Administration Portal,
> and finally see my HostedEngine VM. :-)
> 
> I seem to be in a bad state however: The Data Center has *no* storage domains
> attached. I'm not sure what else might need cleaning up. Any assistance
> appreciated.

No, it's right: hosted engine storage domain is a special one and is currently not reported by the engine cause you cannot use it for other VMs.
Simply add another storage domain and, after all, you are done.

> -Bob
> 
> 
> >> # ip addr
> >> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
> >> default
> >>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> >>     inet 127.0.0.1/8 scope host lo
> >>        valid_lft forever preferred_lft forever
> >>     inet6 ::1/128 scope host
> >>        valid_lft forever preferred_lft forever
> >> 2: p3p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
> >> master
> >> ovirtmgmt state UP group default qlen 1000
> >>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
> >>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
> >>        valid_lft forever preferred_lft forever
> >> 3: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc
> >> noqueue
> >> state DOWN group default
> >>     link/ether 56:56:f7:cf:73:27 brd ff:ff:ff:ff:ff:ff
> >> 4: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state
> >> DOWN
> >> group default qlen 1000
> >>     link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff
> >> 6: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
> >> default
> >>     link/ether 22:a1:01:9e:30:71 brd ff:ff:ff:ff:ff:ff
> >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
> >> state
> >> UP group default
> >>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
> >>     inet 172.16.0.58/16 brd 172.16.255.255 scope global ovirtmgmt
> >>        valid_lft forever preferred_lft forever
> >>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
> >>        valid_lft forever preferred_lft forever
> >>
> >>
> >> The only unusual thing about my setup that I can think of, from the
> >> network
> >> perspective, is that my physical host has a wireless interface, which I've
> >> not configured. Could it be confusing hosted-engine --deploy?
> >>
> >> -Bob
> >>
> >>
> 
> 


More information about the Users mailing list