[ovirt-users] Error during hosted-engine-setup for 3.5.1 on F20 (The VDSM host was found in a failed state)

Wed Mar 11 11:37:38 EDT 2015

For the record, once I added a new storage domain the Data center came up.

So in the end, this seems to have been due to known bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1160667
https://bugzilla.redhat.com/show_bug.cgi?id=1160423

Effectively, for hosts with static/manual IP addressing (i.e. not DHCP), the DNS and default route information are not set up correctly by hosted-engine-setup. I'm not sure why that's not considered a higher priority bug (e.g. blocker for 3.5.2?) since I believe the most typical configuration for servers is static IP addressing.

All seems to be working now. Many thanks to Simone for the invaluable assistance.

-Bob

On Mar 10, 2015 2:29 PM, "Bob Doolittle" <bob at doolittle.us.com <mailto:bob at doolittle.us.com>> wrote:
>
>
> On 03/10/2015 10:20 AM, Simone Tiraboschi wrote:
>>
>>
>> ----- Original Message -----
>>>
>>> From: "Bob Doolittle" <bob at doolittle.us.com <mailto:bob at doolittle.us.com>>
>>> To: "Simone Tiraboschi" <stirabos at redhat.com <mailto:stirabos at redhat.com>>
>>> Cc: "users-ovirt" <users at ovirt.org <mailto:users at ovirt.org>>
>>> Sent: Tuesday, March 10, 2015 2:40:13 PM
>>> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on F20 (The VDSM host was found in a failed
>>> state)
>>>
>>>
>>> On 03/10/2015 04:58 AM, Simone Tiraboschi wrote:
>>>>
>>>> ----- Original Message -----
>>>>>
>>>>> From: "Bob Doolittle" <bob at doolittle.us.com <mailto:bob at doolittle.us.com>>
>>>>> To: "Simone Tiraboschi" <stirabos at redhat.com <mailto:stirabos at redhat.com>>
>>>>> Cc: "users-ovirt" <users at ovirt.org <mailto:users at ovirt.org>>
>>>>> Sent: Monday, March 9, 2015 11:48:03 PM
>>>>> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1 on
>>>>> F20 (The VDSM host was found in a failed
>>>>> state)
>>>>>
>>>>>
>>>>> On 03/09/2015 02:47 PM, Bob Doolittle wrote:
>>>>>>
>>>>>> Resending with CC to list (and an update).
>>>>>>
>>>>>> On 03/09/2015 01:40 PM, Simone Tiraboschi wrote:
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>>
>>>>>>>> From: "Bob Doolittle" <bob at doolittle.us.com <mailto:bob at doolittle.us.com>>
>>>>>>>> To: "Simone Tiraboschi" <stirabos at redhat.com <mailto:stirabos at redhat.com>>
>>>>>>>> Cc: "users-ovirt" <users at ovirt.org <mailto:users at ovirt.org>>
>>>>>>>> Sent: Monday, March 9, 2015 6:26:30 PM
>>>>>>>> Subject: Re: [ovirt-users] Error during hosted-engine-setup for 3.5.1
>>>>>>>> on
>>>>>>>> F20 (Cannot add the host to cluster ... SSH
>>>>>>>> has failed)
>>>>>>>>
>>> ...
>>>>>>>>
>>>>>>>> OK, I've started over. Simply removing the storage domain was
>>>>>>>> insufficient,
>>>>>>>> the hosted-engine deploy failed when it found the HA and Broker
>>>>>>>> services
>>>>>>>> already configured. I decided to just start over fresh starting with
>>>>>>>> re-installing the OS on my host.
>>>>>>>>
>>>>>>>> I can't deploy DNS at the moment, so I have to simply replicate
>>>>>>>> /etc/hosts
>>>>>>>> files on my host/engine. I did that this time, but have run into a new
>>>>>>>> problem:
>>>>>>>>
>>>>>>>> [ INFO  ] Engine replied: DB Up!Welcome to Health Status!
>>>>>>>>           Enter the name of the cluster to which you want to add the
>>>>>>>>           host
>>>>>>>>           (Default) [Default]:
>>>>>>>> [ INFO  ] Waiting for the host to become operational in the engine.
>>>>>>>> This
>>>>>>>> may
>>>>>>>> take several minutes...
>>>>>>>> [ ERROR ] The VDSM host was found in a failed state. Please check
>>>>>>>> engine
>>>>>>>> and
>>>>>>>> bootstrap installation logs.
>>>>>>>> [ ERROR ] Unable to add ovirt-vm to the manager
>>>>>>>>           Please shutdown the VM allowing the system to launch it as a
>>>>>>>>           monitored service.
>>>>>>>>           The system will wait until the VM is down.
>>>>>>>> [ ERROR ] Failed to execute stage 'Closing up': [Errno 111] Connection
>>>>>>>> refused
>>>>>>>> [ INFO  ] Stage: Clean up
>>>>>>>> [ ERROR ] Failed to execute stage 'Clean up': [Errno 111] Connection
>>>>>>>> refused
>>>>>>>>
>>>>>>>>
>>>>>>>> I've attached my engine log and the ovirt-hosted-engine-setup log. I
>>>>>>>> think I
>>>>>>>> had an issue with resolving external hostnames, or else a connectivity
>>>>>>>> issue
>>>>>>>> during the install.
>>>>>>>
>>>>>>> For some reason your engine wasn't able to deploy your hosts but the SSH
>>>>>>> session this time was established.
>>>>>>> 2015-03-09 13:05:58,514 ERROR
>>>>>>> [org.ovirt.engine.core.bll.InstallVdsInternalCommand]
>>>>>>> (org.ovirt.thread.pool-8-thread-3) [3cf91626] Host installation failed
>>>>>>> for host 217016bb-fdcd-4344-a0ca-4548262d10a8, ovirt-vm.:
>>>>>>> java.io.IOException: Command returned failure code 1 during SSH session
>>>>>>> 'root at xion2.smartcity.net <mailto:root at xion2.smartcity.net>'
>>>>>>>
>>>>>>> Can you please attach host-deploy logs from the engine VM?
>>>>>>
>>>>>> OK, attached.
>>>>>>
>>>>>> Like I said, it looks to me like a name-resolution issue during the yum
>>>>>> update on the engine. I think I've fixed that, but do you have a better
>>>>>> suggestion for cleaning up and re-deploying other than installing the OS
>>>>>> on my host and starting all over again?
>>>>>
>>>>> I just finished starting over from scratch, starting with OS installation
>>>>> on
>>>>> my host/node, and wound up with a very similar problem - the engine
>>>>> couldn't
>>>>> reach the hosts during the yum operation. But this time the error was
>>>>> "Network is unreachable". Which is weird, because I can ssh into the
>>>>> engine
>>>>> and ping many of those hosts, after the operation has failed.
>>>>>
>>>>> Here's my latest host-deploy log from the engine. I'd appreciate any
>>>>> clues.
>>>>
>>>> It seams that now your host is able to resolve that addresses but it's not
>>>> able to connect over http.
>>>> On your hosts some of them resolves as IPv6 addresses; can you please try
>>>> to use curl to get one of the file that it wasn't able to fetch?
>>>> Can you please check your network configuration before and after
>>>> host-deploy?
>>>
>>> I can give you the network configuration after host-deploy, at least for the
>>> host/Node. The engine won't start for me this morning, after I shut down the
>>> host for the night.
>>>
>>> In order to give you the config before host-deploy (or, apparently for the
>>> engine), I'll have to re-install the OS on the host and start again from
>>> scratch. Obviously I'd rather not do that unless absolutely necessary.
>>>
>>> Here's the host config after the failed host-deploy:
>>>
>>> Host/Node:
>>>
>>> # ip route
>>> 169.254.0.0/16 <http://169.254.0.0/16> dev ovirtmgmt  scope link  metric 1007
>>> 172.16.0.0/16 <http://172.16.0.0/16> dev ovirtmgmt  proto kernel  scope link  src 172.16.0.58
>>
>> You are missing a default gateway and so the issue.
>> Are you sure that it was properly configured before trying to deploy that host?
>
>
> It should have been, it was a fresh OS install. So I'm starting again, and keeping careful records of my network config.
>
> Here is my initial network config of my host/node, immediately following a new OS install:
>
> % ip route
> default via 172.16.0.1 dev p3p1  proto static  metric 1024
> 172.16.0.0/16 <http://172.16.0.0/16> dev p3p1  proto kernel  scope link  src 172.16.0.58
>
> % ip addr
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 <http://127.0.0.1/8> scope host lo
>        valid_lft forever preferred_lft forever
>     inet6 ::1/128 scope host
>        valid_lft forever preferred_lft forever
> 2: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
>     inet 172.16.0.58/16 <http://172.16.0.58/16> brd 172.16.255.255 scope global p3p1
>        valid_lft forever preferred_lft forever
>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
>        valid_lft forever preferred_lft forever
> 3: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
>     link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff
>
>
> After the VM is first created, the host/node config is:
>
> # ip route
> default via 172.16.0.1 dev ovirtmgmt
> 169.254.0.0/16 <http://169.254.0.0/16> dev ovirtmgmt  scope link  metric 1006
> 172.16.0.0/16 <http://172.16.0.0/16> dev ovirtmgmt  proto kernel  scope link  src 172.16.0.58
>
> # ip addr
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 <http://127.0.0.1/8> scope host lo
>        valid_lft forever preferred_lft forever
>     inet6 ::1/128 scope host
>        valid_lft forever preferred_lft forever
> 2: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovirtmgmt state UP group default qlen 1000
>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
>        valid_lft forever preferred_lft forever
> 3: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
>     link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff
> 4: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN group default
>     link/ether 92:cb:9d:97:18:36 brd ff:ff:ff:ff:ff:ff
> 5: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
>     link/ether 9a:bc:29:52:82:38 brd ff:ff:ff:ff:ff:ff
> 6: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
>     inet 172.16.0.58/16 <http://172.16.0.58/16> brd 172.16.255.255 scope global ovirtmgmt
>        valid_lft forever preferred_lft forever
>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
>        valid_lft forever preferred_lft forever
> 7: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovirtmgmt state UNKNOWN group default qlen 500
>     link/ether fe:16:3e:16:a4:37 brd ff:ff:ff:ff:ff:ff
>     inet6 fe80::fc16:3eff:fe16:a437/64 scope link
>        valid_lft forever preferred_lft forever
>
>
> At this point, I was already seeing a problem on the host/node. I remembered that a newer version of sos package is delivered from the ovirt repositories. So I tried to do a "yum update" on my host, and got a similar problem:
>
> % sudo yum update
> [sudo] password for rad:
> Loaded plugins: langpacks, refresh-packagekit
> Resolving Dependencies
> --> Running transaction check
> ---> Package sos.noarch 0:3.1-1.fc20 will be updated
> ---> Package sos.noarch 0:3.2-0.2.fc20.ovirt will be an update
> --> Finished Dependency Resolution
>
> Dependencies Resolved
>
> ================================================================================================================
>  Package             Arch                   Version                             Repository                 Size
> ================================================================================================================
> Updating:
>  sos                 noarch                 3.2-0.2.fc20.ovirt                  ovirt-3.5                 292 k
>
> Transaction Summary
> ================================================================================================================
> Upgrade  1 Package
>
> Total download size: 292 k
> Is this ok [y/d/N]: y
> Downloading packages:
> No Presto metadata available for ovirt-3.5
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED                                         
> http://www.gtlib.gatech.edu/pub/oVirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: [Errno 14] curl#6 - "Could not resolve host: www.gtlib.gatech.edu <http://www.gtlib.gatech.edu>"
> Trying other mirror.
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED                                         
> ftp://ftp.gtlib.gatech.edu/pub/oVirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: [Errno 14] curl#6 - "Could not resolve host: ftp.gtlib.gatech.edu <http://ftp.gtlib.gatech.edu>"
> Trying other mirror.
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED                                         
> http://resources.ovirt.org/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: [Errno 14] curl#6 - "Could not resolve host: resources.ovirt.org <http://resources.ovirt.org>"
> Trying other mirror.
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED                                         
> http://ftp.snt.utwente.nl/pub/software/ovirt/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: [Errno 14] curl#6 - "Could not resolve host: ftp.snt.utwente.nl <http://ftp.snt.utwente.nl>"
> Trying other mirror.
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED                                         
> http://ftp.nluug.nl/os/Linux/virtual/ovirt/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: [Errno 14] curl#6 - "Could not resolve host: ftp.nluug.nl <http://ftp.nluug.nl>"
> Trying other mirror.
> sos-3.2-0.2.fc20.ovirt.noarch. FAILED                                         
> http://mirror.linux.duke.edu/ovirt/pub/ovirt-3.5/rpm/fc20/noarch/sos-3.2-0.2.fc20.ovirt.noarch.rpm: [Errno 14] curl#6 - "Could not resolve host: mirror.linux.duke.edu <http://mirror.linux.duke.edu>"
> Trying other mirror.
>
>
> Error downloading packages:
>   sos-3.2-0.2.fc20.ovirt.noarch: [Errno 256] No more mirrors to try.
>
>
> This was similar to my previous failures. I took a look, and the problem was that /etc/resolv.conf had no nameservers, and the /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt file contained no entries for DNS1 or DOMAIN.
>
> So, it appears that when hosted-engine set up my bridged network, it neglected to carry over the DNS configuration necessary to the bridge.
>
> Note that I am using *static* network configuration, rather than DHCP. During installation of the OS I am setting up the network configuration as Manual. Perhaps the hosted-engine script is not properly prepared to deal with that?
>
> I went ahead and modified the ifcfg-ovirtmgmt network script (for the next service restart/boot) and resolv.conf (I was afraid to restart the network in the middle of hosted-engine execution since I don't know what might already be connected to the engine). This time it got further, but ultimately it still failed at the very end:
>
> [ INFO  ] Waiting for the host to become operational in the engine. This may take several minutes...
> [ INFO  ] Still waiting for VDSM host to become operational...
> [ INFO  ] The VDSM Host is now operational
>           Please shutdown the VM allowing the system to launch it as a monitored service.
>           The system will wait until the VM is down.
> [ ERROR ] Failed to execute stage 'Closing up': Error acquiring VM status
> [ INFO  ] Stage: Clean up
> [ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20150310140028.conf'
> [ INFO  ] Stage: Pre-termination
> [ INFO  ] Stage: Termination
>
>
> At that point, neither the ovirt-ha-broker or ovirt-ha-agent services were running.
>
> Note there was no significant pause after it said "The system will wait until the VM is down".
>
> After the script completed, I shut down the VM, and manually started the ha services, and the VM came up. I could login to the Administration Portal, and finally see my HostedEngine VM. :-)
>
> I seem to be in a bad state however: The Data Center has no storage domains attached. I'm not sure what else might need cleaning up. Any assistance appreciated.

>

> -Bob
>
>
>
>>> # ip addr
>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
>>> default
>>>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>     inet 127.0.0.1/8 <http://127.0.0.1/8> scope host lo
>>>        valid_lft forever preferred_lft forever
>>>     inet6 ::1/128 scope host
>>>        valid_lft forever preferred_lft forever
>>> 2: p3p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master
>>> ovirtmgmt state UP group default qlen 1000
>>>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
>>>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
>>>        valid_lft forever preferred_lft forever
>>> 3: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue
>>> state DOWN group default
>>>     link/ether 56:56:f7:cf:73:27 brd ff:ff:ff:ff:ff:ff
>>> 4: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
>>> group default qlen 1000
>>>     link/ether 1c:3e:84:50:8d:c3 brd ff:ff:ff:ff:ff:ff
>>> 6: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
>>> default
>>>     link/ether 22:a1:01:9e:30:71 brd ff:ff:ff:ff:ff:ff
>>> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state
>>> UP group default
>>>     link/ether b8:ca:3a:79:22:12 brd ff:ff:ff:ff:ff:ff
>>>     inet 172.16.0.58/16 <http://172.16.0.58/16> brd 172.16.255.255 scope global ovirtmgmt
>>>        valid_lft forever preferred_lft forever
>>>     inet6 fe80::baca:3aff:fe79:2212/64 scope link
>>>        valid_lft forever preferred_lft forever
>>>
>>>
>>> The only unusual thing about my setup that I can think of, from the network
>>> perspective, is that my physical host has a wireless interface, which I've
>>> not configured. Could it be confusing hosted-engine --deploy?
>>>
>>> -Bob
>>>
>>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150311/f422e362/attachment-0001.html>