Host remains Non-Responsive after reboot

------sinikael-?=_1-14220409622600.038446111837401986 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable I am running oVirt Engine Version 3.5.0.1-1.el6. I have 4 hosts in the = cluster. Each host has a drac5 and it is configured and working. I am = trying to simulate a node failure. I am running one HA VM on one of the = hosts for testing. I simulate the failure by powering off the host with the= VM running. Here is what is happening. * Host is powered off * ~4 minutes pass and the host is recognized as not responding * Automatic fence runs and the VM migrates.Another host in the node is = chosen as a proxy to execute Status command on the host. * Same host is chosen as proxy to execute Start command on the host. * Same host is chosen as proxy to execute Status command on the host. * The host DOES physically start. * The host never shows status of UP. * I select =E2=80=9Cconfirm host has been rebooted=E2=80=9D and I see a = manual fence start. * Host stays non-responsive. * I put the host in = maintenance and then activate it. * Host still non-responsive * I put the host in maintenance and do a reinstall * Reinstall finishes and host becomes UP So, everything seems to go fine = with the HA functionality, but the host never recovers without being = reinstalled. Please let me know which logs you need to look at to help me out with this. Thanks Sent withMixmax [https://mixmax.= com/r/S6cJAfQTLnw8QGtnD] ------sinikael-?=_1-14220409622600.038446111837401986 Content-Type: text/html; format=flowed Content-Transfer-Encoding: quoted-printable <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.= w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns=3D"http://www.= w3.org/1999/xhtml" xmlns:v=3D"urn:schemas-microsoft-com:vml" = xmlns:o=3D"urn:schemas-microsoft-com:office:office"> <head> <meta name=3D"viewport" content=3D"width=3Ddevice-width, = initial-scale=3D1.0"> =20 =20 <!--[if gte mso 9]> <xml> <o:OfficeDocumentSettings> <o:AllowPNG/> <o:PixelsPerInch>96</o:PixelsPerInch> </o:OfficeDocumentSettin= gs> </xml> <![endif]--> =20 =20 <style = type=3D"text/css">table {border-collapse:collapse;}* = a:hover{cursor:pointer;}img {width:auto;}* [lang~=3D"preview-card"],.= preview-card {display:block;margin:0;width:100%;font-size:0;}* = [lang~=3D"interactive-card"],.interactive-card {display:none !important;}* = [lang~=3D"brand-pinterest"] {width:280px !important;}form {border:0 != important;margin:0 !important;padding:0 0 8px 0 !important;font-size:0;}for= m >div {display:inline-block;width:50%;}form td {padding-right:6px;font-fam= ily:'proxima-nova','Avenir Next','Segoe UI','Calibri','Helvetica Neue',= Helvetica,Arial,sans-serif;}fieldset {border:1px solid #ccd != important;padding:6px 5px 5px 0 !important;border-radius:4px != important;padding-right:20px;margin:0;width:auto;}input = {background:none;outline:none !important;min-height:25px;padding:0 = 10px;border:none;margin:0;width:100%;box-sizing:border-box;}* = [lang~=3D"column-wrapper-first"],div.column-wrapper-first = {display:inline-block;width:30%;vertical-align:top;padding:8px 16px 4px 8px= !important;}* [lang~=3D"column-wrapper-second"],div.column-wrapper-second = {display:inline-block;width:60%;vertical-align:top;padding:4px 0 4px 0;}* = [lang~=3D"column-wrapper-only"],div.column-wrapper-only {padding:8px 8px = 4px 8px !important;}</style> </head> <body leftmargin=3D"0" = topmargin=3D"0" marginwidth=3D"0" marginheight=3D"0" yahoo=3D"fix" = style=3D"word-wrap:normal; word-break:break-word;"> <style></style> =20 <!--[if mso]> <style>a {font-family:'Segoe UI','Calibri',= Arial,sans-serif !important;}p {line-height:24px;margin-left:3px != important;}h1,h2,h3 {padding-left:3px;}img {border:none != important;-ms-interpolation-mode:bicubic;}.container {width:600px != important;}.p {line-height:22px;mso-line-height-rule:exactly !important;}td= {mso-line-height-rule:exactly !important;}table.mso-card-outer = {width:580px !important;margin-bottom:15px !important;}table.border-outer = {width:580px !important;margin-bottom:15px !important;}table.= mso-card-outer-pinterest {width:274px !important;margin-bottom:15px != important;}td.mso-card-inner table {border-collapse:collapse != important;mso-table-lspace:0pt;mso-table-rspace:0pt;vertical-align:top;}.= border-outer,.border-middle,.border-inner {border:none !important;}.= mso-border-outer,.mso-border-middle,.mso-border-inner {padding:1px;}.= mso-border-outer {background-color:rgb(245,255,255);}.mso-border-middle = {background-color:rgb(223,246,255);}.mso-border-inner = {background-color:rgb(153,176,225);}.preview-card {margin-bottom:0 != important;padding:0 !important;}.column-wrapper-first {margin:0;}.= column-only {padding:8px 8px 4px 8px;}.column-first {padding:8px 16px 8px = 8px;}.mso-column-wrapper-only {width:100% !important;}.outlook-only = {display:block !important;max-height:none !important;overflow:visible != important;}.outlook-com-only {display:none;}</style> <![endif]--> =20 =20 <style>.column-wrapper {vertical-align:top;}a = {word-wrap:normal;word-break:break-word;}@media only screen and = (max-width:600px) {.container[not-yahoo] {-webkit-text-size-adjust:none != important;}.container[not-yahoo] {width:100% !important;min-width:100% != important;}.container[not-yahoo] [class=3D"border-outer"] {width:100% != important;}.container[not-yahoo] [class=3D"palm-one-whole"] {width:100% != important;min-width:100% !important;}.container[not-yahoo] = td[class=3D"palm-one-whole"] {display:inline-block !important;}.= container[not-yahoo] .message-wrapper {padding:2.5%;}.container[not-yahoo] = td[class=3D"hostname"] {padding-top:3px !important;}.container[not-yahoo] = div.column-wrapper-first {display:block;padding:inherit != important;width:100% !important;}.container[not-yahoo] div.= column-wrapper-second {display:block;padding:inherit !important;width:100% = !important;}.container[not-yahoo] div.column-wrapper-only {padding:0 != important;}}@media only screen and (min-device-width :320px) and = (max-device-width :568px),only screen and (min-device-width :768px) and = (max-device-width :1024px),only screen and (max-device-width:640px),only = screen and (max-device-width:667px),only screen and = (max-width:480px){table[class=3D"container"] {width:100% != important;min-width:100% !important;}.container[not-yahoo] .p,.= container[not-yahoo] ol,.container[not-yahoo] ul {font-size:17px;}audio = {margin-bottom:10px;}.container[not-yahoo] .message-wrapper {padding:0;}.= container[not-yahoo] [lang~=3D"brand-pinterest"] {width:100% != important;}}@media only screen and (min-width:601px) {.container[not-yahoo]= table[class=3D"container"] {width:600px !important;}.container[not-yahoo] = .message-wrapper {padding:15px 25px;}}@media only screen and = (min-device-width :320px) and (max-device-width :568px),only screen and = (min-device-width :768px) and (max-device-width :1024px),only screen and = (min-device-width :1224px) {.container[not-yahoo] {}audio::-webkit-media-c= ontrols-panel {-webkit-appearance:none !important;background-color:#ff571b;= border-radius:2px;}audio::-webkit-media-controls-rewind-button = {display:none !important;}.container[not-yahoo] .apple-only[style] = {display:block !important;max-height:none !important;line-height:normal != important;overflow:visible !important;height:auto !important;width:100% != important;position:relative !important;}.ExternalClass .ecxapple-only = {display:none !important;}.container[not-yahoo] .no-apple {display:none != important;}.container[not-yahoo] .no-apple {display:block;}.= container[not-yahoo] form {width:100%;font-size:inherit;padding:0 0 8px 0!= important;}.container[not-yahoo] form td {}.container[not-yahoo] form = select {}.container[not-yahoo] form fieldset {padding:0 != important;height:45px;}.container[not-yahoo] form input = {height:43px;padding-left:4px !important;}.container[not-yahoo] form = button:hover {cursor:pointer;}.container[not-yahoo] .form-row = {font-size:0;}.container[not-yahoo] .form-row >.form-column = {display:inline-block;width:50%;}.container[not-yahoo] .quality fieldset = {width:40% !important;}.container[not-yahoo] .zip fieldset {width:40% != important;}}</style> =20 <style>.ExternalClass p,.ExternalClass font= ,.ExternalClass td {margin:0 !important;}.ExternalClass {width:100%;}.= ExternalClass .ecxcolumn-wrapper-second {width:60% !important;}.= ExternalClass .ecxcolumn-wrapper-first {padding-top:6px != important;padding-left:6px !important;}.ExternalClass .ecxlabels = {display:none !important;}.ExternalClass .ecxarrow {display:none != important;}.ExternalClass .h1 {padding-bottom:5px;}.ExternalClass .h2 = {padding-bottom:5px;}.ExternalClass .h3 {padding-bottom:5px;}.ExternalClass= .outlook-com-hidden {display:none !important;}.ExternalClass .= outlook-com-button {display:block;}.ExternalClass .outlook-com-only = {display:block !important;max-height:none !important;line-height:normal != important;overflow:visible !important;height:auto !important;width:100% != important;position:relative !important;}.ExternalClass .outlook-only = {display:block !important;max-height:none !important;overflow:visible != important;}.ExternalClass [lang=3D"brand-pinterest"] {width:280px != important;}.ExternalClass cite >div + div {padding:0 0 4px 0;}.= ExternalClass button {height:auto;}</style> <table class=3D"container"= lang=3D"container" not-yahoo=3D"fix" border=3D"0" cellpadding=3D"0" = cellspacing=3D"0" valign=3D"top" style=3D"max-width: 600px;"> <tr> <td valign=3D"top" class=3D"message-wrapper webfont-sans" = style=3D"font-size: 14px; line-height: 1.5; color: #333; = font-family:'Segoe UI', 'Helvetica Neue', Helvetica, 'Calibri', Arial, = sans-serif; "> <div class=3D"p" style=3D"line-height: 1.5;">I am = running oVirt Engine Version 3.5.0.1-1.el6. I have 4 hosts in the cluster. = Each host has a drac5 and it is configured and working. I am trying to = simulate a node failure. I am running one HA VM on one of the hosts for = testing. I simulate the failure by powering off the host with the VM = running.</div><div class=3D"p" style=3D"line-height: 1.5;"><br></div><div = class=3D"p" style=3D"line-height: 1.5;">Here is what is happening.= </div><ul><li>Host is powered off</li><li><span style=3D"white-space: = pre-wrap; line-height: 1.5;">~4 minutes pass and the host is recognized as = not responding</span></li><li><span style=3D"white-space: pre-wrap; = line-height: 1.5;">Automatic fence runs and the VM migrates.</span><span = style=3D"white-space: pre-wrap; line-height: 1.5;">Another host in the node= is chosen as a proxy to execute Status command on the host.= </span></li><li><span style=3D"white-space: pre-wrap; line-height: 1.= 5;">Same host is chosen as proxy to execute Start command on the host.= </span></li><li><span style=3D"white-space: pre-wrap; line-height: 1.= 5;">Same host is chosen as proxy to execute Status command on the host.= </span></li><li><span style=3D"white-space: pre-wrap; line-height: 1.= 5;">The host DOES physically start.</span></li><li><span = style=3D"white-space: pre-wrap; line-height: 1.5;">The host never shows = status of UP.</span></li><li><span style=3D"white-space: pre-wrap; = line-height: 1.5;">I select “confirm host has been rebooted” = and I see a manual fence start.</span></li><li>Host stays non-responsive.= </li><li><span style=3D"white-space: pre-wrap; line-height: 1.5;">I put the= host in maintenance and then activate it.</span></li><li><span = style=3D"white-space: pre-wrap; line-height: 1.5;">Host still = non-responsive</span></li><li><span style=3D"white-space: pre-wrap; = line-height: 1.5;">I put the host in maintenance and do a = reinstall</span></li><li>Reinstall finishes and host becomes = UP</li></ul><div class=3D"p" style=3D"line-height: 1.5;">So, everything = seems to go fine with the HA functionality, but the host never recovers = without being reinstalled. Please let me know which logs you need to look = at to help me out with this. </div><div class=3D"p" style=3D"line-height: 1= .5;"><br></div><div class=3D"p" style=3D"line-height: 1.= 5;">Thanks</div><div class=3D"p" style=3D"line-height: 1.5;"><br></div><img= src=3D"https://app.mixmax.com/api/track?id=3DHpxhNDpPcWWiXBhCL&re=3DIy= Zy9mL0JXa29GQzJXZzVnI&rn=3D"> <br> <div = class=3D"signature" style=3D"font-size: 14px; border-top:1px solid #eef; = font-weight:500;"> <table border=3D"0" cellpadding=3D"0" = cellspacing=3D"0" valign=3D"top" style=3D"border-collapse:collapse;"> <tr> <td class=3D"signature-text" = style=3D"padding-top:15px;"> <span = style=3D"display:block; font-family:'proxima-nova', 'Avenir Next', 'Segoe = UI', 'Calibri', 'Helvetica Neue', Helvetica, Arial, sans-serif; "> Sent with <b style=3D"font-family:'proxima= -nova', 'Avenir Next', 'Segoe UI', 'Calibri', 'Helvetica Neue', Helvetica, = Arial, sans-serif; ;"><u><a style=3D"text-decoration:underline; = color:#0d52cb;" href=3D"https://mixmax.com/r/S6cJAfQTLnw8QGtnD" = target=3D"_blank">Mixmax</a></u></b> </span> </td> </tr> </table></div> </td> </tr> </table> </body> </html> ------sinikael-?=_1-14220409622600.038446111837401986--

Hi Rob, Thanks for this report. Would you please provide these logs, at the time frame, the host failure occur: 1. oVirt Engine: /var/log/ovirt-engine/engine.log 2. host: /var/log/vdsm/vdsm.log If it is reproducible, please add this info as well. You can also check vdsm service status, on host, while host reported as Non responsive, by running on host 'service vdsmd status' There might some problem, that might have prevented from vdsm service to come up, on host. Ilanit. ----- Original Message ----- From: "Rob Abshear" <rabshear@citytwist.net> To: users@ovirt.org Sent: Friday, January 23, 2015 9:22:42 PM Subject: [ovirt-users] Host remains Non-Responsive after reboot I am running oVirt Engine Version 3.5.0.1-1.el6. I have 4 hosts in the cluster. Each host has a drac5 and it is configured and working. I am trying to simulate a node failure. I am running one HA VM on one of the hosts for testing. I simulate the failure by powering off the host with the VM running. Here is what is happening. * Host is powered off * ~4 minutes pass and the host is recognized as not responding * Automatic fence runs and the VM migrates. Another host in the node is chosen as a proxy to execute Status command on the host. * Same host is chosen as proxy to execute Start command on the host. * Same host is chosen as proxy to execute Status command on the host. * The host DOES physically start. * The host never shows status of UP. * I select “confirm host has been rebooted” and I see a manual fence start. * Host stays non-responsive. * I put the host in maintenance and then activate it. * Host still non-responsive * I put the host in maintenance and do a reinstall * Reinstall finishes and host becomes UP So, everything seems to go fine with the HA functionality, but the host never recovers without being reinstalled. Please let me know which logs you need to look at to help me out with this. Thanks Sent with Mixmax _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

I have done a bit more investigating on this matter. If I restart the node from within oVirt using the power management option "restart", then the node restarts and vdsmd DOES NOT start. If I go into the DRAC and issue the command to power cycle the machine, then the machine restarts and vdsmd DOES start. I can run the following command from another node in the cluster: fence_drac5 -a 192.168.200.105 -l root -p <password> -x -o reboot and the node restarts and vdsmd DOES start. On Sun, Jan 25, 2015 at 1:56 AM, ILanit Stein <istein@redhat.com> wrote:
Hi Rob,
Thanks for this report.
Would you please provide these logs, at the time frame, the host failure occur: 1. oVirt Engine: /var/log/ovirt-engine/engine.log 2. host: /var/log/vdsm/vdsm.log
If it is reproducible, please add this info as well.
You can also check vdsm service status, on host, while host reported as Non responsive, by running on host 'service vdsmd status' There might some problem, that might have prevented from vdsm service to come up, on host.
Ilanit.
----- Original Message ----- From: "Rob Abshear" <rabshear@citytwist.net> To: users@ovirt.org Sent: Friday, January 23, 2015 9:22:42 PM Subject: [ovirt-users] Host remains Non-Responsive after reboot
I am running oVirt Engine Version 3.5.0.1-1.el6. I have 4 hosts in the cluster. Each host has a drac5 and it is configured and working. I am trying to simulate a node failure. I am running one HA VM on one of the hosts for testing. I simulate the failure by powering off the host with the VM running.
Here is what is happening.
* Host is powered off * ~4 minutes pass and the host is recognized as not responding * Automatic fence runs and the VM migrates. Another host in the node is chosen as a proxy to execute Status command on the host. * Same host is chosen as proxy to execute Start command on the host. * Same host is chosen as proxy to execute Status command on the host. * The host DOES physically start. * The host never shows status of UP. * I select “confirm host has been rebooted” and I see a manual fence start. * Host stays non-responsive. * I put the host in maintenance and then activate it. * Host still non-responsive * I put the host in maintenance and do a reinstall * Reinstall finishes and host becomes UP
So, everything seems to go fine with the HA functionality, but the host never recovers without being reinstalled. Please let me know which logs you need to look at to help me out with this.
Thanks
Sent with Mixmax
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

It might be a bug, Would you please attach the logs, I mentioned bellow, that can bring more details on the failure? Adding Eli, that may want to give some input on this issue. Thanks, Ilanit. ----- Original Message ----- From: "Rob Abshear" <rabshear@citytwist.net> To: "ILanit Stein" <istein@redhat.com> Cc: users@ovirt.org Sent: Monday, January 26, 2015 9:43:14 PM Subject: Re: [ovirt-users] Host remains Non-Responsive after reboot I have done a bit more investigating on this matter. If I restart the node from within oVirt using the power management option "restart", then the node restarts and vdsmd DOES NOT start. If I go into the DRAC and issue the command to power cycle the machine, then the machine restarts and vdsmd DOES start. I can run the following command from another node in the cluster: fence_drac5 -a 192.168.200.105 -l root -p <password> -x -o reboot and the node restarts and vdsmd DOES start. On Sun, Jan 25, 2015 at 1:56 AM, ILanit Stein <istein@redhat.com> wrote:
Hi Rob,
Thanks for this report.
Would you please provide these logs, at the time frame, the host failure occur: 1. oVirt Engine: /var/log/ovirt-engine/engine.log 2. host: /var/log/vdsm/vdsm.log
If it is reproducible, please add this info as well.
You can also check vdsm service status, on host, while host reported as Non responsive, by running on host 'service vdsmd status' There might some problem, that might have prevented from vdsm service to come up, on host.
Ilanit.
----- Original Message ----- From: "Rob Abshear" <rabshear@citytwist.net> To: users@ovirt.org Sent: Friday, January 23, 2015 9:22:42 PM Subject: [ovirt-users] Host remains Non-Responsive after reboot
I am running oVirt Engine Version 3.5.0.1-1.el6. I have 4 hosts in the cluster. Each host has a drac5 and it is configured and working. I am trying to simulate a node failure. I am running one HA VM on one of the hosts for testing. I simulate the failure by powering off the host with the VM running.
Here is what is happening.
* Host is powered off * ~4 minutes pass and the host is recognized as not responding * Automatic fence runs and the VM migrates. Another host in the node is chosen as a proxy to execute Status command on the host. * Same host is chosen as proxy to execute Start command on the host. * Same host is chosen as proxy to execute Status command on the host. * The host DOES physically start. * The host never shows status of UP. * I select “confirm host has been rebooted” and I see a manual fence start. * Host stays non-responsive. * I put the host in maintenance and then activate it. * Host still non-responsive * I put the host in maintenance and do a reinstall * Reinstall finishes and host becomes UP
So, everything seems to go fine with the HA functionality, but the host never recovers without being reinstalled. Please let me know which logs you need to look at to help me out with this.
Thanks
Sent with Mixmax
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
participants (2)
-
ILanit Stein
-
Rob Abshear