[Users] host remains in status unassigned

Hello, oVirt 3.2.1 on F18 One mgmt server + one host. At some point the host had some problem and was rebooted at reboot ther was a problem because I got detected unhandled Python exception in '/usr/share/vdsm/vdsm' watching at http://www.ovirt.org/Troubleshooting I verified that for some reason vdsmd.log was root.root. So I set correct permissions and restarted the node. Now it stays at "unassigned" status from an engine point of view, that I don't know what means... In node messages Jul 14 16:55:02 management02 vdsm vds WARNING Unable to load the json rpc server module. Please make sure it is installed. [root@management02 ~]# systemctl status vdsmd vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled) Active: active (running) since Sun 2013-07-14 16:54:58 CEST; 6min ago Process: 1015 ExecStart=/lib/systemd/systemd-vdsmd start (code=exited, status=0/SUCCESS) Main PID: 2128 (respawn) CGroup: name=systemd:/system/vdsmd.service ├─2128 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.... ├─2131 /usr/bin/python /usr/share/vdsm/vdsm ├─2151 /usr/bin/sudo -n /usr/bin/python /usr/share/vdsm/supervdsmServer.py 62a4a000-a73a-4e2c-b1a8-25e24... └─2152 /usr/bin/python /usr/share/vdsm/supervdsmServer.py 62a4a000-a73a-4e2c-b1a8-25e24587b558 2131 /var... Jul 14 16:54:58 management02.localdomain.local runuser[2125]: pam_unix(runuser:session): session opened for user vdsm ...d=0) Jul 14 16:54:58 management02.localdomain.local runuser[2125]: pam_unix(runuser:session): session closed for user vdsm Jul 14 16:54:58 management02.localdomain.local systemd-vdsmd[1015]: [27B blob data] Jul 14 16:54:58 management02.localdomain.local systemd[1]: Started Virtual Desktop Server Manager. Jul 14 16:54:59 management02.localdomain.local python[2131]: DIGEST-MD5 client step 2 Jul 14 16:54:59 management02.localdomain.local python[2131]: DIGEST-MD5 client step 2 Jul 14 16:54:59 management02.localdomain.local python[2131]: DIGEST-MD5 client step 3 Jul 14 16:54:59 management02.localdomain.local vdsm[2131]: vdsm fileUtils WARNING Dir /rhev/data-center/mnt already exists Jul 14 16:55:02 management02.localdomain.local vdsm[2131]: vdsm vds WARNING Unable to load the json rpc server module....led. Jul 14 16:55:04 management02.localdomain.local vdsm[2131]: vdsm fileUtils WARNING Dir /rhev/data-center/hsm-tasks alre...ists In vdsm.log torageRefresh::DEBUG::2013-07-14 16:55:01,919::misc::1064::SamplingMethod::(__call__) Returning last result MainProcess|storageRefresh::DEBUG::2013-07-14 16:55:01,921::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/dd of=/sys/class /scsi_host/host0/scan' (cwd None) MainProcess|storageRefresh::DEBUG::2013-07-14 16:55:01,932::iscsi::402::Storage.ISCSI::(forceIScsiScan) Performing SCSI scan, th is will take up to 30 seconds MainThread::INFO::2013-07-14 16:55:02,085::momIF::36::MOM::(__init__) Starting up MOM MainThread::INFO::2013-07-14 16:55:02,091::vmChannels::140::vds::(settimeout) Setting channels' timeout to 30 seconds. VM Channels Listener::INFO::2013-07-14 16:55:02,091::vmChannels::128::vds::(run) Starting VM channels listener thread. MainThread::WARNING::2013-07-14 16:55:02,139::clientIF::175::vds::(_prepareBindings) Unable to load the json rpc server module. Please make sure it is installed. storageRefresh::DEBUG::2013-07-14 16:55:04,290::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/multipath' (cwd None) storageRefresh::DEBUG::2013-07-14 16:55:04,341::misc::84::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0 storageRefresh::DEBUG::2013-07-14 16:55:04,342::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex storageRefresh::DEBUG::2013-07-14 16:55:04,342::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex storageRefresh::DEBUG::2013-07-14 16:55:04,342::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex storageRefresh::DEBUG::2013-07-14 16:55:04,343::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex storageRefresh::DEBUG::2013-07-14 16:55:04,343::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex storageRefresh::DEBUG::2013-07-14 16:55:04,343::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex storageRefresh::DEBUG::2013-07-14 16:55:04,344::misc::1064::SamplingMethod::(__call__) Returning last result storageRefresh::WARNING::2013-07-14 16:55:04,344::fileUtils::185::fileUtils::(createdir) Dir /rhev/data-center/hsm-tasks already exists full engine logs and vdsm logs here https://docs.google.com/file/d/0BwoPbcrMv8mvTXpFNmw3UG9NVzg/edit?usp=sharing and https://docs.google.com/file/d/0BwoPbcrMv8mvRDViZGo1S1hRcUU/edit?usp=sharing thanks for any help. I should use them tomorrow, they kept ok for many weeks, but Murphy is always alive ;-) Gianluca

Gianluca Cecchi wrote:
Hello,
oVirt 3.2.1 on F18 One mgmt server + one host.
At some point the host had some problem and was rebooted at reboot ther was a problem because I got
detected unhandled Python exception in '/usr/share/vdsm/vdsm'
watching at http://www.ovirt.org/Troubleshooting
I verified that for some reason vdsmd.log was root.root. So I set correct permissions and restarted the node.
OK, I have had this too and managed to get out of it uptil now, maybe my experience will help you, may you have already tried what I'm going to suggest. Put the host in maintenance, login with ssh on the host and restart the vdsmd service, activate the host. I think I amended vdsms logrotate.conf with a user statement so that it wouldn't make root owner of vdsmd.log. There is a race condition that has been fixed I think in 3.2.2 and onwards. Hope this helps you, Joop

Il giorno 14/lug/2013 17:35, "Joop" <jvdwege@xs4all.nl> ha scritto:
Gianluca Cecchi wrote:
Hello,
oVirt 3.2.1 on F18 One mgmt server + one host.
At some point the host had some problem and was rebooted at reboot ther was a problem because I got
detected unhandled Python exception in '/usr/share/vdsm/vdsm'
watching at http://www.ovirt.org/Troubleshooting
I verified that for some reason vdsmd.log was root.root. So I set correct permissions and restarted the node.
OK, I have had this too and managed to get out of it uptil now, maybe my
experience will help you, may you have already tried what I'm going to suggest.
Put the host in maintenance, login with ssh on the host and restart the vdsmd service, activate the host.
I think I amended vdsms logrotate.conf with a user statement so that it wouldn't make root owner of vdsmd.log. There is a race condition that has been fixed I think in 3.2.2 and onwards.
Hope this helps you,
Joop
I'm going to check but it seems to me that I could only select "host has rebooted" that failed because "unassigned" is not a good starting point to set it. Not sure I could put in maintenance. .. Gianluca

On Sun 14 Jul 2013 08:12:21 PM IDT, Gianluca Cecchi wrote:
Il giorno 14/lug/2013 17:35, "Joop" <jvdwege@xs4all.nl <mailto:jvdwege@xs4all.nl>> ha scritto:
Gianluca Cecchi wrote:
Hello,
oVirt 3.2.1 on F18 One mgmt server + one host.
At some point the host had some problem and was rebooted at reboot ther was a problem because I got
detected unhandled Python exception in '/usr/share/vdsm/vdsm'
watching at http://www.ovirt.org/Troubleshooting
I verified that for some reason vdsmd.log was root.root. So I set correct permissions and restarted the node.
OK, I have had this too and managed to get out of it uptil now,
Put the host in maintenance, login with ssh on the host and restart
maybe my experience will help you, may you have already tried what I'm going to suggest. the vdsmd service, activate the host.
I think I amended vdsms logrotate.conf with a user statement so that
it wouldn't make root owner of vdsmd.log. There is a race condition that has been fixed I think in 3.2.2 and onwards.
Hope this helps you,
Joop
I'm going to check but it seems to me that I could only select "host has rebooted" that failed because "unassigned" is not a good starting point to set it. Not sure I could put in maintenance. .. Gianluca
Unfortunatly you can't. "unassigned" puts the host in transition and its status should be transited automatically by the engine. Its a bug that it didn't move to any other state - Non-operational/up/non-responsive etc. Can you help me with a stacktrace of your engine with the output of "jstack -l <jboss-pid>?
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On Mon, Jul 15, 2013 at 8:43 AM, Roy Golan wrote:
Unfortunatly you can't. "unassigned" puts the host in transition and its status should be transited automatically by the engine. Its a bug that it didn't move to any other state - Non-operational/up/non-responsive etc.
Can you help me with a stacktrace of your engine with the output of "jstack -l <jboss-pid>?
In fact after my final recover operations (change permissions of vdsm.log and reboot of host and engine servers), after a few hours I connected to the infra and I discovered that actually it recovered automatically after a few minutes of engine reboot, putting online host and all related storage domains and I was able to start without problems my VMs again. If confirmed fixed in 3.2.2 I'm going to update in coming days after this activity I had to do today. details: reboot of host at 16:54 with start of vdsmd reboot of engine at 17:34 these logs in engine should show host becoming operational again after 4 minutes if I understand correctly 2013-07-14 17:37:41,658 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (Defaul tQuartzScheduler_Worker-7) spmStart polling ended, spm status: SPM 2013-07-14 17:37:41,660 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (De faultQuartzScheduler_Worker-7) START, HSMClearTaskVDSCommand(HostName = management02, HostId = 605a7 bda-a122-40be-a5ea-8f12e11484be, taskId=11aef1c7-cc52-42d4-b884-bbb1e3810260), log id: 2d48c7f0 2013-07-14 17:37:41,668 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, HSMClearTaskVDSCommand, log id: 2d48c7f0 2013-07-14 17:37:41,668 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@57ce4b98, log id: 27089775 2013-07-14 17:37:41,672 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-7) Initialize Irs proxy from vds: management02 2013-07-14 17:37:51,385 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-6) Storage Domain 24f30005-1f28-474d-9855-14932f8e0f1e:EXP_DOMAIN was reported by Host management02 as Active in Pool bc50e9ae-0206-47bb-929f-c5aaf653fb14, moving to active status 2013-07-14 17:37:51,393 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-6) Storage Domain 9b9fef06-694e-4ec4-b090-73508cf11119:222b30001551a8c76 was reported by Host management02 as Active in Pool bc50e9ae-0206-47bb-929f-c5aaf653fb14, moving to active status Thanks again Gianluca

On Mon 15 Jul 2013 09:59:01 AM IDT, Gianluca Cecchi wrote:
On Mon, Jul 15, 2013 at 8:43 AM, Roy Golan wrote:
Unfortunatly you can't. "unassigned" puts the host in transition and its status should be transited automatically by the engine. Its a bug that it didn't move to any other state - Non-operational/up/non-responsive etc.
Can you help me with a stacktrace of your engine with the output of "jstack -l <jboss-pid>?
In fact after my final recover operations (change permissions of vdsm.log and reboot of host and engine servers), after a few hours I connected to the infra and I discovered that actually it recovered automatically after a few minutes of engine reboot, putting online host and all related storage domains and I was able to start without problems my VMs again. If confirmed fixed in 3.2.2 I'm going to update in coming days after this activity I had to do today.
details: reboot of host at 16:54 with start of vdsmd reboot of engine at 17:34
these logs in engine should show host becoming operational again after 4 minutes if I understand correctly
this is the expected behavior. I'm not sure if the vdsm.log being root has been handled in 3.2.2. can someone confirm that? (or just post the question to the vdsm list or at ovirt irc channel)
2013-07-14 17:37:41,658 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (Defaul tQuartzScheduler_Worker-7) spmStart polling ended, spm status: SPM 2013-07-14 17:37:41,660 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (De faultQuartzScheduler_Worker-7) START, HSMClearTaskVDSCommand(HostName = management02, HostId = 605a7 bda-a122-40be-a5ea-8f12e11484be, taskId=11aef1c7-cc52-42d4-b884-bbb1e3810260), log id: 2d48c7f0 2013-07-14 17:37:41,668 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, HSMClearTaskVDSCommand, log id: 2d48c7f0 2013-07-14 17:37:41,668 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@57ce4b98, log id: 27089775 2013-07-14 17:37:41,672 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-7) Initialize Irs proxy from vds: management02 2013-07-14 17:37:51,385 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-6) Storage Domain 24f30005-1f28-474d-9855-14932f8e0f1e:EXP_DOMAIN was reported by Host management02 as Active in Pool bc50e9ae-0206-47bb-929f-c5aaf653fb14, moving to active status 2013-07-14 17:37:51,393 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-6) Storage Domain 9b9fef06-694e-4ec4-b090-73508cf11119:222b30001551a8c76 was reported by Host management02 as Active in Pool bc50e9ae-0206-47bb-929f-c5aaf653fb14, moving to active status
Thanks again
Gianluca
participants (3)
-
Gianluca Cecchi
-
Joop
-
Roy Golan