[Users] NFS Domains down because of single node failure

------=_NextPartTM-000-161201e6-5459-4708-99a4-b65c101b4bd9 Content-Type: multipart/alternative; boundary="_000_12EF8D94C6F8734FB2FF37B9FBEDD17358578085EXCHANGEcollogi_" --_000_12EF8D94C6F8734FB2FF37B9FBEDD17358578085EXCHANGEcollogi_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hello, maybe a stupid one but ... When I create a (NFS) storage domain I have to provide a node host that mak= es the inital contact. All other node hosts wil directly connect to that domai= n. So no bottlenecks. Today I stopped one of my two nodes outside ovirt-engine. For simplicity we assume the node crashed. The machine is up right now but VDSM is down. All domains that where setup with this host are "down" now (red arrow down)= . After searching the web interface I found "Data center" -> "Select your DC"= -> "Storage" -> "Activate". Trying to activate only results in a failure messa= ge. To ensure that I can recover those situations in the future I'd like to kno= w what this node binding is all about and what to do next. Logs attached & thanks in advance Markus 2013-09-17 08:54:54,985 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.Ir= sBrokerCommand] (pool-6-thread-50) [1be325f3] spm vds is non responsive, st= opping spm selection. 2013-09-17 08:54:54,986 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.Ac= tivateStorageDomainVDSCommand] (pool-6-thread-50) [1be325f3] FINISH, Activa= teStorageDomainVDSCommand, log id: 4c38e98 2013-09-17 08:54:54,987 ERROR [org.ovirt.engine.core.bll.storage.ActivateSt= orageDomainCommand] (pool-6-thread-50) [1be325f3] Command org.ovirt.engine.= core.bll.storage.ActivateStorageDomainCommand throw Vdc Bll exception. With= error message VdcBLLException: Cannot allocate IRS server (Failed with VDS= M error IRS_REPOSITORY_NOT_FOUND and code 5009) 2013-09-17 08:54:54,989 INFO [org.ovirt.engine.core.bll.storage.ActivateSt= orageDomainCommand] (pool-6-thread-50) [1be325f3] Command [id=3Da0dbe909-fb= b1-40ff-b77a-8e43bd075ace]: Compensating CHANGED_STATUS_ONLY of org.ovirt.e= ngine.core.common.businessentities.StoragePoolIsoMap; snapshot: EntityStatu= sSnapshot [id=3DstoragePoolId =3D b054727d-fe4a-41ed-8393-a81e36b8a1af, sto= rageId =3D ecf7f507-b0fa-47ee-a8b2-d621fbd7b8bf, status=3DUnknown]. 2013-09-17 08:54:55,004 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.Ge= tCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-99) Command GetCapa= bilitiesVDS execution failed. Exception: VDSNetworkException: java.net.Conn= ectException: Connection refused 2013-09-17 08:54:55,008 INFO [org.ovirt.engine.core.dal.dbbroker.auditlogh= andling.AuditLogDirector] (pool-6-thread-50) [1be325f3] Correlation ID: 1be= 325f3, Job ID: c88c00ba-0298-4f42-bc0e-a720d79c5f49, Call Stack: null, Cust= om Event ID: -1, Message: Failed to activate Storage Domain NAS5_IB (Data C= enter Collogia) by admin@internal 2013-09-17 08:54:56,263 INFO [org.ovirt.engine.core.bll.storage.ActivateSt= orageDomainCommand] (ajp--127.0.0.1-8702-2) [5c6218c1] Lock Acquired to obj= ect EngineLock [exclusiveLocks=3D key: ecf7f507-b0fa-47ee-a8b2-d621fbd7b8bf= value: STORAGE , sharedLocks=3D ] 2013-09-17 08:54:56,272 INFO [org.ovirt.engine.core.bll.storage.ActivateSt= orageDomainCommand] (pool-6-thread-50) [5c6218c1] Running command: Activate= StorageDomainCommand internal: false. Entities affected : ID: ecf7f507-b0f= a-47ee-a8b2-d621fbd7b8bf Type: Storage 2013-09-17 08:54:56,291 INFO [org.ovirt.engine.core.bll.storage.ActivateSt= orageDomainCommand] (pool-6-thread-50) [5c6218c1] Lock freed to object Engi= neLock [exclusiveLocks=3D key: ecf7f507-b0fa-47ee-a8b2-d621fbd7b8bf value: = STORAGE , sharedLocks=3D ] 2013-09-17 08:54:56,292 INFO [org.ovirt.engine.core.bll.storage.ActivateSt= orageDomainCommand] (pool-6-thread-50) [5c6218c1] ActivateStorage Domain. B= efore Connect all hosts to pool. Time:9/17/13 8:54 AM 2013-09-17 08:54:56,296 INFO [org.ovirt.engine.core.bll.storage.ConnectSto= rageToVdsCommand] (pool-6-thread-47) Running command: ConnectStorageToVdsCo= mmand internal: true. Entities affected : ID: aaa00000-0000-0000-0000-1234= 56789aaa Type: System 2013-09-17 08:54:56,299 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.Co= nnectStorageServerVDSCommand] (pool-6-thread-47) START, ConnectStorageServe= rVDSCommand(HostName =3D colovn3, HostId =3D 0fdccd63-f5d7-41e4-8350-5941bb= c29270, storagePoolId =3D 00000000-0000-0000-0000-000000000000, storageType= =3D NFS, connectionList =3D [{ id: 68c31a49-0e37-4438-a8fe-fc28be62cd3f, c= onnection: 10.10.30.251:/var/nas5/ovirt, iqn: null, vfsType: null, mountOpt= ions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id= : 75a9c6a0 2013-09-17 08:54:56,317 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.Co= nnectStorageServerVDSCommand] (pool-6-thread-47) FINISH, ConnectStorageServ= erVDSCommand, return: {68c31a49-0e37-4438-a8fe-fc28be62cd3f=3D0}, log id: 7= 5a9c6a0 --_000_12EF8D94C6F8734FB2FF37B9FBEDD17358578085EXCHANGEcollogi_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <html dir=3D"ltr"> <head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-= 1"> <style id=3D"owaParaStyle" type=3D"text/css">P {margin-top:0;margin-bottom:= 0;}</style> </head> <body ocsi=3D"0" fpstyle=3D"1"> <div style=3D"direction: ltr;font-family: Tahoma;color: #000000;font-size: = 10pt;">Hello,<br> <br> maybe a stupid one but ...<br> <br> When I create a (NFS) storage domain I have to provide a node host that mak= es<br> the inital contact. All other node hosts wil directly connect to that domai= n. So<br> no bottlenecks.<br> <br> Today I stopped one of my two nodes outside ovirt-engine. For simplicity we= <br> assume the node crashed. The machine is up right now but VDSM is down.<br> <br> All domains that where setup with this host are "down" now (red a= rrow down). <br> After searching the web interface I found "Data center" -> &qu= ot;Select your DC" -> <br> "Storage" -> "Activate". Trying to activate only res= ults in a failure message.<br> To ensure that I can recover those situations in the future I'd like to kno= w what<br> this node binding is all about and what to do next.<br> <br> Logs attached & thanks in advance<br> <br> Markus<br> <br> 2013-09-17 08:54:54,985 WARN [org.ovirt.engine.core.vdsbroker.irsbrok= er.IrsBrokerCommand] (pool-6-thread-50) [1be325f3] spm vds is non responsiv= e, stopping spm selection.<br> 2013-09-17 08:54:54,986 INFO [org.ovirt.engine.core.vdsbroker.irsbrok= er.ActivateStorageDomainVDSCommand] (pool-6-thread-50) [1be325f3] FINISH, A= ctivateStorageDomainVDSCommand, log id: 4c38e98<br> 2013-09-17 08:54:54,987 ERROR [org.ovirt.engine.core.bll.storage.ActivateSt= orageDomainCommand] (pool-6-thread-50) [1be325f3] Command org.ovirt.engine.= core.bll.storage.ActivateStorageDomainCommand throw Vdc Bll exception. With= error message VdcBLLException: Cannot allocate IRS server (Failed with VDSM error IRS_REPOSITORY_NOT_FOUN= D and code 5009)<br> 2013-09-17 08:54:54,989 INFO [org.ovirt.engine.core.bll.storage.Activ= ateStorageDomainCommand] (pool-6-thread-50) [1be325f3] Command [id=3Da0dbe9= 09-fbb1-40ff-b77a-8e43bd075ace]: Compensating CHANGED_STATUS_ONLY of org.ov= irt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: EntityStatusSnapshot [id=3DstoragePoolId =3D b054727d-fe4a-41ed-= 8393-a81e36b8a1af, storageId =3D ecf7f507-b0fa-47ee-a8b2-d621fbd7b8bf, stat= us=3DUnknown].<br> 2013-09-17 08:54:55,004 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.Ge= tCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-99) Command GetCapa= bilitiesVDS execution failed. Exception: VDSNetworkException: java.net.Conn= ectException: Connection refused<br> 2013-09-17 08:54:55,008 INFO [org.ovirt.engine.core.dal.dbbroker.audi= tloghandling.AuditLogDirector] (pool-6-thread-50) [1be325f3] Correlation ID= : 1be325f3, Job ID: c88c00ba-0298-4f42-bc0e-a720d79c5f49, Call Stack: null,= Custom Event ID: -1, Message: Failed to activate Storage Domain NAS5_IB (Data Center Collogia) by admin@interna= l<br> 2013-09-17 08:54:56,263 INFO [org.ovirt.engine.core.bll.storage.Activ= ateStorageDomainCommand] (ajp--127.0.0.1-8702-2) [5c6218c1] Lock Acquired t= o object EngineLock [exclusiveLocks=3D key: ecf7f507-b0fa-47ee-a8b2-d621fbd= 7b8bf value: STORAGE<br> , sharedLocks=3D ]<br> 2013-09-17 08:54:56,272 INFO [org.ovirt.engine.core.bll.storage.Activ= ateStorageDomainCommand] (pool-6-thread-50) [5c6218c1] Running command: Act= ivateStorageDomainCommand internal: false. Entities affected : ID: ec= f7f507-b0fa-47ee-a8b2-d621fbd7b8bf Type: Storage<br> 2013-09-17 08:54:56,291 INFO [org.ovirt.engine.core.bll.storage.Activ= ateStorageDomainCommand] (pool-6-thread-50) [5c6218c1] Lock freed to object= EngineLock [exclusiveLocks=3D key: ecf7f507-b0fa-47ee-a8b2-d621fbd7b8bf va= lue: STORAGE<br> , sharedLocks=3D ]<br> 2013-09-17 08:54:56,292 INFO [org.ovirt.engine.core.bll.storage.Activ= ateStorageDomainCommand] (pool-6-thread-50) [5c6218c1] ActivateStorage Doma= in. Before Connect all hosts to pool. Time:9/17/13 8:54 AM<br> 2013-09-17 08:54:56,296 INFO [org.ovirt.engine.core.bll.storage.Conne= ctStorageToVdsCommand] (pool-6-thread-47) Running command: ConnectStorageTo= VdsCommand internal: true. Entities affected : ID: aaa00000-0000-0000= -0000-123456789aaa Type: System<br> 2013-09-17 08:54:56,299 INFO [org.ovirt.engine.core.vdsbroker.vdsbrok= er.ConnectStorageServerVDSCommand] (pool-6-thread-47) START, ConnectStorage= ServerVDSCommand(HostName =3D colovn3, HostId =3D 0fdccd63-f5d7-41e4-8350-5= 941bbc29270, storagePoolId =3D 00000000-0000-0000-0000-000000000000, storageType =3D NFS, connectionList =3D [{ id: 68c31a49-0e37-4438-a8fe-fc2= 8be62cd3f, connection: 10.10.30.251:/var/nas5/ovirt, iqn: null, vfsType: nu= ll, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null = };]), log id: 75a9c6a0<br> 2013-09-17 08:54:56,317 INFO [org.ovirt.engine.core.vdsbroker.vdsbrok= er.ConnectStorageServerVDSCommand] (pool-6-thread-47) FINISH, ConnectStorag= eServerVDSCommand, return: {68c31a49-0e37-4438-a8fe-fc28be62cd3f=3D0}, log = id: 75a9c6a0<br> <br> </div> </body> </html> --_000_12EF8D94C6F8734FB2FF37B9FBEDD17358578085EXCHANGEcollogi_-- ------=_NextPartTM-000-161201e6-5459-4708-99a4-b65c101b4bd9 Content-Type: text/plain; name="InterScan_Disclaimer.txt" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="InterScan_Disclaimer.txt" **************************************************************************** Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. Über das Internet versandte E-Mails können unter fremden Namen erstellt oder manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine rechtsverbindliche Willenserklärung. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln Vorstand: Kadir Akin Dr. Michael Höhnerbach Vorsitzender des Aufsichtsrates: Hans Kristian Langva Registergericht: Amtsgericht Köln Registernummer: HRB 52 497 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. e-mails sent over the internet may have been written under a wrong name or been manipulated. That is why this message sent as an e-mail is not a legally binding declaration of intention. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln executive board: Kadir Akin Dr. Michael Höhnerbach President of the supervisory board: Hans Kristian Langva Registry office: district court Cologne Register number: HRB 52 497 **************************************************************************** ------=_NextPartTM-000-161201e6-5459-4708-99a4-b65c101b4bd9--

On 09/17/2013 10:01 AM, Markus Stockhausen wrote:
Hello,
maybe a stupid one but ...
When I create a (NFS) storage domain I have to provide a node host that makes the inital contact. All other node hosts wil directly connect to that domain. So no bottlenecks.
Today I stopped one of my two nodes outside ovirt-engine. For simplicity we assume the node crashed. The machine is up right now but VDSM is down.
All domains that where setup with this host are "down" now (red arrow down). After searching the web interface I found "Data center" -> "Select your DC" -> "Storage" -> "Activate". Trying to activate only results in a failure message. To ensure that I can recover those situations in the future I'd like to know what this node binding is all about and what to do next.
Logs attached & thanks in advance
Markus
2013-09-17 08:54:54,985 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-50) [1be325f3] spm vds is non responsive, stopping spm selection. 2013-09-17 08:54:54,986 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-6-thread-50) [1be325f3] FINISH, ActivateStorageDomainVDSCommand, log id: 4c38e98 2013-09-17 08:54:54,987 ERROR [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-6-thread-50) [1be325f3] Command org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand throw Vdc Bll exception. With error message VdcBLLException: Cannot allocate IRS server (Failed with VDSM error IRS_REPOSITORY_NOT_FOUND and code 5009) 2013-09-17 08:54:54,989 INFO [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-6-thread-50) [1be325f3] Command [id=a0dbe909-fbb1-40ff-b77a-8e43bd075ace]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: EntityStatusSnapshot [id=storagePoolId = b054727d-fe4a-41ed-8393-a81e36b8a1af, storageId = ecf7f507-b0fa-47ee-a8b2-d621fbd7b8bf, status=Unknown]. 2013-09-17 08:54:55,004 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-99) Command GetCapabilitiesVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused 2013-09-17 08:54:55,008 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-6-thread-50) [1be325f3] Correlation ID: 1be325f3, Job ID: c88c00ba-0298-4f42-bc0e-a720d79c5f49, Call Stack: null, Custom Event ID: -1, Message: Failed to activate Storage Domain NAS5_IB (Data Center Collogia) by admin@internal 2013-09-17 08:54:56,263 INFO [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (ajp--127.0.0.1-8702-2) [5c6218c1] Lock Acquired to object EngineLock [exclusiveLocks= key: ecf7f507-b0fa-47ee-a8b2-d621fbd7b8bf value: STORAGE , sharedLocks= ] 2013-09-17 08:54:56,272 INFO [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-6-thread-50) [5c6218c1] Running command: ActivateStorageDomainCommand internal: false. Entities affected : ID: ecf7f507-b0fa-47ee-a8b2-d621fbd7b8bf Type: Storage 2013-09-17 08:54:56,291 INFO [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-6-thread-50) [5c6218c1] Lock freed to object EngineLock [exclusiveLocks= key: ecf7f507-b0fa-47ee-a8b2-d621fbd7b8bf value: STORAGE , sharedLocks= ] 2013-09-17 08:54:56,292 INFO [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (pool-6-thread-50) [5c6218c1] ActivateStorage Domain. Before Connect all hosts to pool. Time:9/17/13 8:54 AM 2013-09-17 08:54:56,296 INFO [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (pool-6-thread-47) Running command: ConnectStorageToVdsCommand internal: true. Entities affected : ID: aaa00000-0000-0000-0000-123456789aaa Type: System 2013-09-17 08:54:56,299 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-6-thread-47) START, ConnectStorageServerVDSCommand(HostName = colovn3, HostId = 0fdccd63-f5d7-41e4-8350-5941bbc29270, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 68c31a49-0e37-4438-a8fe-fc28be62cd3f, connection: 10.10.30.251:/var/nas5/ovirt, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 75a9c6a0 2013-09-17 08:54:56,317 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-6-thread-47) FINISH, ConnectStorageServerVDSCommand, return: {68c31a49-0e37-4438-a8fe-fc28be62cd3f=0}, log id: 75a9c6a0
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
the node you used to setup the storage isn't affecting later runs. can the remaining node access the storage?

This is a multi-part message in MIME format. ------=_NextPartTM-000-a5c018b4-b885-4b57-82de-6955db1bb07e Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable
Hello,=0A= =0A= maybe a stupid one but ...=0A= =0A= When I create a (NFS) storage domain I have to provide a node host that= =0A= makes=0A= the inital contact. All other node hosts wil directly connect to that= =0A= domain. So=0A= no bottlenecks.=0A= =0A= Today I stopped one of my two nodes outside ovirt-engine. For simplicit= y we=0A= assume the node crashed. The machine is up right now but VDSM is down.= =0A= =0A= All domains that where setup with this host are "down" now (red arrow= =0A= down).=0A= =0A=
Von: Itamar Heim [iheim@redhat.com]=0A= Gesendet: Mittwoch, 18. September 2013 14:04=0A= An: Markus Stockhausen=0A= Cc: users; Allon Mureinik=0A= Betreff: Re: [Users] NFS Domains down because of single node failure=0A= =0A= On 09/17/2013 10:01 AM, Markus Stockhausen wrote:=0A= the node you used to setup the storage isn't affecting later runs.=0A= can the remaining node access the storage?=0A= =0A= These drive-by errors are really hard ones. Remembering what you did=0A= before the error and all the time your are close to restarting all cluster = =0A= components...=0A= =0A= Nevertheless I provided a reproduceble test case. More in=0A= https://bugzilla.redhat.com/show_bug.cgi?id=3D1009610=0A= =0A= Markus= ------=_NextPartTM-000-a5c018b4-b885-4b57-82de-6955db1bb07e Content-Type: text/plain; name="InterScan_Disclaimer.txt" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="InterScan_Disclaimer.txt"
**************************************************************************** Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. Über das Internet versandte E-Mails können unter fremden Namen erstellt oder manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine rechtsverbindliche Willenserklärung. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln Vorstand: Kadir Akin Dr. Michael Höhnerbach Vorsitzender des Aufsichtsrates: Hans Kristian Langva Registergericht: Amtsgericht Köln Registernummer: HRB 52 497 This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. e-mails sent over the internet may have been written under a wrong name or been manipulated. That is why this message sent as an e-mail is not a legally binding declaration of intention. Collogia Unternehmensberatung AG Ubierring 11 D-50678 Köln executive board: Kadir Akin Dr. Michael Höhnerbach President of the supervisory board: Hans Kristian Langva Registry office: district court Cologne Register number: HRB 52 497 **************************************************************************** ------=_NextPartTM-000-a5c018b4-b885-4b57-82de-6955db1bb07e--
participants (2)
-
Itamar Heim
-
Markus Stockhausen