John,

Thanks again for the reply.  Yes the API at the path you mentioned shows the domain.  This has to have been a bug as things began working after I changed values in the database.  Somehow setting the new IP for the storage connection in the database for both NFS and iSCSI resulted in the NFS domain becoming master again and at that point the iSCSI "magically" went active once NFS (master) was active.  I don't pretend to know how this happened and even my boss laughed when I shrugged to the question "how did you fix it?".  I'd be glad to supply the devs with whatever information I can, but I can't change much now as the goal of today was to get back online and that's been achieved.

One thing I may have done that could have been a cause of iSCSI not coming back was once I lost the IB fabric, in order to disconnect iSCSI that was over ISER, I issued the "vgchange -an <domain ID>" and then logged out of the iscsi session on each ovirt node.  One of my hosts would not re-activate once everything was back online and doing a "vgchange -ay <domain ID>" then removing the host from maintenance worked.  Since I had to switch from one network to another and from iSER to iSCSI, I wanted all active connections closed and the only way I could make the block devices disconnect cleanly was to disable the volume group on the LUN.

Thanks,
- Trey

On Tue, Oct 21, 2014 at 4:06 PM, Sandra Taylor <jtt77777@gmail.com> wrote:
Trey,
The thread that keeps repeating is the call to repoStats. I believe
it's part of the storage monitoring and in my environment it repeats
every 15 seconds
Mine looks like
Thread-168::INFO::2014-10-21
15:02:42,616::logUtils::44::dispatcher::(wrapper) Run and protect:
repoStats(options=None)
Thread-168::INFO::2014-10-21
15:02:42,617::logUtils::47::dispatcher::(wrapper) Run and protect:
repoStats, Return response: {'86f0a388-dc9d-4e44-a599-b3f2c9e58922':
{'code': 0, 'version': 3, 'acquired': True, 'delay': '0.00066814',
'lastCheck': '1.8', 'valid': True}}

but yours isn't returning anything , that's the the response: {}

But I think that the problem is that the hsm isn't finding volume
groups in its call to lvm vgs, and thus no storage domains (below in
the No volume groups found and  Found SD uuids: () )

Thread-14::DEBUG::2014-10-21
15:12:56,768::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n
/sbin/lvm vgs --config " devices { preferred_names =
[\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0
disable_after_error_count=3)
Thread-14::DEBUG::2014-10-21
15:12:56,968::lvm::296::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '
No volume groups found\n'; <rc> = 0
Thread-14::DEBUG::2014-10-21
15:12:56,969::lvm::415::OperationMutex::(_reloadvgs) Operation 'lvm
reload operation' released the operation mutex
Thread-14::DEBUG::2014-10-21
15:12:56,974::hsm::2352::Storage.HSM::(__prefetchDomains) Found SD
uuids: ()
Thread-14::DEBUG::2014-10-21
15:12:56,974::hsm::2408::Storage.HSM::(connectStorageServer) knownSDs:
{}

But I don't really know how that's possible considering you show what
looks to be an domain in the lvscan.
The only thing that comes to mind is that there was a bug in some of
the iscsi initiator tools where there was an error returned if a
session was already logged in but that doesn't look to be the case by
the logs. Or maybe something like lvmetad caching but vdsm uses its
own config to turn lvmetad off  (at /var/run/vdsm/lvm I think)

Does the storage domain with that id exist ?
It should be seen at  /api/storagedomains/4eeb8415-c912-44bf-b482-2673849705c9

-John



On Tue, Oct 21, 2014 at 4:17 PM, Trey Dockendorf <treydock@gmail.com> wrote:
> John,
>
> Thanks for reply.  The Discover function in GUI works...it's once I try and
> login (Click the array next to target) that things just hang indefinitely.
>
> # iscsiadm -m session
> tcp: [2] 10.0.0.10:3260,1
> iqn.2014-04.edu.tamu.brazos.vmstore1:ovirt-data_iscsi
>
> # iscsiadm -m node
> 10.0.0.10:3260,1 iqn.2014-04.edu.tamu.brazos.vmstore1:ovirt-data_iscsi
>
> # multipath -ll
> 1IET_00010001 dm-3 IET,VIRTUAL-DISK
> size=500G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   `- 8:0:0:1 sdd 8:48 active ready running
> 1ATA_WDC_WD5003ABYZ-011FA0_WD-WMAYP0DNSAEZ dm-2 ATA,WDC WD5003ABYZ-0
> size=466G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   `- 3:0:0:0 sdc 8:32 active ready running
>
> The first entry, 1IET_00010001 is the iSCSI LUN.
>
> The log when I click the array in the interface for the target is this:
>
> Thread-14::DEBUG::2014-10-21
> 15:12:49,900::BindingXMLRPC::251::vds::(wrapper) client [192.168.202.99]
> flowID [7177dafe]
> Thread-14::DEBUG::2014-10-21
> 15:12:49,901::task::595::TaskManager.Task::(_updateState)
> Task=`01d8d01e-8bfd-4764-890f-2026fdeb78d9`::moving from state init -> state
> preparing
> Thread-14::INFO::2014-10-21
> 15:12:49,901::logUtils::44::dispatcher::(wrapper) Run and protect:
> connectStorageServer(domType=3,
> spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection':
> '10.0.0.10', 'iqn': 'iqn.2014-04.edu.tamu.brazos.)
> Thread-14::DEBUG::2014-10-21
> 15:12:49,902::iscsiadm::92::Storage.Misc.excCmd::(_runCmd) '/usr/bin/sudo -n
> /sbin/iscsiadm -m node -T
> iqn.2014-04.edu.tamu.brazos.vmstore1:ovirt-data_iscsi -I default -p
> 10.0.0.10:3260,1 --op=new' (cwd None)
> Thread-14::DEBUG::2014-10-21
> 15:12:56,684::iscsiadm::92::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> =
> ''; <rc> = 0
> Thread-14::DEBUG::2014-10-21
> 15:12:56,685::iscsiadm::92::Storage.Misc.excCmd::(_runCmd) '/usr/bin/sudo -n
> /sbin/iscsiadm -m node -T
> iqn.2014-04.edu.tamu.brazos.vmstore1:ovirt-data_iscsi -I default -p
> 10.0.0.10:3260,1 -l' (cwd None)
> Thread-14::DEBUG::2014-10-21
> 15:12:56,711::iscsiadm::92::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> =
> ''; <rc> = 0
> Thread-14::DEBUG::2014-10-21
> 15:12:56,711::iscsiadm::92::Storage.Misc.excCmd::(_runCmd) '/usr/bin/sudo -n
> /sbin/iscsiadm -m node -T
> iqn.2014-04.edu.tamu.brazos.vmstore1:ovirt-data_iscsi -I default -p
> 10.0.0.10:3260,1 -n node.startup -v manual --op)
> Thread-14::DEBUG::2014-10-21
> 15:12:56,767::iscsiadm::92::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> =
> ''; <rc> = 0
> Thread-14::DEBUG::2014-10-21
> 15:12:56,767::lvm::373::OperationMutex::(_reloadvgs) Operation 'lvm reload
> operation' got the operation mutex
> Thread-14::DEBUG::2014-10-21
> 15:12:56,768::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n
> /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"]
> ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3)
> Thread-14::DEBUG::2014-10-21
> 15:12:56,968::lvm::296::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  No
> volume groups found\n'; <rc> = 0
> Thread-14::DEBUG::2014-10-21
> 15:12:56,969::lvm::415::OperationMutex::(_reloadvgs) Operation 'lvm reload
> operation' released the operation mutex
> Thread-14::DEBUG::2014-10-21
> 15:12:56,974::hsm::2352::Storage.HSM::(__prefetchDomains) Found SD uuids: ()
> Thread-14::DEBUG::2014-10-21
> 15:12:56,974::hsm::2408::Storage.HSM::(connectStorageServer) knownSDs: {}
> Thread-14::INFO::2014-10-21
> 15:12:56,974::logUtils::47::dispatcher::(wrapper) Run and protect:
> connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id':
> '00000000-0000-0000-0000-000000000000'}]}
> Thread-14::DEBUG::2014-10-21
> 15:12:56,974::task::1185::TaskManager.Task::(prepare)
> Task=`01d8d01e-8bfd-4764-890f-2026fdeb78d9`::finished: {'statuslist':
> [{'status': 0, 'id': '00000000-0000-0000-0000-000000000000'}]}
> Thread-14::DEBUG::2014-10-21
> 15:12:56,975::task::595::TaskManager.Task::(_updateState)
> Task=`01d8d01e-8bfd-4764-890f-2026fdeb78d9`::moving from state preparing ->
> state finished
> Thread-14::DEBUG::2014-10-21
> 15:12:56,975::resourceManager::940::ResourceManager.Owner::(releaseAll)
> Owner.releaseAll requests {} resources {}
> Thread-14::DEBUG::2014-10-21
> 15:12:56,975::resourceManager::977::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-14::DEBUG::2014-10-21
> 15:12:56,975::task::990::TaskManager.Task::(_decref)
> Task=`01d8d01e-8bfd-4764-890f-2026fdeb78d9`::ref 0 aborting False
> Thread-13::DEBUG::2014-10-21
> 15:13:18,281::task::595::TaskManager.Task::(_updateState)
> Task=`8674b6b0-5e4c-4f0c-8b6b-c5fa5fef6126`::moving from state init -> state
> preparing
> Thread-13::INFO::2014-10-21
> 15:13:18,281::logUtils::44::dispatcher::(wrapper) Run and protect:
> repoStats(options=None)
> Thread-13::INFO::2014-10-21
> 15:13:18,282::logUtils::47::dispatcher::(wrapper) Run and protect:
> repoStats, Return response: {}
> Thread-13::DEBUG::2014-10-21
> 15:13:18,282::task::1185::TaskManager.Task::(prepare)
> Task=`8674b6b0-5e4c-4f0c-8b6b-c5fa5fef6126`::finished: {}
> Thread-13::DEBUG::2014-10-21
> 15:13:18,282::task::595::TaskManager.Task::(_updateState)
> Task=`8674b6b0-5e4c-4f0c-8b6b-c5fa5fef6126`::moving from state preparing ->
> state finished
> Thread-13::DEBUG::2014-10-21
> 15:13:18,282::resourceManager::940::ResourceManager.Owner::(releaseAll)
> Owner.releaseAll requests {} resources {}
> Thread-13::DEBUG::2014-10-21
> 15:13:18,282::resourceManager::977::ResourceManager.Owner::(cancelAll)
> Owner.cancelAll requests {}
> Thread-13::DEBUG::2014-10-21
> 15:13:18,283::task::990::TaskManager.Task::(_decref)
> Task=`8674b6b0-5e4c-4f0c-8b6b-c5fa5fef6126`::ref 0 aborting False
>
> The lines prefixed with "Thread-13" just repeat over and over only changing
> the Task value.
>
> Unsure what could be done to restore things.  The iscsi connection is good
> and I'm able to see the logical volumes:
>
> # lvscan
>   ACTIVE            '/dev/4eeb8415-c912-44bf-b482-2673849705c9/metadata'
> [512.00 MiB] inherit
>   ACTIVE            '/dev/4eeb8415-c912-44bf-b482-2673849705c9/leases' [2.00
> GiB] inherit
>   ACTIVE            '/dev/4eeb8415-c912-44bf-b482-2673849705c9/ids' [128.00
> MiB] inherit
>   ACTIVE            '/dev/4eeb8415-c912-44bf-b482-2673849705c9/inbox'
> [128.00 MiB] inherit
>   ACTIVE            '/dev/4eeb8415-c912-44bf-b482-2673849705c9/outbox'
> [128.00 MiB] inherit
>   ACTIVE            '/dev/4eeb8415-c912-44bf-b482-2673849705c9/master' [1.00
> GiB] inherit
>   inactive
> '/dev/4eeb8415-c912-44bf-b482-2673849705c9/aced9726-5a28-4d52-96f5-89553ba770af'
> [100.00 GiB] inherit
>   inactive
> '/dev/4eeb8415-c912-44bf-b482-2673849705c9/87bf28aa-be25-4a93-9b23-f70bfd8accc0'
> [1.00 GiB] inherit
>   inactive
> '/dev/4eeb8415-c912-44bf-b482-2673849705c9/27256587-bf87-4519-89e7-260e13697de3'
> [20.00 GiB] inherit
>   inactive
> '/dev/4eeb8415-c912-44bf-b482-2673849705c9/ac2cb7f9-1df9-43dc-9fda-8a9958ef970f'
> [20.00 GiB] inherit
>   inactive
> '/dev/4eeb8415-c912-44bf-b482-2673849705c9/d8c41f05-006a-492b-8e5f-101c4e113b28'
> [100.00 GiB] inherit
>   inactive
> '/dev/4eeb8415-c912-44bf-b482-2673849705c9/83f17e9b-183e-4bad-ada5-bcef1c5c8e6a'
> [20.00 GiB] inherit
>   inactive
> '/dev/4eeb8415-c912-44bf-b482-2673849705c9/cf79052e-b4ef-4bda-96dc-c53b7c2acfb5'
> [20.00 GiB] inherit
>   ACTIVE            '/dev/vg_ovirtnode02/lv_swap' [46.59 GiB] inherit
>   ACTIVE            '/dev/vg_ovirtnode02/lv_root' [418.53 GiB] inherit
>
> Thanks,
> - Trey
>
>
>
> On Tue, Oct 21, 2014 at 2:49 PM, Sandra Taylor <jtt77777@gmail.com> wrote:
>>
>> Hi Trey,
>> Sorry for your trouble.
>> Don't know if I can help but I run iscsi here as my primary domain so
>> I've had some experience with it.
>> I don't know the answer to the master domain question.
>>
>> Does iscsi show connected  using iscsiadm -m session and   -m node  ?
>> in the vdsm log there should be the iscsiadm commands that were
>> executed to connect.
>> Does multipath -ll show anything?
>>
>> -John
>>
>> On Tue, Oct 21, 2014 at 3:18 PM, Trey Dockendorf <treydock@gmail.com>
>> wrote:
>> > I was able to get iSCSI over TCP working...but now the task of adding
>> > the
>> > LUN to the GUI has been stuck at the "spinning" icon for about 20
>> > minutes.
>> >
>> > I see these entries in vdsm.log over and over with the Task value
>> > changing:
>> >
>> > Thread-14::DEBUG::2014-10-21
>> > 14:16:50,086::task::595::TaskManager.Task::(_updateState)
>> > Task=`ebcd8e0a-54b1-43d2-92a2-ed9fd62d00fa`::moving from state init ->
>> > state
>> > preparing
>> > Thread-14::INFO::2014-10-21
>> > 14:16:50,086::logUtils::44::dispatcher::(wrapper) Run and protect:
>> > repoStats(options=None)
>> > Thread-14::INFO::2014-10-21
>> > 14:16:50,086::logUtils::47::dispatcher::(wrapper) Run and protect:
>> > repoStats, Return response: {}
>> > Thread-14::DEBUG::2014-10-21
>> > 14:16:50,087::task::1185::TaskManager.Task::(prepare)
>> > Task=`ebcd8e0a-54b1-43d2-92a2-ed9fd62d00fa`::finished: {}
>> > Thread-14::DEBUG::2014-10-21
>> > 14:16:50,087::task::595::TaskManager.Task::(_updateState)
>> > Task=`ebcd8e0a-54b1-43d2-92a2-ed9fd62d00fa`::moving from state preparing
>> > ->
>> > state finished
>> > Thread-14::DEBUG::2014-10-21
>> > 14:16:50,087::resourceManager::940::ResourceManager.Owner::(releaseAll)
>> > Owner.releaseAll requests {} resources {}
>> > Thread-14::DEBUG::2014-10-21
>> > 14:16:50,087::resourceManager::977::ResourceManager.Owner::(cancelAll)
>> > Owner.cancelAll requests {}
>> > Thread-14::DEBUG::2014-10-21
>> > 14:16:50,087::task::990::TaskManager.Task::(_decref)
>> > Task=`ebcd8e0a-54b1-43d2-92a2-ed9fd62d00fa`::ref 0 aborting False
>> >
>> > What is there I can do to get my storage back online?  Right now my
>> > iSCSI is
>> > master (something I did not want) which is odd considering the NFS data
>> > domain was added as master when I setup oVirt.  Nothing will come back
>> > until
>> > I get the master domain online and unsure what to do now.
>> >
>> > Thanks,
>> > - Trey
>> >
>> > On Tue, Oct 21, 2014 at 12:58 PM, Trey Dockendorf <treydock@gmail.com>
>> > wrote:
>> >>
>> >> I had a catastrophic failure of the IB switch that was used by all my
>> >> storage domains.  I had one data domain that was NFS and one that was
>> >> iSCSI.
>> >> I managed to get the iSCSI LUN detached using the docs [1] but now I
>> >> noticed
>> >> that somehow my master domain went from the NFS domain to the iSCSI
>> >> domain
>> >> and I'm unable to switch them back.
>> >>
>> >> How does one change the master?  Right now I am having issues getting
>> >> iSCSI over TCP to work, so am sort of stuck with 30 VMs down and an
>> >> entire
>> >> cluster inaccessible.
>> >>
>> >> Thanks,
>> >> - Trey
>> >>
>> >> [1] http://www.ovirt.org/Features/Manage_Storage_Connections
>> >
>> >
>> >
>> > _______________________________________________
>> > Users mailing list
>> > Users@ovirt.org
>> > http://lists.ovirt.org/mailman/listinfo/users
>> >
>
>