Hi All,
I've got some issues with connecting my oVirt Cluster to my Ceph Cluster via iSCSI.
There are two issues, and I don't know if one is causing the other, if they are
related at all, or if they are two separate, unrelated issues. Let me explain.
The Situation
-------------
- I have a working three node Ceph Cluster (Ceph Quincy on Rocky Linux 8.6)
- The Ceph Cluster has four Storage Pools of between 4 and 8 TB each
- The Ceph Cluster has three iSCSI Gateways
- There is a single iSCSI Target on the Ceph Cluster
- The iSCSI Target has all three iSCSI Gateways attached
- The iSCSI Target has all four Storage Pools attached
- The four Storage Pools have been assigned LUNs 0-3
- I have set up (Discovery) CHAP Authorisation on the iSCSI Target
- I have a working three node self-hosted oVirt Cluster (oVirt v4.5.3 on Rocky Linux 8.6)
- The oVirt Cluster has (in addition to the hosted_storage Storage Domain) three GlusterFS
Storage Domains
- I can ping all three Ceph Cluster Nodes to/from all three oVirt Hosts
- The iSCSI Target on the Ceph Cluster has all three oVirt Hosts Initiators attached
- Each Initiator has all four Ceph Storage Pools attached
- I have set up CHAP Authorisation on the iSCSI Target's Initiators
- The Ceph Cluster Admin Portal reports that all three Initiators are
"logged_in"
- I have previous connected Ceph iSCSI LUNs to the oVirt Cluster successfully (as an
experiment), but had to remove and re-instate them for the "final" version(?).
- The oVirt Admin Portal (ie HostedEngine) reports that Initiators are 1 & 2 (ie oVirt
Hosts 1 & 2) are "logged_in" to all three iSCSI Gateways
- The oVirt Admin Portal reports that Initiator 3 (ie oVirt Host 3) is
"logged_in" to iSCSI Gateways 1 & 2
- I can "force" Initiator 3 to become "logged_in" to iSCSI Gateway 3,
but when I do this it is *not* persistent
- oVirt Hosts 1 & 2 can/have discovered all three iSCSI Gateways
- oVirt Hosts 1 & 2 can/have discovered all four LUNs/Targets on all three iSCSI
Gateways
- oVirt Host 3 can only discover 2 of the iSCSI Gateways
- For Target/LUN 0 oVirt Host 3 can only "see" the LUN provided by iSCSI Gateway
1
- For Targets/LUNs 1-3 oVirt Host 3 can only "see" the LUNs provided by iSCSI
Gateways 1 & 2
- oVirt Host 3 can *not* "see" any of the Targets/LUNs provided by iSCSI Gateway
3
- When I create a new oVirt Storage Domain for any of the four LUNs:
- I am presented with a message saying "The following LUNs are already in
use..."
- I am asked to "Approve operation" via a checkbox, which I do
- As I watch the oVirt Admin Portal I can see the new iSCSI Storage Domain appear in the
Storage Domain list, and then after a few minutes it is removed
- After those few minutes I am presented with this failure message: "Error while
executing action New SAN Storage Domain: Network error during communication with the
Host."
- I have looked in the engine.log and all I could find that was relevant (as far as I
know) was this:
~~~
2022-11-28 19:59:20,506+11 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand] (default task-1)
[77b0c12d] Command 'CreateStorageDomainVDSCommand(HostName = ovirt_node_1.mynet.local,
CreateStorageDomainVDSCommandParameters:{hostId='967301de-be9f-472a-8e66-03c24f01fa71',
storageDomain='StorageDomainStatic:{name='data',
id='2a14e4bd-c273-40a0-9791-6d683d145558'}',
args='s0OGKR-80PH-KVPX-Fi1q-M3e4-Jsh7-gv337P'})' execution failed:
VDSGenericException: VDSNetworkException: Message timeout which can be caused by
communication issues
2022-11-28 19:59:20,507+11 ERROR
[org.ovirt.engine.core.bll.storage.domain.AddSANStorageDomainCommand] (default task-1)
[77b0c12d] Command
'org.ovirt.engine.core.bll.storage.domain.AddSANStorageDomainCommand' failed:
EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
VDSGenericException: VDSNetworkException: Message timeout which can be caused by
communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
~~~
I cannot see/detect any "communication issue" - but then again I'm not 100%
sure what I should be looking for
I have looked on-line for an answer, and apart from not being able to get past Red
Hat's "wall" to see the solutions that they have, all I could find that was
relevant was this:
https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/AVLORQNOLJHR...
. If this *is* relevant then there is not enough context here for me to proceed (ie/eg
*where* (which host/vm) should that command be run?).
I also found (for a previous version of oVirt) notes about modifying the Postgres DB
manual to resolve a similar issue. While I am more than comfortable doing this (I've
been an SQL DBA for well over 20 years) this seems like asking for trouble - at least
until I hear back from the oVirt Devs that this is OK to do - and of course, I'll need
the relevant commands / locations / authorisations to get into the DB.
Questions
---------
- Are the two issues (oVirt Host 3 not having a full picture of the Ceph iSCSI environment
and the oVirt iSCSI Storage Domain creation failure) related?
- Do I need to "refresh" the iSCSI info on the oVirt Hosts, and if so, how do I
do this?
- Do I need to "flush" the old LUNs from the oVirt Cluster, and if so, how do I
do this?
- Where else should I be looking for info in the logs (& which logs)?
- Does *anyone* have any other ideas how to resolve the situation - especially when using
the Ceph iSCII Gateways?
Thanks in advance
Cheers
Dulux-Oz