On Wed, Sep 1, 2021 at 6:21 PM Sketch <ovirt@rednsx.org> wrote:
My cluster was originally built on 4.3, and things were working as long as
my SPM was on 4.3.  I just killed off the last 4.3 host and rebuilt it as
4.4, and upgraded my cluster and DC to compatibility level 4.6.

We had cephfs mounted as a posix FS which worked fine, but oddly in 4.3 we
would end up with two mounts for the same volume.  The configuration had a
comma separated list of IPs as that is how ceph was configured for
redundancy, and this is the mount that shows up on both 4.3 and 4.4 hosts
(/rhev/data-center/mnt/10.1.88.75,10.1.88.76,10.1.88.77:_vmstore/). 

This was never supported.

We had this old fix that was rejected:
https://gerrit.ovirt.org/c/vdsm/+/94027

but it will not help to solve the issue with the task argument below.
 
But
the 4.3 hosts would also have a duplicate mount which had the FQDN of one
of the servers instead of the comma separated list.

In 4.4, there's only a single mount and existing VMs will start just fine,
but you can't create new disks or migrate existing disks onto the posix
storage volume.  My suspicion is this is an issue with the mount parser
not liking the comma in the name of the mount from the error that I get on
the SPM host when it tries to create a volume (migration would also fail
on the volume creation task):

2021-08-31 19:34:07,767-0700 INFO  (jsonrpc/6) [vdsm.api] START createVolume(sdUUID='e8ec5645-fc1b-4d64-a145-44aa8ac5ef48', spUUID='2948c860-9bdf-11e8-a6b3-00163e0419f0', imgUUID='7d704b4d-1ebe-462f-b11e-b91039f43637', size='1073741824', volFormat=5, preallocate=1, diskType='DATA', volUUID='be6cb033-4e42-4bf5-a4a3-6ab5bf03edee', desc='{"DiskAlias":"test","DiskDescription":""}', srcImgUUID='00000000-0000-0000-0000-000000000000', srcVolUUID='00000000-0000-0000-0000-000000000000', initialSize=None, addBitmaps=False) from=::ffff:10.1.2.37,43490, flow_id=bb137995-1ffa-429f-b6eb-5b9ca9f8dfd7, task_id=2ddfd1bc-d7e1-4a1e-877a-68e1c2a897ed (api:48)
2021-08-31 19:34:07,767-0700 INFO  (jsonrpc/6) [IOProcessClient] (Global) Starting client (__init__:340)
2021-08-31 19:34:07,782-0700 INFO  (ioprocess/3193398) [IOProcess] (Global) Starting ioprocess (__init__:465)
2021-08-31 19:34:07,803-0700 INFO  (jsonrpc/6) [vdsm.api] FINISH createVolume return=None from=::ffff:10.1.2.37,43490, flow_id=bb137995-1ffa-429f-b6eb-5b9ca9f8dfd7, task_id=2ddfd1bc-d7e1-4a1e-877a-68e1c2a897ed (api:54)
2021-08-31 19:34:07,844-0700 INFO  (tasks/5) [storage.ThreadPool.WorkerThread] START task 2ddfd1bc-d7e1-4a1e-877a-68e1c2a897ed (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7f4894279860>>, args=None) (threadPool:146)
2021-08-31 19:34:07,869-0700 INFO  (tasks/5) [storage.StorageDomain] Create placeholder /rhev/data-center/mnt/10.1.88.75,10.1.88.76,10.1.88.77:_vmstore/e8ec5645-fc1b-4d64-a145-44aa8ac5ef48/images/7d704b4d-1ebe-462f-b11e-b91039f43637 for image's volumes (sd:1718)
2021-08-31 19:34:07,869-0700 ERROR (tasks/5) [storage.TaskManager.Task] (Task='2ddfd1bc-d7e1-4a1e-877a-68e1c2a897ed') Unexpected error (task:877)
Traceback (most recent call last):
   File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run
     return fn(*args, **kargs)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
     return self.cmd(*self.argslist, **self.argsdict)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
     return method(self, *args, **kwargs)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1945, in createVolume
     initial_size=initialSize, add_bitmaps=addBitmaps)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1216, in createVolume
     initial_size=initial_size, add_bitmaps=add_bitmaps)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1174, in create
     imgPath = dom.create_image(imgUUID)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1721, in create_image
     "create_image_rollback", [image_dir])
   File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 385, in __init__
     self.params = ParamList(argslist)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 298, in __init__
     raise ValueError("ParamsList: sep %s in %s" % (sep, i))
ValueError: ParamsList: sep , in /rhev/data-center/mnt/10.1.88.75,10.1.88.76,10.1.88.77:_vmstore/e8ec5645-fc1b-4d64-a145-44aa8ac5ef48/images/7d704b4d-1ebe-462f-b11e-b91039f43637
2021-08-31 19:34:07,964-0700 INFO  (tasks/5) [storage.ThreadPool.WorkerThread] FINISH task 2ddfd1bc-d7e1-4a1e-877a-68e1c2a897ed (threadPool:148)

I think the issue is the task arguments parser - these are separated by ",", and
arguments including "," breaks the parser.
 
This is a pretty major issue since we can no longer create new VMs.  As a
workaround, I could change the mount path of the volume to only reference
a single IP, but oVirt won't let me edit the mount.  I wonder if I could
manually edit in the database, then reboot the hosts one by one to make
the change take effect without having to shut down hundreds of VMs at
once?

This should work.

Please file a bug for this, so we can consider this for the next release.

Nir