On Fri, Mar 29, 2019 at 12:47 PM Krutika Dhananjay <kdhananj@redhat.com> wrote:
Questions/comments inline ...

On Thu, Mar 28, 2019 at 10:18 PM <olaf.buitelaar@gmail.com> wrote:
Dear All,

I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While previous upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a different experience. After first trying a test upgrade on a 3 node setup, which went fine. i headed to upgrade the 9 node production platform, unaware of the backward compatibility issues between gluster 3.12.15 -> 5.3. After upgrading 2 nodes, the HA engine stopped and wouldn't start. Vdsm wasn't able to mount the engine storage domain, since /dom_md/metadata was missing or couldn't be accessed. Restoring this file by getting a good copy of the underlying bricks, removing the file from the underlying bricks where the file was 0 bytes and mark with the stickybit, and the corresponding gfid's. Removing the file from the mount point, and copying back the file on the mount point. Manually mounting the engine domain,  and manually creating the corresponding symbolic links in /rhev/data-center and /var/run/vdsm/storage and fixing the ownership back to vdsm.kvm (which was root.root), i was able to start the HA engine again. Since the engine was up again, and things seemed rather unstable i decided to continue the upgrade on the other nodes suspecting an incompatibility in gluster versions, i thought would be best to have them all on the same version rather soonish. However things went from bad to worse, the engine stopped again, and all vm’s stopped working as well.  So on a machine outside the setup and restored a backup of the engine taken from version 4.2.8 just before the upgrade. With this engine I was at least able to start some vm’s again, and finalize the upgrade. Once the upgraded, things didn’t stabilize and also lose 2 vm’s during the process due to image corruption. After figuring out gluster 5.3 had quite some issues I was as lucky to see gluster 5.5 was about to be released, on the moment the RPM’s were available I’ve installed those. This helped a lot in terms of stability, for which I’m very grateful! However the performance is unfortunate terrible, it’s about 15% of what the performance was running gluster 3.12.15. It’s strange since a simple dd shows ok performance, but our actual workload doesn’t. While I would expect the performance to be better, due to all improvements made since gluster version 3.12. Does anybody share the same experience?
I really hope gluster 6 will soon be tested with ovirt and released, and things start to perform and stabilize again..like the good old days. Of course when I can do anything, I’m happy to help.

I think the following short list of issues we have after the migration;
Gluster 5.5;
-       Poor performance for our workload (mostly write dependent)

For this, could you share the volume-profile output specifically for the affected volume(s)? Here's what you need to do -

1. # gluster volume profile $VOLNAME stop
2. # gluster volume profile $VOLNAME start
3. Run the test inside the vm wherein you see bad performance
4. # gluster volume profile $VOLNAME info # save the output of this command into a file
5. # gluster volume profile $VOLNAME stop
6. and attach the output file gotten in step 4

-       VM’s randomly pause on un
known storage errors, which are “stale file’s”. corresponding log; Lookup on shard 797 failed. Base file gfid = 8a27b91a-ff02-42dc-bd4c-caa019424de8 [Stale file handle]

Could you share the complete gluster client log file (it would be a filename matching the pattern rhev-data-center-mnt-glusterSD-*)
Also the output of `gluster volume info $VOLNAME`

 
-       Some files are listed twice in a directory (probably related the stale file issue?)
Example;
ls -la  /rhev/data-center/59cd53a9-0003-02d7-00eb-0000000001e3/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/4add6751-3731-4bbd-ae94-aaeed12ea450/
total 3081
drwxr-x---.  2 vdsm kvm    4096 Mar 18 11:34 .
drwxr-xr-x. 13 vdsm kvm    4096 Mar 19 09:42 ..
-rw-rw----.  1 vdsm kvm 1048576 Mar 28 12:55 1a7cf259-6b29-421d-9688-b25dfaafb13c
-rw-rw----.  1 vdsm kvm 1048576 Mar 28 12:55 1a7cf259-6b29-421d-9688-b25dfaafb13c
-rw-rw----.  1 vdsm kvm 1048576 Jan 27  2018 1a7cf259-6b29-421d-9688-b25dfaafb13c.lease
-rw-r--r--.  1 vdsm kvm     290 Jan 27  2018 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
-rw-r--r--.  1 vdsm kvm     290 Jan 27  2018 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta

Adding DHT and readdir-ahead maintainers regarding entries getting listed twice.
 

- brick processes sometimes starts multiple times. Sometimes I’ve 5 brick processes for a single volume. Killing all glusterfsd’s for the volume on the machine and running gluster v start <vol> force usually just starts one after the event, from then on things look all right.

Did you mean 5 brick processes for a single brick directory?

Mohit - Could this be because of missing the following commit in release-5 branch? It might be worth to backport this fix.

commit 66986594a9023c49e61b32769b7e6b260b600626
Author: Mohit Agrawal <moagrawal@redhat.com>
Date:   Fri Mar 1 13:41:24 2019 +0530

    glusterfsd: Multiple shd processes are spawned on brick_mux environment
   
    Problem: Multiple shd processes are spawned while starting volumes
             in the loop on brick_mux environment.glusterd spawn a process
             based on a pidfile and shd daemon is taking some time to
             update pid in pidfile due to that glusterd is not able to
             get shd pid
   
    Solution: Commit cd249f4cb783f8d79e79468c455732669e835a4f changed
              the code to update pidfile in parent for any gluster daemon
              after getting the status of forking child in parent.To resolve
              the same correct the condition update pidfile in parent only
              for glusterd and for rest of the daemon pidfile is updated in
              child
   
    Change-Id: Ifd14797fa949562594a285ec82d58384ad717e81
    fixes: bz#1684404
    Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>

 

-Krutika


Ovirt 4.3.2.1-1.el7
-       All vms images ownership are changed to root.root after the vm is shutdown, probably related to; https://bugzilla.redhat.com/show_bug.cgi?id=1666795 but not only scoped to the HA engine. I’m still in compatibility mode 4.2 for the cluster and for the vm’s, but upgraded to version ovirt 4.3.2
-       The network provider is set to ovn, which is fine..actually cool, only the “ovs-vswitchd” is a CPU hog, and utilizes 100%
-       It seems on all nodes vdsm tries to get the the stats for the HA engine, which is filling the logs with (not sure if this is new);
[api.virt] FINISH getStats return={'status': {'message': "Virtual machine does not exist: {'vmId': u'20d69acd-edfd-4aeb-a2ae-49e9c121b7e9'}", 'code': 1}} from=::1,59290, vmId=20d69acd-edfd-4aeb-a2ae-49e9c121b7e9 (api:54)
-       It seems the package os_brick [root] managedvolume not supported: Managed Volume Not Supported. Missing package os-brick.: ('Cannot import os_brick',) (caps:149)  which fills the vdsm.log, but for this I also saw another message, so I suspect this will already be resolved shortly
-       The machine I used to run the backup HA engine, doesn’t want to get removed from the hosted-engine –vm-status, not even after running; hosted-engine --clean-metadata --host-id=10 --force-clean or hosted-engine --clean-metadata --force-clean from the machine itself.

Think that's about it.

Don’t get me wrong, I don’t want to rant, I just wanted to share my experience and see where things can made better.


Best Olaf
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/3CO35Q7VZMWNHS4LPUJNO7S47MGLSKS5/
_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users