hyperconverged single node with SSD cache fails gluster creation

I am seeing more success than failures at creating single and triple node hyperconverged setups after some weeks of experimentation so I am branching out to additional features: In this case the ability to use SSDs as cache media for hard disks. I tried first with a single node that combined caching and compression and that fails during the creation of LVMs. I tried again without the VDO compression, but actually the results where identical whilst VDO compression but without the LV cache worked ok. I tried various combinations, using less space etc., but the results are always the same and unfortunately rather cryptic (substituted the physical disk label with {disklabel}): TASK [gluster.infra/roles/backend_setup : Extend volume group] ***************** failed: [{hostname}] (item={u'vgname': u'gluster_vg_{disklabel}p1', u'cachethinpoolname': u'gluster_thinpool_gluster_vg_{disklabel}p1', u'cachelvname': u'cachelv_gluster_thinpool_gluster_vg_{disklabel}p1', u'cachedisk': u'/dev/sda4', u'cachemetalvname': u'cache_gluster_thinpool_gluster_vg_{disklabel}p1', u'cachemode': u'writeback', u'cachemetalvsize': u'70G', u'cachelvsize': u'630G'}) => {"ansible_loop_var": "item", "changed": false, "err": " Physical volume \"/dev/mapper/vdo_{disklabel}p1\" still in use\n", "item": {"cachedisk": "/dev/sda4", "cachelvname": "cachelv_gluster_thinpool_gluster_vg_{disklabel}p1", "cachelvsize": "630G", "cachemetalvname": "cache_gluster_thinpool_gluster_vg_{disklabel}p1", "cachemetalvsize": "70G", "cachemode": "writeback", "cachethinpoolname": "gluster_thinpool_gluster_vg_{disklabel}p1", "vgname": "gluster_vg_{disklabel}p1"}, "msg": "Unable to reduce gluster_vg_{disklabel}p1 by /dev/dm-15.", "rc": 5} somewhere within that I see something that points to a race condition ("still in use"). Unfortunately I have not been able to pinpoint the raw logs which are used at that stage and I wasn't able to obtain more info. At this point quite a bit of storage setup is already done, so rolling back for a clean new attempt, can be a bit complicated, with reboots to reconcile the kernel with data on disk. I don't actually believe it's related to single node and I'd be quite happy to move the creation of the SSD cache to a later stage, but in a VDO setup, this looks slightly complex to someone without intimate knowledge of LVS-with-cache-and-perhaps-thin/VDO/Gluster all thrown into one. Needless the feature set (SSD caching & compressed-dedup) sounds terribly attractive but when things don't just work, it's more terrifying.

On Wed, Sep 4, 2019 at 9:27 PM <thomas@hoberg.net> wrote:
I am seeing more success than failures at creating single and triple node hyperconverged setups after some weeks of experimentation so I am branching out to additional features: In this case the ability to use SSDs as cache media for hard disks.
I tried first with a single node that combined caching and compression and that fails during the creation of LVMs.
I tried again without the VDO compression, but actually the results where identical whilst VDO compression but without the LV cache worked ok.
I tried various combinations, using less space etc., but the results are always the same and unfortunately rather cryptic (substituted the physical disk label with {disklabel}):
TASK [gluster.infra/roles/backend_setup : Extend volume group] ***************** failed: [{hostname}] (item={u'vgname': u'gluster_vg_{disklabel}p1', u'cachethinpoolname': u'gluster_thinpool_gluster_vg_{disklabel}p1', u'cachelvname': u'cachelv_gluster_thinpool_gluster_vg_{disklabel}p1', u'cachedisk': u'/dev/sda4', u'cachemetalvname': u'cache_gluster_thinpool_gluster_vg_{disklabel}p1', u'cachemode': u'writeback', u'cachemetalvsize': u'70G', u'cachelvsize': u'630G'}) => {"ansible_loop_var": "item", "changed": false, "err": " Physical volume \"/dev/mapper/vdo_{disklabel}p1\" still in use\n", "item": {"cachedisk": "/dev/sda4", "cachelvname": "cachelv_gluster_thinpool_gluster_vg_{disklabel}p1", "cachelvsize": "630G", "cachemetalvname": "cache_gluster_thinpool_gluster_vg_{disklabel}p1", "cachemetalvsize": "70G", "cachemode": "writeback", "cachethinpoolname": "gluster_thinpool_gluster_vg_{disklabel}p1", "vgname": "gluster_vg_{disklabel}p1"}, "msg": "Unable to reduce gluster_vg_{disklabel}p1 by /dev/dm-15.", "rc": 5}
somewhere within that I see something that points to a race condition ("still in use").
Unfortunately I have not been able to pinpoint the raw logs which are used at that stage and I wasn't able to obtain more info.
At this point quite a bit of storage setup is already done, so rolling back for a clean new attempt, can be a bit complicated, with reboots to reconcile the kernel with data on disk.
I don't actually believe it's related to single node and I'd be quite happy to move the creation of the SSD cache to a later stage, but in a VDO setup, this looks slightly complex to someone without intimate knowledge of LVS-with-cache-and-perhaps-thin/VDO/Gluster all thrown into one.
Needless the feature set (SSD caching & compressed-dedup) sounds terribly attractive but when things don't just work, it's more terrifying.
Hi Thomas, The way we have to write the variables for 2.8 while setting up cache. Currently we are writing something like this:
gluster_infra_cache_vars: - vgname: vg_sdb2 cachedisk: /dev/sdb3 cachelvname: cachelv_thinpool_vg_sdb2 cachethinpoolname: thinpool_vg_sdb2 cachelvsize: '10G' cachemetalvsize: '2G' cachemetalvname: cache_thinpool_vg_sdb2 cachemode: writethrough =================== Not that cachedisk is provided as /dev/sdb3 which would be extended with vg vg_sdb2 ... this works well The module will take care of extending the vg with /dev/sdb3.
*However with Ansible-2.8 we cannot provide like this but have to be more explicit. And have to mention the pv underlying* *this volume group vg_sdb2. So, with respect to 2.8 we have to write that variable like:*
>>>>>>>>> gluster_infra_cache_vars: - vgname: vg_sdb2 cachedisk: '/dev/sdb2,/dev/sdb3' cachelvname: cachelv_thinpool_vg_sdb2 cachethinpoolname: thinpool_vg_sdb2 cachelvsize: '10G' cachemetalvsize: '2G' cachemetalvname: cache_thinpool_vg_sdb2 cachemode: writethrough =====================
Note that I have mentioned both /dev/sdb2 and /dev/sdb3. This change is backward compatible, that is it works with 2.7 as well. I have raised an issue with Ansible as well. Which can be found here: https://github.com/ansible/ansible/issues/56501 However, @olafbuitelaar has fixed this in gluster-ansible-infra, and the patch is merged in master. If you can checkout master branch, you should be fine.

Thanks a ton! On one hand I'm glad it's a bug now known and fixed, on the other hand I am more scared than ever, that oVirt is too raw to upgrade without intensive QA. I'll try both the manual approach and the new ansible scripts once I've overcome a new problem, that keeps me busy (that will be a new post). So when would the change flow into the current oVirt release? 4.3.6 or 4.4?

(could become a double post because I am using e-mail to attach the logs..) Hi URS, I have tried again using the latest release (4.3.7) and noted that now the more "explicit" variant you quote was generated. The behavior is changed, but it still fails now complaining about /dev/sdb being mounted (or inaccessible in any other way). I am attaching the logs. I have a HDD RAID on /dev/sdb and a SSD partiton on /dev/sda3 with >600GB of space left. I have mostly gone with defaults everywhere, used an arbiter (at least for the vmstore and data volumes) VDO and write-through caching with 550GB size (note that it fails to apply that value beyond the first node). Has anyone else tried a hyperconverged 3-node with SSD caching with success recently? Thanks for your feedback and help so far, Thomas

Hi URS, I have tried again using the latest release (4.3.7) and noted that now the more "explicit" variant you quote was generated. The behavior is changed, but it still fails now complaining about /dev/sdb being mounted (or inaccessible in any other way). I am attaching the logs. I have a HDD RAID on /dev/sdb and a SSD partiton on /dev/sda3 with
600GB of space left.
I have mostly gone with defaults everywhere, used an arbiter (at least for the vmstore and data volumes) VDO and write-through caching with 550GB size (note that it fails to apply that value beyond the first node). Has anyone else tried a hyperconverged 3-node with SSD caching with success recently? Thanks for your feedback and help so far, Thomas

Hi Thomas, Can you please share "hc_wizard_inventory.yml" file which is under /etc/ansible/ ? On Thu, Nov 28, 2019 at 11:26 PM Thomas Hoberg <thomas@hoberg.net> wrote:
Hi URS,
I have tried again using the latest release (4.3.7) and noted that now the more "explicit" variant you quote was generated.
The behavior is changed, but it still fails now complaining about /dev/sdb being mounted (or inaccessible in any other way).
I am attaching the logs.
I have a HDD RAID on /dev/sdb and a SSD partiton on /dev/sda3 with
600GB of space left.
I have mostly gone with defaults everywhere, used an arbiter (at least for the vmstore and data volumes) VDO and write-through caching with 550GB size (note that it fails to apply that value beyond the first node).
Has anyone else tried a hyperconverged 3-node with SSD caching with success recently?
Thanks for your feedback and help so far,
Thomas
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/SR6QYSJHMYY6JR...
-- Thanks, Gobinda

Hi Gobinda, unfortunately it's long gone, because I went back to an un-cached setup. It was mostly a trial anyway, I had to re-do the 3-node HCI because it had died rather horribly on me (a repeating issue I have so far had on distinct sets of hardware, that I am still trying to hunt down... separate topic). And since it was a blank(ed) set of servers, I just decided to try the SSD cache, to see if the Ansible script generation issue had been sorted out as described from upstream. I was rather encouraged to see that the Ansible script now had these changes included, that URS had described as becoming necessary with a new Ansible version. It doesn't actually make a lot of sense in the setup, because the SSD cache is a single Samsung EVO 860 1TB unit while the storage is a RAID6 out of 7 4TB 2.5" drives (per server): Both have similar bandwidth, IOPS would be very much workload dependent (the 2nd SSD I intended to use as a mirror was unfortunately cut from the budget). It has space left over because the OS doesn't need that much, but I don't dare use a single SSD as a write-back cache, especially because the RAID controller (HP420i) hides all wear information and doesn't seem to pass TRIM either and for write-through I'm not sure it would do noticeably better than the RAID controller (I configured that not to cache the SSD, too). So after it failed, I simply went back to no-cache for now. This HCI cluster is using relatively low-power hardware recalled from retirement that will host functional VMs, not high-performance workloads. They are well equipped with RAM and that's always the fastest cache anyway. I guess you should be able to add and remove the SSD as cache layer at any time during the operation, because it's at a level oVirt doesn't manage and I'd love to see examples as to how it's done. Especially the removal part would be important to know, if your SSD signals unexpected levels of wear and you need to swap them out on the fly. If I hit across another opportunity to test (most likely a single node), I will update here and make sure to collect a full set of log files including the ansible main config file. Thank you for your interest and the follow-up, Thomas

When I first deployed my oVirt lab (v4.2.7 was latest and greatest) the ansible playbook didn't work for me.So I decided to stop the gluster processes on one of the nodes, Wipe all LVM and recreate it manually. Finally , I have managed to use my SSD for write-back cache - but I found out that if your Chunk size is larger than the default limit - it will never push it to the spinning disks. For details you can check 1668163 – LVM cache cannot flush buffer,change cache type or lvremove LV (CachePolicy 'cleaner' also doesn't work) As we use either 'replica 2 arbiter 1' (old name replica 3 arbiter 1) or a pure replica 3 , we can afford a gluster node go 'pouf' as long as we have decent bandwidth and we use sharding. So far I have changed my brick layout at least twice (for the cluster) without the VMs being affected - so you can still try to do the caching, but please check the comments in #1668163 about the chunk size of the cache. Best Regards,Strahil Nikolov В неделя, 1 декември 2019 г., 16:02:36 ч. Гринуич+2, Thomas Hoberg <thomas@hoberg.net> написа: Hi Gobinda, unfortunately it's long gone, because I went back to an un-cached setup. It was mostly a trial anyway, I had to re-do the 3-node HCI because it had died rather horribly on me (a repeating issue I have so far had on distinct sets of hardware, that I am still trying to hunt down... separate topic). And since it was a blank(ed) set of servers, I just decided to try the SSD cache, to see if the Ansible script generation issue had been sorted out as described from upstream. I was rather encouraged to see that the Ansible script now had these changes included, that URS had described as becoming necessary with a new Ansible version. It doesn't actually make a lot of sense in the setup, because the SSD cache is a single Samsung EVO 860 1TB unit while the storage is a RAID6 out of 7 4TB 2.5" drives (per server): Both have similar bandwidth, IOPS would be very much workload dependent (the 2nd SSD I intended to use as a mirror was unfortunately cut from the budget). It has space left over because the OS doesn't need that much, but I don't dare use a single SSD as a write-back cache, especially because the RAID controller (HP420i) hides all wear information and doesn't seem to pass TRIM either and for write-through I'm not sure it would do noticeably better than the RAID controller (I configured that not to cache the SSD, too). So after it failed, I simply went back to no-cache for now. This HCI cluster is using relatively low-power hardware recalled from retirement that will host functional VMs, not high-performance workloads. They are well equipped with RAM and that's always the fastest cache anyway. I guess you should be able to add and remove the SSD as cache layer at any time during the operation, because it's at a level oVirt doesn't manage and I'd love to see examples as to how it's done. Especially the removal part would be important to know, if your SSD signals unexpected levels of wear and you need to swap them out on the fly. If I hit across another opportunity to test (most likely a single node), I will update here and make sure to collect a full set of log files including the ansible main config file. Thank you for your interest and the follow-up, Thomas _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/Y45BAH7PXJN6C6...
participants (5)
-
Gobinda Das
-
Sachidananda URS
-
Strahil Nikolov
-
Thomas Hoberg
-
thomas@hoberg.net