Hi,

I have a 2 node cluster with Hosted Engine attached to a storage Domain (NFS share) served by WS2016.  I run about a dozen VMs.

I need to improve availability / resilience of the storage domain, and also the I/O performance.

Anytime we need to reboot the Windows Server, its a nightmare for the cluster, we have to put it all into maintenance and take it down.  When the Storage server crashes (has happened once) or Windows decides to install an update and reboot (has happened once), the storage domain obviously goes down and sometimes the hosts have a difficult time re-connecting.

I can afford a second bare metal server and am looking for input in the best way to provide a highly available storage domain.  Ideally I'd like to be able to reboot either storage server without disrupting Ovirt. Should I be looking at clustering with Windows Server, or moving to a different OS?

I currently run the Storage in RAID10 (spinning discs) and have the option of adding CacheCade to the array w/ SSD.  Would that help I/O for small random R/W?

What are the suggested options for this scenario?

Thanks