On Wed, May 25, 2016 at 6:17 PM, David Caro <dcaro@redhat.com> wrote:

On 05/25 17:06, David Caro wrote:
> On 05/25 16:09, Barak Korren wrote:
> > On 25 May 2016 at 14:52, David Caro <dcaro@redhat.com> wrote:
> > > On 05/25 14:42, Barak Korren wrote:
> > >> On 25 May 2016 at 12:44, Eyal Edri <eedri@redhat.com> wrote:
> > >> > OK,
> > >> > I suggest to test using a VM with local disk (preferably on a host with SSD
> > >> > configured), if its working,
> > >> > lets expedite moving all VMs or at least a large amount of VMs to it until
> > >> > we see network load reduced.
> > >> >
> > >>
> > >> This is not that easy, oVirt doesn't support mixing local disk and
> > >> storage in the same cluster, so we will need to move hosts to a new
> > >> cluster for this.
> > >> Also we will lose the ability to use templates, or otherwise have to
> > >> create the templates on each and every disk.
> > >>
> > >> The scratch disk is a good solution for this, where you can have the
> > >> OS image on the central storage and the ephemeral data on the local
> > >> disk.
> > >>
> > >> WRT to the storage architecture - a single huge (10.9T) ext4 is used
> > >> as the FS on top of the DRBD, this is probably not the most efficient
> > >> thing one can do (XFS would probably have been better, RAW via iSCSI -
> > >> even better).
> > >
> > > That was done >3 years ago, xfs was not quite stable and widely used and
> > > supported back then.
> > >
> > AFAIK it pre-dates EXT4
>
> It does, but for el6, it was performing way poorly, and with more bugs (for
> what the reviews of it said at the time).
>
> > in any case this does not detract from the
> > fact that the current configuration in not as efficient as we can make
> > it.
> >
>
> It does not, I agree to better focus on what we can do now on, now what should
> have been done then.
>
> >
> > >>
> > >> I'm guessing that those 10/9TB are not made from a single disk but
> > >> with a hardware RAID of some sort. In this case deactivating the
> > >> hardware RAID and re-exposing it as multiple separate iSCSI LUNs (That
> > >> are then re-joined to a single sotrage domain in oVirt) will enable
> > >> different VMs to concurrently work on different disks. This should
> > >> lower the per-vm storage latency.
> > >
> > > That would get rid of the drbd too, it's a totally different setup, from
> > > scratch (no nfs either).
> >
> > We can and should still use DRBD, just setup a device for each disk.
> > But yeah, NFS should probably go away.
> > (We are seeing dramatically better performance for iSCSI in
> > integration-engine)
>
> I don't understand then what you said about splitting the hardware raids, you
> mean to setup one drdb device on top of each hard drive instead?

Though I really think we should move to gluster/ceph instead for the jenkins
vms, anyone knows what's the current status of the hyperconverge?

Neither Gluster nor Hyper-converge I think is stable enough to move all production infra into.

Hyperconverged is not supported yet for oVirt as well (might be a 4.x feature)

That would allow us for better scalable distributed storage, and properly use
the hosts local disks (we have more space on the combined hosts right now that
on the storage servers).

I agree a stable distributed storage solution is the way to go if we can find one :)

>
>
> btw. I think that the nfs is used also for something more than just the engine
> storage domain (just to keep it in mind that it has to be checked if we are
> going to get rid of it)
>
> >
> > >
> > >>
> > >> Looking at the storage machine I see strong indication it is IO bound
> > >> - the load average is ~12 while there are just 1-5 working processes
> > >> and the CPU is ~80% idle and the rest is IO wait.
> > >>
> > >> Running 'du *' at:
> > >> /srv/ovirt_storage/jenkins-dc/658e5b87-1207-4226-9fcc-4e5fa02b86b4/images
> > >> one can see that most images are ~40G in size (that is _real_ 40G not
> > >> sparse!). This means that despite having most VMs created based on
> > >> templates, the VMs are full template copies rather then COW clones.
> > >
> > > That should not be like that, maybe the templates are wrongly configured? or
> > > foreman images?
> >
> > This is the expected behaviour when creating a VM from template in the
> > oVirt admin UI. I thought Foreman might behave differently, but it
> > seems it does not.
> >
> > This behaviour is determined by the parameters you pass to the engine
> > API when instantiating a VM, so it most probably doesn't have anything
> > to do with the template configuration.
>
> So maybe a misconfiguration in foreman?
>
> >
> > >
> > >> What this means is that using pools (where all VMs are COW copies of
> > >> the single pool template) is expected to significantly reduce the
> > >> storage utilization and therefore the IO load on it (the less you
> > >> store, the less you need to read back).
> > >
> > > That should happen too without pools, with normal qcow templates.
> >
> > Not unless you create all the VMs via the API and pass the right
> > parameters. Pools are the easiest way to ensure you never mess that
> > up...
>
> That was the idea
>
> >
> > > And in any case, that will not lower the normal io, when not actually
> > > creating vms, as any read and write will still hit the disk anyhow, it
> > > only alleviates the io when creating new vms.
> >
> > Since you are reading the same bits over and over (for different VMs)
> > you enable the various buffer caches along the way (in the storage
> > machines and in the hypevirsors) to do what they are supposed to.
>
>
> Once the vm is started, mostly all that's needed is on ram, so there are not
> that much reads from disk, unless you start writing down to it, and that's
> mostly what we are hitting, lots of writes.
>
> >
> > > The local disk (scratch disk) is the best option
> > > imo, now and for the foreseeable future.
> >
> > This is not an either/or thing, IMO we need to do both.
>
> I think that it's way more useful, because it will solve our current issues
> faster and for longer, so IMO it should get more attention sooner.
>
> Any improvement that does not remove the current bottleneck, is not really
> giving any value to the overall infra (even if it might become valuable later).
>
> >
> > --
> > Barak Korren
> > bkorren@redhat.com
> > RHEV-CI Team
>
> --
> David Caro
>
> Red Hat S.L.
> Continuous Integration Engineer - EMEA ENG Virtualization R&D
>
> Tel.: +420 532 294 605
> Email: dcaro@redhat.com
> IRC: dcaro|dcaroest@{freenode|oftc|redhat}
> Web: www.redhat.com
> RHT Global #: 82-62605

--
David Caro

Red Hat S.L.
Continuous Integration Engineer - EMEA ENG Virtualization R&D

Tel.: +420 532 294 605
Email: dcaro@redhat.com
IRC: dcaro|dcaroest@{freenode|oftc|redhat}
Web: www.redhat.com
RHT Global #: 82-62605

Eyal Edri
Associate Manager

RHEV DevOps
EMEA ENG Virtualization R&D
Red Hat Israel

phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)