I have experienced the same performance problem in creating new template or creating new
stateful image from template. I have traced it to the use of qemu-img convert which is
very very slow.
Currently a template creation on my setup, takes around 29m for a 45G image when a simple
copy over NFS takes 1m26s over a 10G network which is expected since there is roughly as
many read as write over the network (4Gbps)
Adding --cache writeback to qemu-img improves from 29m to around 8m but according to a
prior comment on this mailing list might not work on all platform.
Hope that helps