----- Original Message -----
From: "Nir Soffer" <nsoffer(a)redhat.com>
To: "Sander Grendelman" <sander(a)grendelman.com>
Cc: users(a)ovirt.org, "Michal Skrivanek" <mskrivan(a)redhat.com>
Sent: Wednesday, December 18, 2013 5:46:37 PM
Subject: Re: [Users] Excessive syslog logging from vdsm/sampling.py
----- Original Message -----
> From: "Sander Grendelman" <sander(a)grendelman.com>
> To: "Nir Soffer" <nsoffer(a)redhat.com>
> Cc: "Michal Skrivanek" <mskrivan(a)redhat.com>, users(a)ovirt.org,
"Ayal Baron"
> <abaron(a)redhat.com>
> Sent: Wednesday, December 18, 2013 4:00:21 PM
> Subject: Re: [Users] Excessive syslog logging from vdsm/sampling.py
>
> On Wed, Dec 18, 2013 at 1:01 PM, Nir Soffer <nsoffer(a)redhat.com> wrote:
> >> > We should check why vmDrive does not have the format attribute in
your
> >> > case.
> >>
> >> Probably because this was checked during migration it's a race between
> >> migration finish and monitoring checking this.
> >> You can see the error repeated in other cases as well (in bugzilla):
> >> 988047,
> >> 994534
> >
> > Then this patch should fix your problem:
> >
http://gerrit.ovirt.org/22518
>
> I applied this patch and restarted vdsmd on both nodes
> (maintenance->restart->activate).
> The errors still occur, but it is limited to fewer machines ( see [1] ).
>
> - Almost all VMs were imported with virt-v2v (not only the VMs with
> errors).
> - Almost all VMs were migrated to another storage domain (offline).
> - Migration to another host was done with all VMs
>
> The current vdsm logs for both are attached to this e-mail.
> Please let me know if you need any additional information.
>
> [1]: unique sampling.py errors on both nodes.
>
> #### Node 1
> ## Before patch
> [root@gnkvm01 ~]# tail -n 2000 /var/log/messages-20131215 | awk
> '/sampling.py/ {print $8}' | sort -u
> vmId=`0ae3a3d7-ead9-4c0d-9df0-3901b6e6859c`::Stats
> vmId=`22654002-cbef-454d-b001-7823da5f592f`::Stats
> vmId=`3e481c73-57df-4dc3-8b1c-421a74308a5e`::Stats
> vmId=`57dbe688-4e18-4358-aa3e-f3f6022ef9b3`::Stats
> vmId=`66aa5555-2299-4d93-931d-b7a2e421b7e9`::Stats
> vmId=`6df65698-4995-4c75-9433-75affe9b9c38`::Stats
> vmId=`9260c69c-93a2-4f8a-b5e9-eaab5e4f4708`::Stats
> vmId=`9edb3e08-f098-4633-a122-e5ba29ae12ea`::Stats
> vmId=`c6f56584-1ccd-4c02-be94-897a4e747d34`::Stats
> vmId=`d3dae626-279b-4bcf-afc4-7a3c198a3035`::Stats
>
> ## After patch
> [root@gnkvm01 ~]# tail -n 2000 /var/log/messages | awk '/sampling.py/
> {print $8}' | sort -u
> vmId=`007ca72e-d0d0-4477-87d4-fb60328cd882`::Stats
> vmId=`1075a178-a4c6-4a8f-a199-56401cd0652f`::Stats
>
> #### Node 2
> ## Before patch
> [root@gnkvm02 ~]# tail -n 2000 /var/log/messages-20131215 | awk
> '/sampling.py/ {print $8}' | sort -u
> vmId=`00317758-16fe-4ac6-b9fd-d522c9908861`::Stats
> vmId=`007ca72e-d0d0-4477-87d4-fb60328cd882`::Stats
> vmId=`06405f12-d763-4bd6-b5e5-997e3f6bb1f6`::Stats
> vmId=`1075a178-a4c6-4a8f-a199-56401cd0652f`::Stats
> vmId=`1bba8930-9c04-4c5c-8b15-c9fe14022cb5`::Stats
> vmId=`2036c21d-e0a4-4d55-a9a7-4cd9dd9d250d`::Stats
> vmId=`5fff0cc7-24e4-4e4a-b220-ba49f9145060`::Stats
> vmId=`86708f62-fcc6-4d0f-978a-3788a61f9775`::Stats
> vmId=`9b8e6d07-295c-404d-a672-efc94a24b6bc`::Stats
> vmId=`aa0445b6-8ca5-4557-9f9b-ee543d6435df`::Stats
>
> ## After patch
> [root@gnkvm02 ~]# tail -n 2000 /var/log/messages | awk '/sampling.py/
> {print $8}' | sort -u
> vmId=`d3dae626-279b-4bcf-afc4-7a3c198a3035`::Stats
I think I understand your logs:
On node1 you started vdsm with the patch on 2013-12-18 13:33:18.
The error "object has no attribute 'format'" is gone with this patch.
On node2 I don't see this error at all, and I don't see a restart of vdsm.
Maybe the entire log was generated with the patch?
It would be nice if you open a bug for this issue.
So one problem is gone, but you have other problems:
Thread-361::DEBUG::2013-12-18
14:41:42,512::libvirtconnection::108::libvirtconnection::(wrapper) Unknown
libvirterror: ecode: 8 edom: 10 level: 2 message: invalid argument: invalid
path
/rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/c928f2c7-9ba0-496f-8be9-b8804fdc1a6d/a2da1070-8de7-4e1f-b736-4c88d089a5cc
not assigned to domain
$ grep 'not assigned to domain' vdsm-node1.log | wc -l
11488
$ grep 'not assigned to domain' vdsm-node2.log | wc -l
3326
$ grep 'not assigned to domain' vdsm-node1.log | awk '{print $6}' | sort
-u
/rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/347f2238-c018-4370-94df-bd1e81f8b854/9e5dad95-73ea-4e5c-aa13-522efd9bad11
/rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/47015859-1995-4ce2-880c-a3c7068a67dd/341b32d6-4276-454d-b3f0-789b705c99cc
/rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/c928f2c7-9ba0-496f-8be9-b8804fdc1a6d/a2da1070-8de7-4e1f-b736-4c88d089a5cc
$ grep 'not assigned to domain' vdsm-node2.log | awk '{print $6}' | sort
-u
/rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/347f2238-c018-4370-94df-bd1e81f8b854/9e5dad95-73ea-4e5c-aa13-522efd9bad11
This issue deserve its own bug.
The version you run has several problems:
1. Sometimes drive does not have a format attribute. We do not know yet why this
happens.
2. Sometimes sampling drive fails because the drive path is unknown to libvirt.
This may be a non-existing drive that should have been removed. We do not
know yet why this happens.
3. When such error happen, and you have multiple disks on a vm, sampling stop
and next disks in the disk list are skipped. This can explain why your disk
stats are constant.
4. The errors spam your vdsm log and syslog
These 3 patches do not fix the underlying problems causing the errors, but
improve error handling and decrease log spam.
1.
http://gerrit.ovirt.org/22518 - vm: Avoid log spamming when drive format is undefined
This is a cleaner replacement for the patch you already applied, and required by
the next two patches. This patch is likely to be merged to master
2.
http://gerrit.ovirt.org/22575 - vm: Continue to sample after errors
This patch fix error handling so failure to sample one disk does not effect sampling
of other disks. You should see disk stats for "good" drives even if you have
one "bad"
drive on a vm.
3.
http://gerrit.ovirt.org/22576 - vm: Decrease log spam for expected errors
This patch logs one line for expected errors instead of 15 lines traceback,
reducing log spam.
It will be nice if you test these patches and report the results.
You ping me on #vdsm if you like to discuss this or need more help.
Thanks,
Nir