
----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Sander Grendelman" <sander@grendelman.com> Cc: users@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com> Sent: Wednesday, December 18, 2013 5:46:37 PM Subject: Re: [Users] Excessive syslog logging from vdsm/sampling.py
----- Original Message -----
From: "Sander Grendelman" <sander@grendelman.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "Michal Skrivanek" <mskrivan@redhat.com>, users@ovirt.org, "Ayal Baron" <abaron@redhat.com> Sent: Wednesday, December 18, 2013 4:00:21 PM Subject: Re: [Users] Excessive syslog logging from vdsm/sampling.py
On Wed, Dec 18, 2013 at 1:01 PM, Nir Soffer <nsoffer@redhat.com> wrote:
We should check why vmDrive does not have the format attribute in your case.
Probably because this was checked during migration it's a race between migration finish and monitoring checking this. You can see the error repeated in other cases as well (in bugzilla): 988047, 994534
Then this patch should fix your problem: http://gerrit.ovirt.org/22518
I applied this patch and restarted vdsmd on both nodes (maintenance->restart->activate). The errors still occur, but it is limited to fewer machines ( see [1] ).
- Almost all VMs were imported with virt-v2v (not only the VMs with errors). - Almost all VMs were migrated to another storage domain (offline). - Migration to another host was done with all VMs
The current vdsm logs for both are attached to this e-mail. Please let me know if you need any additional information.
[1]: unique sampling.py errors on both nodes.
#### Node 1 ## Before patch [root@gnkvm01 ~]# tail -n 2000 /var/log/messages-20131215 | awk '/sampling.py/ {print $8}' | sort -u vmId=`0ae3a3d7-ead9-4c0d-9df0-3901b6e6859c`::Stats vmId=`22654002-cbef-454d-b001-7823da5f592f`::Stats vmId=`3e481c73-57df-4dc3-8b1c-421a74308a5e`::Stats vmId=`57dbe688-4e18-4358-aa3e-f3f6022ef9b3`::Stats vmId=`66aa5555-2299-4d93-931d-b7a2e421b7e9`::Stats vmId=`6df65698-4995-4c75-9433-75affe9b9c38`::Stats vmId=`9260c69c-93a2-4f8a-b5e9-eaab5e4f4708`::Stats vmId=`9edb3e08-f098-4633-a122-e5ba29ae12ea`::Stats vmId=`c6f56584-1ccd-4c02-be94-897a4e747d34`::Stats vmId=`d3dae626-279b-4bcf-afc4-7a3c198a3035`::Stats
## After patch [root@gnkvm01 ~]# tail -n 2000 /var/log/messages | awk '/sampling.py/ {print $8}' | sort -u vmId=`007ca72e-d0d0-4477-87d4-fb60328cd882`::Stats vmId=`1075a178-a4c6-4a8f-a199-56401cd0652f`::Stats
#### Node 2 ## Before patch [root@gnkvm02 ~]# tail -n 2000 /var/log/messages-20131215 | awk '/sampling.py/ {print $8}' | sort -u vmId=`00317758-16fe-4ac6-b9fd-d522c9908861`::Stats vmId=`007ca72e-d0d0-4477-87d4-fb60328cd882`::Stats vmId=`06405f12-d763-4bd6-b5e5-997e3f6bb1f6`::Stats vmId=`1075a178-a4c6-4a8f-a199-56401cd0652f`::Stats vmId=`1bba8930-9c04-4c5c-8b15-c9fe14022cb5`::Stats vmId=`2036c21d-e0a4-4d55-a9a7-4cd9dd9d250d`::Stats vmId=`5fff0cc7-24e4-4e4a-b220-ba49f9145060`::Stats vmId=`86708f62-fcc6-4d0f-978a-3788a61f9775`::Stats vmId=`9b8e6d07-295c-404d-a672-efc94a24b6bc`::Stats vmId=`aa0445b6-8ca5-4557-9f9b-ee543d6435df`::Stats
## After patch [root@gnkvm02 ~]# tail -n 2000 /var/log/messages | awk '/sampling.py/ {print $8}' | sort -u vmId=`d3dae626-279b-4bcf-afc4-7a3c198a3035`::Stats
I think I understand your logs:
On node1 you started vdsm with the patch on 2013-12-18 13:33:18.
The error "object has no attribute 'format'" is gone with this patch.
On node2 I don't see this error at all, and I don't see a restart of vdsm. Maybe the entire log was generated with the patch?
It would be nice if you open a bug for this issue.
So one problem is gone, but you have other problems:
Thread-361::DEBUG::2013-12-18 14:41:42,512::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 8 edom: 10 level: 2 message: invalid argument: invalid path /rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/c928f2c7-9ba0-496f-8be9-b8804fdc1a6d/a2da1070-8de7-4e1f-b736-4c88d089a5cc not assigned to domain
$ grep 'not assigned to domain' vdsm-node1.log | wc -l 11488 $ grep 'not assigned to domain' vdsm-node2.log | wc -l 3326
$ grep 'not assigned to domain' vdsm-node1.log | awk '{print $6}' | sort -u /rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/347f2238-c018-4370-94df-bd1e81f8b854/9e5dad95-73ea-4e5c-aa13-522efd9bad11 /rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/47015859-1995-4ce2-880c-a3c7068a67dd/341b32d6-4276-454d-b3f0-789b705c99cc /rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/c928f2c7-9ba0-496f-8be9-b8804fdc1a6d/a2da1070-8de7-4e1f-b736-4c88d089a5cc
$ grep 'not assigned to domain' vdsm-node2.log | awk '{print $6}' | sort -u /rhev/data-center/mnt/blockSD/b526b148-b810-47c6-9bdd-4fd8d8226855/images/347f2238-c018-4370-94df-bd1e81f8b854/9e5dad95-73ea-4e5c-aa13-522efd9bad11
This issue deserve its own bug.
The version you run has several problems: 1. Sometimes drive does not have a format attribute. We do not know yet why this happens. 2. Sometimes sampling drive fails because the drive path is unknown to libvirt. This may be a non-existing drive that should have been removed. We do not know yet why this happens. 3. When such error happen, and you have multiple disks on a vm, sampling stop and next disks in the disk list are skipped. This can explain why your disk stats are constant. 4. The errors spam your vdsm log and syslog These 3 patches do not fix the underlying problems causing the errors, but improve error handling and decrease log spam. 1. http://gerrit.ovirt.org/22518 - vm: Avoid log spamming when drive format is undefined This is a cleaner replacement for the patch you already applied, and required by the next two patches. This patch is likely to be merged to master 2. http://gerrit.ovirt.org/22575 - vm: Continue to sample after errors This patch fix error handling so failure to sample one disk does not effect sampling of other disks. You should see disk stats for "good" drives even if you have one "bad" drive on a vm. 3. http://gerrit.ovirt.org/22576 - vm: Decrease log spam for expected errors This patch logs one line for expected errors instead of 15 lines traceback, reducing log spam. It will be nice if you test these patches and report the results. You ping me on #vdsm if you like to discuss this or need more help. Thanks, Nir