Heads up! Influence of our recent Mock/Proxy changes on Lago jobs

Hi infra team members! As you may know, we've recently changed out proxied Mock configuration so that the 'http_proxy' environment variable gets defined inside the Mock environment. This was in an effort to make 'pip', 'curl' and 'wget' commands go through our PHX proxy. As it turns out, this also have unforeseen influence on yum tools. Now, when it come to yum, as it is used inside the mock environmet, we long has the proxied configuration hard-wiring it to use the proxy by setting it in "yum.conf". However, so far, yum tools (Such as reposync) that brought their own configuration, essentially bypassed the "yum.conf" file and hence were not using the proxy. Well, now it turns out that 'yum' and the derived tools also respect the 'http_proxy' environment variable [1]: 10.2. Configuring Proxy Server Access for a Single User To enable proxy access for a specific user, add the lines in the example box below to the user's shell profile. For the default bash shell, the profile is the file ~/.bash_profile. The settings below enable yum to use the proxy server mycache.mydomain.com, connecting to port 3128. # The Web proxy server used by this account http_proxy="http://mycache.mydomain.com:3128" export http_proxy This is generally a good thing, but it can lead to formerly unexpected consequences. Case-to-point: The Lago job reposync failures of last Thursday (Dec 22nd, 2016). The root-cause behind the failures was that the "ovirt-web-ui-0.1.0-4.el7.centos.x86_64.rpm" file was changed in the "ovirt-master-snapshot-static" repo. Updating an RPM file without changing the version or revision numbers breaks YUM`s rules and makes reposync choke. We already knew about this and actually had a work-around in the Lago code [2]. We I came in Thursday morning, and saw reposync failing in all the Lago jobs, I just assumed that our work-around simply failed to work. My assumption was enforced by the fact that I was able to reproduce the issue by running 'reposync' manually on the Lago hosts, and also managed to rectify it by removing the offending from file the reposync cache. I spent the next few hours chasing down failing jobs and cleaning up the caches on the hosts they ran on. I took me a while to figure out that I was seeing the problem (Essentially, the older version of the package file) reappear on the same hosts over and over again! Wondering how could that be, and after ensuring the older package file was nowhere to be found on any of the repos the jobs were using, Me and Gal took a look at the Lago code to see if it could be causing the issue. Imagine our puzzlement when we realized the work-around code was doing _exactly_ what I was doing manually, and still somehow managed to make the very issue it was designed to solve reappear! Eventually the problem seemed to disappear on its own. Now, armed with the knowledge above I can provide a plausible explanation to what we were seeing. The difference between my manual executions of 'reposync' and the way Lago was running it was that Lago was running within Mock, where 'http_proxy' was defined. What was probably happening is that reposync kept getting the old RPM file from the proxy while still getting a newer yum metadate file. Conclusion - The next time such an issue arises, we must make sure to clear the PHX proxy cache, there is actually no need to clear the cache on the Lago hosts themselves, because our work-around will resolve the issue there. Longer term we may configure the proxy to not cache files coming from resources.ovirt.org. [1]: https://www.centos.org/docs/5/html/yum/sn-yum-proxy-server.html [2]: https://github.com/lago-project/lago/blob/master/ovirtlago/reposetup.py#L141... -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

Hello Barak. But why this should be handled on infra side? Was that infra code that produced two RPMs with same name and version and different content? If not then I would bug it against whoever code is creating such RPMs and it then should be rebuild with at least rpm release incremented and hence does not require cache invalidation. Any reason we are not doing that? Anton. On Fri, Dec 23, 2016 at 7:54 PM, Barak Korren <bkorren@redhat.com> wrote:
Hi infra team members!
As you may know, we've recently changed out proxied Mock configuration so that the 'http_proxy' environment variable gets defined inside the Mock environment. This was in an effort to make 'pip', 'curl' and 'wget' commands go through our PHX proxy. As it turns out, this also have unforeseen influence on yum tools.
Now, when it come to yum, as it is used inside the mock environmet, we long has the proxied configuration hard-wiring it to use the proxy by setting it in "yum.conf". However, so far, yum tools (Such as reposync) that brought their own configuration, essentially bypassed the "yum.conf" file and hence were not using the proxy.
Well, now it turns out that 'yum' and the derived tools also respect the 'http_proxy' environment variable [1]:
10.2. Configuring Proxy Server Access for a Single User
To enable proxy access for a specific user, add the lines in the example box below to the user's shell profile. For the default bash shell, the profile is the file ~/.bash_profile. The settings below enable yum to use the proxy server mycache.mydomain.com, connecting to port 3128.
# The Web proxy server used by this account http_proxy="http://mycache.mydomain.com:3128" export http_proxy
This is generally a good thing, but it can lead to formerly unexpected consequences.
Case-to-point: The Lago job reposync failures of last Thursday (Dec 22nd, 2016).
The root-cause behind the failures was that the "ovirt-web-ui-0.1.0-4.el7.centos.x86_64.rpm" file was changed in the "ovirt-master-snapshot-static" repo. Updating an RPM file without changing the version or revision numbers breaks YUM`s rules and makes reposync choke. We already knew about this and actually had a work-around in the Lago code [2].
We I came in Thursday morning, and saw reposync failing in all the Lago jobs, I just assumed that our work-around simply failed to work. My assumption was enforced by the fact that I was able to reproduce the issue by running 'reposync' manually on the Lago hosts, and also managed to rectify it by removing the offending from file the reposync cache. I spent the next few hours chasing down failing jobs and cleaning up the caches on the hosts they ran on. I took me a while to figure out that I was seeing the problem (Essentially, the older version of the package file) reappear on the same hosts over and over again! Wondering how could that be, and after ensuring the older package file was nowhere to be found on any of the repos the jobs were using, Me and Gal took a look at the Lago code to see if it could be causing the issue. Imagine our puzzlement when we realized the work-around code was doing _exactly_ what I was doing manually, and still somehow managed to make the very issue it was designed to solve reappear! Eventually the problem seemed to disappear on its own. Now, armed with the knowledge above I can provide a plausible explanation to what we were seeing. The difference between my manual executions of 'reposync' and the way Lago was running it was that Lago was running within Mock, where 'http_proxy' was defined. What was probably happening is that reposync kept getting the old RPM file from the proxy while still getting a newer yum metadate file.
Conclusion - The next time such an issue arises, we must make sure to clear the PHX proxy cache, there is actually no need to clear the cache on the Lago hosts themselves, because our work-around will resolve the issue there. Longer term we may configure the proxy to not cache files coming from resources.ovirt.org.
[1]: https://www.centos.org/docs/5/html/yum/sn-yum-proxy-server.html [2]: https://github.com/lago-project/lago/blob/master/ ovirtlago/reposetup.py#L141-L153
-- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/ _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
-- Anton Marchukov Senior Software Engineer - RHEV CI - Red Hat

On 23 December 2016 at 21:02, Anton Marchukov <amarchuk@redhat.com> wrote:
Hello Barak.
But why this should be handled on infra side? Was that infra code that produced two RPMs with same name and version and different content? If not then I would bug it against whoever code is creating such RPMs and it then should be rebuild with at least rpm release incremented and hence does not require cache invalidation.
I did not query Sandro about why he did the update the way he did. He knows far more then I do about the various build processes of the various packages and oVirt, an I tend to trust his judgement.
Any reason we are not doing that?
This can take time, every maintainer has his own (bad) habits, and not everyone will agree to do what we want (Some downright regard oVirt as a "downstream" consumer and refuse to do anything that they regard as oVirt-specific!). In the meantime we need to be resilient to such issues if we can. We can`t just let everything fail while we try to "fix the world". Also, next time around, we could be seeing similar caching issues with a non-yum/rpm file, so its good to have a deep understanding of the data paths into our system. -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

I did not query Sandro about why he did the update the way he did. He knows far more then I do about the various build processes of the various packages and oVirt, an I tend to trust his judgement.
But we should (CCing Sandro). I do not see any mistrust here. Even if we can make our infra behave abnormal way there is no warranty that such RPM will not get wild (or be prevented from going out). We cannot clear caches of all of our users. I even not sure how repoman will behave when oVirt releases are composed in this case, it does not compare RPM content and assumes that they are immutable as designed. So makes sense to check this. This can take time, every maintainer has his own (bad) habits, and not
everyone will agree to do what we want (Some downright regard oVirt as a "downstream" consumer and refuse to do anything that they regard as oVirt-specific!).
This is correct. So sounds like we need to send an announcement that RPM versions are supposed to change when content change and see who have issues with this and try to analyse why and help if needed. In the meantime we need to be resilient to such issues if we can. We
can`t just let everything fail while we try to "fix the world".
Meantime if we need to support this we have to disable caching for everything minus repos we know follow the rules. But I would insist that this is not normal situation and we need to have an idea of how we notify people to fix it and how we can help. And we are not fixing the world. yum is there for a long time and world is ok. It is just we have problem in our stuff we need to fix so we can use all the systems the world designed for us.
Also, next time around, we could be seeing similar caching issues with a non-yum/rpm file, so its good to have a deep understanding of the data paths into our system.
Afaik we have a configuration on proxy server that caches only immutable content for yum repos? So this is indeed an interesting discovery! Anton. -- Anton Marchukov Senior Software Engineer - RHEV CI - Red Hat
participants (2)
-
Anton Marchukov
-
Barak Korren