[JIRA] (OVIRT-1468) Use CI mirrors for the slaves
by Barak Korren (oVirt JIRA)
[ https://ovirt-jira.atlassian.net/browse/OVIRT-1468?page=com.atlassian.jir... ]
Barak Korren commented on OVIRT-1468:
-------------------------------------
Here are some thoughts about doing this from Jenkins.
Besides reducing the need for external CM systems on the slaves this has another nice benefit where, because of the way the 'jenkins' repo is organised and tested, we can get CI for repo URL changes for "free". Essentially once we send a patch to the Jenkins repo that updates repo URLs for slaves, the updated configuration will be used in the '{{check-patch}}' run for that very patch.
But this also raises an issue where if we send a broken patch, it will leave the slave that it ran on in a broken state.
We can work around this by making the '{{check-patch.sh}}' script in the Jenkins repo rollback the configuration change on the slave. But there can be cases where we won't even get to '{{check-patch.sh}}', so we'll need to have some kind on a catch-all code that will revert changes to the slave in the case of breakage.
Another way to look at this is to think in terms of a transaction. We will have a script that runs at the beginning of the job, "opens" a transaction and updates the slave repo configuration. The we will also have a script at the end of the job that "applies" the transaction if the job was successful or rolls it back otherwise.
So we need a tool that can make the change transactional. This can probably be easily implemented by using some backup files (E.g. with '{{cp --backup}}'). But if we implement this ourselves in the 'jenkins' repo then we run the risk of having a patch that breaks the very rollback mechanism we're trying to provide... So we'll need transactional updated to the transactional updates tool... My brain starts to hurt from running in a loop now...
Ok, let me try to get more practical:
# We will need to either find a tool to allow setting repo configuration and rolling it back, or decide to implement this ourselves
# However we choose to implement this we will probably need to call this from the '{{mock_setup.sh}}' script that get run at the start of every job.
# We will need to add a new post-build step (publisher) that will only runs for successful jobs and applies the transaction
# We will add code to '{{mock_cleanup.sh}}' to make it rollback the transaction if its has not been applied, and sort things so that it runs after the step mentioned in #3 above.
# We will need to implement things so that the code that does rollback does not get updated in check-path. (But gets checked!)
(As a side note, I know '{{mock_setup.sh}}' and '{{mock_cleanup.sh}}' need renaming, bare with me...)
Since I mention some testing in #5, it may be a good idea to implement some of this in Python and use '{{pytest}}' (The 'jenkins' repo is already configured to run '{{pytest}}' tests), but we could also use '{{bats}}'.
[~ederevea] would you like to take a swing at this yourself?
> Use CI mirrors for the slaves
> -----------------------------
>
> Key: OVIRT-1468
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1468
> Project: oVirt - virtualization made easy
> Issue Type: Improvement
> Components: oVirt Infra
> Reporter: Barak Korren
> Assignee: infra
> Labels: jenkins, mirrors, slaves
>
> Since we've seen that issues like OVIRT-1467 can very quickly break our entire infrastructure, it may be advisable to use out mirrors to isolate the slave VMs from the upstream repos as well, and not just the testing environments.
--
This message was sent by Atlassian JIRA
(v1000.1092.0#100053)
7 years, 4 months
[JIRA] (OVIRT-1494) RCA for PHX Storage outage on 29.06.2017
by Evgheni Dereveanchin (oVirt JIRA)
[ https://ovirt-jira.atlassian.net/browse/OVIRT-1494?page=com.atlassian.jir... ]
Evgheni Dereveanchin reassigned OVIRT-1494:
-------------------------------------------
Assignee: Evgheni Dereveanchin (was: infra)
> RCA for PHX Storage outage on 29.06.2017
> ----------------------------------------
>
> Key: OVIRT-1494
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1494
> Project: oVirt - virtualization made easy
> Issue Type: Task
> Reporter: Evgheni Dereveanchin
> Assignee: Evgheni Dereveanchin
>
> The PHX storage stopped working around 5:40 GMT today and was brought up manually at 9:12 GMT by shutting down ovirt-storage02 and starting services on the remaining node.
> ovirt-storage02 was the active cluster member and some unknown condition triggered a cluster failover attempt. This event however failed with all cluster resources going offline and not coming up on either of the nodes until one of them was shut down completely.
> Opening this ticket to analyze logs and confirm what triggered the failover and why it eventually failed.
--
This message was sent by Atlassian JIRA
(v1000.1092.0#100053)
7 years, 4 months
[JIRA] (OVIRT-1494) RCA for PHX Storage outage on 29.06.2017
by Evgheni Dereveanchin (oVirt JIRA)
Evgheni Dereveanchin created OVIRT-1494:
-------------------------------------------
Summary: RCA for PHX Storage outage on 29.06.2017
Key: OVIRT-1494
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1494
Project: oVirt - virtualization made easy
Issue Type: Task
Reporter: Evgheni Dereveanchin
Assignee: infra
The PHX storage stopped working around 5:40 GMT today and was brought up manually at 9:12 GMT by shutting down ovirt-storage02 and starting services on the remaining node.
ovirt-storage02 was the active cluster member and some unknown condition triggered a cluster failover attempt. This event however failed with all cluster resources going offline and not coming up on either of the nodes until one of them was shut down completely.
Opening this ticket to analyze logs and confirm what triggered the failover and why it eventually failed.
--
This message was sent by Atlassian JIRA
(v1000.1092.0#100053)
7 years, 4 months
[JIRA] (OVIRT-1490) Simplify oVirt storage configuration
by Evgheni Dereveanchin (oVirt JIRA)
[ https://ovirt-jira.atlassian.net/browse/OVIRT-1490?page=com.atlassian.jir... ]
Evgheni Dereveanchin reassigned OVIRT-1490:
-------------------------------------------
Assignee: Evgheni Dereveanchin (was: infra)
> Simplify oVirt storage configuration
> ------------------------------------
>
> Key: OVIRT-1490
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1490
> Project: oVirt - virtualization made easy
> Issue Type: Improvement
> Components: storage
> Reporter: Barak Korren
> Assignee: Evgheni Dereveanchin
> Priority: Highest
> Labels: infra, storage
>
> Today's outage was a clear reminder that our current storage configuration does not serve us well. We hardly know how to debug it, it seems to not be resistant to the very issues it was supposed to protect against and introduce potential failure scenarios of its own.
> I suggest we implement a new storage layout that meets the following criteria:
> # Ultimate simplicity at the lower level of the stack. More specifically:
> ## The storage severs should be simple NFS or iSCSI servers. No DRBD and no exotic file-systems.
> ## Only simple storage will be presented to oVirt for use as storage domains
> # Separation of resources between critical services - The 'Jenkins" master for e.g. should not share resources with the "resources" server or anything else.The separation should hold true down to the physical spindle level.
> # Duplication of services and use of local storage where possible - this is a longer term effort - but we have some low hanging fruits here like artifactory, where simple DNS/LB-based fail-over between two identical hosts would probably suffice.
> # Complexity only where needed and up the stack. For example we can just have the storage for Jenkins be mirrored at the VM level with fail-over to a backup VM.
--
This message was sent by Atlassian JIRA
(v1000.1092.0#100053)
7 years, 4 months