In the OpenShift CI there is a concept called "rehearsals". These are special jobs, that are run when a job is configured or changed, to test out that job on the CI build farm prior to merging the config change. This feature is popular amongst CI job maintainers as it is a simple way to make sure that the jobs they are creating or changing are working as intended
Original Runner Job
We originally had a "rehearsal runner job" that was implemented like any other ProwJob. This runner job would execute some complex logic to determine the changed jobs, and then spawn executions of them. These executions were linked to the PR in openshift/release that changed or added the configuration. This job would execute on every PR to openshift/release, and would spawn a sampling of up to 10 randomly selected jobs each time it ran.
Original Runner Job Status
Problems With This Approach
- Spawns new rehearsal jobs for every push to every PR, with no way to opt-out
- No way to select which job(s) are rehearsed, have to rely on retrying random selection until the correct job is chosen
- No simple way to abort all of the rehearsal jobs when a mistake with the PR is noticed
- A Confusing user experience resulting in people being unclear of why the runner job was failing, and not made aware that it was due to one of the spawned jobs having failed
External Prow Plugins
Prow allows for the creation and configuration of External Plugins to handle tasks that are generally outside of its scope. These plugins are configured as a GitHub event server responding to webhooks that GitHub sends, such as when a PR is created or commented on. The plugin can decide which events it cares about, and act on them accordingly.
Converting to a Plugin
Due to an ongoing effort to reduce the cost of running our CI, and allow the rehearsals to be easier to use for job maintainers, we decided to convert the runner job into an external Prow plugin. This was a complex effort that included the following:
- It was necessary to significantly refactor the logic of the rehearsal tool in order to support running as a plugin
- An event server was created and configured to care about PR creation and comment events
- When a PR is created: the affected jobs list is determined, and commented as a table on the PR itself, along with detailed steps for interacting with the plugin
- When a user comments something like "/pj-rehearse" on a PR the plugin is notified, and executes the appropriate command
- If there is an error determining the affected jobs, or spawning the chosen rehearsal jobs, it is surfaced to the user as a comment on the PR
- A new label, 'rehearsals-ack', was added to the openshift/release repo, and it is required in order to merge
New Rehearsal Workflow
With this plugin in place, the new workflow for a PR author adding or maintaining CI job configuration became pretty simple. The author opens a PR on openshift/release changing one or more jobs. The plugin notices this and comments a table of all affected jobs along with commands for running each (or multiple) of them. Then:
- If the author elects to skip the rehearsals they can just comment: '/pj-rehearse skip' to add the 'rehearsals-ack' label and allow merge one other criteria are met.
- If they want to run up to 10 rehearsals using the original sampling logic, they simply comment: '/pj-rehearse' to kick them off.
- If they want to rehearse a specific job(s) they can follow the directions to comment: '/pj-rehearse {job-name}'.
- If they decide to abort the running rehearsals they can comment: '/pj-rehearse abort' to immediately abort them, saving any additional cost that would be incurred.
- Once they are satisfied with the results of the rehearsals, they can comment: '/pj-rehearse ack' to add the appropriate label.
Results
PR authors now have much greater control over: how many jobs are rehearsed, which jobs are run, and when their jobs are rehearsed. The process is simpler to understand, and instructions for following the workflow are included with every PR. The primary goal of reducing CI cost was also achieved, as costs for PRs on openshift/release were reduced by 2 - 3 times!