Each one of these commits produce different types of builds and tests. For a given build we can end up queuing up to 14 test suites plus 8 different talos jobs for a given OS.
How easily can the test pool be out of capacity? Three builds of a certain OS finishing around the same time can generate up to 66 testing jobs and take up more than the whole testing pool for that OS (we have 48 to 54 machines per OS) for a variable amount of time. Test jobs can take from 5 minutes to more than 60 minutes depending on the OS and the test suites.
For further information on test times I have some raw data from back in December (out-of-date warning) and three blog posts where I drew conclusions out of it.
This high load of pushes and the conglomeration of pushes (how close they are to each other) make test jobs to be queued and wait to be processed (this can be seen on the daily Wait Time emails on dev.tree-management). We need more machines (and we are working on it) but here are few things that you can do to improve things until then:
- Use the TryChooser syntax. Spending a moment to choose a subset of build and test jobs for your change helps to use the right amount of resources. If you need all builds and tests do not hesitate to use it all. Note that at some point this syntax will be mandatory.
- Cancel unneeded jobs. Use self-serve (which shows up on tbpl) to stop running or pending jobs once you know that they are not needed because you pushed something incorrectly or it is going to fall Once a build or test is not needed please cancel it to free up resources. Everyone will thank you.
[EDIT] 4:15pm PDT - I want to highlight that there is going to be a series of blog posts explaining what is the work and new testing machines purchase that we will be undertaking to handle such bad wait times.
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.