Tuesday, April 12, 2011

Yesterday's load

I will do a longer analysis at some point but I would like to share with you a link and a screenshot of it.
These two diagrams show commits over 24 hours (from Mon, 11 Apr 2011 00:00 PDT to Tue, 12 Apr 2011 00:00 PDT) from all of our currently supported project branches. On the first diagram we can see pushes per hour and on the second diagram we can see a distribution of these pushes among the different project branches.

Each one of these commits produce different types of builds and tests. For a given build we can end up queuing up to 14 test suites plus 8 different talos jobs for a given OS.
How easily can the test pool be out of capacity? Three builds of a certain OS finishing around the same time can generate up to 66 testing jobs and take up more than the whole testing pool for that OS (we have 48 to 54 machines per OS) for a variable amount of time. Test jobs can take from 5 minutes to more than 60 minutes depending on the OS and the test suites.

For further information on test times I have some raw data from back in December (out-of-date warning) and three blog posts where I drew conclusions out of it.

This high load of pushes and the conglomeration of pushes (how close they are to each other) make test jobs to be queued and wait to be processed (this can be seen on the daily Wait Time emails on dev.tree-management). We need more machines (and we are working on it) but here are few things that you can do to improve things until then:
  • Use the TryChooser syntax. Spending a moment to choose a subset of build and test jobs for your change helps to use the right amount of resources. If you need all builds and tests do not hesitate to use it all. Note that at some point this syntax will be mandatory.
  • Cancel unneeded jobs. Use self-serve (which shows up on tbpl) to stop running or pending jobs once you know that they are not needed because you pushed something incorrectly or it is going to fall  Once a build or test is not needed please cancel it to free up resources. Everyone will thank you.
There are also things that could be fixed like improving reftests and xpcshell for Win7 but that is not something that everyone can help in a reasonable amount of time.

[EDIT] 4:15pm PDT - I want to highlight that there is going to be a series of blog posts explaining what is the work and new testing machines purchase that we will be undertaking to handle such bad wait times.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.