Tuesday, April 12, 2011

Yesterday's load

I will do a longer analysis at some point but I would like to share with you a link and a screenshot of it.
These two diagrams show commits over 24 hours (from Mon, 11 Apr 2011 00:00 PDT to Tue, 12 Apr 2011 00:00 PDT) from all of our currently supported project branches. On the first diagram we can see pushes per hour and on the second diagram we can see a distribution of these pushes among the different project branches.

Each one of these commits produce different types of builds and tests. For a given build we can end up queuing up to 14 test suites plus 8 different talos jobs for a given OS.
How easily can the test pool be out of capacity? Three builds of a certain OS finishing around the same time can generate up to 66 testing jobs and take up more than the whole testing pool for that OS (we have 48 to 54 machines per OS) for a variable amount of time. Test jobs can take from 5 minutes to more than 60 minutes depending on the OS and the test suites.

For further information on test times I have some raw data from back in December (out-of-date warning) and three blog posts where I drew conclusions out of it.

This high load of pushes and the conglomeration of pushes (how close they are to each other) make test jobs to be queued and wait to be processed (this can be seen on the daily Wait Time emails on dev.tree-management). We need more machines (and we are working on it) but here are few things that you can do to improve things until then:
  • Use the TryChooser syntax. Spending a moment to choose a subset of build and test jobs for your change helps to use the right amount of resources. If you need all builds and tests do not hesitate to use it all. Note that at some point this syntax will be mandatory.
  • Cancel unneeded jobs. Use self-serve (which shows up on tbpl) to stop running or pending jobs once you know that they are not needed because you pushed something incorrectly or it is going to fall  Once a build or test is not needed please cancel it to free up resources. Everyone will thank you.
There are also things that could be fixed like improving reftests and xpcshell for Win7 but that is not something that everyone can help in a reasonable amount of time.

[EDIT] 4:15pm PDT - I want to highlight that there is going to be a series of blog posts explaining what is the work and new testing machines purchase that we will be undertaking to handle such bad wait times.

Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.


  1. 1. Require trychooser syntax. No-one will complain - we all need these resources, and the only reason that people don't use trychooser is that they forget.

    2. Make it easier to cancel unused jobs. The self-serve interface is painful to cancel jobs individually. I filed a bug on this ages ago.

    3. Buy more machines, or use EC2. We're hiring like crazy, and I already need to wait half a work-day for results of a _push_, never mind a try. That means that 100% of the time I have to push code that hasn't run on try (because I need to merge with what's been landed in the meantime).

  2. I just updated the post to note that we are undertaking a large purchase and this is part of the researching of how many machines will handle expected growth and bring it to a new level.

    For #1 we are looking into making mandatory to improve the people who forget. There are other improvements for the tryserver but that comes later.

    For #2, yes, it should be improved. I don't which bug it is but it is important to raise the priority on it now that aurora is settling down.

    For #3, you are right. We are going to buy more and we will look into the EC2 option as well.

  3. Yeah, a "cancel all jobs" button would be great.

  4. @Nicholas the bug filed for it is this https://bugzilla.mozilla.org/show_bug.cgi?id=636027

    @Paul I believe there is something more going on than just the lack of slaves. Please check https://bugzilla.mozilla.org/show_bug.cgi?id=649734 for more details