Armen Zambrano's battlefield: Conversing with co-workers can spark new ideas

Tuesday, August 31, 2010

Conversing with co-workers can spark new ideas

A couple of weeks ago, I got a chance to speak with shaver about my concern on the load on our testing pool of Rev3 minis. In short, we have switched from running unit tests on the same pool of machines that produce Firefox development builds to the pool of Rev3 Mac minis (the ones released at the end of 2009) where we run our performance tests (aka talos). Unfortunately, we've also redesigned our try server and our developers loved it so much that decided to juice it out and produce an immense amount of load that we had not expected to reach before the end of the year. Guess what? We are going to have even more load as soon as we add more super-fast hardware machines (aka IX machines - we are ordering more than 150 but not sure how many will actually be slaves and remember that we are adding Linux 64 and Windows 64 as new IX based platforms) and this will cause even more load on testing pool of slaves.

These are some of the things we discussed:

There is no need to tie ourselves to Rev3 minis for unit tests. Adding another pool of faster machines will help to keep that pool to just performance tests and run the unit tests on a pool with faster machines and without the need to have the same hardware for each platform. The problem with this is that it would require us adding more maintenance for a 3rd pool of slaves and many more reference images. We can revisit this in another quarter as there are other short-term options to reduce the time that unit tests jobs take. The good thing is that our infrastructure is now capable of doing unit test jobs in different pools of slaves as we have made our code infrastructure more flexible.
We can shave tear up and tear down times. To run our unit tests we have to download the builds and the tests, remove everything from the previous run and checkout the tools repository to unpack the dmg mac files. We also download the symbols in case the browser crashes. We will have to determine where we can optimize steps to take shorter time and get to run the test suite before. These tear-up and tear-down steps could be greatly optimized, for instance, on Windows we determined at a quick glance that we could save between 20% to 30%.
We can investigate if the test framework could be optimized. I don't recall too much of this but I believe that Bob Moss' team could help us speed up our functional and performance tests. For instance we could leave it to the framework to download and unpack the symbols only if the build crashed.
Our minis are dual core - how could we take advantage of it?. Could we run two buildbot instances? Could we hand off two jobs each one in a different thread? There are a lot of experimenting and technical considerations for this; specially the fact that we have to reboot every time and we would have to wait for both jobs to finish.
We need better tools to determine step times. Imagine if I could tell you that suite A in average wastes X% of its time on Y platform doing tear up/tear down? It would also be cool if we could determine when a spike on test runs appeared. I saw yesterday our new intern Syed playing with SQL queries to determine some of these things. Happy to see this happening :)
Quickformat instead of remove. The step that removes the previous build and tests can take few minutes on Windows and that is way too much time. Instead we could quickformat the drive where these gets unpacked which is supposed to be really fast. Here is the bug where the investigation is to happen. This can also help to make our talos time more reliable.

I love discussing our problems with other people since many times it brings good ideas that can help us all. Notice that I said "discussing" and not just "hearing"; There are many considerations that most people outside of our team are not aware of when we have to make a decision and it helps them also to give us even better suggestions when we spend the back-and-forth time that a proper discussion requires.

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

3 comments:

Pike31 Aug 2010, 18:21:00
One little math experiment I did a while back in bug 489333 may be interesting, too. Basically, you can calculate the optimal amount of slaves on a parallizable task if there's a constant setup time per slave.

Just something that comes to my mind, could it save time if you download and unpack packages in one go? Like, wget -O- | tar -jx- ?
ReplyDelete
Replies
armenzg1 Sept 2010, 21:32:00
Both very interesting points. Thanks for your input!

First as you said we can try to have a more constant tear down/tear up time by optimizing it. The only difference is that we don't clobber on the minis as on the builders since there is hundreds of available GBs. We should optimize the removal of the previous run or put it on the side.

Very out of the box the wget/tar combination :). There are other ideas lurking around in the bugs saying of just extracting the subset of tests we need.
ReplyDelete
Replies
Ted Mielczarek2 Sept 2010, 10:14:00
There's a bug on doing crash processing on another server:
https://bugzilla.mozilla.org/show_bug.cgi?id=561754

All the code is basically written, it just needs to be hooked up to buildbot. I think catlee started looking at it but ran out of time.

Running the tests on faster hardware (but the same OS environment) is just plain smart. Why didn't anyone think of that before?

We should look into fixing our test suites
so that they can run some tests in parallel. it's a good idea in general.
ReplyDelete
Replies

Add comment