These are some of the things we discussed:
- There is no need to tie ourselves to Rev3 minis for unit tests. Adding another pool of faster machines will help to keep that pool to just performance tests and run the unit tests on a pool with faster machines and without the need to have the same hardware for each platform. The problem with this is that it would require us adding more maintenance for a 3rd pool of slaves and many more reference images. We can revisit this in another quarter as there are other short-term options to reduce the time that unit tests jobs take. The good thing is that our infrastructure is now capable of doing unit test jobs in different pools of slaves as we have made our code infrastructure more flexible.
- We can shave tear up and tear down times. To run our unit tests we have to download the builds and the tests, remove everything from the previous run and checkout the tools repository to unpack the dmg mac files. We also download the symbols in case the browser crashes. We will have to determine where we can optimize steps to take shorter time and get to run the test suite before. These tear-up and tear-down steps could be greatly optimized, for instance, on Windows we determined at a quick glance that we could save between 20% to 30%.
- We can investigate if the test framework could be optimized. I don't recall too much of this but I believe that Bob Moss' team could help us speed up our functional and performance tests. For instance we could leave it to the framework to download and unpack the symbols only if the build crashed.
- Our minis are dual core - how could we take advantage of it?. Could we run two buildbot instances? Could we hand off two jobs each one in a different thread? There are a lot of experimenting and technical considerations for this; specially the fact that we have to reboot every time and we would have to wait for both jobs to finish.
- We need better tools to determine step times. Imagine if I could tell you that suite A in average wastes X% of its time on Y platform doing tear up/tear down? It would also be cool if we could determine when a spike on test runs appeared. I saw yesterday our new intern Syed playing with SQL queries to determine some of these things. Happy to see this happening :)
- Quickformat instead of remove. The step that removes the previous build and tests can take few minutes on Windows and that is way too much time. Instead we could quickformat the drive where these gets unpacked which is supposed to be really fast. Here is the bug where the investigation is to happen. This can also help to make our talos time more reliable.
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.