Monday, September 21, 2015

Minimal job scheduling

One of the biggest benefits of moving the scheduling into the tree is that you can adjust the decisions on what to schedule from within the tree.

As chmanchester and I were recently discussing, this is very important as we believe we can do much better on deciding what to schedule on try.

Currently, too many developers push to try with  -p all -u all (which schedules all platforms and all tests). It is understandable, it is the easiest way to reduce the risk of your change being backed out when it lands on one of the integration trees (e.g. mozilla-inbound).

In-tree scheduling analysis

What if your changes would get analysed and we would determine the best educated guess set of platforms and test jobs required to test your changes in order to not be backed out on an integration tree?

For instance, when I push Mozharness changes to mozilla-inbound, I wish I could tell the system that I only need these set of platforms and not those other ones.

If everyone had the minimum amount of jobs added to their pushes, our systems would be able to return results faster (less load) and no one would need to take short cuts.

This would be the best approximation and we would need to fine tune the logic over time to get things as right as possible. We would need to find the right balance of some changes being backed out because we did not get the right scheduling on try and getting results faster for everyone.

Prioritized tests

There is already some code that chmanchester landed where we can tell the infrastructure to run a small set of tests based on files changed. In this case we hijack one of the jobs (e.g. mochitest-1) to run the most relevant tests to your changes which would can normally be tested on different chunks. Once the prioritized tests are run, we can run the remaining tests as we would normally do. Prioritized tests also applies to suites that are not chunked (run a subset of tests instead of all).

There are some UI problems in here that we would need to figure out with Treeherder and Buildbot.

Tiered testing

Soon, we will have all technological pieces to create a multi tiered job scheduling system.

For instance, we could run things in this order (just a suggestion):
  • Builds
  • Prioritized tests (run in 5 to 15 mins)
  • Remaining tests in normal mode

This has the advantage of using prioritized tests as a canary job which would prevent running the remaining tests if we fail the canary (shorter) job.

Post minimal run (automatic) precise scheduling (manual)

This is not specifically to scheduling the right thing automatically but to extending what gets scheduled automatically.
Imagine that you're not satisfied with what gets scheduled automatically and you would like to add more jobs (e.g. missing platforms or missing suites).
You will be able to add those missing jobs later directly from Treeherder by selecting which jobs are missing.
This will be possible once bug 1194830 lands.

NOTE: Mass scheduling (e.g. all mochitests across all platforms) would be a bit of a pain to do through Treeherder. We might want to do a second version of try-extender.



Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Friday, September 18, 2015

Mozharness' support for Buildbot Bridge triggered test jobs

I have landed today [1] some code which allows Buildbot *test* jobs triggered through the Buildbot Bridge (BBB) to work properly.

In my previous post I explain a bit on how Mozci works with the Buildbot Bridge.
In this post I will only explain what we fixed on the Mozharness side.

Buildbot Changes

If a Buildbot test job is scheduled through TaskCluster (The Buildbot Bridge supports this), then the generated Buildbot Change associated to a test job does not have the installer and
test urls necessary for Mozharness to download for a test job.

What is a Buildbot Change? It is an object which represents the files touched associated to a code push. For the build jobs, this value gets set as part of the process of polling the Mercurial repositories, however, the test jobs are triggered via  a "buildbot sendchange" step part of the build job.
This sendchange creates the Buildbot Change for the test job which Mozharness can then use.
The BBB does not listen to sendchanges, hence, jobs triggered via the BBB have an empty changes object. Therefore, we can't download the files needed to run a test job and fail to execute.

In order to overcome this limtation, we have to detect when a Buildbot job is triggered normally or through the Buildbot Bridge.
Buildbot Bridge triggered jobs have a 'taskId' property defined (this represents the task associated to this Buildbot job). Through this 'taskId' we can determine the parent task and find a file called properties.json [2], which, it is uploaded by every BBB triggered job.
In such file, we can find both the installer and test urls.

Mozilla CI tools: Upcoming support for Buildbot Bridge

What is the Buildbot Bridge?

The Buildbot Bridge (BBB) allows scheduling Buildbot jobs through TaskCluster.
In other words, you can have taskcluster tasks which represent Buildbot jobs.
This allows having TaskCluster graphs composed of tasks which will be executed either on Buildbot or TaskCluster, hence, allowing for *all* relationships between tasks to happen in TaskCluster.

Read my recent post on the benefits of scheduling every job via TaskCluster.

The next Mozilla CI tools (mozci) release will have support for BBB.

Brief explanation

You can see in this try push both types of Buildbot jobs [1].
One set of jobs were triggered through Buildbot's analysis of the try syntax in the commit message while two of the jobs should not have been scheduled.

Those two jobs were triggered off-band via Mozci submitting a task graph.
You can see the TaskCluster graph representing them in here [2].

These jobs were triggered using this simple data structure:
{
  'Linux x86-64 try build': [
    'Ubuntu VM 12.04 x64 try opt test mochitest-1'
  ]
}

Mozci turns this simple graph into a TaskCluster graph.
The graph is composed of tasks which follow this structure [3]

Notes about the Buildbot Bridge

bhearsum's original post, recording and slides:
http://hearsum.ca/blog/buildbot-taskcluster-bridge-an-overview.html
https://vreplay.mozilla.com/replay/showRecordDetails.html?recId=1879
http://hearsum.ca/slides/buildbot-bridge-overview/#/

Some notes which Selena took about the topic:
http://www.chesnok.com/daily/2015/06/03/taskcluster-migration-about-the-buildbot-bridge/

The repository is in here:
https://github.com/mozilla/buildbot-bridge

---------
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=9417c6151f2c
[2] https://tools.taskcluster.net/task-graph-inspector/#mAI0H1GyTJSo-YwpklZqng
[3] https://github.com/armenzg/mozilla_ci_tools/blob/mozci_bbb/mozci/sources/buildbot_bridge.py#L37


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Thursday, September 17, 2015

Platform Operations lightning talks (Whistler 2015)

You can read and watch in here about the Platform Operations lighting talks:
https://wiki.mozilla.org/Platform_Operations/Ligthning_talks

Here the landing pages for the various Platform Operations teams:
https://wiki.mozilla.org/Platform_Operations

PS = I don't know what composes platform operations so feel free to add your team if I'm missing it.


Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Thursday, September 10, 2015

The benefits of moving per-push Buildbot scheduling into the tree

Some of you may be aware of the Buildbot Bridge (aka BBB) work that bhearsum worked on during Q2 of this year. This system allows scheduling TaskCluster graphs for Buildbot builders. For every Buildbot job, there is a TaskCluster task that represents it.
This is very important as it will help to transition the release process piece by piece to TaskCluster without having to move large pieces of code at once to TaskCluster. You can have graphs of

I recently added to Mozilla CI tools the ability to schedule Buildbot jobs by submitting a TaskCluster graph (the BBB makes this possible).

Even though the initial work for the BBB is intended for Release tasks, I believe there are various benefits if we moved the scheduling into the tree (currently TaskCluster works like this; look for the gecko decision task in Treeherder).

To read another great blog post around try syntax and schedulling please visit ahal's post "Looking beyond Try Syntax".

NOTE: Try scheduling might not have try syntax in the future so I will not talk much about trychooser and try syntax. Read ahal's post to understand a bit more.

Benefits of in-tree scheduling:

  • Per-branch scheduling matrix can be done in-tree
    • We can define which platforms and jobs run on each tree
    • TaskCluster tasks already do this
  • Accurate Treeherder job visualization
    • Currently, jobs that run through Buildbot do not necessarily show up properly
    • Jobs run through TaskCluster show up accurately
    • This is due to some issues with how Buildbot jobs are represented in between states and the difficulty to have a way to related them
    • It could be fixed but it is not worth the effort if we're transitioning to TaskCluster
  • Control when non-green jobs are run
    • Currently on try we can't say run all unit tests jobs *but* the ones that should not run by default
    • We would save resources (do not run non-green jobs) and confusion for developers (do not have to ask why is this job non-green)
  • The try syntax parser can be done in-tree
    • This allows for improving and extending the try parser
    • Unit tests can be added
    • The parser can be tested with a push
    • try parser changes become atomic (it won't affecting all trees and can ride the trains)
  • SETA analysis can be done in-tree
    • SETA changes can become atomic (it won't affecting all trees and can ride the trains)
    • We would not need to wait on Buildbot reconfigurations for new changes to be live.
  • Per push scheduling analysis can be done in-tree
    • We currently only will schedule jobs for a specific change if files for that product are being touched (e.g. Firefox for Android for mobile/* changes)
  • PGO scheduling can be done in-tree
    • PGO scheduling changes become atomic (it won't affecting all trees and can ride the trains)
  • Environment awareness hooks (new)
    • If the trees are closed, we can teach the scheduling system to not schedule jobs until further notice
    • If we're backlogged, we can teach the scheduling system to not schedule certain platforms or to schedule a reduced set of jobs or to skip certain pushes
  • Help the transition to TaskCluster
    • Without it we would need to transition builds and associated tests to TaskCluster in one shot (not possible for Talos)
  • Deprecate Self-serve/BuildApi
    • Making changes to BuildApi is very difficult due to the lack of testing environments and set-up burden
    • Moving to the BBB will help us move away from this old system
There are various parts that will need to be in place before we can do this. Here's some that I can think of:
  • TaskCluster's big-graph scheduling
    • This is important since it will allow for the concept of coalescing to exist in TaskCluster
  • Task prioritization
    • This is important if we're to have different levels of priority for jobs on TaskCluster
    • On Buildbot we have release repositories with the highest priority and the try repo having the lowest
    • We also currently have the ability to raise/decrease task priorities through self-serve/buildapi. This is used by developers, specially on Try. to allow their jobs to be picked up sooner.
  • Treeherder to support LDAP authentication
    • It is a better security model to scheduling changes
    • If we want to move away from self-server/buildapi we need this
  • Allow test jobs to find installer and test packages
    • Currently test jobs scheduled through the BBB cannot find the Firefox installer and the 
Can you think of other benefits? Can you think of problems with this model? Are you aware of other pieces needed before moving forward to this model? Please let me know!



Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.