Thursday, September 10, 2015

The benefits of moving per-push Buildbot scheduling into the tree

Some of you may be aware of the Buildbot Bridge (aka BBB) work that bhearsum worked on during Q2 of this year. This system allows scheduling TaskCluster graphs for Buildbot builders. For every Buildbot job, there is a TaskCluster task that represents it.
This is very important as it will help to transition the release process piece by piece to TaskCluster without having to move large pieces of code at once to TaskCluster. You can have graphs of

I recently added to Mozilla CI tools the ability to schedule Buildbot jobs by submitting a TaskCluster graph (the BBB makes this possible).

Even though the initial work for the BBB is intended for Release tasks, I believe there are various benefits if we moved the scheduling into the tree (currently TaskCluster works like this; look for the gecko decision task in Treeherder).

To read another great blog post around try syntax and schedulling please visit ahal's post "Looking beyond Try Syntax".

NOTE: Try scheduling might not have try syntax in the future so I will not talk much about trychooser and try syntax. Read ahal's post to understand a bit more.

Benefits of in-tree scheduling:

  • Per-branch scheduling matrix can be done in-tree
    • We can define which platforms and jobs run on each tree
    • TaskCluster tasks already do this
  • Accurate Treeherder job visualization
    • Currently, jobs that run through Buildbot do not necessarily show up properly
    • Jobs run through TaskCluster show up accurately
    • This is due to some issues with how Buildbot jobs are represented in between states and the difficulty to have a way to related them
    • It could be fixed but it is not worth the effort if we're transitioning to TaskCluster
  • Control when non-green jobs are run
    • Currently on try we can't say run all unit tests jobs *but* the ones that should not run by default
    • We would save resources (do not run non-green jobs) and confusion for developers (do not have to ask why is this job non-green)
  • The try syntax parser can be done in-tree
    • This allows for improving and extending the try parser
    • Unit tests can be added
    • The parser can be tested with a push
    • try parser changes become atomic (it won't affecting all trees and can ride the trains)
  • SETA analysis can be done in-tree
    • SETA changes can become atomic (it won't affecting all trees and can ride the trains)
    • We would not need to wait on Buildbot reconfigurations for new changes to be live.
  • Per push scheduling analysis can be done in-tree
    • We currently only will schedule jobs for a specific change if files for that product are being touched (e.g. Firefox for Android for mobile/* changes)
  • PGO scheduling can be done in-tree
    • PGO scheduling changes become atomic (it won't affecting all trees and can ride the trains)
  • Environment awareness hooks (new)
    • If the trees are closed, we can teach the scheduling system to not schedule jobs until further notice
    • If we're backlogged, we can teach the scheduling system to not schedule certain platforms or to schedule a reduced set of jobs or to skip certain pushes
  • Help the transition to TaskCluster
    • Without it we would need to transition builds and associated tests to TaskCluster in one shot (not possible for Talos)
  • Deprecate Self-serve/BuildApi
    • Making changes to BuildApi is very difficult due to the lack of testing environments and set-up burden
    • Moving to the BBB will help us move away from this old system
There are various parts that will need to be in place before we can do this. Here's some that I can think of:
  • TaskCluster's big-graph scheduling
    • This is important since it will allow for the concept of coalescing to exist in TaskCluster
  • Task prioritization
    • This is important if we're to have different levels of priority for jobs on TaskCluster
    • On Buildbot we have release repositories with the highest priority and the try repo having the lowest
    • We also currently have the ability to raise/decrease task priorities through self-serve/buildapi. This is used by developers, specially on Try. to allow their jobs to be picked up sooner.
  • Treeherder to support LDAP authentication
    • It is a better security model to scheduling changes
    • If we want to move away from self-server/buildapi we need this
  • Allow test jobs to find installer and test packages
    • Currently test jobs scheduled through the BBB cannot find the Firefox installer and the 
Can you think of other benefits? Can you think of problems with this model? Are you aware of other pieces needed before moving forward to this model? Please let me know!



Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.