Thursday, September 10, 2015

The benefits of moving per-push Buildbot scheduling into the tree

Some of you may be aware of the Buildbot Bridge (aka BBB) work that bhearsum worked on during Q2 of this year. This system allows scheduling TaskCluster graphs for Buildbot builders. For every Buildbot job, there is a TaskCluster task that represents it.
This is very important as it will help to transition the release process piece by piece to TaskCluster without having to move large pieces of code at once to TaskCluster. You can have graphs of

I recently added to Mozilla CI tools the ability to schedule Buildbot jobs by submitting a TaskCluster graph (the BBB makes this possible).

Even though the initial work for the BBB is intended for Release tasks, I believe there are various benefits if we moved the scheduling into the tree (currently TaskCluster works like this; look for the gecko decision task in Treeherder).

To read another great blog post around try syntax and schedulling please visit ahal's post "Looking beyond Try Syntax".

NOTE: Try scheduling might not have try syntax in the future so I will not talk much about trychooser and try syntax. Read ahal's post to understand a bit more.

Benefits of in-tree scheduling:

  • Per-branch scheduling matrix can be done in-tree
    • We can define which platforms and jobs run on each tree
    • TaskCluster tasks already do this
  • Accurate Treeherder job visualization
    • Currently, jobs that run through Buildbot do not necessarily show up properly
    • Jobs run through TaskCluster show up accurately
    • This is due to some issues with how Buildbot jobs are represented in between states and the difficulty to have a way to related them
    • It could be fixed but it is not worth the effort if we're transitioning to TaskCluster
  • Control when non-green jobs are run
    • Currently on try we can't say run all unit tests jobs *but* the ones that should not run by default
    • We would save resources (do not run non-green jobs) and confusion for developers (do not have to ask why is this job non-green)
  • The try syntax parser can be done in-tree
    • This allows for improving and extending the try parser
    • Unit tests can be added
    • The parser can be tested with a push
    • try parser changes become atomic (it won't affecting all trees and can ride the trains)
  • SETA analysis can be done in-tree
    • SETA changes can become atomic (it won't affecting all trees and can ride the trains)
    • We would not need to wait on Buildbot reconfigurations for new changes to be live.
  • Per push scheduling analysis can be done in-tree
    • We currently only will schedule jobs for a specific change if files for that product are being touched (e.g. Firefox for Android for mobile/* changes)
  • PGO scheduling can be done in-tree
    • PGO scheduling changes become atomic (it won't affecting all trees and can ride the trains)
  • Environment awareness hooks (new)
    • If the trees are closed, we can teach the scheduling system to not schedule jobs until further notice
    • If we're backlogged, we can teach the scheduling system to not schedule certain platforms or to schedule a reduced set of jobs or to skip certain pushes
  • Help the transition to TaskCluster
    • Without it we would need to transition builds and associated tests to TaskCluster in one shot (not possible for Talos)
  • Deprecate Self-serve/BuildApi
    • Making changes to BuildApi is very difficult due to the lack of testing environments and set-up burden
    • Moving to the BBB will help us move away from this old system
There are various parts that will need to be in place before we can do this. Here's some that I can think of:
  • TaskCluster's big-graph scheduling
    • This is important since it will allow for the concept of coalescing to exist in TaskCluster
  • Task prioritization
    • This is important if we're to have different levels of priority for jobs on TaskCluster
    • On Buildbot we have release repositories with the highest priority and the try repo having the lowest
    • We also currently have the ability to raise/decrease task priorities through self-serve/buildapi. This is used by developers, specially on Try. to allow their jobs to be picked up sooner.
  • Treeherder to support LDAP authentication
    • It is a better security model to scheduling changes
    • If we want to move away from self-server/buildapi we need this
  • Allow test jobs to find installer and test packages
    • Currently test jobs scheduled through the BBB cannot find the Firefox installer and the 
Can you think of other benefits? Can you think of problems with this model? Are you aware of other pieces needed before moving forward to this model? Please let me know!



Creative Commons License
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

3 comments:

  1. Will this also enable jobs' taskcluster group and symbol (eg SM(r)) to be specified from within the tree? How about task graphs, like the jobs involved and the edges between them? I would love to be able to create a new pair of jobs, a build job and a dependent test job, entirely within the tree and have them automagically show up on try (esp. if it could be off by default for other people, so I could roll it out gradually) with the right group and symbol on treeherder. Bonus points if I could make them scheduled once a day (or scheduled on every push but have some extra upload-all-the-stuff setting turned on once a day or something.)

    ReplyDelete
    Replies
    1. Adding new tasks in-tree is already possible today if the tasks you're adding runs on TaskCluster (which I think all new tasks should -- there is work on adding windows and later OS X support for TC).
      The in-tree configuration format and associated tools are still a bit rough, but they'll get better over time.

      Ehsan has a post about his experience adding a new build type in-tree:
      http://ehsanakhgari.org/blog/2015-09-29/my-experience-adding-new-build-type-taskcluster

      To me one of the major benefits is that it'll be possible to refactor test environments in-tree and test changes by pushing to try. No one will need to stand-up a staging buildbot server, and if changes to the test environments are broken they can be backed out like any other patch.

      Note: that for some environments like OS X or windows, it might not be possible to define the entire test environment in-tree. However, we would still reference the workerType, which defines the environment, in-tree; so that changes rides the trains.

      The experience of course much better with docker on linux, as we can reference a docker image in-tree, hence, we'll rarely have to modify our test machines, so long as we just make new docker image.

      Greg Arndt is currently working on the idea that we can have the Dockerfile from which a docker image is built in-tree and then automatically rebuild the docker image on push, if the Dockerfile changed. Earlier this year, I successfully built a docker image on TaskCluster, and last quarter support for loading docker images from artifacts was added. So this is all under way, granted there is still a few kinks to work out before this process perfect. Amongst those kinks is the fact that a Dockerfile isn't necessarily deterministic. I think it was gps who suggested using debian packages to get this feature, but that's whole other story we can dream about when the other pieces are in place :)

      Delete
  2. @armen, I'm not sure how essential big-graph scheduler is to the concept of coalescing last I discussed it with dustin different requirements came up and the implementation might be different.
    Big graph scheduler is still important to me, as it'll be more stable, allow for cross push task dependencies, and be a much more elegant concept in so many ways :)

    Regarding priority, I have so many mixed feelings... I jump from one strong stand to another every time I come across it.
    In short we have an azure storage queue for each workerType and each priority-level. So workers have a list of queues to pull work from, the list is ordered by priority, currently we only have two priority levels, which means each worker has two queues to poll from. This concept works great, but if we increase the number of priority levels, workers will have to poll from a lot of queues (most of them probably empty).
    Conclusion: We can have multiple priority-levels, but it won't scale well if we keep increasing the number of priority-levels.
    A workaround have been suggested, like make a secondary queue that uses an SQL database and allows for more expressive priority levels...
    Then there is the issue of changing priority levels, I don't want to see task definitions being mutable as this breaks idempotency of createTask... And immutability is generally such as nice feature. That said one might still be able to up-prioritize dynamically without having it reflected in the task definition (granted it would feel like a hack).
    ---
    On the other side of the priority issue: I don't like solving lack of capacity with an infinite number of priority levels, as it so easily becomes "a screw with no ending" -- that's a Danish expression that google translates to "bottomless" :)
    All of that said I don't see an alternative to priority levels... I just wish there was a better solution.

    ReplyDelete