Armen Zambrano's battlefield: The benefits of moving per-push Buildbot scheduling into the tree

Thursday, September 10, 2015

The benefits of moving per-push Buildbot scheduling into the tree

Some of you may be aware of the Buildbot Bridge (aka BBB) work that bhearsum worked on during Q2 of this year. This system allows scheduling TaskCluster graphs for Buildbot builders. For every Buildbot job, there is a TaskCluster task that represents it.
This is very important as it will help to transition the release process piece by piece to TaskCluster without having to move large pieces of code at once to TaskCluster. You can have graphs of

I recently added to Mozilla CI tools the ability to schedule Buildbot jobs by submitting a TaskCluster graph (the BBB makes this possible).

Even though the initial work for the BBB is intended for Release tasks, I believe there are various benefits if we moved the scheduling into the tree (currently TaskCluster works like this; look for the gecko decision task in Treeherder).

To read another great blog post around try syntax and schedulling please visit ahal's post "Looking beyond Try Syntax".

NOTE: Try scheduling might not have try syntax in the future so I will not talk much about trychooser and try syntax. Read ahal's post to understand a bit more.

Benefits of in-tree scheduling:

Per-branch scheduling matrix can be done in-tree

We can define which platforms and jobs run on each tree
TaskCluster tasks already do this

Accurate Treeherder job visualization

Currently, jobs that run through Buildbot do not necessarily show up properly
Jobs run through TaskCluster show up accurately
This is due to some issues with how Buildbot jobs are represented in between states and the difficulty to have a way to related them
It could be fixed but it is not worth the effort if we're transitioning to TaskCluster

Control when non-green jobs are run

Currently on try we can't say run all unit tests jobs *but* the ones that should not run by default
We would save resources (do not run non-green jobs) and confusion for developers (do not have to ask why is this job non-green)

The try syntax parser can be done in-tree

This allows for improving and extending the try parser
Unit tests can be added
The parser can be tested with a push
try parser changes become atomic (it won't affecting all trees and can ride the trains)

SETA analysis can be done in-tree

SETA changes can become atomic (it won't affecting all trees and can ride the trains)
We would not need to wait on Buildbot reconfigurations for new changes to be live.

Per push scheduling analysis can be done in-tree

We currently only will schedule jobs for a specific change if files for that product are being touched (e.g. Firefox for Android for mobile/* changes)

PGO scheduling can be done in-tree

PGO scheduling changes become atomic (it won't affecting all trees and can ride the trains)

Environment awareness hooks (new)

If the trees are closed, we can teach the scheduling system to not schedule jobs until further notice
If we're backlogged, we can teach the scheduling system to not schedule certain platforms or to schedule a reduced set of jobs or to skip certain pushes

Help the transition to TaskCluster

Without it we would need to transition builds and associated tests to TaskCluster in one shot (not possible for Talos)

Deprecate Self-serve/BuildApi

Making changes to BuildApi is very difficult due to the lack of testing environments and set-up burden
Moving to the BBB will help us move away from this old system

There are various parts that will need to be in place before we can do this. Here's some that I can think of:

TaskCluster's big-graph scheduling

This is important since it will allow for the concept of coalescing to exist in TaskCluster

Task prioritization

This is important if we're to have different levels of priority for jobs on TaskCluster
On Buildbot we have release repositories with the highest priority and the try repo having the lowest
We also currently have the ability to raise/decrease task priorities through self-serve/buildapi. This is used by developers, specially on Try. to allow their jobs to be picked up sooner.

Treeherder to support LDAP authentication

It is a better security model to scheduling changes
If we want to move away from self-server/buildapi we need this

Allow test jobs to find installer and test packages

Currently test jobs scheduled through the BBB cannot find the Firefox installer and the

Can you think of other benefits? Can you think of problems with this model? Are you aware of other pieces needed before moving forward to this model? Please let me know!

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

3 comments:

Steve Fink14 Sept 2015, 12:53:00
Will this also enable jobs' taskcluster group and symbol (eg SM(r)) to be specified from within the tree? How about task graphs, like the jobs involved and the edges between them? I would love to be able to create a new pair of jobs, a build job and a dependent test job, entirely within the tree and have them automagically show up on try (esp. if it could be off by default for other people, so I could roll it out gradually) with the right group and symbol on treeherder. Bonus points if I could make them scheduled once a day (or scheduled on every push but have some extra upload-all-the-stuff setting turned on once a day or something.)
ReplyDelete
Replies
Jonas Finnemann Jensen17 Oct 2015, 18:37:00
@armen, I'm not sure how essential big-graph scheduler is to the concept of coalescing last I discussed it with dustin different requirements came up and the implementation might be different.
Big graph scheduler is still important to me, as it'll be more stable, allow for cross push task dependencies, and be a much more elegant concept in so many ways :)

Regarding priority, I have so many mixed feelings... I jump from one strong stand to another every time I come across it.
In short we have an azure storage queue for each workerType and each priority-level. So workers have a list of queues to pull work from, the list is ordered by priority, currently we only have two priority levels, which means each worker has two queues to poll from. This concept works great, but if we increase the number of priority levels, workers will have to poll from a lot of queues (most of them probably empty).
Conclusion: We can have multiple priority-levels, but it won't scale well if we keep increasing the number of priority-levels.
A workaround have been suggested, like make a secondary queue that uses an SQL database and allows for more expressive priority levels...
Then there is the issue of changing priority levels, I don't want to see task definitions being mutable as this breaks idempotency of createTask... And immutability is generally such as nice feature. That said one might still be able to up-prioritize dynamically without having it reflected in the task definition (granted it would feel like a hack).
---
On the other side of the priority issue: I don't like solving lack of capacity with an infinite number of priority levels, as it so easily becomes "a screw with no ending" -- that's a Danish expression that google translates to "bottomless" :)
All of that said I don't see an alternative to priority levels... I just wish there was a better solution.
ReplyDelete
Replies