Over the last few years as a Mozilla Release Engineer, I have completed hundreds of bugs and dozens of large and complex cross-team projects. As time has gone by, I have gained experience from my mistakes and I have developed a mental list of steps that I follow in order to complete cross team projects as effectively as possible and in good terms with the developers involved. I have never attempted to write this down, so I am excited to see what we can glean from this blog post. This may save you from having to build these habits through trial and error over time. Feel free to point out places where I have not figured everything out yet, or where I am wrong.
This post is probably geared towards Release Engineering and people who close closely with us, however, it may have value to understand what a release engineer has to take into consideration before running your jobs on tbpl.mozilla.org visibly.
I would like to use a recent project as an example, namely the Android x86 emulator test infrastructure project. It has taken more than two months to get 80% of this project completed. We are currently blocked on issues external to Mozilla with regards to the actual emulator being unstable.
You should expect the following sections in this post:
Tips
Document as much as possible on the bug
Communicate often with the people you're working with
Make it clear what blocks who
Make it clear what you're working on
Set expectations
Report back when you're not meeting expectations
Ask for help!
If you can pick one of many tasks, ask the dev which one would benefit him/her the most
Mention when you can’t work on the project for a period of time (e.g. PTO, buildduty, release)
Communicate major slow downs or change of plans to stakeholders on both sides of the project
File bugs in order to clarify the scope, the dependency and the ownership
Meet the members of the other parties through a video call as soon as possible if you have not met them before, especially if the project is rather complex
It makes it so much easier to understand/know each other and work together
You would have the opportunity to read the non-verbal communication
You become a human being in the eyes of each other, rather than IRC nicknames
Not necessary if you have worked together often and/or the project is very simple
If there’s confusion and/or conflict schedule another video meeting
Restate in your own words what you are taking away with you from the bugs and email conversations
Keep your word
Do not try to force artifical deadlines
Consult the team when in doubt of which approach to use for a big problem
Do not ask for help if you have not even tried for yourself
If you get stuck; ask for help
Checklist
NOTE: These questions do not apply to every project we take, however much of it applies when setting up a new platform.
have you read all bugs with regard to the project?
have you written down the questions that you need to ask the other team?
on which machines is this going to run?
do we have enough capacity?
who else should know about this project? have you got them in the loop?
has the developer verified that his scripts runs as expected on one of our machines?
what happens when we run twice a job on a machine?
has the developer *recently* run *all* types of jobs required for the project?
can the developer set up the job to run multiple types in a row and always get the expected results?
this is a new question that I have not asked in the past, however, it might be helpful to spot instability issues early
for instance, we might have been able to catch the QEMU issue on the emulators earlier. Instead, we found after two months when we started running the emulator jobs at scale
which artifacts will we need to deploy?
how are you going to distribute the artifacts?
what privacy do those artifacts need?
how often will the artifacts need to be recreated or updated?
what are the expectations? deadlines?
where is the source code?
how long does each test suite take to run?
have you *manually* reproduced the steps specified by the developer?
this will generally be your highest priority and will initially be on your critical path
unless you have figured out *all* of these with him, you may regret not doing so
how does the machine need to be set up? when was the last time that this set up was done?
have you verified the setup steps as instructed by the developer?
have you made the action items clear to each other?
have you found a new blocker? have you discussed it with the developer and does he understand why it is a blocker?
are the blockers clearly filed or specified on the bug?
is there anything particularly different to this project compared to the way we run other projects?
e.g. running four emulators with four different test suites was clearly new
if so, notify those people that might be affected
I hope this is useful when tackling new Release Engineering projects.
Feel free to read the following case study or skip it completely as it is very long!
regards,
Armen
##############################
NOTE: The following two sections can feel *very* long/boring, as the bug ended up having more than 200 comments and I had to file many many bugs.
Context of the Android x86 project
First of all, I would like to set the context of this project: I had just come back from three weeks of holidays, I was not mentally prepared to take on an unexpected and large project, I had to catch up with my intern (his internship was ending shortly), I was trying to cover for a co-worker who was taking four weeks of absence, I had been looking forward to working on a different, more exciting project instead, and the amount of unforeseen interruptions following my return were very, very high. This is important to have in mind, as you will notice in this blogpost that I made some drastic requests of my managers in order for me to meet expectations.
Sequence of events of the Android x86 project
NOTE: As I did an analysis of my work, I could see where I made mistakes or missed the opportunity to ask the right questions. Unfortunately, these oversights delayed the whole project further down the road.
Quote from gbrown after reading this post: “I thought it was really helpful meeting on Vidyo when we did, and I found your frequent in-bug status updates very useful and re-assuring.”
gbrown (a-team developer) filed on 2013-07-17 the bug:
some preliminary scripts had been developed by that time
this bug was initially about preliminary discussions on how to integrate with buildbot and the releng infrastructure
a fair amount of bugs had been filed on the testing and mobile development area (this was the first releng bug)
it seems that the project had been in development for several months (at least since April)
on the same day that the bug got filed, Callek (member of releng - my team) asked me if I would be able to coordinate this.
The next day, I immediately requested a video meeting with gbrown
I knew gbrown from the mobile testing meetings
I used to be the releng liaison for mobile in 2012 and that is where I met him
I had not directly worked with him in any project; however, the times that he spoke at the meetings had left a good impression
I made sure that my calendar was up-to-date before asking him to pick a time
I assigned the bug to me to make it clear that there was an owner on the releng side
Until the meeting (and post the meeting):
I browsed all bugs involved
I read all of them
I cleaned up the dependencies
when the ownership of the bug is not clear, I asked before making changes
I asked questions concerning bugs where there were concepts unclear to me
I tried paraphrasing what I got out of some bugs to make sure that I understood them properly
I clarified some terms and concepts about our infra
Initial meeting:
we decided to meet a few days after our initial contact on the 22nd
this allowed me to catch up on reading the bugs and prepare any questions that I needed to ask
I made it clear from the beginning that I might ask silly questions, since this is an unfamiliar area for me
I needed to know the following:
NOTE: I might have not asked all of these questions on that day, however, I should have asked them
has this been run on one of our releng machines?
where is the source code?
which test suites had been run?
when was the last time that you ran the tests?
which artifacts will I need?
how do you set a releng machine up for this?
how do you run the code?
what are the action items we each have to take?
what are we waiting on each other for?
what is the priority of this project?
how long does each test suite take to run? how does it compare to pandas/tegras?
on which machines are we going to run this?
are you aware of whether or not we have requested the purchase of the machines?
what are the expectations? What are the deadlines?
Post-meeting:
I wrote a summary of the meeting (closing the loop so others can follow)
I wrote down the action items
I wrote down the constraints on the setup (gathered from meeting and other bugs)
I made it clear that I'm the releng owner on the bug by setting the assignee
I made gbrown explicitly accountable by setting a needinfo request to declare that I'm waiting for him on an action item
Started trying to reproduce manually the steps as explained by gbrown
08-02 (2 days after last comment) I made an incorrect proposal about how to distribute the avd artifacts due to my lack of understanding (comment 12)
I double-checked with gbrown before going down that road
I requested for sample avd files to be uploaded somewhere for me to use
I requested a new meeting with gbrown to ask smarter questions
I informed him of PTO time that I'm going to take
I wrote a note on the bug for myself to remember to enable the project on Ash/Cedar so we could see how it would run within the production automation rather than on staging
-
08-06 - we met again
08-06 - we debugged back and forth concerning the setup steps
08-07 - gbrown provided new avd files; the setup steps are re-adjusted
08-08 - this is the first time that I noticed that compiz is getting on our way towards running this reliably (comment 30):
I was not fully aware at that point that it was a blocker
gbrown believes that staggering the emulators was what prevented this from happening
later down the road we discovered that we have to kill compiz before starting any emulators
we don't know how to debug why this was happening
08-08 - I asked gbrown for feedback on my patch
08-09 - I asked aki for feedback for the first time; I was doing things significantly differently than anything we had seen before in our code. I had to make sure that it wouldn't be negatively reviewed too late into the process
08-09 - I created a repo where my changes would be pushed to, and I notified dminor and gbrown of this
08-09 - I asked gbrown which test suites we were going to run
08-16 - compiz was preventing us to run reliably on staging (comment 54)
I didn't know yet that compiz was at fault
I mentioned to gbrown that it was a blocker for me
I made a reference to comment 30 where we first spoke about it
I attached a log to facilitate gbrown's expert eyes
I also mentioned that I will continue working on other things that were not on the critical path
it is only now, when I try to run many suites repeatedly through buildbot, that I see that the setup is not yet reliable
08-14
08-16 - every new patch had a comment logging what gets fixed
08-20 - I notified gbrown that I wouldn't be able to do any more work for a week as I was on buildduty
I attached my latest code in case he had time to make changes on his own (or any other person)
this gave him and his manager a chance to discuss if they would need to request someone else to sub for me
08-25
I read an IRC comment that made me realize that Android x86 was more important than I had initially thought
By then, I had mentioned to coop (my manager) that I needed to focus more for this project
We discussed that adding Callek to the project and start a "project mode" would help us
08-26
I gave a heads-up to sheriffs about running 4 jobs in 1
This was important as they sheriff tbpl, and running jobs this way was different than any other platform we had seen before (except mochitest-other)
Thread: "Reporting 4 jobs in 1"
Reply: "I don't have a strong preference at this point in time - however we'll just need to bear in mind that TBPL doesn't support splitting one job up into several, so we'll need to handle this similar to mochitest-other (ie: ideally TinderboxPrinting pass/fail counts for each sub-suite & ensuring it's clear what's failed where). "
08-26
Callek and I met to discuss the project and split the work
We created an etherpad with clear action items for each other
We discussed back and forth all the gotchas and the different approaches for our deployment plans
Every day afterwards, we scratched our action items on the etherpad
Callek helped me for a couple of weeks with deploying the avd artifacts through Puppet
08-26
I emailed blassey (mobile manager) to clarify what the expectactions were since it had been over a month since we last spoke
I stated that I knew that this project was important in order to prevent regressions to Android x86, however, I knew that it was _not_ a release blocker
We could still have released Firefox for Android x86 without the automated tests
I explicitly listed compiz as my main worry concerning whether or not we could complete this project on time
08-26 - I had an IRC conversation with Callek that made me believe that rail (another releng member) might know more about compiz wrt to the linux64 machines as he had developed some puppet code for it (comment 68)
I added rail to the bug and asked him for his help
I give him a brief description of what we're doing so he could grasp what had happened in the last 60+ comments and what I specifically needed help with
08-27 - First time that I ran tests on two different emulators (comment 76)
I still used a separate script to launch the emulators
I deferred including this code into my scripts in order to avoid departing from gbrown's original setup
This would ensure that I would get to a working state first rather than re-write from scratch
08-29 - By this day, Callek had deployed all the artifacts that I needed across the Linux 64-bit pool of machines
Around this time, we decided to let Callek go back to his other projects as the urgency of Android x86 was clearly set by blassey and we had things under control
09-03
09-03
I spotted an issue on mozharness where ADB interactions were not integrated into the logs
I discussed with aki what is needed to fix this
I didn't touch it as it was not important as of that moment
09-03
my work was very close to completion and I asked dminor and gbrown for feedback
this gave them time to get back to me
this gave me time to continue to clean up the patch and deal with minor issues
I was not blocked on their feedback since I left a lot of small non-critical tasks to deal with later on in my development
I got gbrown's feedback on the same day!
09-05 - aki gives me a big congratulations and big r+
"Awesome work, Armen! I'm pretty impressed you got parallel processes working..."
Getting these kinds of compliments are very valuable to me; it's great to see them, especially coming from aki since he has so much more experience than myself
aki nevertheless pointed out the unit tests that my code broke
- I made sure that I re-gained the habit of running them before my next review
Getting ready for reporting inside of tbpl
Running at scale:
Once I had the 5 different sets of test jobs running on tbpl and had many machines to run the jobs, I could start seeing a lot of intermittent oranges
It is important to point out that going from POC to a working releng production capable script took close to a month and a half
You will now notice the large amount of issues that were only discovered once we started running things at scale on tbpl
You will also notice how many adjustments gbrown and I made to deal with all sorts of permanent test oranges as well as intermittent test oranges
Running a job few dozens times locally or even on staging does not guarantee readiness for running at scale
Any of the following minor issues are major for sheriffs when running at scale and trying to find regressions
Read this wiki to understand what the expectations for sheriffs are:
Not setting minidump stackwalk path properly (comment 103)
This was used when Fennec crashes so we can see the crash symbols
If we wanted to run the jobs at scale we had to fix this
This was not a component of our infra that I understand (I kind of understand it now)
-
What happens when we run a job twice on the same machine?
What sort of artifacts would we have to clobber?
"gbrown: should I delete ~/.android/avds and unpack clean templates before each run?"
Timeouts
Logs
-
""Output exceeded 52428800 bytes, remaining output has been truncated"
This happens when all of the emulators do not actually fail right away."
"Updating the maxLogSize is very unwanted since it affects the performance of the masters."
After reporting the issue I also established various strategies on how we could mitigate it
I also mentioned that I would not deal with it this time around as there were other more important items
Lots of back/forth
gbrown and I worked comment after comment on every single issue we noticed
sometimes we missed a parameter to one of the test suites
sometimes we had to file bugs to fix the test failures
fixed relative paths VS absolute paths
android manifest modifiations
Status summaries
After a bunch of comments I would re-summarize what the heck is going on
Unless it was clear where we are, we could have easily lost track of who was supposed to be fixing what
Crash dumps not working for Android x86 (09-16)
TryChooser (09-12)
"put complexity behind simplicity"
instead of having to have a special TrySyntax for Android x86 we decided to make it follow the same syntax as other platforms
instead of saying "androidx86-set-1" a developer could say "reftest-plain-1"
this was a new feature for TryChooser
the original writer of TryChooser was not in the releng team anymore (this was challenging for me)
Seeing the light (09-20)
Once I saw the bugs line up, I sent an email to gbrown and krudnitski (mobile product manager) to let them know where we were and asked them if the importance the project had changed since the last time we spoke
I also requested from them if I could slow down so I could focus on some Summit prep-work that I needed to do
I gave them two possible schedules depending on if they still needed me to continue with full steam
I heard back and they told me to slow down if needed
Running across the board
Weird spot
We tried to complete the project before the end of the quarter
Unfortunately, when running at scale we discovered the bug inside of the emulator and all of our progress came to a slow crawl
At this point, I decided to pass the bugs (including the tracking bug) to gbrown as most remaining bugs were on the testing and emulator side
I wrote a status update on the main bugs and a whiteboard status pointing to the right comment #
Notify stakeholders (09-30)
As soon as I saw our project slowing down due to external sources, I had to notify the stakeholders so they didn’t hear it late
I gave a very high-level picture and pointers to the specific blocker (e.g. the emulator bug)
I notified both of my managers as well as blassey and krudnitski
I made sure that there were no releng bugs blocking gbrown
- As of today (10-17)
- We are still working out the emulator bugs and other minor bugs
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.