Armen Zambrano's battlefield: October 2013

Over the last few years as a Mozilla Release Engineer, I have completed hundreds of bugs and dozens of large and complex cross-team projects. As time has gone by, I have gained experience from my mistakes and I have developed a mental list of steps that I follow in order to complete cross team projects as effectively as possible and in good terms with the developers involved. I have never attempted to write this down, so I am excited to see what we can glean from this blog post. This may save you from having to build these habits through trial and error over time. Feel free to point out places where I have not figured everything out yet, or where I am wrong.

This post is probably geared towards Release Engineering and people who close closely with us, however, it may have value to understand what a release engineer has to take into consideration before running your jobs on tbpl.mozilla.org visibly.

I would like to use a recent project as an example, namely the Android x86 emulator test infrastructure project. It has taken more than two months to get 80% of this project completed. We are currently blocked on issues external to Mozilla with regards to the actual emulator being unstable.

You should expect the following sections in this post:

Tips
Checklist
Context of the Android x86 project
Sequence of events of the Android x86 project

Tips

Document as much as possible on the bug

this makes it easier for external people to follow along

Communicate often with the people you're working with

Make it clear what blocks who
Make it clear what you're working on
Set expectations
Report back when you're not meeting expectations
Ask for help!
If you can pick one of many tasks, ask the dev which one would benefit him/her the most
Mention when you can’t work on the project for a period of time (e.g. PTO, buildduty, release)

Communicate major slow downs or change of plans to stakeholders on both sides of the project

Your own managers
The managers of the other developer

File bugs in order to clarify the scope, the dependency and the ownership

Make sure that the dependencies are logical (rather than just adding all bugs to the tracking bugs)

Meet the members of the other parties through a video call as soon as possible if you have not met them before, especially if the project is rather complex

It makes it so much easier to understand/know each other and work together
You would have the opportunity to read the non-verbal communication
You become a human being in the eyes of each other, rather than IRC nicknames
Not necessary if you have worked together often and/or the project is very simple

If there’s confusion and/or conflict schedule another video meeting
Restate in your own words what you are taking away with you from the bugs and email conversations

This allows the other developer to the debug your understanding

Keep your word

It builds trust

Do not try to force artifical deadlines

You ruin the trust you have gained

Consult the team when in doubt of which approach to use for a big problem
Do not ask for help if you have not even tried for yourself

This builds a reputation that you won’t try to take other people’s time for granted

If you get stuck; ask for help

Re-estate what you’re trying to solve and why it is important

Checklist

NOTE: These questions do not apply to every project we take, however much of it applies when setting up a new platform.

have you read all bugs with regard to the project?
have you written down the questions that you need to ask the other team?
on which machines is this going to run?
do we have enough capacity?
who else should know about this project? have you got them in the loop?

IT? (more machines, method of deployment of artifacts)
A-team?
Sheriffs?
Your manager(s)?
Your own team?

has the developer verified that his scripts runs as expected on one of our machines?

their local machine does not count
loan them a machine if needed

what happens when we run twice a job on a machine?

what artifacts do we need to clobber?
test files and application files always get clobbered

has the developer *recently* run *all* types of jobs required for the project?

not just a handful; must be all and recently

can the developer set up the job to run multiple types in a row and always get the expected results?

this is a new question that I have not asked in the past, however, it might be helpful to spot instability issues early
for instance, we might have been able to catch the QEMU issue on the emulators earlier. Instead, we found after two months when we started running the emulator jobs at scale

which artifacts will we need to deploy?

e.g. the android sdk
e.g. the android emulator template definitions

how are you going to distribute the artifacts?

through puppet?
from tooltool?
from in-tree?
will we need to build it on the build machines and upload it to ftp?

what privacy do those artifacts need?

public/behind LDAP/VPN only

how often will the artifacts need to be recreated or updated?

do we have documentation about it?

what are the expectations? deadlines?
where is the source code?

put it on the bug or a link to the public repo

how long does each test suite take to run?

this is important to know in order to help capacity planning as well as planning how much we need to chunk the suites

have you *manually* reproduced the steps specified by the developer?

this will generally be your highest priority and will initially be on your critical path
unless you have figured out *all* of these with him, you may regret not doing so

how does the machine need to be set up? when was the last time that this set up was done?

request *recently* *verified* step-by-step instructions on a *clean* machine

have you verified the setup steps as instructed by the developer?
have you made the action items clear to each other?
have you found a new blocker? have you discussed it with the developer and does he understand why it is a blocker?
are the blockers clearly filed or specified on the bug?
is there anything particularly different to this project compared to the way we run other projects?

e.g. running four emulators with four different test suites was clearly new
if so, notify those people that might be affected

I hope this is useful when tackling new Release Engineering projects.

Feel free to read the following case study or skip it completely as it is very long!

regards,

Armen

##############################

NOTE: The following two sections can feel *very* long/boring, as the bug ended up having more than 200 comments and I had to file many many bugs.

Context of the Android x86 project

First of all, I would like to set the context of this project: I had just come back from three weeks of holidays, I was not mentally prepared to take on an unexpected and large project, I had to catch up with my intern (his internship was ending shortly), I was trying to cover for a co-worker who was taking four weeks of absence, I had been looking forward to working on a different, more exciting project instead, and the amount of unforeseen interruptions following my return were very, very high. This is important to have in mind, as you will notice in this blogpost that I made some drastic requests of my managers in order for me to meet expectations.

Sequence of events of the Android x86 project

NOTE: As I did an analysis of my work, I could see where I made mistakes or missed the opportunity to ask the right questions. Unfortunately, these oversights delayed the whole project further down the road.

Quote from gbrown after reading this post: “I thought it was really helpful meeting on Vidyo when we did, and I found your frequent in-bug status updates very useful and re-assuring.”

gbrown (a-team developer) filed on 2013-07-17 the bug:

some preliminary scripts had been developed by that time

they were not yet attached to the bug

this bug was initially about preliminary discussions on how to integrate with buildbot and the releng infrastructure
a fair amount of bugs had been filed on the testing and mobile development area (this was the first releng bug)
it seems that the project had been in development for several months (at least since April)

dminor (A-team member) had created some scripts and had asked for feedback from aki (member of releng - my team)

on the same day that the bug got filed, Callek (member of releng - my team) asked me if I would be able to coordinate this.

Callek has been working with releng for more than a year and I have a strong trust and respect towards him. If he was asking me to pick it up, it was probably because he was very entangled with other projects and could make use of my help.

The next day, I immediately requested a video meeting with gbrown

I knew gbrown from the mobile testing meetings
I used to be the releng liaison for mobile in 2012 and that is where I met him
I had not directly worked with him in any project; however, the times that he spoke at the meetings had left a good impression
I made sure that my calendar was up-to-date before asking him to pick a time
I assigned the bug to me to make it clear that there was an owner on the releng side

Until the meeting (and post the meeting):

I browsed all bugs involved
I read all of them
I cleaned up the dependencies
when the ownership of the bug is not clear, I asked before making changes
I asked questions concerning bugs where there were concepts unclear to me
I tried paraphrasing what I got out of some bugs to make sure that I understood them properly
I clarified some terms and concepts about our infra

e.g. mozpool is only used for panda boards (not for tegras or anything else)
e.g. foopies will not be necessary to run this project

foopies are *only* used for tegra boards

Initial meeting:

we decided to meet a few days after our initial contact on the 22nd
this allowed me to catch up on reading the bugs and prepare any questions that I needed to ask
I made it clear from the beginning that I might ask silly questions, since this is an unfamiliar area for me
I needed to know the following:

NOTE: I might have not asked all of these questions on that day, however, I should have asked them
has this been run on one of our releng machines?

gbrown had already received one and had done so

where is the source code?

gbrown had to clean the scripts and get them to me from local copies

which test suites had been run?
when was the last time that you ran the tests?
which artifacts will I need?

e.g. the android sdk
e.g. the android emulator template definitions

how do you set a releng machine up for this?

gbrown had filed a bug with detailed documentation; his attention to small details paid off down the road

how do you run the code?
what are the action items we each have to take?
what are we waiting on each other for?
what is the priority of this project?
how long does each test suite take to run? how does it compare to pandas/tegras?

this is important in order to anticipate the load

on which machines are we going to run this?
are you aware of whether or not we have requested the purchase of the machines?

the purchase was under way; I did not need to be involved with this

what are the expectations? What are the deadlines?

Post-meeting:

I wrote a summary of the meeting (closing the loop so others can follow)
I wrote down the action items

"gbrown has some mozharness scripts and configs that he's going to get me"
"I can try to integrate it to my staging master with an iX box"

I wrote down the constraints on the setup (gathered from meeting and other bugs)

"We can't do this testing on EC2 due to some issues with OpenGL which crashes."

I wrote since I knew that people joining the bug might have asked this question; it is also likely that they may had not read all other bugs

"Run times are a bit slower (20-30% slower) than on Pandas and Tegras."

this is important since we have to keep a rough understanding of how many machines we will need for this project

I made it clear that I'm the releng owner on the bug by setting the assignee

this makes it easier for stakeholders to know that it is a project that I'm clearly committed to
it also confirms to gbrown that I'm serious about helping him

I made gbrown explicitly accountable by setting a needinfo request to declare that I'm waiting for him on an action item

Set flag: needinfo?(:gbrown) - "Adding needinfo to keep track that I need the scripts."
This also helps gbrown to not lose the action item since it will show up on his bugzilla dashboard (rather than writing down the action item as a comment)

The dashboard is not necessarily something that every developer uses
Comments can be missed; flags remain until they are cleared
The needinfo request can be seen by others even if they have not read all comments

Started trying to reproduce manually the steps as explained by gbrown

07-24 (7 days after bug creation)

I spoke with my manager and asked to drop something from my plate so I can spend more time the Android x86 project
He agreed
My performance on Android x86 started picking up

07-31 (9 days after meeting with gbrown) missing adb (comment 10)

I needed to determine either how to install it on my machine (deploy a new artifact) or if it is already installed somewhere on the linux64 machines
gbrown clarified it right away

08-02 (2 days after last comment) I made an incorrect proposal about how to distribute the avd artifacts due to my lack of understanding (comment 12)

I double-checked with gbrown before going down that road
I requested for sample avd files to be uploaded somewhere for me to use

gbrown provided them on the same day

I requested a new meeting with gbrown to ask smarter questions
I informed him of PTO time that I'm going to take

I didn’t want him to be frustrated if he was unable to find me

I wrote a note on the bug for myself to remember to enable the project on Ash/Cedar so we could see how it would run within the production automation rather than on staging

This is important since we don't have tbpl support for staging :(

08-02 - comment 13

I notified him of setup differences between the set of instructions he described when he set his releng machine up a while back
Android had released a newer version of the SDK by that time

08-06 - we met again

I don't know what we spoke about; however I noticed that I sped up wrt to making comments on the bug

08-06 - we debugged back and forth concerning the setup steps

08-07 - gbrown provided new avd files; the setup steps are re-adjusted

08-08 - this is the first time that I noticed that compiz is getting on our way towards running this reliably (comment 30):

I was not fully aware at that point that it was a blocker
gbrown believes that staggering the emulators was what prevented this from happening
later down the road we discovered that we have to kill compiz before starting any emulators
we don't know how to debug why this was happening

08-08 - I asked gbrown for feedback on my patch

that was the first of many patches come, however I left comments in the code so that he could help me clarify them directly in the code, to ensure that I was on the right track

08-09 - I asked aki for feedback for the first time; I was doing things significantly differently than anything we had seen before in our code. I had to make sure that it wouldn't be negatively reviewed too late into the process

aki did not complain about my approach

08-09 - I created a repo where my changes would be pushed to, and I notified dminor and gbrown of this

this ensured that they would be able to test my code and contribute to my development if they wanted to, without having to re-attach my patches every day

08-09 - I asked gbrown which test suites we were going to run

I should have asked this way earlier in our initial meetings
It did not delay the project

08-16 - compiz was preventing us to run reliably on staging (comment 54)

I didn't know yet that compiz was at fault
I mentioned to gbrown that it was a blocker for me

We couldn't have unreliable jobs on tbpl

I made a reference to comment 30 where we first spoke about it
I attached a log to facilitate gbrown's expert eyes
I also mentioned that I will continue working on other things that were not on the critical path

getting the emulators started and killed reliably was my main priority as it is common to all test suites
spotting unreliable issues early is my main duty

not raising these issues early enough would have derailed the whole project

it is only now, when I try to run many suites repeatedly through buildbot, that I see that the setup is not yet reliable

08-14

my intern left and his project finally stuck to production
I had more time for the Android x86 project

08-16 - every new patch had a comment logging what gets fixed

this became a changelog for all parties

08-20 - I notified gbrown that I wouldn't be able to do any more work for a week as I was on buildduty

I attached my latest code in case he had time to make changes on his own (or any other person)
this gave him and his manager a chance to discuss if they would need to request someone else to sub for me

08-25

I read an IRC comment that made me realize that Android x86 was more important than I had initially thought
By then, I had mentioned to coop (my manager) that I needed to focus more for this project
We discussed that adding Callek to the project and start a "project mode" would help us

"project mode" allowed us to disconnect from IRC and even bugmail for long periods of time to work on just one project

08-26

I gave a heads-up to sheriffs about running 4 jobs in 1
This was important as they sheriff tbpl, and running jobs this way was different than any other platform we had seen before (except mochitest-other)
Thread: "Reporting 4 jobs in 1"
Reply: "I don't have a strong preference at this point in time - however we'll just need to bear in mind that TBPL doesn't support splitting one job up into several, so we'll need to handle this similar to mochitest-other (ie: ideally TinderboxPrinting pass/fail counts for each sub-suite & ensuring it's clear what's failed where). "

08-26

Callek and I met to discuss the project and split the work
We created an etherpad with clear action items for each other
We discussed back and forth all the gotchas and the different approaches for our deployment plans
Every day afterwards, we scratched our action items on the etherpad
Callek helped me for a couple of weeks with deploying the avd artifacts through Puppet

08-26

I emailed blassey (mobile manager) to clarify what the expectactions were since it had been over a month since we last spoke
I stated that I knew that this project was important in order to prevent regressions to Android x86, however, I knew that it was _not_ a release blocker
We could still have released Firefox for Android x86 without the automated tests
I explicitly listed compiz as my main worry concerning whether or not we could complete this project on time

once blassey replied and said that we were not on a very hot seat, I looped my manager right away

08-26 - I had an IRC conversation with Callek that made me believe that rail (another releng member) might know more about compiz wrt to the linux64 machines as he had developed some puppet code for it (comment 68)

I added rail to the bug and asked him for his help
I give him a brief description of what we're doing so he could grasp what had happened in the last 60+ comments and what I specifically needed help with

08-27 - First time that I ran tests on two different emulators (comment 76)

I still used a separate script to launch the emulators
I deferred including this code into my scripts in order to avoid departing from gbrown's original setup
This would ensure that I would get to a working state first rather than re-write from scratch

08-29 - By this day, Callek had deployed all the artifacts that I needed across the Linux 64-bit pool of machines

the adb tools
the avd definitions
It was great that he focused on this aspect since he had more experience and I could focus on the actual scripts

Around this time, we decided to let Callek go back to his other projects as the urgency of Android x86 was clearly set by blassey and we had things under control

09-03

I decided to kill compiz *always* before starting any emulators
This improved the setup reliability

09-03

I spotted an issue on mozharness where ADB interactions were not integrated into the logs
I discussed with aki what is needed to fix this
I didn't touch it as it was not important as of that moment

09-03

my work was very close to completion and I asked dminor and gbrown for feedback
this gave them time to get back to me
this gave me time to continue to clean up the patch and deal with minor issues
I was not blocked on their feedback since I left a lot of small non-critical tasks to deal with later on in my development
I got gbrown's feedback on the same day!

09-05 - aki gives me a big congratulations and big r+

"Awesome work, Armen! I'm pretty impressed you got parallel processes working..."

Getting these kinds of compliments are very valuable to me; it's great to see them, especially coming from aki since he has so much more experience than myself

aki nevertheless pointed out the unit tests that my code broke
- I made sure that I re-gained the habit of running them before my next review

Getting ready for reporting inside of tbpl

Until this point I was mainly focusing on reliability, repeatability and proper machine set up
09-05

I enabled the jobs on Cedar and Ash
This allowed us to see me the jobs running on tbpl
This allowed me to see that every artifact got deployed to the rest of the pool correctly

rather than my own staging machine, which I set up manually

The Ash branch was special because it allowed me to push mozharness changes to a repo, and re-trigger jobs that use the latest code

This is like a try repo but for release and auto-tools engineers
This setup makes me iterate fast and test on production

09-05

Now that the jobs were running on tbpl we could ask to differentiate them from the regular "U" for unidentified jobs
https://bugzilla.mozilla.org/show_bug.cgi?id=913174

09-12

Sheriffs raised concerns on the approach we chose to run 4 test jobs within 1 (even though 2 weeks ago this had been raised and the approach was accepted).
I wrote a thread to my team to see if they can think of something I had not: "Reporting of parallel test runs inside of the same jobs (from running 4 emulator jobs for Android x86)"
Reply: "Unfortunately, the way that we group the jobs is not liked by the sheriffs as they would prefer each job to report independently as other jobs do. This would also allow tbpl to group tbpl jobs into their various categories (e.g. reftests, mochitests et al)."
I asked for brainstorming concerning what other ways we could run the jobs to please the sheriffs
Unfortunately nothing that was proposed would be easy or rather different than our normal approaches

Running at scale:

Once I had the 5 different sets of test jobs running on tbpl and had many machines to run the jobs, I could start seeing a lot of intermittent oranges
It is important to point out that going from POC to a working releng production capable script took close to a month and a half

You will now notice the large amount of issues that were only discovered once we started running things at scale on tbpl
You will also notice how many adjustments gbrown and I made to deal with all sorts of permanent test oranges as well as intermittent test oranges
Running a job few dozens times locally or even on staging does not guarantee readiness for running at scale
Any of the following minor issues are major for sheriffs when running at scale and trying to find regressions
Read this wiki to understand what the expectations for sheriffs are:

https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy

Not setting minidump stackwalk path properly (comment 103)

This was used when Fennec crashes so we can see the crash symbols
If we wanted to run the jobs at scale we had to fix this
This was not a component of our infra that I understand (I kind of understand it now)

Tear up/tear down (comment 110)

What happens when we run a job twice on the same machine?
What sort of artifacts would we have to clobber?
"gbrown: should I delete ~/.android/avds and unpack clean templates before each run?"

Timeouts

buildbot has "no output" timeouts (comment 105)

We spawned four emulators which each had their own output and I didn't want to mix them on the main log until they had been completed
We had to output something every so often to prevent this timeout

buildbot has max time timeouts

This timeout would kill a job if it has not finished within 2 hours (I think)
This made us have to chunk a bunch of suites (like crashtest and reftests) into more chunks

test harnesses have their own timeouts

If a mochitest runs for too long, we can have the harness kill itself

Logs

Max size (comment 118)

""Output exceeded 52428800 bytes, remaining output has been truncated"
This happens when all of the emulators do not actually fail right away."
"Updating the maxLogSize is very unwanted since it affects the performance of the masters."
After reporting the issue I also established various strategies on how we could mitigate it
I also mentioned that I would not deal with it this time around as there were other more important items

Lots of back/forth

gbrown and I worked comment after comment on every single issue we noticed
sometimes we missed a parameter to one of the test suites
sometimes we had to file bugs to fix the test failures
fixed relative paths VS absolute paths
android manifest modifiations

Status summaries

After a bunch of comments I would re-summarize what the heck is going on
Unless it was clear where we are, we could have easily lost track of who was supposed to be fixing what

Crash dumps not working for Android x86 (09-16)

https://bugzilla.mozilla.org/show_bug.cgi?id=916923
we asked ted to look into why our minidump crash adjustments did not yield success
froydnj had to give us a hand

TryChooser (09-12)

"put complexity behind simplicity"
instead of having to have a special TrySyntax for Android x86 we decided to make it follow the same syntax as other platforms
instead of saying "androidx86-set-1" a developer could say "reftest-plain-1"
this was a new feature for TryChooser
the original writer of TryChooser was not in the releng team anymore (this was challenging for me)

Seeing the light (09-20)

Once I saw the bugs line up, I sent an email to gbrown and krudnitski (mobile product manager) to let them know where we were and asked them if the importance the project had changed since the last time we spoke
I also requested from them if I could slow down so I could focus on some Summit prep-work that I needed to do
I gave them two possible schedules depending on if they still needed me to continue with full steam
I heard back and they told me to slow down if needed

I passed the information to my manager

Running across the board

We enabled sets 1 & 2 since they were running green
They had to be run hidden until we met the Job Visibility Policy
After few weeks we were asked to only run on Cedar/Ash until we fixed the emulator bug we found in bug 917562

Weird spot

We tried to complete the project before the end of the quarter
Unfortunately, when running at scale we discovered the bug inside of the emulator and all of our progress came to a slow crawl
At this point, I decided to pass the bugs (including the tracking bug) to gbrown as most remaining bugs were on the testing and emulator side

I discussed this with him on IRC first

I wrote a status update on the main bugs and a whiteboard status pointing to the right comment #

Notify stakeholders (09-30)

As soon as I saw our project slowing down due to external sources, I had to notify the stakeholders so they didn’t hear it late
I gave a very high-level picture and pointers to the specific blocker (e.g. the emulator bug)
I notified both of my managers as well as blassey and krudnitski
I made sure that there were no releng bugs blocking gbrown

As of today (10-17)

We are still working out the emulator bugs and other minor bugs

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Armen Zambrano's battlefield

Thursday, October 17, 2013

How to drive pseudo-effectively a cross-team project for a new platform you know little about

Tips

Checklist

Context of the Android x86 project

Sequence of events of the Android x86 project