Over the last few years as a Mozilla Release Engineer, I have completed hundreds of bugs and dozens of large and complex cross-team projects. As time has gone by, I have gained experience from my mistakes and I have developed a mental list of steps that I follow in order to complete cross team projects as effectively as possible and in good terms with the developers involved. I have never attempted to write this down, so I am excited to see what we can glean from this blog post. This may save you from having to build these habits through trial and error over time. Feel free to point out places where I have not figured everything out yet, or where I am wrong.
This post is probably geared towards Release Engineering and people who close closely with us, however, it may have value to understand what a release engineer has to take into consideration before running your jobs on tbpl.mozilla.org visibly.
I would like to use a recent project as an example, namely the Android x86 emulator test infrastructure project. It has taken more than two months to get 80% of this project completed. We are currently blocked on issues external to Mozilla with regards to the actual emulator being unstable.
You should expect the following sections in this post:
- Tips
- Checklist
- Context of the Android x86 project
- Sequence of events of the Android x86 project
Tips
- Document as much as possible on the bug
- this makes it easier for external people to follow along
- Communicate often with the people you're working with
- Make it clear what blocks who
- Make it clear what you're working on
- Set expectations
- Report back when you're not meeting expectations
- Ask for help!
- If you can pick one of many tasks, ask the dev which one would benefit him/her the most
- Mention when you can’t work on the project for a period of time (e.g. PTO, buildduty, release)
- Communicate major slow downs or change of plans to stakeholders on both sides of the project
- Your own managers
- The managers of the other developer
- File bugs in order to clarify the scope, the dependency and the ownership
- Make sure that the dependencies are logical (rather than just adding all bugs to the tracking bugs)
- Meet the members of the other parties through a video call as soon as possible if you have not met them before, especially if the project is rather complex
- It makes it so much easier to understand/know each other and work together
- You would have the opportunity to read the non-verbal communication
- You become a human being in the eyes of each other, rather than IRC nicknames
- Not necessary if you have worked together often and/or the project is very simple
- If there’s confusion and/or conflict schedule another video meeting
- Restate in your own words what you are taking away with you from the bugs and email conversations
- This allows the other developer to the debug your understanding
- Keep your word
- It builds trust
- Do not try to force artifical deadlines
- You ruin the trust you have gained
- Consult the team when in doubt of which approach to use for a big problem
- Do not ask for help if you have not even tried for yourself
- This builds a reputation that you won’t try to take other people’s time for granted
- If you get stuck; ask for help
- Re-estate what you’re trying to solve and why it is important
Checklist
NOTE: These questions do not apply to every project we take, however much of it applies when setting up a new platform.
- have you read all bugs with regard to the project?
- have you written down the questions that you need to ask the other team?
- on which machines is this going to run?
- do we have enough capacity?
- who else should know about this project? have you got them in the loop?
- IT? (more machines, method of deployment of artifacts)
- A-team?
- Sheriffs?
- Your manager(s)?
- Your own team?
- has the developer verified that his scripts runs as expected on one of our machines?
- their local machine does not count
- loan them a machine if needed
- what happens when we run twice a job on a machine?
- what artifacts do we need to clobber?
- test files and application files always get clobbered
- has the developer *recently* run *all* types of jobs required for the project?
- not just a handful; must be all and recently
- can the developer set up the job to run multiple types in a row and always get the expected results?
- this is a new question that I have not asked in the past, however, it might be helpful to spot instability issues early
- for instance, we might have been able to catch the QEMU issue on the emulators earlier. Instead, we found after two months when we started running the emulator jobs at scale
- which artifacts will we need to deploy?
- e.g. the android sdk
- e.g. the android emulator template definitions
- how are you going to distribute the artifacts?
- through puppet?
- from tooltool?
- from in-tree?
- will we need to build it on the build machines and upload it to ftp?
- what privacy do those artifacts need?
- public/behind LDAP/VPN only
- how often will the artifacts need to be recreated or updated?
- do we have documentation about it?
- what are the expectations? deadlines?
- where is the source code?
- put it on the bug or a link to the public repo
- how long does each test suite take to run?
- this is important to know in order to help capacity planning as well as planning how much we need to chunk the suites
- have you *manually* reproduced the steps specified by the developer?
- this will generally be your highest priority and will initially be on your critical path
- unless you have figured out *all* of these with him, you may regret not doing so
- how does the machine need to be set up? when was the last time that this set up was done?
- request *recently* *verified* step-by-step instructions on a *clean* machine
- have you verified the setup steps as instructed by the developer?
- have you made the action items clear to each other?
- have you found a new blocker? have you discussed it with the developer and does he understand why it is a blocker?
- are the blockers clearly filed or specified on the bug?
- is there anything particularly different to this project compared to the way we run other projects?
- e.g. running four emulators with four different test suites was clearly new
- if so, notify those people that might be affected
I hope this is useful when tackling new Release Engineering projects.
Feel free to read the following case study or skip it completely as it is very long!
Feel free to read the following case study or skip it completely as it is very long!
regards,
Armen
##############################
NOTE: The following two sections can feel *very* long/boring, as the bug ended up having more than 200 comments and I had to file many many bugs.
Context of the Android x86 project
First of all, I would like to set the context of this project: I had just come back from three weeks of holidays, I was not mentally prepared to take on an unexpected and large project, I had to catch up with my intern (his internship was ending shortly), I was trying to cover for a co-worker who was taking four weeks of absence, I had been looking forward to working on a different, more exciting project instead, and the amount of unforeseen interruptions following my return were very, very high. This is important to have in mind, as you will notice in this blogpost that I made some drastic requests of my managers in order for me to meet expectations.
Sequence of events of the Android x86 project
NOTE: As I did an analysis of my work, I could see where I made mistakes or missed the opportunity to ask the right questions. Unfortunately, these oversights delayed the whole project further down the road.
Quote from gbrown after reading this post: “I thought it was really helpful meeting on Vidyo when we did, and I found your frequent in-bug status updates very useful and re-assuring.”
gbrown (a-team developer) filed on 2013-07-17 the bug:
- some preliminary scripts had been developed by that time
- they were not yet attached to the bug
- this bug was initially about preliminary discussions on how to integrate with buildbot and the releng infrastructure
- a fair amount of bugs had been filed on the testing and mobile development area (this was the first releng bug)
- it seems that the project had been in development for several months (at least since April)
- dminor (A-team member) had created some scripts and had asked for feedback from aki (member of releng - my team)
- on the same day that the bug got filed, Callek (member of releng - my team) asked me if I would be able to coordinate this.
- Callek has been working with releng for more than a year and I have a strong trust and respect towards him. If he was asking me to pick it up, it was probably because he was very entangled with other projects and could make use of my help.
The next day, I immediately requested a video meeting with gbrown
- I knew gbrown from the mobile testing meetings
- I used to be the releng liaison for mobile in 2012 and that is where I met him
- I had not directly worked with him in any project; however, the times that he spoke at the meetings had left a good impression
- I made sure that my calendar was up-to-date before asking him to pick a time
- I assigned the bug to me to make it clear that there was an owner on the releng side
Until the meeting (and post the meeting):
- I browsed all bugs involved
- I read all of them
- I cleaned up the dependencies
- when the ownership of the bug is not clear, I asked before making changes
- I asked questions concerning bugs where there were concepts unclear to me
- I tried paraphrasing what I got out of some bugs to make sure that I understood them properly
- I clarified some terms and concepts about our infra
- e.g. mozpool is only used for panda boards (not for tegras or anything else)
- e.g. foopies will not be necessary to run this project
- foopies are *only* used for tegra boards
Initial meeting:
- we decided to meet a few days after our initial contact on the 22nd
- this allowed me to catch up on reading the bugs and prepare any questions that I needed to ask
- I made it clear from the beginning that I might ask silly questions, since this is an unfamiliar area for me
- I needed to know the following:
- NOTE: I might have not asked all of these questions on that day, however, I should have asked them
- has this been run on one of our releng machines?
- gbrown had already received one and had done so
- where is the source code?
- gbrown had to clean the scripts and get them to me from local copies
- which test suites had been run?
- when was the last time that you ran the tests?
- which artifacts will I need?
- e.g. the android sdk
- e.g. the android emulator template definitions
- how do you set a releng machine up for this?
- gbrown had filed a bug with detailed documentation; his attention to small details paid off down the road
- how do you run the code?
- what are the action items we each have to take?
- what are we waiting on each other for?
- what is the priority of this project?
- how long does each test suite take to run? how does it compare to pandas/tegras?
- this is important in order to anticipate the load
- on which machines are we going to run this?
- are you aware of whether or not we have requested the purchase of the machines?
- the purchase was under way; I did not need to be involved with this
- what are the expectations? What are the deadlines?
Post-meeting:
- I wrote a summary of the meeting (closing the loop so others can follow)
- I wrote down the action items
- "gbrown has some mozharness scripts and configs that he's going to get me"
- "I can try to integrate it to my staging master with an iX box"
- I wrote down the constraints on the setup (gathered from meeting and other bugs)
- "We can't do this testing on EC2 due to some issues with OpenGL which crashes."
- I wrote since I knew that people joining the bug might have asked this question; it is also likely that they may had not read all other bugs
- "Run times are a bit slower (20-30% slower) than on Pandas and Tegras."
- this is important since we have to keep a rough understanding of how many machines we will need for this project
- I made it clear that I'm the releng owner on the bug by setting the assignee
- this makes it easier for stakeholders to know that it is a project that I'm clearly committed to
- it also confirms to gbrown that I'm serious about helping him
- I made gbrown explicitly accountable by setting a needinfo request to declare that I'm waiting for him on an action item
- Set flag: needinfo?(:gbrown) - "Adding needinfo to keep track that I need the scripts."
- This also helps gbrown to not lose the action item since it will show up on his bugzilla dashboard (rather than writing down the action item as a comment)
- The dashboard is not necessarily something that every developer uses
- Comments can be missed; flags remain until they are cleared
- The needinfo request can be seen by others even if they have not read all comments
Started trying to reproduce manually the steps as explained by gbrown
- 07-24 (7 days after bug creation)
- I spoke with my manager and asked to drop something from my plate so I can spend more time the Android x86 project
- He agreed
- My performance on Android x86 started picking up
- 07-31 (9 days after meeting with gbrown) missing adb (comment 10)
- I needed to determine either how to install it on my machine (deploy a new artifact) or if it is already installed somewhere on the linux64 machines
- gbrown clarified it right away
- 08-02 (2 days after last comment) I made an incorrect proposal about how to distribute the avd artifacts due to my lack of understanding (comment 12)
- I double-checked with gbrown before going down that road
- I requested for sample avd files to be uploaded somewhere for me to use
- gbrown provided them on the same day
- I requested a new meeting with gbrown to ask smarter questions
- I informed him of PTO time that I'm going to take
- I didn’t want him to be frustrated if he was unable to find me
- I wrote a note on the bug for myself to remember to enable the project on Ash/Cedar so we could see how it would run within the production automation rather than on staging
- This is important since we don't have tbpl support for staging :(
- 08-02 - comment 13
- I notified him of setup differences between the set of instructions he described when he set his releng machine up a while back
- Android had released a newer version of the SDK by that time
- 08-06 - we met again
- I don't know what we spoke about; however I noticed that I sped up wrt to making comments on the bug
- 08-06 - we debugged back and forth concerning the setup steps
- 08-07 - gbrown provided new avd files; the setup steps are re-adjusted
- 08-08 - this is the first time that I noticed that compiz is getting on our way towards running this reliably (comment 30):
- I was not fully aware at that point that it was a blocker
- gbrown believes that staggering the emulators was what prevented this from happening
- later down the road we discovered that we have to kill compiz before starting any emulators
- we don't know how to debug why this was happening
- 08-08 - I asked gbrown for feedback on my patch
- that was the first of many patches come, however I left comments in the code so that he could help me clarify them directly in the code, to ensure that I was on the right track
- 08-09 - I asked aki for feedback for the first time; I was doing things significantly differently than anything we had seen before in our code. I had to make sure that it wouldn't be negatively reviewed too late into the process
- aki did not complain about my approach
- 08-09 - I created a repo where my changes would be pushed to, and I notified dminor and gbrown of this
- this ensured that they would be able to test my code and contribute to my development if they wanted to, without having to re-attach my patches every day
- 08-09 - I asked gbrown which test suites we were going to run
- I should have asked this way earlier in our initial meetings
- It did not delay the project
- 08-16 - compiz was preventing us to run reliably on staging (comment 54)
- I didn't know yet that compiz was at fault
- I mentioned to gbrown that it was a blocker for me
- We couldn't have unreliable jobs on tbpl
- I made a reference to comment 30 where we first spoke about it
- I attached a log to facilitate gbrown's expert eyes
- I also mentioned that I will continue working on other things that were not on the critical path
- getting the emulators started and killed reliably was my main priority as it is common to all test suites
- spotting unreliable issues early is my main duty
- not raising these issues early enough would have derailed the whole project
- it is only now, when I try to run many suites repeatedly through buildbot, that I see that the setup is not yet reliable
- 08-14
- my intern left and his project finally stuck to production
- I had more time for the Android x86 project
- 08-16 - every new patch had a comment logging what gets fixed
- this became a changelog for all parties
- 08-20 - I notified gbrown that I wouldn't be able to do any more work for a week as I was on buildduty
- I attached my latest code in case he had time to make changes on his own (or any other person)
- this gave him and his manager a chance to discuss if they would need to request someone else to sub for me
- 08-25
- I read an IRC comment that made me realize that Android x86 was more important than I had initially thought
- By then, I had mentioned to coop (my manager) that I needed to focus more for this project
- We discussed that adding Callek to the project and start a "project mode" would help us
- "project mode" allowed us to disconnect from IRC and even bugmail for long periods of time to work on just one project
- 08-26
- I gave a heads-up to sheriffs about running 4 jobs in 1
- This was important as they sheriff tbpl, and running jobs this way was different than any other platform we had seen before (except mochitest-other)
- Thread: "Reporting 4 jobs in 1"
- Reply: "I don't have a strong preference at this point in time - however we'll just need to bear in mind that TBPL doesn't support splitting one job up into several, so we'll need to handle this similar to mochitest-other (ie: ideally TinderboxPrinting pass/fail counts for each sub-suite & ensuring it's clear what's failed where). "
- 08-26
- Callek and I met to discuss the project and split the work
- We created an etherpad with clear action items for each other
- We discussed back and forth all the gotchas and the different approaches for our deployment plans
- Every day afterwards, we scratched our action items on the etherpad
- Callek helped me for a couple of weeks with deploying the avd artifacts through Puppet
- 08-26
- I emailed blassey (mobile manager) to clarify what the expectactions were since it had been over a month since we last spoke
- I stated that I knew that this project was important in order to prevent regressions to Android x86, however, I knew that it was _not_ a release blocker
- We could still have released Firefox for Android x86 without the automated tests
- I explicitly listed compiz as my main worry concerning whether or not we could complete this project on time
- once blassey replied and said that we were not on a very hot seat, I looped my manager right away
- 08-26 - I had an IRC conversation with Callek that made me believe that rail (another releng member) might know more about compiz wrt to the linux64 machines as he had developed some puppet code for it (comment 68)
- I added rail to the bug and asked him for his help
- I give him a brief description of what we're doing so he could grasp what had happened in the last 60+ comments and what I specifically needed help with
- 08-27 - First time that I ran tests on two different emulators (comment 76)
- I still used a separate script to launch the emulators
- I deferred including this code into my scripts in order to avoid departing from gbrown's original setup
- This would ensure that I would get to a working state first rather than re-write from scratch
- 08-29 - By this day, Callek had deployed all the artifacts that I needed across the Linux 64-bit pool of machines
- the adb tools
- the avd definitions
- It was great that he focused on this aspect since he had more experience and I could focus on the actual scripts
- Around this time, we decided to let Callek go back to his other projects as the urgency of Android x86 was clearly set by blassey and we had things under control
- 09-03
- I decided to kill compiz *always* before starting any emulators
- This improved the setup reliability
- 09-03
- I spotted an issue on mozharness where ADB interactions were not integrated into the logs
- I discussed with aki what is needed to fix this
- I didn't touch it as it was not important as of that moment
- 09-03
- my work was very close to completion and I asked dminor and gbrown for feedback
- this gave them time to get back to me
- this gave me time to continue to clean up the patch and deal with minor issues
- I was not blocked on their feedback since I left a lot of small non-critical tasks to deal with later on in my development
- I got gbrown's feedback on the same day!
- 09-05 - aki gives me a big congratulations and big r+
- "Awesome work, Armen! I'm pretty impressed you got parallel processes working..."
- Getting these kinds of compliments are very valuable to me; it's great to see them, especially coming from aki since he has so much more experience than myself
- aki nevertheless pointed out the unit tests that my code broke
- I made sure that I re-gained the habit of running them before my next review
Getting ready for reporting inside of tbpl
- Until this point I was mainly focusing on reliability, repeatability and proper machine set up
- 09-05
- I enabled the jobs on Cedar and Ash
- This allowed us to see me the jobs running on tbpl
- This allowed me to see that every artifact got deployed to the rest of the pool correctly
- rather than my own staging machine, which I set up manually
- The Ash branch was special because it allowed me to push mozharness changes to a repo, and re-trigger jobs that use the latest code
- This is like a try repo but for release and auto-tools engineers
- This setup makes me iterate fast and test on production
- 09-05
- Now that the jobs were running on tbpl we could ask to differentiate them from the regular "U" for unidentified jobs
- 09-12
- Sheriffs raised concerns on the approach we chose to run 4 test jobs within 1 (even though 2 weeks ago this had been raised and the approach was accepted).
- I wrote a thread to my team to see if they can think of something I had not: "Reporting of parallel test runs inside of the same jobs (from running 4 emulator jobs for Android x86)"
- Reply: "Unfortunately, the way that we group the jobs is not liked by the sheriffs as they would prefer each job to report independently as other jobs do. This would also allow tbpl to group tbpl jobs into their various categories (e.g. reftests, mochitests et al)."
- I asked for brainstorming concerning what other ways we could run the jobs to please the sheriffs
- Unfortunately nothing that was proposed would be easy or rather different than our normal approaches
Running at scale:
- Once I had the 5 different sets of test jobs running on tbpl and had many machines to run the jobs, I could start seeing a lot of intermittent oranges
- It is important to point out that going from POC to a working releng production capable script took close to a month and a half
- You will now notice the large amount of issues that were only discovered once we started running things at scale on tbpl
- You will also notice how many adjustments gbrown and I made to deal with all sorts of permanent test oranges as well as intermittent test oranges
- Running a job few dozens times locally or even on staging does not guarantee readiness for running at scale
- Any of the following minor issues are major for sheriffs when running at scale and trying to find regressions
- Read this wiki to understand what the expectations for sheriffs are:
- Not setting minidump stackwalk path properly (comment 103)
- This was used when Fennec crashes so we can see the crash symbols
- If we wanted to run the jobs at scale we had to fix this
- This was not a component of our infra that I understand (I kind of understand it now)
- Tear up/tear down (comment 110)
- What happens when we run a job twice on the same machine?
- What sort of artifacts would we have to clobber?
- "gbrown: should I delete ~/.android/avds and unpack clean templates before each run?"
- Timeouts
- buildbot has "no output" timeouts (comment 105)
- We spawned four emulators which each had their own output and I didn't want to mix them on the main log until they had been completed
- We had to output something every so often to prevent this timeout
- buildbot has max time timeouts
- This timeout would kill a job if it has not finished within 2 hours (I think)
- This made us have to chunk a bunch of suites (like crashtest and reftests) into more chunks
- test harnesses have their own timeouts
- If a mochitest runs for too long, we can have the harness kill itself
- Logs
- Max size (comment 118)
- ""Output exceeded 52428800 bytes, remaining output has been truncated"
- This happens when all of the emulators do not actually fail right away."
- "Updating the maxLogSize is very unwanted since it affects the performance of the masters."
- After reporting the issue I also established various strategies on how we could mitigate it
- I also mentioned that I would not deal with it this time around as there were other more important items
- Lots of back/forth
- gbrown and I worked comment after comment on every single issue we noticed
- sometimes we missed a parameter to one of the test suites
- sometimes we had to file bugs to fix the test failures
- fixed relative paths VS absolute paths
- android manifest modifiations
- Status summaries
- After a bunch of comments I would re-summarize what the heck is going on
- Unless it was clear where we are, we could have easily lost track of who was supposed to be fixing what
- Crash dumps not working for Android x86 (09-16)
- we asked ted to look into why our minidump crash adjustments did not yield success
- froydnj had to give us a hand
- TryChooser (09-12)
- "put complexity behind simplicity"
- instead of having to have a special TrySyntax for Android x86 we decided to make it follow the same syntax as other platforms
- instead of saying "androidx86-set-1" a developer could say "reftest-plain-1"
- this was a new feature for TryChooser
- the original writer of TryChooser was not in the releng team anymore (this was challenging for me)
- Seeing the light (09-20)
- Once I saw the bugs line up, I sent an email to gbrown and krudnitski (mobile product manager) to let them know where we were and asked them if the importance the project had changed since the last time we spoke
- I also requested from them if I could slow down so I could focus on some Summit prep-work that I needed to do
- I gave them two possible schedules depending on if they still needed me to continue with full steam
- I heard back and they told me to slow down if needed
- I passed the information to my manager
- Running across the board
- We enabled sets 1 & 2 since they were running green
- They had to be run hidden until we met the Job Visibility Policy
- After few weeks we were asked to only run on Cedar/Ash until we fixed the emulator bug we found in bug 917562
- Weird spot
- We tried to complete the project before the end of the quarter
- Unfortunately, when running at scale we discovered the bug inside of the emulator and all of our progress came to a slow crawl
- At this point, I decided to pass the bugs (including the tracking bug) to gbrown as most remaining bugs were on the testing and emulator side
- I discussed this with him on IRC first
- I wrote a status update on the main bugs and a whiteboard status pointing to the right comment #
- Notify stakeholders (09-30)
- As soon as I saw our project slowing down due to external sources, I had to notify the stakeholders so they didn’t hear it late
- I gave a very high-level picture and pointers to the specific blocker (e.g. the emulator bug)
- I notified both of my managers as well as blassey and krudnitski
- I made sure that there were no releng bugs blocking gbrown
- As of today (10-17)
- We are still working out the emulator bugs and other minor bugs
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.