After reading kmoir's post, I decided to give a talk at "2nd International Workshop on Release Engineering 2014" hosted at Google's Mountain View headquarters.
Here's the talk proposal that I submitted in case you're interested to attend (I have not yet been selected).
This event will happen on April 11th, 2014 (6 weeks from now).
Mozilla’s hybrid continuous integration
Opportunities and challenges faced by Mozilla’s hybrid cloud/in-house continuous integration as it supports fast-paced development
Abstract - Issues and considerations when designing and using a hybrid cloud/in-house continuous integration (CI) infrastructure to support a very intense and fast-paced development process.
Keywords—release engineering, cloud, Amazon, EC2, continuous integration, infrastructure, high scale, high availability
I. SUMMARY
Over the last few years, Mozilla has seen a large growth in its number of contributors. The growth comes with an increased infrastructure load. This has lead, many times, in the development load outpacing the current infrastructure capacity. Unfortunately, our procurement process can be too slow and expensive to keep up. In the last two years, we have been moving an increasing number of builds and tests of Firefox for desktop, Firefox for mobile and Firefox OS to the cloud. Doing so has allowed us to increase ‘on-demand’ our capacity to handle the sudden spikes in load produced by development. This shift in the nature of the infrastructure comes with great opportunities as well as challenges.
II. BACKGROUND
After the release of Firefox 4, Mozilla started accelerating its Firefox for desktop development by introducing a train development model [1]. This model allowed for smaller and more frequent releases which, at the same time, allowed for speeding up the development process with shorter code stabilization periods. This new development process caused a greater demand on Release Engineering’s continuous integration.
After Firefox for desktop started running this train model, Firefox for Mobile joined the same development model by merging back into the Firefox for Desktop’s code repository. This improved the integration period and enabled developers to identify regressions faster. During this period, a complete revamp of Firefox for Android was sought out. All of these changes created a greater demand on Release Engineering’s continuous integration.
In 2012, the Firefox OS moved from being an experiment to a product that was fully supported by Release Engineering’s continuous integration. Adding this third product, as well as the heavy investment in its success, increased the demand on our infrastructure.
In the same year, Release Engineering started running builds and tests for our continuous integration in Amazon Elastic Compute Cloud (Amazon EC2). Since then, we have been running the majority of our load in EC2 as well as the in-house cloud and physical infrastructure.
Keeping the capacity growth rate of our continuous integration systems the same as the load growth rate is crucial to our organization’s success. Mozilla has grown from supporting 15,000 pushes/year in 2009 to 80,000 pushers/year, and this number is rising. Over the years, we have also increased the number of test suites and operating systems that we support, thereby making each code push more expensive. Each year, Mozilla’s CI runs more than 1M build jobs, more than 10M tests jobs and more than 2,000 CPU years. The ability to scale quickly and reliably is essential.
III. BENEFITS
Running build and test jobs on EC2 has help speed up our scaling process and reduce our procurement needs. This creates more time for improving the overall infrastructure and reduces the time involved to coordinate procurement procedures. It also enables Mozilla’s development to grow without slowing down build and test results due to the lack of machinery.
IV. CHALLENGES
Unfortunately, adding the cloud to the CI does not come for free. Running Release Engineering’s infrastructure across multiple EC2 regions and multiple in-house data centers comes with its own unique challenges.
For example, one challenge is the task of re-designing the main infrastructure services to accommodate for the cloud services outside of our own data centers.
Another challenge, is the financial burden that comes from running so many jobs in the cloud. Various optimizations will be discussed.
Another challenge, is having a larger infrastructure size can also undermine the benefits of caching, which can become quite noticeable when running several hundred pushes a day.
V. FUTURE DIRECTIONS
While the current setup works, we are going to optimize it to improve reliability, reduce cost and improve end to end times.
On-going development is focused on:
- Improve EC2 bidding algorithms
- In order to reduce costs by dynamically using less expensive machines
- Reduce infrastructure issues due to dependencies on inter-datacenter infrastructure
- Optimize the usage of resources
Acknowledgment
Even though I get the privilege to talk about this topic, the engineers behind all these improvements are:
- Rail Alliev
- Chris AtLee
- Taras Glek
- Ben Hearsum
- John Hopkins
- Mike Hommey
My deepest gratitude to Chris Cooper, Kim Moir and Hal Wine for encouraging me to submit this talk as well as review it.
References
- Bhavana Bajaj: “Release Process” [https://wiki.mozilla.org/RapidRelease]
This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.