The on-call process is nothing new at tech companies. When your service goes down due to unexpected reasons, someone needs to be around to fix issues and save your company from frustrated customers and lost revenue. Similar to product development, a mature, efficient on-call process takes effort to develop and iterate. At Zip, we’ve gone through a couple rounds of iteration - adapting as our company has grown from a baby startup to a teenage unicorn.
“This is fine”
In the early days, when we only had a handful of customers, our incident response process was simple. Immediately after each bug got reported, the entire engineering team swarmed on it. The founding engineers, including our CTO Lu, were effectively permanently “on call” – they knew how to fix any issue quickly. While this worked well when we were a team of six, it quickly became unscalable once we doubled, tripled, and quadrupled in size.
To give engineers the precious focus time that they needed, we introduced a lightweight on-call process and rotation. Easy, right? We had all learned a thing or two about on-call from our past experiences at big companies. It became apparent that on-call engineers didn’t understand the boundaries of their responsibilities and were not well-equipped with the knowledge they needed. Issues and incidents were repeatedly triaged to the same group of early engineers. Couple months into the new process, we felt like we were back to square one.
Additionally, as the product became more complex, we instrumented a comprehensive alert system for all the endpoints and offline jobs based on simple heuristics. At a glance, we had strong guardrails with hundreds of alerts. However, in reality, most alerts were not well-tuned and engineered didn't have a clear understanding of what to do when they were fired. Being on-call became a sleepless job of mechanically acknowledging a large volume of noisy, in-actionable alerts. Engineers carried around the guilt of not having the bandwidth to tune and follow-up on all the flaky alerts.
Time for a change
After observing and intimately feeling the pain for a while, we realized that just setting up an on-call rotation wasn't enough. We pinpointed a few core issues:
- Lack of observability on the health of the alert system and on-call workload
- Lack of education for on-call to understand their responsibilities and develop their investigation skills
- Lack of protected time for on-call to do their on-call work
- Lack of accountability when on-call isn’t performing their duty
In the spirit of Zip’s core value to “Just own it”, a couple of us got together and proposed a few changes to revamp our on-call process.
Define ideal state and metrics
The ideal state of a healthy on-call process, as we envisioned, was fairly straightforward: less pages, faster incident resolution. To understand how far we were from the ideal state, we needed to have a concrete idea of the current on-call workload and whether on-call engineers were fulfilling their duties. We developed a set of success metrics aimed at measuring operational excellence (triaging, mitigation and resolution SLAs) and workload (number of pages across teams or services).
After we aligned the metrics with engineering managers and tech leads across teams, we built data pipelines and a Superset dashboard to monitor any trends and build OKRs around these metrics.
Besides the quantitative metrics, we wanted to make sure on-call engineers were happy and their voices were heard. We automated sending out an on-call happiness survey to each on-call engineer as they got off schedule every week. We tracked general sentiment and an estimate of the hours spent on on-call duties. This qualitative feedback was valuable in evaluating team-level support and resource allocation. We observed that some teams were struggling with forecasting and felt they didn't have enough time for the necessary on-call work. On the bright side, a majority of respondents felt their on-call work was crucial to the quality of Zip’s product and felt motivated to contribute to the process.
Our goal was to use these useful metrics to bring visibility to the hard work of every on-call engineer and continue to build a culture where time was set aside of on-call engineers to to their best work. We shared out key on-call metric scorecard to the broader team on a regular cadence, and set some benchmarks to evaluate our overall on-call performance. For instance, when the number of on-call pages skyrocketed and the resolution SLA rate dropped under a certain threshold, we discussed openly and worked with the corresponding teams to find ways to tune their alerts and identify any systematic issues that need fixing.
Enrich on-call education and tooling
Not knowing what to do as an on-call was a failure in our process. We wanted to educate on-call engineers on their responsibilities and ensure they were familiar with the available tools to help them respond to all kinds of issues quickly. We created centralized on-call run-books to standardize shared best practices across teams.
One common pain point on-call engineers raised was the number of systems that they had to monitor, and the constant context switching cost that ensued. The challenge was that we used multiple softwares for production monitoring: Datadog for alerts, Bugsnag for online exceptions, Asana for sprints, and Pagerduty for pages. To reduce the amount of overhead, we built integrations among these systems to create tickets in Asana for all kinds of critical issues, and route Asana tickets to the correct on-call based on team ownership configuration. On-call engineers could easily track the tickets that need resolution in a centralized system, and move tickets to their team’s Asana sprints so that their team could have clear visibility into the on-call’s work and plan accordingly.
Manage on-call engineers in a decentralized way
As the company grew in size, we needed a strong voice on each team ensure that the on-call process was followed properly and improved continuously. We asked each team to nominate an on-call “champion”. The on-call champion is the advocate for improving and managing the on-call process on the team. Their responsibilities include:
- Ensuring on-call quality is excellent on a regular basis
- Refreshing team-level on-call run-books and documentation as the product evolves
- Organizing regular on-call training for new team members
Ship a high-quality product
While we were revamping the on-call process, the engineering team also worked tirelessly on guardrails and mechanisms to help engineers ship high-quality product. We heavily invested in test coverage, typing and new CI checks to prevent breaking changes from getting merged in the first place. Various testing initiatives, including manual QA testing and end-to-end integration tests, were put into place to catch issues before they reached the production environment. All of these efforts collectively reduced the on-call load, and contributed to a reliable product.
All disruptive changes to existing machinery will inevitably cause hardship and pushbacks. Throughout the process, we learned a few lessons:
- Leave room for teams to define team-specific best practices. For instance, our Integrations team had a lot of external dependencies that created a large number of noisy alerts and required extremely detailed, team-specific documentation so that each on-call engineer could ramp up and work on specific integration issues quickly. Their on-call engineers not only had to handle pages, but also a lot of internal questions in various channels due to the intricate details of integration systems. The team iterated on their own on-call experience, expanded their on-call responsibilities and had more generous time allocation for their on-call engineers.
- Avoid creating overly specific metrics which lead into the realm of micro-management. Initially, to make sure on-call engineers were actively working on each ticket when a page triggers, we introduced a new SLA metric that tracks “time to triage” on top of the existing “time to resolve” SLA. This metric quickly became obsolete as the “time to resolve” metric was already capturing the resolution speed that users perceived, and “time to triage” caused unnecessary mental overhead.
- On-call problems need to be solved in the context of team management processes. For example, on-call engineers often had difficulty completing on-call work when the team roadmap was tightly planned without enough buffer for product maintenance work. We later aligned with engineering managers to work with their cross-functional partners and allocate enough time for on-call engineers during roadmap planning, and regularly asked each on-call through the on-call happiness survey if their team is providing enough resources and time allocation for their work.
Developing a healthy and effective on-call process is an organization-wide effort. We quantified the ideal state with metrics and brought visibility into system health and on-call impact. We enabled on-call engineers to effectively do their work with education and tooling. We listened to our engineers and kept improving the process, trying to avoid unnecessary operational overhead. We went to the root of the problem and attempted to solve engineering quality issues with a broader lens. We are only at the beginning of the journey, and we will continuously build towards to a culture of shipping product with high-quality, and always putting customers first.
Many appreciations to Jimmy Zhuang and all of our on-call champions for contributing to this effort and making Zip product more stable and available!