Here at Swrve we work very hard to both create software that is efficient and reliable and to operate that software at a high availability level - to ensure our customers get consistency and quality of service. We’ve evolved as a company, from the simpler early days of running a straightforward LAMP style stack (back in 2009) to the current scalable SaaS platform, with a custom parallel event stream processor at its core, which now deals with volumes of over 60k events per second being gathered from millions of active user devices around the globe.
One of the big challenges to any organization as it scales is to develop the appropriate processes and behaviors within the team and to adapt those incrementally with growth. That’s all about trying to stay on the right side of the pragmatic tradeoffs that are inevitable in the absence of infinite budgets.
We’re frequently asked by customers how we go about delivering our service, so I thought I could give a high level overview here. To begin with, let’s look at what’s probably the most visible measure of reliability: the service uptime (and latency). We use www.pingdom.com as an independent reporting service, and here’s what Pingdom reports for our primary API endpoint for calendar year 2014:
The key thing here is the uptime report (which Pingdom is currently rounding up to 100%). In reality over 2014 we’ve had 6 minutes of unavailability of the APIs, which is approximately 99.998% of availability (nearly, but not quite five 9s!).
A lot of energy goes into achieving this record; much of it is in the architecture and reliability of the software components that make up the service itself: but that’s way beyond the scope of this article.
Instead I wanted to focus on the systems we have in place to monitor and maintain the service, and the processes we have in place to deal with issues that inevitable arise. We’ve drawn a lot of inspiration from our peers in the industry, and in particular I’d recommend you check out all the posts on the topic from Stephanie Dean (who’s worked with Amazon, Activision and Twitter). Stephanie has seen it all before and you’d do a lot worse than just implement verbatim what she writes about.
Here’s a brief run-through of some of the infrastructure and processes we use at Swrve. This is a pretty high level summary view, glossing over pretty much all the details, but feel free to reach out to me (steve at swrve) if you want more information on any topic.
Monitoring
You can’t respond to an issue unless you know it’s happening. With 100s or 1000s of server instances running, in our case, on Amazon Web Services, each with its own set of operational characteristics, you need to depend on automation to alert you to ‘bad stuff’ going down. We use a combination of AWS Cloudwatch metrics, Graphite, Nagios, Logstash and other tools to build our monitoring fabric. Across the service we’re collecting 1000’s of metrics per second, and these are piped into a central metrics service, which feeds our live operations dashboards and our alerting tools. While automation is critical, it needs to be complemented by a systematic approach to reviewing metrics and services on a regular basis. To help with this, we pump all key operational metrics to central live dashboards visible to all of engineering (not just our operations staff). You can see a dashboard below:
We conduct periodic reviews of the entire service, component by component and also implement a burn-in review of new services (in order to check out how they’re operating after 6 to 8 weeks, ensure the alerting thresholds are correct, and that we’re gathering all necessary metrics and logs). These less automated behaviors really help with pro-active identification of issues before they become incidents.
Alerting & Escalation
With all these metrics reporting a huge variety of different operational characteristics every second, we rely on a system of automated alerts to allow us to respond quickly to issues.
Rules are setup on AWS Cloudwatch and Nagios that trigger a variety of responses depending on the criticality of the issue:
- Critical issues: if an issue is in danger of impacting on the customer experience, that issue is automatically escalated to PagerDuty which implements a set of escalation policies. At any time, there is a rota of operations engineers on-call, and PagerDuty raises the issue via mobile txts and pages, automatically failing over to other engineers if no response is received within a time threshold. We aim to have an engineer respond and acknowledge the issue within 5 minutes of the alert firing.
- High Priority Issues: these types of events cause emails to be generated targeting specific email distributions. Different team members are on each of the email distros and can respond accordingly, and while we don’t actually wake someone up, the issue will be dealt with within a 24 hour period (and usually much less given our office locations in different time zones).
- Low Priority Issues: generally these are informational only, and are recorded in emails or logs, which are subsequently analysed for frequencies and trends.
In addition we have issues raised by customers, by our customer success team working directly with customer data and by engineering. In all cases, issues are recorded in the operations ticketing system, so that the full lifecycle of the issue can be tracked, and allows all relevant parties to be kept informed about the status of the service.
Communications
It’s so important to ensure that communications flow, especially in times where you’re under pressure dealing with an emergency, and the instinct is to hunker down and solve the problem before coming up for air. We use a variety of tools to help us with this.
Internally, we are big users of Zendesk as our primary ticketing system (both for customer support queries and for operations incident tracking). Externally we use StatusPages.io (status.swrve.com and @swrve_status) to communicate the current status of the system.
This allows us to alert customers directly within the main Swrve app (via banner alerts) when something is up, and customers can refer to the status pages to get updates on on-going issues. They can also subscribe to our twitter alert stream to keep abreast in real-time of our progress.
As an incident is in progress we keep this system up to date, so that customers can track how we’re responding. Here’s an example of an “incident” where we were implementing some essential data base maintenance, bring our application dashboards down for a few minutes:
If we’re planning maintenance, we’ll also use a combination of emails and intercom.io embedded messaging to alert customers of the maintenance window, well in advance of the expected impact to the service.
Incident Response
At the heart of our systems for site reliability is our process for dealing with critical incidents. These situations don’t happen very often, thankfully, but it’s inevitable that stuff will go wrong, despite all best precautions and proactive monitoring of the service. Once an incident has been acknowledged by the on-call operations team, and subsequently escalated to a critical incident, our response plan kicks in.
The first and most important point is that the Critical Response Plan has a primary goal and a secondary goal. The primary goal is to re-establish a working service for our customers. The secondary goal is to identify the issue that is causing the critical drop in service. We’ve learned that these are not the same thing, so rather than immediately diving into the problem to try and find the cause, we instead focus on service restoration; very often you can ignore the root cause. Here are a few of the important aspects of our response plan:
- A response leader has been previously identified and will the incident will be escalated to them by the on-call operations engineer. The leader is responsible for assembling the appropriate response team, and for initial escalation to company management as required.
- The leader is also responsible for all communications, both internally to ensure the response team is inter-communicating effective, and also externally ensuring our customer notification systems are being used and kept up to date, and to ensure the incident status is being updated to management and other team members.
- During office hours, the response team is assembled in a response room, to take advantage of team proximity. If out of hours, a virtual response room is created using a combination of video conferencing and Slack channel. We are big users of Slack, and during an incident, we use Slack as the primary log of the event, with all steps being recorded there, sharing graphs and logs, and recording all major updates being made to the service.
- The team is split into 2 from the outset: a service recovery team and a root cause team. The recovery team is tasked with restoring the service as quickly as possible. This usually involves a combination of graceful degrading of non-essential services, routing traffic to alternative infrastructures, rollbacks of recent service updates or restarts of servers. The second team immediately begins a deep dive into application logs, service metrics, crash dumps and any other data available to try and figure out what triggered the issue in the first place. It’s often the case that you can restore a fully working service long before you ever figure out a root cause.
During this time, the most important thing is to maintain communications and a calm and professional approach to the incident. The leader will update the entire team methodically and at regular intervals so everyone is in sync with what is happening. Never assume, even if everyone is in the same room, that everyone is on the same page.
Ultimately you aim that the service can be restored, data recovered, latencies reduced or whatever as efficiently as possible. Once restored, the leader is responsible for updating management, standing down the team, and communicating with customers as required. The incident may remain at a high priority (as opposed to critical) status for days thereafter, as the root cause team continue to work on the core issues that triggered the incident in the first place.
Post Mortem
Having restored the service, and once the root cause has been determined and potentially dealt with, it’s very important to conduct a five Y’s and post mortem, to get to the roots of the problems that arose. This discussion, involving engineering and operations staff, will highlight a long list of actions, in our case with 3 levels of priority.
All high priority and medium priority items are recorded in our JIRA tracking system for action, with high priority actions also scheduled for work in current active engineering sprints Medium priority items will be placed on the backlog, for subsequent scheduling, and low priority items are recorded for future reference, and generally are taken on board by the architects for consideration in future planning sessions.
Summary
I’ve really skimmed the surface here, but hopefully this gives a sense for what goes on under the covers! All highly available systems stay that way through a combination of automation and human processes; the old analogy of the swan really applies there: you have the calm serene swam above the water, without a care in the world, and a highly industrious and vigorous paddling below the surface, balancing planning, exertion and turbulence, to achieve the overall graceful performance.