SaaS platforms like Swrve’s Mobile Marketing Automation service need to deal with significant scale, handling event streams from millions of devices per day, and serving content, configurations and messaging to those devices in real-time. This puts a lot of pressure on the back-end systems and operations teams, but it is vital to ensure a highly available service with fast response times. Which we do!

There’s no one solution to how you build systems like this. John Allspaw is a great source of inspiration, particularly for those of us building web and mobile operations at scale, and highscalability.com is a great place to keep track of recent developments and to compare notes with other company’s approaches.

But I figured I’d note down some of things that we think about as we grow and operate the Swrve service, not trying to give any sort of definitive list, but to highlight, in no particular order, some of the things that we believe are important.

1. No single points of failure

The achilles heel of any system is a single point of failure - a single service that if interrupted could bring down the entire system. It’s funny (or not) how easily these can creep into designs. You’ll find that it won’t be the obvious systems, like a central database storing configuration data for your service, but rather something like a tool you use for aggregating logs, or for orchestrating deployments, or sequencing services (we had one of these recently in a design that was caught pretty late in the day).

If there’s a single point of failure, then you’re pretty much guaranteed that the system will fail.  Always look at systems that have automated failover, and can run happily and in sync in multiple locations.

2. “Shard Hard”

This seems obvious, but the route to scaling is to plan your sharding (distribution of a workload across a cluster) strategy, and ideally to ensure that the strategy you choose initially does not lead you down a blind alley.

The sharding approach you choose should grow with you. It might seem that you can happily shard by geography, for example, but eventually you grow to the point where the number of users or activity related to a single geography becomes too large and itself needs to be sharded. Think about what entity represents the most atomic unit of the transactions you process (in our case it’s the user of an app), and plan ultimately to shard on that basis.

3. Aim for the happy path, but plan for the sad path

Your smart engineers will be building great systems that achieve wonderful things (e.g. real-time processing of signals from apps), but ultimately you need to think about your failover strategies; what do you do during an incident with the service, how do you deal with sharp discontinuities in load (superbowl ad breaks, or time-linked super sales promotions), or reaching a system limit (hopefully never, but it’s tough to automate everything from the start!)

Your best friend in these circumstances is a super resilient ‘sad path’. For example, Swrve has a pretty cool real-time event streaming infrastructure; we can segment and cluster users pretty much in real-time using grids of state machine processors and a custom event dispatch system.

This is great when everything is running as expected - but if we hit an issue (perhaps an EC2 instance degrades on AWS, or a customer’s update to their app has an error and starts to spam our API endpoints), then you need a backup. Our approach is to degrade our event processing to a “sad path” slow transport layer, which bounces all event traffic through AWS’s S3 storage layer.

This way we can make guarantees about the resilience of our event storage, and “do our best” to maintain the real-time happy path.  Other strategies here are to allow individual services to be degraded to take pressure off the systems that are in trouble; being able to do this from a dashboard is critical. Hailo has a nice post about this.

4. Always have a roll-back strategy

When pushing new code to production (and you want to be able to do this rapidly and frequently to avoid code going stale, and to react quickly to customer requests) always have a roll-back capability. This is sometimes a lot harder to implement than you think, and very occasionally you may be forced into a deployment where pragmatically having a roll-back is just not something you can afford.

Avoid this if you possibly can. In some cases you’ll be able to do deployments using simple DNS switches to switch in new services, that had been quietly shadowing the current live service. In other cases you’ll be making changes to services where it’s not possible to switch cleanly, and here you really want to be able to revert back to the previous service if you see any issues.

Classic blockers to having a roll-back are core database migrations that are one-way, or doing simultaneous deployments to an entire fleet of services.  Where possible, build in the ability to do rolling updates of servers, so that you can easily remove a bad update quickly from the fleet, and avoid database migrations that are not backwards compatible. A common source of these sorts of migrations are the ORMs used to build web applications (I’m looking at you Ruby on Rails!)

5. Minimise service dependencies

The goal of any architect of a scalable web service is to build that service as a set of individual services with a well defined API and protocol, each of which operates independently, can be tested and deployed in isolation.

This holy grail state does not remove the need for system testing (a set of interacting services may fail even if each of the individual services operate correctly), but removing first order dependencies between systems significantly simplifies testing, deployment and maintenance.

Each sub-system should ideally not “know” about any of the other systems, and thus a change in one should not require updates to other services. Once these dependencies creep into a system, maintenance becomes a nightmare, and at the limit a change to a service triggers an entire redeployment of the full system. That’s not a place you want to be!

6. Stateless as a goal

It’d be a more pleasant (but possibly less interesting!) world if we could cost effectively implement everything as a stateless system. With a purely functional view of the world, we’d have a set of services operating independently on data which flows through those services.

The services would be idempotent, trivially parallelizable and scalable, and you find yourself in a place where you can “throw money at the problem”. Now, your CFO might not like this, but from an operational perspective you always want the comfort of knowing “hey, I can just spin up more instances in this cluster and we’ll be good until we figure out a better idea”.

But persistence of data is nearly always a fact of life. We’ve many stateless components in Swrve, but at the core of what we do is a pretty stateful system, maintaining a record of a user session (cached locally on the event stream processors), which just doesn’t fit the stateless model (at least not easily) and so we spend lot of time and effort on building tools to allow us to persist state, recover state (or rebuild state).

7. “If it’s not monitored, it’s probably already failed”

Without metrics (and associated automated alerts) you’re essentially flying blind. As you start to scale out the size of your service, you need to spend more time and energy on your logging and metrics services.  We’ve invested pretty heavily in our monitoring infrastructure at Swrve.  It’s often difficult to keep pace with the growth of the service, and even more difficult to decide on level of monitoring, and expiry policy on the time series data.  Ultimately though, having detailed metrics and logs is absolutely key to maintaining system integrity and to analyse issues, ideally before they become critical.

Right now, Swrve generates about 25k metrics events per second, which are aggregated by a sharded graphite cluster. There’s lots of exciting work on using machine learning to predict or analyse large metrics data volumes; for now though, we rely on our all-seeing operations team to build appropriate alerts (the same team responds to the alerts so they spend time getting these tuned correctly) and to monitor key metrics by eye - sometimes just as valuable as an automated system.

8. There’s no test like real production load

No matter how much energy and effort you spend on pre-production testing, it’s nearly always impossible to fully test a system without exposing it to real production load, and operating it in concert with the other components of your system.

Do what you can to simulate this in advance (a good approach is to “record” a snapshot of real production load on your system and to play this back), but ultimately the best approach is to run the system in a live production environment either with a failover to an older system, or behind feature flags to allow the load on the system to be quickly controlled.

In our case, we always try to push a feature to production as soon as possible (to avoid code getting stale), but always hide this behind a feature flag. We then expose the feature to a subset of our users, check everything is looking good, and eventually ramp it up to full general availability status. With a controlled approach (and with appropriate roll-back capabilities) this really helps to reduce the risk of the new system operating at production scales.

...

Now clearly this is a short list, and only a small sample of the many things we think about when planning the roadmap for our platform.  There are many more strategies and processes that we’ve learned from others or built ourselves through experience, so feel free to reach out if you want to hear any war stories or more details on how we built Swrve.