Reliability and scalability of service
There’s nothing more disappointing than an outage of an application serving all your customers. These systems take significant time and investment to deliver and they typically fail at the time they’re most needed – when most people try to use them.
Outages result in lost revenue and, often, bad press too.
Scale – elastically
A web or mobile service’s demands on computing resources grow with the customer demand of the service. Sometimes the capacity needed at the peak of demand is thousands of times higher than in the middle of a quiet night. In order to fulfil it, services need to elastically scale, by adding resources as they’re needed. Otherwise, your spare capacity will likely go to waste. And you’ll still end up paying for it.
This requires infrastructure automation commonly supported by cloud computing services, but also a system design that can use the added resources effectively – a property known as horizontal scalability. Applications need to be packaged in a portable way and able to start within seconds, when a new, independent instance is needed to serve more customers. These assumptions are the foundation of the current infrastructure tools, such as Docker and Kubernetes.
Look to smart integration layers
Horizontal scalability is especially challenging for applications which integrate with legacy systems running on physical hardware. These systems can’t scale easily. Smart integration layers are often a good way to isolate the effects of them reaching their capacity limits from the customer-facing application.
Unlock the power of continuous delivery
An added benefit of having an application and infrastructure design which scales easily is the ability to create development and testing environments in minutes, enabling the continuous delivery of new features. The risk posed by changes can be reduced by using methods whereby they’re released gradually to a small subset of customers first, before dialling the number up. And if a change does cause an outage, full automation makes it much easier to quickly restore the service.
Manage risk
Special consideration is needed for monitoring web scale production systems. Real-time performance metrics, observability, and well-tuned alarms are a must. Why? Because every minute of outage can cost thousands of pounds. Think e-commerce systems going down in the weeks coming up to Christmas, or flight operations systems outages causing chaos.
Empower your teams
Scalable and reliable systems are far more complex than many people first assume. Teams should be given the time and autonomy to learn to operate them safely. Some incidents in the early stages are inevitable, but they should be used as lessons to improve the system.
It’s also important to point out that while cloud computing is a great enabler of scalable and reliable systems, many expect it will also reduce their like-for-like infrastructure costs. Be warned – this is rarely the case. But the spend is well worth it to prevent expensive disasters.