Reliable, Scalable, and Maintainable Applications

A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to:

Databases: To store data
Caches: To remember result of expensive operation, to speed up reads
Search indexes: Search data by keyword or filter on various fields
Stream processing: Send a message to another process, to be handled async
Batch processing: Periodically compute large amount of acc. data

Quote

Increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storage needs.

Three important concerns for most software systems:

Reliability
- Work correctly in the fact of adversity (hardware or software faults, and even human error)
Scalability
- As system grows (volume, traffic, or complexity), there should be ways to dealing with that growth
Maintainability
- Maintaining current behaviour and adapting the system to new use cases

I. Reliability

Quote

Continuing to work correctly, even when things go wrong

Things that can go wrong are called *faults

It only makes sense to talk about tolerating certain type of faults

Types of issues: Hardware, Software errors, Human errors

Well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourages “the wrong thing”

II. Scalability

One common reason for degradation is increased load: System has grown from 10k concurrent users to 100k concurrent users

Quote

a system’s ability to cope with increased load

There are a few dimensions when discussing scalability of a system:

Dimensions of scalability

Describing load — called load parameters

RPS, Ratio of reads to writes, Simultaneously active users.

What is fan-out?

From EE, describes # of logic gate inputs attached to another gate’s output.

Number of requests to other services that we need to make in order to serve on incoming request.

Describing performance — what happens when load increases

Notes on latency and response time

They are not the same.

Response time — what the client sees besides the actual time to process the request (Round trip time, RTT), it includes network delays and queuing delays

Latency — duration that a request is waiting to be handled - during which it is awaiting services

Knobs to control: Load parameter, system resources (# of CPUs, Memory, storage)

Two ways to think about this:

When you increase load parameter and keep the system resources unchanged, how is the performance of your system affected?
When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

Median a good metric if you want to know how long users typically have to wait

The median is also known as the 50th percentile, and sometimes abbreviated as p50
High percentiles of response times, also known as tail latencies, are important because they directly affect users’ experience of the service

Understanding performance — Median

If the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more

Percentiles in Practice

Tail latency amplification: When several backend calls are needed to serve a request to user, it takes just a single slow backend request to slow down the entire end-user request.

While adding response time percentiles to the monitoring dashboards:

Keep a rolling window of response times of requests in the last 10 minutes
Calculate the median and various percentiles over the values in that window and plot those metrics on a graph
- This can be expensive
- Algorithms to calculate a good approximation of percentiles
  - Forward decay
  - t-digest
  - HdrHistogram

Averaging percentiles is mathematically meaningless

Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from several machines, is mathematically meaningless—the right way of aggregating response time data is to add the histograms

III. Maintainability

Three design principles for software systems:

Operability - Easy for the ops team to keep the system running
Simplicity - Easy for any new engineers to understand the system
Evolvability - Easy to make changes to the system

Unlike reliability and scalability, there are no easy solutions to achieve maintainability

Operability

Quote

“good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations”

Monitoring the health of the system and quickly restoring service if it goes into a bad state
Tracking down the cause of problems, such as system failures or degraded performance
Keeping software and platforms up to date, including security patches
Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage

Simplicity

Simplicity should be a key goal for the systems we build.

Symptoms of complexity:

Explosion of the state space
Tight coupling of modules
Tangled dependencies
Inconsistent naming and terminology
Hacks aimed at solving performance problems
Special-casing to work around issues elsewhere
and the list goes on…

Evolvability

The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions:

simple and easy-to understand systems are usually easier to modify than complex ones

Quick summary

Functional requirements: what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways

Non functional requirements: security, reliability, compliance, scalability, compatibil‐ ity, and maintainability

Reliability means making systems work correctly, even when faults occur.

Scalability means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively.

Maintainability has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. Good abstrac‐ tions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the system’s health, and having effective ways of managing it.

Appendix

Follow ups

What is “Netflix Chaos Monkey”?
Why is using histogram the right way of aggregating response time data?
- Read: https://bravenewgeek.com/everything-you-know-about-latency-is-wrong

The Commonplace Book

Explorer

Reliable, Scalable, and Maintainable Applications

I. Reliability

II. Scalability

Dimensions of scalability

Describing load — called load parameters

Describing performance — what happens when load increases

Percentiles in Practice

III. Maintainability

Operability

Simplicity

Evolvability

Appendix

Follow ups

Graph View

Table of Contents

Backlinks