A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to:

  • Databases: To store data
  • Caches: To remember result of expensive operation, to speed up reads
  • Search indexes: Search data by keyword or filter on various fields
  • Stream processing: Send a message to another process, to be handled async
  • Batch processing: Periodically compute large amount of acc. data

Quote

Increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storage needs.

Three important concerns for most software systems:

  • Reliability
    • Work correctly in the fact of adversity (hardware or software faults, and even human error)
  • Scalability
    • As system grows (volume, traffic, or complexity), there should be ways to dealing with that growth
  • Maintainability
    • Maintaining current behaviour and adapting the system to new use cases

I. Reliability

Quote

Continuing to work correctly, even when things go wrong

Things that can go wrong are called *faults

  • It only makes sense to talk about tolerating certain type of faults

Types of issues: Hardware, Software errors, Human errors

  • Well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourages “the wrong thing”

II. Scalability

One common reason for degradation is increased load: System has grown from 10k concurrent users to 100k concurrent users

Quote

a system’s ability to cope with increased load

There are a few dimensions when discussing scalability of a system:

Dimensions of scalability

Describing load — called load parameters

RPS, Ratio of reads to writes, Simultaneously active users.

What is fan-out?

From EE, describes # of logic gate inputs attached to another gate’s output.

Number of requests to other services that we need to make in order to serve on incoming request.

Describing performance — what happens when load increases

Notes on latency and response time

They are not the same.

Response time — what the client sees besides the actual time to process the request (Round trip time, RTT), it includes network delays and queuing delays

Latency — duration that a request is waiting to be handled - during which it is awaiting services

Knobs to control: Load parameter, system resources (# of CPUs, Memory, storage)

Two ways to think about this:

  • When you increase load parameter and keep the system resources unchanged, how is the performance of your system affected?
  • When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

Median a good metric if you want to know how long users typically have to wait

  • The median is also known as the 50th percentile, and sometimes abbreviated as p50
  • High percentiles of response times, also known as tail latencies, are important because they directly affect users’ experience of the service

Understanding performance — Median

If the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more

Percentiles in Practice

Tail latency amplification: When several backend calls are needed to serve a request to user, it takes just a single slow backend request to slow down the entire end-user request.

While adding response time percentiles to the monitoring dashboards:

  • Keep a rolling window of response times of requests in the last 10 minutes
  • Calculate the median and various percentiles over the values in that window and plot those metrics on a graph
    • This can be expensive
    • Algorithms to calculate a good approximation of percentiles
      • Forward decay
      • t-digest
      • HdrHistogram

Averaging percentiles is mathematically meaningless

Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from several machines, is mathematically meaningless—the right way of aggregating response time data is to add the histograms

III. Maintainability

Three design principles for software systems:

  • Operability - Easy for the ops team to keep the system running
  • Simplicity - Easy for any new engineers to understand the system
  • Evolvability - Easy to make changes to the system

Unlike reliability and scalability, there are no easy solutions to achieve maintainability

Operability

Quote

“good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations”

  • Monitoring the health of the system and quickly restoring service if it goes into a bad state
  • Tracking down the cause of problems, such as system failures or degraded performance
  • Keeping software and platforms up to date, including security patches
  • Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage

Simplicity

Simplicity should be a key goal for the systems we build.

Symptoms of complexity:

  1. Explosion of the state space
  2. Tight coupling of modules
  3. Tangled dependencies
  4. Inconsistent naming and terminology
  5. Hacks aimed at solving performance problems
  6. Special-casing to work around issues elsewhere
  7. and the list goes on…

Evolvability

The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions:

simple and easy-to understand systems are usually easier to modify than complex ones


Quick summary

  • Functional requirements: what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways

  • Non functional requirements: security, reliability, compliance, scalability, compatibil‐ ity, and maintainability

    Reliability means making systems work correctly, even when faults occur.

    Scalability means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively.

    Maintainability has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. Good abstrac‐ tions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the system’s health, and having effective ways of managing it.

Appendix

Follow ups