A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to:
- Databases: To store data
- Caches: To remember result of expensive operation, to speed up reads
- Search indexes: Search data by keyword or filter on various fields
- Stream processing: Send a message to another process, to be handled async
- Batch processing: Periodically compute large amount of acc. data
Quote
Increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storage needs.
Three important concerns for most software systems:
- Reliability
- Work correctly in the fact of adversity (hardware or software faults, and even human error)
- Scalability
- As system grows (volume, traffic, or complexity), there should be ways to dealing with that growth
- Maintainability
- Maintaining current behaviour and adapting the system to new use cases
I. Reliability
Quote
Continuing to work correctly, even when things go wrong
Things that can go wrong are called *faults
- It only makes sense to talk about tolerating certain type of faults
Types of issues: Hardware, Software errors, Human errors
- Well-designed abstractions, APIs, and admin interfaces make it easy to do âthe right thingâ and discourages âthe wrong thingâ
II. Scalability
One common reason for degradation is increased load: System has grown from 10k concurrent users to 100k concurrent users
Quote
a systemâs ability to cope with increased load
There are a few dimensions when discussing scalability of a system:
Dimensions of scalability
Describing load â called load parameters
RPS, Ratio of reads to writes, Simultaneously active users.
What is fan-out?
From EE, describes # of logic gate inputs attached to another gateâs output.
Number of requests to other services that we need to make in order to serve on incoming request.
Describing performance â what happens when load increases
Notes on latency and response time
They are not the same.
Response time â what the client sees besides the actual time to process the request (Round trip time, RTT), it includes network delays and queuing delays
Latency â duration that a request is waiting to be handled - during which it is awaiting services
Knobs to control: Load parameter, system resources (# of CPUs, Memory, storage)
Two ways to think about this:
- When you increase load parameter and keep the system resources unchanged, how is the performance of your system affected?
- When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?
Median a good metric if you want to know how long users typically have to wait
- The median is also known as the 50th percentile, and sometimes abbreviated as p50
- High percentiles of response times, also known as tail latencies, are important because they directly affect usersâ experience of the service
Understanding performance â Median
If the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more
Percentiles in Practice
Tail latency amplification: When several backend calls are needed to serve a request to user, it takes just a single slow backend request to slow down the entire end-user request.
While adding response time percentiles to the monitoring dashboards:
- Keep a rolling window of response times of requests in the last 10 minutes
- Calculate the median and various percentiles over the values in that window and plot those metrics on a graph
- This can be expensive
- Algorithms to calculate a good approximation of percentiles
- Forward decay
- t-digest
- HdrHistogram
Averaging percentiles is mathematically meaningless
Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from several machines, is mathematically meaninglessâthe right way of aggregating response time data is to add the histograms
III. Maintainability
Three design principles for software systems:
- Operability - Easy for the ops team to keep the system running
- Simplicity - Easy for any new engineers to understand the system
- Evolvability - Easy to make changes to the system
Unlike reliability and scalability, there are no easy solutions to achieve maintainability
Operability
Quote
âgood operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operationsâ
- Monitoring the health of the system and quickly restoring service if it goes into a bad state
- Tracking down the cause of problems, such as system failures or degraded performance
- Keeping software and platforms up to date, including security patches
- Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage
Simplicity
Simplicity should be a key goal for the systems we build.
Symptoms of complexity:
- Explosion of the state space
- Tight coupling of modules
- Tangled dependencies
- Inconsistent naming and terminology
- Hacks aimed at solving performance problems
- Special-casing to work around issues elsewhere
- and the list goes onâŚ
Evolvability
The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions:
simple and easy-to understand systems are usually easier to modify than complex ones
Quick summary
Functional requirements: what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways
Non functional requirements: security, reliability, compliance, scalability, compatibilâ ity, and maintainability
Reliability means making systems work correctly, even when faults occur.
Scalability means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively.
Maintainability has many facets, but in essence itâs about making life better for the engineering and operations teams who need to work with the system. Good abstracâ tions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the systemâs health, and having effective ways of managing it.
Appendix
Follow ups
- What is âNetflix Chaos Monkeyâ?
- Why is using histogram the right way of aggregating response time data?