The OS is a resource broker. The interesting parts — for someone shipping services, not writing kernels — are the abstractions it leaks and the places where its scheduling decisions become your latency.
The scheduler is your roommate
The kernel scheduler decides which thread runs on which core when. You don’t pick — but you can observe and bias.
- Run queue depth (
vmstat,runq-sz) > number of cores → you’re CPU-bound and waiting on the scheduler. Adding threads makes it worse. - Context switches are not free: ~1–10µs for the switch itself, much more for cache eviction. A service doing 100k context switches/sec is burning real CPU on overhead.
- Voluntary vs involuntary switches matter. Voluntary = you blocked on I/O (fine). Involuntary = preempted (you have too many threads, or noisy neighbors).
- Linux’s CFS is fair, not real-time. If you need bounded latency, pin threads (
taskset,sched_setaffinity) and isolate cores (isolcpus).
Memory, the four-level lie
User-space sees a flat virtual address space. Reality:
- Virtual → physical via page tables. TLB miss = ~100 cycles. Working set bigger than TLB coverage → death by walks.
- Page faults: minor (page in cache, just map it, ~µs) vs major (read from disk, ~ms). Major faults in a hot path are a smell.
- Swap: if you’re swapping, you’ve already lost. Set
vm.swappiness=1or disable on servers; OOM-kill is faster than thrashing. - Huge pages: 2MB instead of 4KB reduces TLB pressure for large heaps. Worth turning on for databases and big JVMs; transparent huge pages can cause latency spikes — usually disable THP, enable explicit hugepages.
I/O: blocking, non-blocking, async
- Blocking syscall: thread parks until ready. Simple, costs you a thread per inflight op.
- Non-blocking + readiness (
select/poll/epoll/kqueue): one thread tracks many fds, woken when any is ready. The basis of Node, Nginx, Go’s netpoller, Pythonasyncio. - Async submission (
io_uring, Windows IOCP): you submit, kernel completes into a ring, you reap. Lowest overhead; the future for high-throughput servers.
Rule of thumb: C10K stops being interesting at C100K. Connection density now is bound by memory per connection (TLS state, buffers) more than scheduling.
Files: what you think write() does vs what it does
write() returns when the data has been copied into the kernel page cache — not when it’s on disk. Until you fsync, a power loss loses it. Databases that promise durability fsync in the commit path; that fsync latency (1–10ms on SSD, 10–100ms on spinning rust) is your write latency floor.
Filesystem mount options that bite: noatime (turn it on, atime updates are pointless writes), data=writeback vs ordered (ext4 journaling mode trades durability for throughput).
CPU caches, the silent killer
| Level | Latency | Size |
|---|---|---|
| L1 | ~1ns | 32–64 KB / core |
| L2 | ~3ns | 256KB–1MB / core |
| L3 | ~10ns | 4–32 MB shared |
| RAM | ~80ns | GB |
| NVMe | ~100µs | TB |
| Network | ~100µs LAN, ~50ms WAN | — |
Sequential access is ~10× faster than random at every level. A linked list with 1M nodes is a worst case at every layer. This is why arrays-of-structs beat structs-of-pointers in hot loops.