Operating systems essentials

The OS is a resource broker. The interesting parts — for someone shipping services, not writing kernels — are the abstractions it leaks and the places where its scheduling decisions become your latency.

The scheduler is your roommate

The kernel scheduler decides which thread runs on which core when. You don’t pick — but you can observe and bias.

Run queue depth (vmstat, runq-sz) > number of cores → you’re CPU-bound and waiting on the scheduler. Adding threads makes it worse.
Context switches are not free: ~1–10µs for the switch itself, much more for cache eviction. A service doing 100k context switches/sec is burning real CPU on overhead.
Voluntary vs involuntary switches matter. Voluntary = you blocked on I/O (fine). Involuntary = preempted (you have too many threads, or noisy neighbors).
Linux’s CFS is fair, not real-time. If you need bounded latency, pin threads (taskset, sched_setaffinity) and isolate cores (isolcpus).

Memory, the four-level lie

User-space sees a flat virtual address space. Reality:

Virtual → physical via page tables. TLB miss = ~100 cycles. Working set bigger than TLB coverage → death by walks.
Page faults: minor (page in cache, just map it, ~µs) vs major (read from disk, ~ms). Major faults in a hot path are a smell.
Swap: if you’re swapping, you’ve already lost. Set vm.swappiness=1 or disable on servers; OOM-kill is faster than thrashing.
Huge pages: 2MB instead of 4KB reduces TLB pressure for large heaps. Worth turning on for databases and big JVMs; transparent huge pages can cause latency spikes — usually disable THP, enable explicit hugepages.

I/O: blocking, non-blocking, async

Blocking syscall: thread parks until ready. Simple, costs you a thread per inflight op.
Non-blocking + readiness (select/poll/epoll/kqueue): one thread tracks many fds, woken when any is ready. The basis of Node, Nginx, Go’s netpoller, Python asyncio.
Async submission (io_uring, Windows IOCP): you submit, kernel completes into a ring, you reap. Lowest overhead; the future for high-throughput servers.

Rule of thumb: C10K stops being interesting at C100K. Connection density now is bound by memory per connection (TLS state, buffers) more than scheduling.

Files: what you think `write()` does vs what it does

write() returns when the data has been copied into the kernel page cache — not when it’s on disk. Until you fsync, a power loss loses it. Databases that promise durability fsync in the commit path; that fsync latency (1–10ms on SSD, 10–100ms on spinning rust) is your write latency floor.

Filesystem mount options that bite: noatime (turn it on, atime updates are pointless writes), data=writeback vs ordered (ext4 journaling mode trades durability for throughput).

CPU caches, the silent killer

Level	Latency	Size
L1	~1ns	32–64 KB / core
L2	~3ns	256KB–1MB / core
L3	~10ns	4–32 MB shared
RAM	~80ns	GB
NVMe	~100µs	TB
Network	~100µs LAN, ~50ms WAN	—

Sequential access is ~10× faster than random at every level. A linked list with 1M nodes is a worst case at every layer. This is why arrays-of-structs beat structs-of-pointers in hot loops.

The Commonplace Book

Explorer

Operating systems essentials

The scheduler is your roommate

Memory, the four-level lie

I/O: blocking, non-blocking, async

Files: what you think `write()` does vs what it does

CPU caches, the silent killer

Graph View

Table of Contents

Backlinks

The Commonplace Book

Explorer

Operating systems essentials

The scheduler is your roommate

Memory, the four-level lie

I/O: blocking, non-blocking, async

Files: what you think write() does vs what it does

CPU caches, the silent killer

Graph View

Table of Contents

Backlinks

Files: what you think `write()` does vs what it does