The mental model I keep returning to: a process is an isolated address space with at least one thread; a thread is a scheduling unit inside that address space. Everything else — GIL, mutexes, fork-vs-spawn quirks — falls out of that.
Stack vs heap, in one paragraph
Stack frames are bump-allocated per call and free on return — fast, but lifetime is bounded by the call. The heap holds anything whose lifetime outlives the function that created it; allocation is more expensive (free list, fragmentation, locking) and freeing requires GC, refcounting, or explicit free. In Python, all user objects live on the heap; the “stack” only holds frame objects and references. So x = 10; y = x doesn’t copy 10 — it bumps the refcount on the cached small-int.
Process vs thread
| Process | Thread | |
|---|---|---|
| Address space | Isolated | Shared with siblings |
| Crash blast radius | Itself | The whole process |
| IPC cost | High (sockets, shm, pipes) | Just shared memory |
| Spin-up | ~ms | ~µs |
| Use when | Fault isolation, parallel CPU work in Python | Concurrent I/O, shared state |
Threads share heap and code; each gets its own stack and registers. The shared heap is what makes them cheap and what makes them dangerous — every read of mutable state is a race unless something says otherwise.
Synchronization, in order of preference
- Don’t share mutable state. Message passing (channels, queues) sidesteps most bugs. Default here.
- Immutable data. No coordination needed by definition.
- Atomic primitives (
std::atomic,AtomicInteger). Lock-free, but reorderings will surprise you — read the memory model. - Mutex. A mutex protects an invariant, not a variable. Hold for the shortest critical section; never call user code inside one.
- Semaphore / RWLock. Specialized — semaphore for counting (connection pools, rate limits), RWLock when reads vastly outnumber writes and the read critical section is non-trivial. RWLock often loses to plain mutex due to writer starvation and bookkeeping cost.
Deadlock checklist: take locks in a global order, never hold a lock across an await, never call back into user code inside a critical section.
The Python GIL — what it actually does
The CPython GIL serializes bytecode execution: only one thread runs Python at a time. Implications:
- Threads are still useful for I/O-bound work — the GIL is released around blocking syscalls.
- Threads are useless for CPU-bound Python. Use
multiprocessing, subinterpreters (3.12+), or rewrite the hot path in C/Rust/numpy where the GIL is dropped. - Threads are not safe despite the GIL. The GIL guarantees bytecode atomicity, not statement atomicity.
x += 1is three bytecodes; you can lose updates. - The free-threaded build (3.13+, PEP 703) removes the GIL but introduces real data races where Python code previously got away without locking.
Endianness — when it matters
Almost never inside one machine. It matters at exactly two boundaries: wire protocols (network byte order is big-endian by convention) and on-disk formats if you ship files between architectures. Use struct.pack/htonl etc. — never assume.