The mental model I keep returning to: a process is an isolated address space with at least one thread; a thread is a scheduling unit inside that address space. Everything else — GIL, mutexes, fork-vs-spawn quirks — falls out of that.

Stack vs heap, in one paragraph

Stack frames are bump-allocated per call and free on return — fast, but lifetime is bounded by the call. The heap holds anything whose lifetime outlives the function that created it; allocation is more expensive (free list, fragmentation, locking) and freeing requires GC, refcounting, or explicit free. In Python, all user objects live on the heap; the “stack” only holds frame objects and references. So x = 10; y = x doesn’t copy 10 — it bumps the refcount on the cached small-int.

Process vs thread

ProcessThread
Address spaceIsolatedShared with siblings
Crash blast radiusItselfThe whole process
IPC costHigh (sockets, shm, pipes)Just shared memory
Spin-up~ms~µs
Use whenFault isolation, parallel CPU work in PythonConcurrent I/O, shared state

Threads share heap and code; each gets its own stack and registers. The shared heap is what makes them cheap and what makes them dangerous — every read of mutable state is a race unless something says otherwise.

Synchronization, in order of preference

  1. Don’t share mutable state. Message passing (channels, queues) sidesteps most bugs. Default here.
  2. Immutable data. No coordination needed by definition.
  3. Atomic primitives (std::atomic, AtomicInteger). Lock-free, but reorderings will surprise you — read the memory model.
  4. Mutex. A mutex protects an invariant, not a variable. Hold for the shortest critical section; never call user code inside one.
  5. Semaphore / RWLock. Specialized — semaphore for counting (connection pools, rate limits), RWLock when reads vastly outnumber writes and the read critical section is non-trivial. RWLock often loses to plain mutex due to writer starvation and bookkeeping cost.

Deadlock checklist: take locks in a global order, never hold a lock across an await, never call back into user code inside a critical section.

The Python GIL — what it actually does

The CPython GIL serializes bytecode execution: only one thread runs Python at a time. Implications:

  • Threads are still useful for I/O-bound work — the GIL is released around blocking syscalls.
  • Threads are useless for CPU-bound Python. Use multiprocessing, subinterpreters (3.12+), or rewrite the hot path in C/Rust/numpy where the GIL is dropped.
  • Threads are not safe despite the GIL. The GIL guarantees bytecode atomicity, not statement atomicity. x += 1 is three bytecodes; you can lose updates.
  • The free-threaded build (3.13+, PEP 703) removes the GIL but introduces real data races where Python code previously got away without locking.

Endianness — when it matters

Almost never inside one machine. It matters at exactly two boundaries: wire protocols (network byte order is big-endian by convention) and on-disk formats if you ship files between architectures. Use struct.pack/htonl etc. — never assume.