On lists, cache, algorithms, and microarchitecture

I’ve been writing a small library of intrusive data structures recently. Singly-linked and doubly-linked lists, being the simplest were also the first ones I implemented. Once I considered the potential for container-specific optimisations by specialising generic algorithms like for_each, transform, and accumulate I’ve realised that there is more to it than just the fact that “lists are slow” and a closer look at them can be quite an educational experience. After all, there’s a lot of code that is not a vectorised loop working with a structure of arrays that fits entirely in the cache and achieves 4 instructions per cycle.