Avoiding instruction cache misses

Modern processors are quite complex, with many parts having the potential to become a bottleneck. It is relatively easy to reason about the performance of short pieces of code, especially if memory effects are kept to a minimum. Both static analysis tools like LLVM MCA and microbenchmarks can provide a lot of information in such cases. However, the behaviour of the whole program is not just the sum of those small parts. As the code becomes larger and more complex other effects start appearing. One of such potential problems are excessive instruction cache misses.

On lists, cache, algorithms, and microarchitecture

I’ve been writing a small library of intrusive data structures recently. Singly-linked and doubly-linked lists, being the simplest were also the first ones I implemented. Once I considered the potential for container-specific optimisations by specialising generic algorithms like for_each, transform, and accumulate I’ve realised that there is more to it than just the fact that “lists are slow” and a closer look at them can be quite an educational experience. After all, there’s a lot of code that is not a vectorised loop working with a structure of arrays that fits entirely in the cache and achieves 4 instructions per cycle.