On GPUs, ranges, latency, and superoptimisers

Not long ago, I put together a small library that provides an interface similar to the range-v3 library and uses CUDA to execute all operations on GPUs. I’ve implemented the basics: map, filter, reduce operations, just enough to serve as a proof of concept. In essence, it is quite similar to Thrust; just the interface is different, based on ranges instead of iterators. As one would expect, the initial implementation wasn’t heavily optimised, but the time has come to improve the performance. This was an excellent opportunity to look more closely at the architecture of modern GPUs – Turing in this case.

Avoiding instruction cache misses

Modern processors are quite complex, with many parts having the potential to become a bottleneck. It is relatively easy to reason about the performance of short pieces of code, especially if memory effects are kept to a minimum. Both static analysis tools like LLVM MCA and microbenchmarks can provide a lot of information in such cases. However, the behaviour of the whole program is not just the sum of those small parts. As the code becomes larger and more complex other effects start appearing. One of such potential problems are excessive instruction cache misses.

On lists, cache, algorithms, and microarchitecture

I’ve been writing a small library of intrusive data structures recently. Singly-linked and doubly-linked lists, being the simplest were also the first ones I implemented. Once I considered the potential for container-specific optimisations by specialising generic algorithms like for_each, transform, and accumulate I’ve realised that there is more to it than just the fact that “lists are slow” and a closer look at them can be quite an educational experience. After all, there’s a lot of code that is not a vectorised loop working with a structure of arrays that fits entirely in the cache and achieves 4 instructions per cycle.