On lists, cache, algorithms, and microarchitecture


I’ve been writing a small library of intrusive data structures recently. Singly-linked and doubly-linked lists, being the simplest were also the first ones I implemented. Once I considered the potential for container-specific optimisations by specialising generic algorithms like for_each, transform, and accumulate I’ve realised that there is more to it than just the fact that “lists are slow” and a closer look at them can be quite an educational experience. After all, there’s a lot of code that is not a vectorised loop working with a structure of arrays that fits entirely in the cache and achieves 4 instructions per cycle.

No matter what we do, lists will remain a hardware unfriendly data structure. However, they do have their uses, and they share some of their properties with other node-based data structures like trees, for example. Lists, being simple, can provide an interesting insight into how the performance of high-level algorithms that we write is affected by the processor microarchitecture, what we can do to improve that, and what the cost of those optimisations can be.

Doubly-linked list

In this article, I’m going to use an intrusive doubly-linked list. We are going to focus on the iterating over that list, so insertion and erasure and other operations that modify it are of little importance. Each element of the list needs a hook that contains pointers to the previous and the next element.

class list_hook {
    list_hook* next_;
    list_hook* prev_;

private:
    list_hook(list_hook* prev, list_hook* next) noexcept
        : next_(next), prev_(prev) {}

    template<typename T, list_hook T::*> friend class list;

public:
    list_hook() = default;
    list_hook(list_hook const&) = delete;
    list_hook(list_hook&&) = delete;
};

Even though the public interface more or less follows std::list and the list is presented as a linear one, internally it is circular. This simplifies some parts of the implementation since list_hook::next_ and list_hook::prev_ are never null pointers. Inside the list object there is a root hook which marks the end of the list.

template<typename T, list_hook T::*Hook>
class list {
    list_hook root_ = {&root_, &root_};

public:
    class iterator {
        list_hook* current_;

    public:
        iterator& operator++() noexcept {
            current_ = current_->next_;
            return *this;
        }

        reference operator*() const noexcept {
            return container_of<value_type, list_hook>(Hook, *current_);
        }

        /* ... */
    };

    iterator begin() noexcept { return iterator(root_.next_); }
    iterator end() noexcept { return iterator(&root_); }
    /* ... */
};

Nothing unusual here. Most of the observations we are going to make will apply to all doubly-linked list implementations.

Sequential reads

The simplest way to iterate over any range in C++ is to use range-based for loop. With the iterators implemented as described earlier GCC1 compiles the following code:

struct value {
    intrusive::list_hook hook_;
    int value_ = 0;
};

using list_type = intrusive::list<value, &value::hook_>;

void do_something(value const&);

void for_each(list_type const& list) {
    for (auto& element : list) {
        do_something(element);
    }
}

to:

.L3:
        mov     rdi, rbx
        call    do_something(value const&)
        mov     rbx, QWORD PTR [rbx]
        cmp     rbp, rbx
        jne     .L3

Clang does the same thing. There’s also no apparent difference between range-based for loop and std::for_each. The latter, however, has the potential to be specialised with container-specific optimisations.

Let’s take a closer look at a more concrete example. The following code computes a sum of all value::value_s in the list.

using iterator = list_type::const_iterator;

int sum_list(iterator first, iterator last) {
    return std::accumulate(first, last, 0,
        [] (int a, value const& b) { return a + b.value_; });
}

The compiler doesn’t disappoint:

.L3:
        add     eax, DWORD PTR [rdi+16]
        mov     rdi, QWORD PTR [rdi]
        cmp     rdi, rsi
        jne     .L3

The generated assembly looks okay, so it is time to see how well such code performs. Before we run the test, let’s first find out what LLVM MCA can tell us about it.

Iterations:        10000
Instructions:      40000
Total Cycles:      50005
Total uOps:        50000

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad

[1]    [2]    [3]    [4]    Instructions:
 2      6     0.50    *     add   eax, dword ptr [rdi + 16]
 1      5     0.50    *     mov   rdi, qword ptr [rdi]
 1      1     0.25          cmp   rdi, rsi
 1      1     0.50          jne   .L3

Each iteration consists of 4 instructions which translate to 5 µops. The source of the additional µop is add eax, DWORD PTR [rdi+16] which involves both memory load and an ALU operation. According to MCA, we can expect, on average, 5 cycles per iteration.

The timeline shows us in more details on how instruction-level parallelism works in this case.

[0,0]     DeeeeeeER .    .    .   add eax, dword ptr [rdi + 16]
[0,1]     DeeeeeE-R .    .    .   mov rdi, qword ptr [rdi]
[0,2]     D=====eER .    .    .   cmp rdi, rsi
[0,3]     .D=====eER.    .    .   jne .L3
[1,0]     .D====eeeeeeER .    .   add eax, dword ptr [rdi + 16]
[1,1]     .D====eeeeeE-R .    .   mov rdi, qword ptr [rdi]
[1,2]     . D========eER .    .   cmp rdi, rsi
[1,3]     . D=========eER.    .   jne .L3
[2,0]     . D========eeeeeeER .   add eax, dword ptr [rdi + 16]
[2,1]     .  D=======eeeeeE-R .   mov rdi, qword ptr [rdi]
[2,2]     .  D============eER .   cmp rdi, rsi
[2,3]     .  D=============eER.   jne .L3

D : Instruction dispatched.
e : Instruction executing.
E : Instruction executed.
R : Instruction retired.
= : Instruction already dispatched, waiting to be executed.
- : Instruction executed, waiting to be retired.

Each iteration depends on the result of mov rdi, qword ptr [rdi] from the previous one. mov and add in the same iteration are independent, and since L1 supports two loads each cycle, the processor can execute them at the same time. The compare and conditional jump have minimal effect on the performance.

IACA shows similar, though, not identical results:

D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred

 Num Of |           Ports pressure in cycles            |
  Uops  |  0  |  1  |  2  -  D  |  3  -  D  |  5  |  6  |
---------------------------------------------------------
   2^   | 0.2 | 0.3 | 1.0   1.0 |           | 0.3 | 0.3 | add eax, dword ptr [rdi+16]
   1    |     |     |           | 1.0   1.0 |     |     | mov rdi, qword ptr [rdi]
   1*   |     |     |           |           |     |     | cmp rdi, rsi
   0*F  |     |     |           |           |     |     | jnz 0xfffffffffffffff7
Total Num Of Uops: 4

What MCA didn’t tell us was that cmp and jnz got macrofused and the actual number of µops per iteration is 4.

 0| 0|add eax, dword ptr [rdi+0x10]    :          |         |         |
 0| 0|    TYPE_LOAD (1 uops)           :s---deeeew----R-------p       |
 0| 0|    TYPE_OP (1 uops)             :A--------sdw----R-------p     |
 0| 1|mov rdi, qword ptr [rdi]         :          |         |         |
 0| 1|    TYPE_LOAD (1 uops)           :s---deeeew------R-------p     |
 0| 2|cmp rdi, rsi                     :          |         |         |
 0| 2|    TYPE_OP (1 uops)             :A--------sdw----R-------p     |
 0| 3|jnz 0xfffffffffffffff7           :          |         |         |
 0| 3|    TYPE_OP (0 uops)             :w---------------R-------p     |
 1| 0|add eax, dword ptr [rdi+0x10]    :          |         |         |
 1| 0|    TYPE_LOAD (1 uops)           :A--------sdeeeew----R-------p |
 1| 0|    TYPE_OP (1 uops)             :A--------------sdw----R-------p
 1| 1|mov rdi, qword ptr [rdi]         :          |         |         |
 1| 1|    TYPE_LOAD (1 uops)           : A-------sdeeeew------R-------p
 1| 2|cmp rdi, rsi                     :          |         |         |
 1| 2|    TYPE_OP (1 uops)             : A-------------w------R-------p
 1| 3|jnz 0xfffffffffffffff7           :          |         |         |
 1| 3|    TYPE_OP (0 uops)             : w--------------------R-------p

[A] – Allocated
[s] – Sources ready
[d] – Dispatched for execution
[e] – Execute
[w] – Writeback
[R] – Retired
[p] – Post Retire
[-] – pending

It’s time now to see what the numbers are if the code is actually executed2 and not just analysed. In this test, all elements of the list are stored sequentially in a contiguous block of memory. That’s the best case and is only a small part of the full picture, but it limits the memory effects and is closest to what the static analysers showed.

------------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------
sum_list/1024            1378 ns         1376 ns       508484 ops=744.098M/s
sum_list/4096            6494 ns         6487 ns       107988 ops=631.396M/s
sum_list/16384          27602 ns        27560 ns        25404 ops=594.483M/s
sum_list/32768          55285 ns        55199 ns        12686 ops=593.629M/s
sum_list/65536         110544 ns       110374 ns         6346 ops=593.762M/s
sum_list/131072        220977 ns       220640 ns         3173 ops=594.053M/s
sum_list/262144        448409 ns       447463 ns         1565 ops=585.845M/s
sum_list/524288        901703 ns       899810 ns          777 ops=582.665M/s
sum_list/1048576      2627253 ns      2618203 ns          266 ops=400.495M/s
sum_list/2097152      6143158 ns      6122264 ns          114 ops=342.545M/s
sum_list/4194304     12309066 ns     12266764 ns           57 ops=341.924M/s
sum_list/8388608     24549560 ns     24477472 ns           29 ops=342.707M/s
sum_list/16777216    49105582 ns     48961525 ns           14 ops=342.661M/s
sum_list/33554432    98183344 ns     97894047 ns            7 ops=342.763M/s

The test case parameter is the size of the list. When the dataset fits in the L1 cache, we get around 4 cycles per list element. Once the cache becomes too small, the memory bandwidth comes into play. 342 million 32-bit integers per second is not much but each list element also contains two 8-byte pointers, and its size needs to be 8-byte aligned. That brings the element size up to 24 bytes. Since the elements are in a contiguous array, that means the loop progresses in that array at the rate of around 8.2GB/s (24B × 342M/s), that’s still not the maximum memory bandwidth achievable in a single thread, but we are starting to get close to it.

Now, that we know what’s the best that we can get from a straightforward list implementation let’s see how bad that “the best” is. We can do that by comparing the results that we’ve got to a data structure that is much more hardware-friendly: an array. I’m using the following test code:

using iterator = std::vector<int>::const_iterator;

int sum_array(iterator first, iterator last) {
    return std::accumulate(first, last, 0, std::plus<>{});
}

The core loop that the compiler emits looks as follows:

.L4:
        vpaddd  ymm1, ymm1, YMMWORD PTR [rax]
        add     rax, 32
        cmp     rax, rdx
        jne     .L4

We can already see the first benefit of an array. The compiler can vectorise the loop. The LLVM MCA result shows how this compares to the same operation performed on the list.

Iterations:        10000
Instructions:      40000
Total Cycles:      13343
Total uOps:        50000

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad

[1]    [2]    [3]    [4]    Instructions:
 2      8     0.50    *     vpaddd ymm1, ymm1, ymmword ptr [rax]
 1      1     0.25          add    rax, 32
 1      1     0.25          cmp    rax, rdx
 1      1     0.50          jne    .L4

There are still 4 instructions and 5 µops per iteration. The average cost of a single iteration is just 1.33 cycles. Not only that but each iteration processes eight elements of the array. In other words, instead of, on average, 4 cycles per integer, we now get 0.167.

Let’s see how this looks on the IACA timeline:

 0| 0|vpaddd ymm1, ymm1, ymmword ptr [rax]    :          |         |
 0| 0|    TYPE_LOAD (1 uops)                  :s---deeeeeew----R-------p
 0| 0|    TYPE_OP (1 uops)                    :A----------sdw----R-------p
 0| 1|add rax, 0x20                           :          |         |
 0| 1|    TYPE_OP (1 uops)                    :sdw---------------R-------p
 0| 2|cmp rax, rdx                            :          |         |
 0| 2|    TYPE_OP (1 uops)                    :Asdw--------------R-------p
 0| 3|jnz 0xfffffffffffffff5                  :          |         |
 0| 3|    TYPE_OP (0 uops)                    :w-----------------R-------p
 1| 0|vpaddd ymm1, ymm1, ymmword ptr [rax]    :          |         |
 1| 0|    TYPE_LOAD (1 uops)                  :As--deeeeeew------R-------p
 1| 0|    TYPE_OP (1 uops)                    :A-----------sdw----R-------p
 1| 1|add rax, 0x20                           :          |         |
 1| 1|    TYPE_OP (1 uops)                    : sdw---------------R-------p
 1| 2|cmp rax, rdx                            :          |         |
 1| 2|    TYPE_OP (1 uops)                    : Aw----------------R-------p
 1| 3|jnz 0xfffffffffffffff5                  :          |         |
 1| 3|    TYPE_OP (0 uops)                    : w-----------------R-------p

A significant difference is that the memory load in one iteration doesn’t depend on the memory load from the previous iteration. While there is a dependency between vpadd instructions, it only affects the µop doing the addition, not the memory load. That allows the instruction-level parallelism the hide the latency of the load, which is by far the largest and allows us to achieve, on average, 1.33 cycles per iteration.

It is also worth noting that this loop can be further optimised by partially unrolling it and having multiple independent addition streams. Doing so would hide the latency of address increment (rax) and data addition (ymm1). However, this is not what this article is about so let’s stick to the simple example.

-------------------------------------------------------------------------------
Benchmark                    Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------
sum_array/1024           57.8 ns         57.7 ns     11883599 ops=17.7382G/s
sum_array/4096            245 ns          245 ns      3393065 ops=16.7102G/s
sum_array/8192            420 ns          419 ns      1769924 ops=19.5334G/s
sum_array/16384           878 ns          877 ns       796682 ops=18.6718G/s
sum_array/32768          1906 ns         1903 ns       377939 ops=17.2148G/s
sum_array/65536          5961 ns         5953 ns       116709 ops=11.0097G/s
sum_array/131072        14382 ns        14358 ns        48150 ops=9.12886G/s
sum_array/262144        28744 ns        28704 ns        24386 ops=9.13269G/s
sum_array/524288        57481 ns        57401 ns        12196 ops=9.13385G/s
sum_array/1048576      115200 ns       114983 ns         6088 ops=9.1194G/s
sum_array/2097152      232698 ns       232258 ns         3014 ops=9.02941G/s
sum_array/4194304      524663 ns       523159 ns         1329 ops=8.01726G/s
sum_array/8388608     2356814 ns      2348119 ns          301 ops=3.57248G/s
sum_array/16777216    4978426 ns      4957937 ns          141 ops=3.38391G/s
sum_array/33554432    9934662 ns      9899215 ns           71 ops=3.38961G/s

These results are much better than the list. Since the core does only a minimum amount of work and memory bandwidth is the bottleneck in most cases we can rather easily infer cache level sizes from these results.

As long as we fit in the L1 cache, we do get the performance that the static analysis has promised. When the memory bandwidth is starting to become the limiting factor, the program processes 3 billion elements per second which translates to over 12GB/s, close to the maximum rate that a single core of this CPU can read from the DRAM. The list was able to achieve over 8GB/s, but since a significant portion of that was metadata, the actual element processing ended up being much lower.

Random memory access

So far we have looked at a simple and, perhaps, not very realistic case – small objects stored sequentially. While we already were able to identify some performance problems caused by lists we are yet to see what is perhaps the most important one: cache misses.

------------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------
sum_list/1024           1379 ns         1378 ns       507422 ops=743.156M/s
sum_list/4096          20155 ns        20134 ns        34726 ops=203.438M/s
sum_list/16384        211483 ns       211150 ns         3317 ops=77.594M/s
sum_list/32768        577084 ns       576205 ns         1214 ops=56.8686M/s
sum_list/65536       1301668 ns      1299602 ns          538 ops=50.4277M/s
sum_list/131072      2745135 ns      2740843 ns          256 ops=47.8218M/s
sum_list/262144      6003913 ns      5990938 ns          117 ops=43.7568M/s
sum_list/524288     13040197 ns     13012240 ns           54 ops=40.2919M/s
sum_list/1048576    49060483 ns     48940961 ns           13 ops=21.4253M/s
sum_list/2097152   152734103 ns    152379010 ns            4 ops=13.7627M/s
sum_list/4194304   368973310 ns    368132055 ns            2 ops=11.3935M/s
sum_list/8388608   808360638 ns    806535445 ns            1 ops=10.4008M/s
sum_list/16777216 1693254531 ns   1689396273 ns            1 ops=9.93089M/s
sum_list/33554432 3470687291 ns   3462788715 ns            1 ops=9.69M/s

These are the results of the benchmark with objects randomly placed in memory. Since each element is at an unpredictable location, the hardware prefetchers aren’t able to hide the memory latency. Consequently, as long as the data fits in the L1 cache, the results are very similar to the sequential dataset. However, as soon as the memory loads need to go to the L2 cache, there is a significant degradation. Once DRAM latency comes into play, the results get absolutely horrible.

Prefetch

The tests we have run until now all involved a straightforward arithmetic operation and focused entirely on the memory. That’s hardly representative of all possible uses. Let’s see what do we get if each iteration is much more expensive. To make reasoning easier, I’m going to introduce more arithmetic operations but no additional memory accesses, even though this may be an oversimplification.

struct value {
    intrusive::list_hook hook_;
    int value_ = 23;
};

using iterator = list_type::const_iterator;

int is_prime(value const& v) {
    for (auto i = 2; i < v.value_; i++) {
        if (v.value_ % i == 0) {
            return 0;
        }
    }
    return 1;
}

int count_primes(iterator first, iterator last) {
    return std::accumulate(first, last, 0,
        [] (int a, value const& b) { return a + is_prime(b); });
}

There isn’t much we will be able to do if the list elements are stored sequentially. In those cases, the bottleneck was either the CPU core, which is something is_prime() makes only worse, or memory bandwidth which we can’t increase nor can’t we reduce our needs, all elements have to be read.

Let’s have a look at the case when objects are at random locations in memory. Unsurprisingly, the results with are much lower than before:

------------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------
count_primes/1024          58141 ns        58079 ns        12042 ops=17.6313M/s
count_primes/4096         232530 ns       232275 ns         3013 ops=17.6342M/s
count_primes/16384        952761 ns       951288 ns          736 ops=17.223M/s
count_primes/32768       1988941 ns      1985711 ns          351 ops=16.5019M/s
count_primes/65536       4064617 ns      4058007 ns          173 ops=16.1498M/s
count_primes/131072      8226710 ns      8213428 ns           85 ops=15.9583M/s
count_primes/262144     16932561 ns     16896077 ns           41 ops=15.5151M/s
count_primes/524288     35003895 ns     34927001 ns           20 ops=15.011M/s
count_primes/1048576    98477086 ns     98256410 ns            7 ops=10.6718M/s
count_primes/2097152   265279864 ns    264683726 ns            3 ops=7.92324M/s
count_primes/4194304   602553289 ns    601206514 ns            1 ops=6.97648M/s
count_primes/8388608  1285006296 ns   1282135197 ns            1 ops=6.54269M/s
count_primes/16777216 2655921573 ns   2650113330 ns            1 ops=6.33075M/s
count_primes/33554432 5405048320 ns   5393033036 ns            1 ops=6.22181M/s

We can try to improve this. What we are dealing with are slow computation and slow memory loads. The memory load from one iteration doesn’t depend on the result of computations from the previous one. As LLVM MCA has shown us earlier instruction-level parallelism is attempting to leverage that, at least to some extent. However, because in the C++ code we increment the iterator after the body of the iteration and the compiler doesn’t change that the processor ability to reorder those operations is limited. What we can do is to implement our own accumulate that’s going to do those operations in the order that we want.

template<typename Iterator, typename T, typename Function>
T accumulate(Iterator first, Iterator last, T init, Function fn) {
    while (first != last) {
        init = fn(init, *first++);
    }
    return init;
}

There is a small, but measurable improvement.

------------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------
count_primes/1024          57882 ns        57820 ns        12086 ops=17.7102M/s
count_primes/4096         232351 ns       232040 ns         3015 ops=17.6522M/s
count_primes/16384        933098 ns       931562 ns          751 ops=17.5877M/s
count_primes/32768       1891233 ns      1888107 ns          371 ops=17.3549M/s
count_primes/65536       3812290 ns      3806098 ns          184 ops=17.2187M/s
count_primes/131072      7660883 ns      7648319 ns           92 ops=17.1374M/s
count_primes/262144     15721946 ns     15687144 ns           45 ops=16.7108M/s
count_primes/524288     32515040 ns     32436334 ns           22 ops=16.1636M/s
count_primes/1048576    89039706 ns     88833419 ns            8 ops=11.8038M/s
count_primes/2097152   237240276 ns    236644783 ns            3 ops=8.86203M/s
count_primes/4194304   533267107 ns    532046120 ns            1 ops=7.88335M/s
count_primes/8388608  1130966829 ns   1128285050 ns            1 ops=7.43483M/s
count_primes/16777216 2333217583 ns   2327657784 ns            1 ops=7.20777M/s
count_primes/33554432 4740416511 ns   4729532527 ns            1 ops=7.09466M/s

Essentially, what we are trying to do here is prefetching the next element. The processor has four hardware prefetchers:

These prefetchers cover sequential accesses and large objects spanning multiple cache lines, but there’s nothing that would help with lists of small objects. We need software prefetching.

There is a family of prefetch instructions that cause a line to be fetched to the L1 cache. Various variants allow specifying whether the actual access is going to be a load or a store and what’s the expected temporal locality. The benchmarks that we are running are not sophisticated enough to show differences between those variants, but we can see that a prefetch performs better than an explicit preload.

template<typename Iterator, typename T, typename Function>
T accumulate(Iterator first, Iterator last, T init, Function fn) {
    while (first != last) {
        __builtin_prefetch(first.current_->next_);
        init = fn(init, *first);
        first++;
    }
    return init;
}

The performance continues improving.

------------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------
count_primes/1024          58509 ns        58432 ns        11968 ops=17.5247M/s
count_primes/4096         233916 ns       233654 ns         2996 ops=17.5302M/s
count_primes/16384        936182 ns       934700 ns          749 ops=17.5286M/s
count_primes/32768       1872717 ns      1869688 ns          374 ops=17.5259M/s
count_primes/65536       3745794 ns      3739701 ns          187 ops=17.5244M/s
count_primes/131072      7491735 ns      7479747 ns           94 ops=17.5236M/s
count_primes/262144     14999112 ns     14966690 ns           47 ops=17.5152M/s
count_primes/524288     30253345 ns     30185697 ns           23 ops=17.3688M/s
count_primes/1048576    68201001 ns     68038354 ns           10 ops=15.4115M/s
count_primes/2097152   171015642 ns    170615830 ns            4 ops=12.2917M/s
count_primes/4194304   383902949 ns    383021166 ns            2 ops=10.9506M/s
count_primes/8388608   825511961 ns    823440345 ns            1 ops=10.1873M/s
count_primes/16777216 1716217939 ns   1712077835 ns            1 ops=9.79933M/s
count_primes/33554432 3509215001 ns   3500849170 ns            1 ops=9.58466M/s

Order guarantees

Everything we have considered and done so far was at low, hardware, level. It is time to take a look at the actual algorithm and see if it can be changed to improve the performance.

std::for_each and std::accumulate guarantee the order in which they are going to iterate over the range. Often the user doesn’t need that, and the guarantee becomes an unnecessary limitation. In reasonably simple cases, the compiler can see that the binary operation is associative and commutative and optimise the code accordingly. We saw that happen in the integer array sum case when the compiler chose to use vectorised vpadd.

Unfortunately, we can’t rely on the optimiser to do similar optimisations for more complicated data structures. C++17 solved this problem by introducing std::reduce which behaves just like std::accumulate, but doesn’t specify the order in which it applies the provided binary operation.

Let’s take advantage of that. For a doubly-linked list, the lack of any order guarantees means that we can visit the elements of the list in two entirely independent sequences: one starting from the beginning and one starting from the end. Since the processor and memory can handle more than one cache miss at a time, we can expect noticeably better benchmark results.

template<typename Iterator, typename T, typename Function>
T reduce(Iterator first, Iterator last, T init, Function fn) {
    T a = init;
    T b{}; // <- not quite right, but let's keep it simple
    while (first != last) {
        auto current = first;
        first++;
       __builtin_prefetch(first.current_->next_);
        a = fn(a, *current);
        if (first == last) {
            break;
        }
        --last;
        current = last;
        __builtin_prefetch(last.current_->prev_);
        b = fn(b, *current);
    }
    return a + b;
}

The results look promising. The additional complexity penalises the cases with a smaller list, however.

---------------------------------------------------------------------------------
Benchmark                      Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------
count_primes/1024          61918 ns        61836 ns        11312 ops=16.5599M/s
count_primes/4096         247573 ns       247294 ns         2830 ops=16.5633M/s
count_primes/16384        992230 ns       990653 ns          707 ops=16.5386M/s
count_primes/32768       1986464 ns      1983268 ns          353 ops=16.5222M/s
count_primes/65536       3974927 ns      3967599 ns          176 ops=16.5178M/s
count_primes/131072      7948909 ns      7936133 ns           88 ops=16.5159M/s
count_primes/262144     15914174 ns     15880084 ns           44 ops=16.5077M/s
count_primes/524288     32182869 ns     32110430 ns           22 ops=16.3277M/s
count_primes/1048576    69216427 ns     69057303 ns           10 ops=15.1841M/s
count_primes/2097152   152311285 ns    151964885 ns            5 ops=13.8002M/s
count_primes/4194304   314740429 ns    314020894 ns            2 ops=13.3568M/s
count_primes/8388608   641017755 ns    639543661 ns            1 ops=13.1166M/s
count_primes/16777216 1293884116 ns   1290747747 ns            1 ops=12.9981M/s
count_primes/33554432 2599582839 ns   2593566559 ns            1 ops=12.9376M/s

It’s also worth noting that while prefetching helped only when there was also quite expensive compute operation, iterating over the list from both ends helps even if the binary operation is just simple addition.

The reason for that is that one of the performance problems we have identified earlier was the dependency between iterations. The processor needs to read from the current object to get the address of the next one. With two independent chains of objects, the instruction-level parallelism has more opportunities to hide latencies.

Since we have already started changing high-level details, we may also want to consider doing slight modifications to the data structure itself. If the performance of reduce-like operations is important, it may be worthwhile to have a separate array of pointers to the list elements in an unspecified order. When an element is inserted into the list, the pointer is added at the end of the array. On erase, the pointer to the removed element is swapped with the last pointer in the array. There are several trade-offs involved here: increased memory footprint, including the hook size, more expensive insert and erase. However, what we can do now is to implement reduce in the following way:

template<typename T, typename Function>
T reduce(std::vector<value*> const& v, T init, Function&& fn) {
    auto current = v.data();
    auto prefetch = v.data() + 4;
    auto end = v.data() + v.size();
    while (current != end) {
        __builtin_prefetch(*prefetch++);
        init = fn(init, **current++);
    }
    return init;
}

If fn is integer addition, this gets compiled to:

.L14:
        mov     rcx, QWORD PTR [rdx+32]
        add     rdx, 8
        prefetcht0      [rcx]
        mov     rcx, QWORD PTR [rdx-8]
        add     eax, DWORD PTR [rcx+16]
        cmp     rsi, rdx
        jne     .L14

What we have gained is the fact that the only dependency between iterations is rdx and the only instruction that modifies its value is cheap add rdx, 8. That could significantly improve the performance by enabling much better instruction-level parallelism than any other list-based solution we’ve tried before as the LLVM MCA report below shows. On the other hand, we get an additional memory load and a prefetch in each iteration, which is not free.

Iterations:        10000
Instructions:      70000
Total Cycles:      20013
Total uOps:        80000

[0,0]     DeeeeeER  .    .    .   mov       rcx, qword ptr [rdx + 32]
[0,1]     DeE----R  .    .    .   add       rdx, 8
[0,2]     D=====eeeeeER  .    .   prefetcht0        byte ptr [rcx]
[0,3]     D=eeeeeE----R  .    .   mov       rcx, qword ptr [rdx - 8]
[0,4]     .D=====eeeeeeER.    .   add       eax, dword ptr [rcx + 16]
[0,5]     .DeE----------R.    .   cmp       rsi, rdx
[0,6]     .D=eE---------R.    .   jne       .L14
[1,0]     . DeeeeeE-----R.    .   mov       rcx, qword ptr [rdx + 32]
[1,1]     . DeE---------R.    .   add       rdx, 8
[1,2]     . D=====eeeeeER.    .   prefetcht0        byte ptr [rcx]
[1,3]     . D=eeeeeE----R.    .   mov       rcx, qword ptr [rdx - 8]
[1,4]     .  D=====eeeeeeER   .   add       eax, dword ptr [rcx + 16]
[1,5]     .  DeE----------R   .   cmp       rsi, rdx
[1,6]     .  D=eE---------R   .   jne       .L14

The benchmark results are much better.

-----------------------------------------------------------------------------
Benchmark                  Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
sum_list/1024            753 ns          752 ns       931223 ops=1.36227G/s
sum_list/4096           4003 ns         3998 ns       176082 ops=1024.41M/s
sum_list/16384         33795 ns        33743 ns        20828 ops=485.547M/s
sum_list/32768         73549 ns        73446 ns         9034 ops=446.15M/s
sum_list/65536        158379 ns       158158 ns         4428 ops=414.37M/s
sum_list/131072       328278 ns       327774 ns         2136 ops=399.886M/s
sum_list/262144       906197 ns       904299 ns          783 ops=289.887M/s
sum_list/524288      3287541 ns      3279609 ns          214 ops=159.863M/s
sum_list/1048576    10340606 ns     10308163 ns           68 ops=101.723M/s
sum_list/2097152    28847559 ns     28760461 ns           24 ops=72.9179M/s
sum_list/4194304    76017404 ns     75794048 ns            9 ops=55.3382M/s
sum_list/8388608   184997683 ns    184460327 ns            4 ops=45.4765M/s
sum_list/16777216  409853275 ns    408730117 ns            2 ops=41.0472M/s
sum_list/33554432  867062224 ns    864723310 ns            1 ops=38.8037M/s

The array-of-pointers reduce algorithm was still using prefetching. The reason is the same as before. If fn performs expensive computation, prefetching helps to ensure that this time is spent loading succeeding elements to the L1 cache. It can harm the cases when the binary operation is cheap, and this is what’s happening now. Below are the results of the same benchmark, but without the explicit prefetch.

-----------------------------------------------------------------------------
Benchmark                  Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
sum_list/1024            508 ns          508 ns      1000000 ops=2.01644G/s
sum_list/4096           3170 ns         3167 ns       222224 ops=1.29346G/s
sum_list/16384         29529 ns        29485 ns        23731 ops=555.678M/s
sum_list/32768         64005 ns        63911 ns        10472 ops=512.709M/s
sum_list/65536        134276 ns       134086 ns         5220 ops=488.762M/s
sum_list/131072       276732 ns       276299 ns         2533 ops=474.385M/s
sum_list/262144       863967 ns       862067 ns          814 ops=304.088M/s
sum_list/524288      3259658 ns      3251631 ns          215 ops=161.238M/s
sum_list/1048576     9678259 ns      9646640 ns           72 ops=108.699M/s
sum_list/2097152    25978633 ns     25902456 ns           27 ops=80.9634M/s
sum_list/4194304    71612908 ns     71414814 ns           10 ops=58.7316M/s
sum_list/8388608   178783902 ns    178306598 ns            4 ops=47.046M/s
sum_list/16777216  400214167 ns    399145315 ns            2 ops=42.0329M/s
sum_list/33554432  848662556 ns    846467448 ns            1 ops=39.6405M/s

Now, if we go back to the counting prime numbers test the things look different. With prefetch:

---------------------------------------------------------------------------------
Benchmark                      Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------
count_primes/1024          58373 ns        58308 ns        11982 ops=17.5621M/s
count_primes/4096         230787 ns       230508 ns         3047 ops=17.7695M/s
count_primes/16384        911047 ns       909573 ns          772 ops=18.0128M/s
count_primes/32768       1823311 ns      1820351 ns          385 ops=18.0009M/s
count_primes/65536       3653772 ns      3647734 ns          192 ops=17.9662M/s
count_primes/131072      7314532 ns      7302256 ns           96 ops=17.9495M/s
count_primes/262144     14650767 ns     14618345 ns           48 ops=17.9325M/s
count_primes/524288     29483920 ns     29417635 ns           24 ops=17.8222M/s
count_primes/1048576    65331597 ns     65176522 ns           11 ops=16.0882M/s
count_primes/2097152   141598740 ns    141261338 ns            5 ops=14.8459M/s
count_primes/4194304   293117615 ns    292433879 ns            2 ops=14.3427M/s
count_primes/8388608   595831596 ns    594449906 ns            1 ops=14.1115M/s
count_primes/16777216 1201265965 ns   1198475633 ns            1 ops=13.9988M/s
count_primes/33554432 2413779177 ns   2408014847 ns            1 ops=13.9345M/s

Without prefetch:

---------------------------------------------------------------------------------
Benchmark                      Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------
count_primes/1024          60592 ns        60511 ns        11626 ops=16.9225M/s
count_primes/4096         240139 ns       239875 ns         2919 ops=17.0755M/s
count_primes/16384        968054 ns       966523 ns          725 ops=16.9515M/s
count_primes/32768       1962921 ns      1959719 ns          357 ops=16.7208M/s
count_primes/65536       3958932 ns      3951584 ns          177 ops=16.5847M/s
count_primes/131072      7957449 ns      7944047 ns           88 ops=16.4994M/s
count_primes/262144     16322022 ns     16286870 ns           43 ops=16.0954M/s
count_primes/524288     33987005 ns     33912660 ns           21 ops=15.4599M/s
count_primes/1048576    99127792 ns     98909029 ns            7 ops=10.6014M/s
count_primes/2097152   244108086 ns    243574237 ns            3 ops=8.60991M/s
count_primes/4194304   541851480 ns    540651988 ns            1 ops=7.75786M/s
count_primes/8388608  1149760156 ns   1147266232 ns            1 ops=7.31182M/s
count_primes/16777216 2371160344 ns   2365748664 ns            1 ops=7.09172M/s
count_primes/33554432 4818974014 ns   4807875429 ns            1 ops=6.97906M/s

As we can see, explicit software prefetch can be beneficial, but it largely depends on the specifics of the code and the way memory accesses are balanced against computations.

Conclusion

What this exploration of ways to optimise doubly-linked lists has shown us several things. The fact that lists perform poorly is hardly a surprise, but depending on the particularities of the dataset there were different reasons for it. When everything fitted in the L1 cache, the problem was the mere fact that std::accumulate on a list needs to do more than the same operation on an array and can’t be virtualised. Once the dataset grew larger than the cache, but the objects were stored sequentially the memory bandwidth was the limiting factor. With randomly placed objects we saw a significant drop in performance as soon as the processor had to load data from L2 cache which only got worse as the size of the data grew.

Another thing that we saw was that while many things that potentially can improve performance a lot of them are just trade-offs. Sometimes they can help, even a lot, but for other workloads and data they do nothing at best, are harmful at worst. Since we have run only simple benchmarks, there were still many phenomena that didn’t affect us at all. There are also other aspects that need consideration like the readability and maintainability of the code. All this makes optimising generic algorithms harder and explains why often more specific solutions can offer better performance.

It’s still probably a good idea to avoid lists if possible, though.


  1. GCC 8.3.1 with -O3 -march=haswell -std=c++17 ↩︎

  2. Intel Core i7-5960X CPU, 3.00GHz, TurboBoost disabled, 32kB of L1d, 256kB of L2, 20MB of L3 ↩︎