Let’s say we want to store some data on multiple drives, so that we can recover from drive failures.

One obvious (and valid!) first attempt is to just store everything multiple times – usually called “mirroring”. The most common form of mirroring is to store things twice: if one drive fails, the data is still in the other.^{1}

Reed-Solomon coding gives us much more flexibility, allowing us to store our data over n = k + t drives, so that *any t* drives can fail while still not losing data.^{2}

For instance, with \mathrm{RS}(10, 4) we can store data over 14 drives, and any 4 drives can fail at once, without incurring data loss. The data is split over 10 equal blocks, and then 4 additional blocks (“parity” blocks) are generated in a such a way that any 4 blocks can be recovered from the other 10.^{3}^{4}

The main idea behind Reed-Solomon is easy to visualize. We’ll focus on the task of durably storing k numbers y_1, ..., y_k. We’ll plot them as (1, y_1), ..., (k, y_k). For instance, with k = 4, we might have:

This post presents a specific way to perform Reed-Solomon coding, using Lagrange interpolation. See the Wikipedia article for more information.↩︎

RAID4 and RAID6 fit in this framework as \mathrm{RS}(3,1) and \mathrm{RS}(4,2) respectively.↩︎

The “parity” terminology comes from other error detecting or correcting codes which use parity bits.↩︎

For any distinct k points, there’s a unique polynomial of degree d < k passing through them.^{5} This fact should not be too surprising, and is apparent when k = 2: given two distinct points, there’s only one line passing through them.

Here’s the unique third-degree polynomial passing through our four points:

A polynomial of degree d is a something of the form

a_0 + a_1 x + a_2 x^2 + ... + a_d x^d↩︎

Armed with this fact, we can pick an additional t points on the same polynomial. Again, with k = 4 and t = 2, we would have:

Sampled points are in gold. Since the interpolating polynomial is unique given any k points it passes through, we can drop any t points, and we’ll still be able to recover the polynomial. Once we’ve done that, we can just resample it to recover the numbers we’re storing.

To recap, our procedure to durably store k numbers is as follows:

- Compute the unique polynomial of degree d < k by placing the numbers at some predefined interval on the XY plane;
- Sample the polynomial t times beyond the original points;
- Store the k original numbers alongside the t “parity” numbers;
- If some numbers are lost we can recompute them by resampling the polynomial.

At this point, you already understand a key idea powering Reed-Solomon coding.

The other key idea allows us to store bits, rather than numbers. I won’t properly explain it here, but the gist of it is to use *finite fields* of order 2^{\mathrm{bits}} as our number type.^{6}

Such finite fields are numeric types working differently from the usual integers modulo 2^{\mathrm{bits}} that we’re used to program with, but still easy to implement in hardware, and importantly numbers for which the considerations in this blog post about polynomials hold.^{7}

Once we have such a numeric type, all we need to do to durably store some blob of data is to split it in a series of 2^{\mathrm{bits}} numbers (each \mathrm{bits} wide), group them in sets of k, and then store them durably as described in this post, tuning t based on how redundant we want to be.

The final trick worth mentioning is that this kind of Reed-Solomon encoding can be implemented efficiently given that we have fixed the x coordinates, no matter what numbers we want to store. Or in other words, the definition for \ell_j which we use to generate the unique polynomial only depends on the x coordinates, which allows us to do the heavy lifting once for any k numbers we want to store.

Head over to Peter Cawley’s blog for details on how the finite field machinery works on modern CPUs.↩︎

As the name suggests, finite fields are fields, which is handy since generating polynomials passing through a set of points involves divisions.

Integers modulo 2^\mathrm{bits}, which is what we’re used to program with, are

*not*closed under division, which jams the maths required to perform the operations we described.↩︎

This setup is entirely due to Niklas Hambüchen, sharing it here since it is tremendously useful.

Let’s say you have a server without a graphics card, and you want to use graphical programs directly on it. Here is a 3 step procedure to get a remote desktop supporting OpenGL applications:

Install

`turbovnc`

, for example by putting it into`environment.systemPackages`

. TurboVNC supports software rendering through LLVMpipe, which is a software rasterizer for Mesa, which in turn is the most popular open-source implementation of OpenGL and in general the Linux graphics stack.Set

`hardware.opengl.enable = true`

in`/etc/nixos/configuration.nix`

. This will create`/run/opengl-driver`

, containing the shared libraries that OpenGL applications will need to load.Start the TurboVNC server with

`server% Xvnc :30 -iglx -depth 24 -rfbwait 120000 -deferupdate 1 -localhost -verbose -securitytypes none`

Note that the server will only listen to `localhost`

, to defer security to SSH.

All we need to do on the client is to port-forward the TurboVNC port and connect:

```
client% ssh <server> -L 5930:localhost:5930
client% vncviewer -securitytypes none :30 -DesktopSize=2500x1350 -Scale=150&
```

Here I’m setting a size and a scale manually, but you get the idea.

The X server started by TurboVNC is very bare, but one can start a terminal manually and then go from there:

`server% DISPLAY=:30 xfce-terminal&`

I use `openbox`

for simple window management, by just starting it from the terminal spawned above.

If we have a heuristic to guess some value cheaply, we can remove a data dependency in a tight loop using the branch predictor. This allows the CPU to run more instructions in parallel, increasing performance. If this explanation does not make much sense to you, keep reading to learn about some of the magic making your CPU fast!

Per Vognsen’s twitter feed is full of neat low-level curiosities, usually leveraging CPU features for some performance benefit.

Recently he tweeted about a trick that I had never heard of – value speculation.^{1} The trick exploits the branch predictor to guess values, enabling more instruction parallelism and therefore removing a bottleneck on the L1 cache. Note that the bottleneck is *not* due to L1 cache misses, but on L1 cache *hits* introducing unwanted data dependencies.

Per, in turn, referenced a blog post by Paul Khuong with a real-world example deploying this trick. Paul, in turn, references side-channel attacks.↩︎

In this post I explain the machinery involved, including a primer on branch prediction and CPU caches, so that anybody with a passing knowledge of C and how code is executed on CPUs should be able to follow.

The code for the post is available here. All the numbers are from a Xeon E5-1650 v3, an Intel Haswell processor with L1 / L2 / L3 cache of 32kB, 256kB, and 15MB respectively. The code was compiled with `clang -O3`

, and not with `gcc`

, for reasons explained later.

Before starting, I’d like to stress that L1 cache *hits* are almost certainly *not* the bottleneck of your application! This is just a very neat trick that illuminates some CPU features, not a guide on how to improve the performance of your average piece of C code.

We have a simple linked list data type, and a function summing all the elements of a given linked list:

```
typedef struct Node {
uint64_t value;
struct Node *next; // NULL for the last node
} Node;
uint64_t sum1(Node* node) {
uint64_t value = 0;
while (node) {
+= node->value;
value = node->next;
node }
return value;
}
```

So far so good. Our test case works as follows: build a linked list where the nodes live sequentially in contiguous memory, then see how long it takes to sum them all up:

```
// Allocate 5MB of linked list nodes, and link them sequentially, with
// random data in the `value`s.
uint64_t n = 312500llu; // 312500 * sizeof(Node) = 312500 * 16 bytes = 5000000 bytes
*nodes = malloc(n * sizeof(Node));
Node for (uint64_t i = 0; i < n - 1; i++) {
[i].value = random_uint64();
nodes[i].next = &nodes[i+1];
nodes}
[n-1].value = random_uint64();
nodes[n-1].next = NULL;
nodes
// Now sum.
(&nodes[0]); sum1
```

On a server with a relatively old Xeon E5-1650 v3, running `sum1`

with the sample data takes 0.36 milliseconds, which means that we’re processing our linked list at roughly 14GB/s. In the rest of the post will will identify the bottleneck and get around it with value speculation, bringing the throughput for this dataset to 30GB/s.

The impact of the fix varies depending on the size of the dataset. If it is already entirely in the CPU cache, the improvement is much more pronounced, since otherwise we are quickly constrained by how fast we can read data from RAM. This graph shows the performance improvement over differently sized datasets (higher is better):

The chart shows the performance of `sum1`

together with the performance of two improved functions, `sum2`

and `sum3`

. We go from a throughput of 14GB/s in `sum1`

to more than 45GB/s in `sum3`

if the data fits entirely in the L1 cache (the 16kB dataset), with the performance decreasing slightly for datasets fitting in the L2 and L3 cache (128kB and 5MB datasets). If the dataset does not fit entirely in any CPU cache (~4GB dataset) we go from 10GB/s to 15GB/s, which is as fast as the RAM allows.^{2}

See remarks in the last section for more data on why I think 15GB/s is the limit without resorting to deeper changes.↩︎

Modern CPUs do not process instructions serially, but rather handle many at the same time. They read many instructions at once, break them down in stages, and then try to fill all the computation units they have with as many tasks from as many instructions as possible.^{3} For instance, modern Intel processors are designed for a throughput of 4 instructions per clock cycle, and AMD Zen processors for up to 5 or 6.^{4}

However, branches pose a challenge when wanting to execute instructions in parallel. Let’s go back to our function `sum1`

:

To expand on this topic, you can start reading on out-of-order execution and pipelining.↩︎

Agner Fog’s microarchitecture document contains tons of details about the pipeline characteristics for Intel and AMD x86 processors. The numbers on throughput for each architecture are usually in the “Pipeline” section.↩︎

```
uint64_t sum1(Node* node) {
uint64_t value = 0;
while (node) {
+= node->value;
value = node->next;
node }
return value;
}
```

and its very readable assembly version:

```
; rdi = node and rax = value.
; rax is the return value register (we're returning value)
sum1:
xor rax, rax ; value = 0
test rdi, rdi ; if node is NULL, exit, otherwise start loop
je end
loop:
add rax, qword ptr [rdi] ; value += node->value
mov rdi, qword ptr [rdi + 8] ; node = node->next
test rdi, rdi ; if node is not NULL, repeat loop,
jne loop ; otherwise exit
end:
ret
```

The loop body is made out of 4 instructions, the last of which a jump. Without special measures, every instruction up to the `jne`

must be executed before proceeding to the next instruction, since we need to know if we’ll go to the beginning of the loop or continue. In other words the conditional jump would introduce a barrier in the instruction level parallelism internal to the CPU.

However, executing many instructions at once is so important that dedicated hardware – the *branch predictor* – is present in all modern CPUs to make an educated guess on which way we’ll go at every conditional jump. The details of how this works are beyond the scope of this blog post, but conceptually your CPU observes your program as it runs and tries to predict which branch will be taken by remembering what happened in the past.^{5}

Apart from the ever useful Agner Fog (see Section 3 of the microarchitecture document), Dan Luu has a nice blogpost explaining less dryly various ways of performing branch prediction.↩︎

Even without knowing much about the branch prediction, we expect the predictor to do a great job for our test case – we always go back to the beginning of the loop apart from when we stop consuming the list. On Linux, we can verify that this is the case with `perf stat`

:

```
$ perf stat ./value-speculation-linux
...
2,507,580 branch-misses # 0.04% of all branches
```

The branch predictor gets it right 99.96% of the time. So the CPU can parallelize our instructions with abandon, right? …right?

Let’s focus on the loop body of `sum1`

:

```
; rdi = node and rax = value.
loop:
add rax, qword ptr [rdi] ; value += node->value
mov rdi, qword ptr [rdi + 8] ; node = node->next
test rdi, rdi ; if node is not NULL, repeat loop,
jne loop ; otherwise exit
```

To increment `value`

(`rax`

), we need to know the value of `node`

(`rdi`

), which depends on the `mov`

in the previous iteration of the loop. The same is true for the `mov`

itself – it is also dependent on the result of the previous `mov`

to operate. So there’s a *data dependency* between each iteration of the loop: we must have finished reading `node->next`

(`[rdi + 8]`

) at iteration n before we can start executing the `add`

and `mov`

at iteration n+1.

Moreover, reading the `node->next`

(`[rdi + 8]`

) is slower than you might think.

Modern CPUs are a lot better at adding numbers than reading from memory. For this reason, a series of fast caches exist between the CPU and main memory. All reading and writing from main memory normally goes through the cache – if the data we are interested in is not already present, the CPU will load a chunk of memory (a “cache line”, 64 bytes on x86) which contains our desired data into the cache.^{6} The fastest cache is usually called L1 (successive caching layers being predictably called L2, L3, …).

Our setup is the best-case scenario when it comes to CPU caches – we read a bunch of memory sequentially, utilizing every byte along the way. However, even if the L1 cache is very fast, it is not free: it takes around 4 CPU cycles to read from it. This will make our `mov`

and `add`

take at least 4 cycles to complete. The other two instructions, `je`

and `test`

, will take only one cycle.^{7}

So the number of cycles needed to go through a single loop iteration is bounded by the 4 cycles it takes to read from L1 cache. The data I get from the Xeon I tested the program with is roughly consistent with this:

I say “normally” because the cache can be avoided using streaming SIMD instructions, which can write or copy memory bypassing the cache. However these methods are opt-in, and by default all memory goes through the cache.↩︎

Again, Agner Fog’s page on performance is the best resource I could find to source these numbers. For example, if one wanted to find these numbers for a Haswell CPU:

- The L1 latency (4 cycles) is in section 10.11 of the microarchitecture guide;
- The numbers of cycles it takes to execute
`mov`

,`add`

,`test`

, and`jne`

are in the Haswell section of the instruction tables.

```
16kB, 10000 iterations
sum1: 8465052097154389858, 1.12us, 14.25GB/s, 3.91 cycles/elem, 1.03 instrs/cycle, 3.48GHz, 4.01 instrs/elem
128kB, 10000 iterations
sum1: 6947699366156898439, 9.06us, 14.13GB/s, 3.95 cycles/elem, 1.01 instrs/cycle, 3.49GHz, 4.00 instrs/elem
5000kB, 100 iterations
sum1: 2134986631019855758, 0.36ms, 14.07GB/s, 3.96 cycles/elem, 1.01 instrs/cycle, 3.48GHz, 4.00 instrs/elem
4294MB, 1 iterations
sum1: 15446485409674718527, 0.43 s, 9.94GB/s, 5.60 cycles/elem, 0.71 instrs/cycle, 3.48GHz, 4.00 instrs/elem
```

The important numbers are `cycles/elem`

and `instrs/cycle`

. We spend roughly 4 cycles per list element (that is to say, per loop iteration), corresponding to a throughput of roughly 1 instruction per cycle. Given that the CPU in question is designed for a throughput of 4 instructions per cycle, we’re wasting a lot of the CPU magic at our disposal, because we’re stuck waiting on the L1 cache.

We finally get to the trick. As discussed, we are stuck waiting on reading what the next node address is. However, in our setup we allocate the list in a contiguous block of memory, and therefore the nodes are always next to each other.

So here’s the key idea: try to guess the next node by just bumping the previous value. If the guess is wrong, set the node to the “real” next value. In C, this is how it would look like:

```
uint64_t faster_sum(Node* node) {
uint64_t value = 0;
* next = NULL;
Nodewhile (node) {
+= node->value;
value = node->next;
next // Guess the next value
++;
node// But fix it up if we guessed wrong (in case the nodes are not
// next to each other).
if (node != next) {
= next;
node }
}
return value;
}
```

This looks quite bizarre. We are still reading `node->next`

in the comparison `node != next`

to make sure our guess is right. So at first glance this might not seem like an improvement.

This is where the branch predictor comes in. In the case of lists where most nodes *are* next to each other (as is the case in our test code), the branch predictor will guess that the `if (node != next) { ... }`

branch is not taken, and therefore we’ll go through loop iterations without having to wait for the L1 read.

Note that when the branch predictor *is* wrong (for example when the list ends, or if we have non-contiguous nodes) the CPU will need to backtrack and re-run from the failed branch prediction, which is costly (15 to 20 cycles on our processor^{8}). However, if the list is mostly contiguous, the trick works and makes our function 50-200% faster.

See “Misprediction penalty” for Haswell processor in Agner Fog’s microarchitecture document.↩︎

However there is one last challenge remaining to reach the final code and show you numbers – convincing compilers that our code is worth compiling.

Let’s go back to the code we showed for value speculation in C:

```
uint64_t faster_sum(Node* node) {
uint64_t value = 0;
* next = NULL;
Nodewhile (node) {
+= node->value;
value = node->next;
next ++;
nodeif (node != next) {
= next;
node }
}
return value;
}
```

Both `gcc`

and `clang`

easily deduce that the guessing is semantically pointless, and compile our trick away, making the compiled version of `faster_sum`

the same as `sum1`

. This is an instance where the compiler smartness undoes human knowledge about the underlying platform we’re compiling for.

Per Vognsen’s gist uses the following trick to get compilers to behave – this is the first improvement to our `sum1`

, `sum2`

:

```
static uint64_t sum2(Node *node) {
uint64_t value = 0;
while (node) {
+= node->value;
value *predicted_next = node + 1;
Node *next = node->next;
Node if (next == predicted_next) {
// Prevent compilers optimizing this apparently meaningless branch away
// by making them think we're changing predicted_next here.
//
// This trick, however, does not work with GCC, only with clang. GCC here
// derives that `next` and `predicted_next` are the same, and therefore
// merges them into the same variable, which re-introduces the data
// dependency we wanted to get rid of.
("" : "+r"(predicted_next));
asm= predicted_next;
node } else {
= next;
node }
}
return value;
}
```

However `gcc`

still doesn’t fully fall for it, as explained in the comment.^{9} Moreover, `clang`

’s generated loop is not as tight as it could, taking 10 instructions per element. So I resorted to manually writing out a better loop, which we’ll call `sum3`

:^{10}

This is why I stuck to

`clang`

for this post. I don’t know what compiler Per is using for his tests.↩︎Here I show the assembly version in Intel syntax, but in the code I write inline assembly, using AT&T syntax since it is better supported.↩︎

```
; rax = value, rcx = next, rdi = node
; Note that rax is the return value register (we are returning the value)
sum3:
xor rax, rax ; value = 0
xor rcx, rcx ; next = NULL
test rdi, rdi ; if node is null, go to the end,
je end ; otherwise start loop
loop_body:
add rax, qword ptr [rdi] ; value += node->value
mov rcx, qword ptr [rdi + 8] ; next = node->next
add rdi, 16 ; node++
cmp rcx, rdi ; if node is equal to next,
je loop_body ; restart loop, otherwise fix up node
mov rdi, rcx ; node = next
test rdi, rdi ; if node is not NULL restart the loop,
jne loop_body ; otherwise exit.
end:
ret
```

The code relies on the fact that `node`

can’t be `NULL`

after we increment it if it is equal to `next`

, avoiding an additional test, and taking only 5 instructions per element (from `loop_body`

to `je loop_body`

in the happy path).^{11}

The original version of

`sum3`

took 6 instructions per cycle, until Rihalto pointed out a needless jump.↩︎

These are the final numbers for our four functions:

```
16kB, 10000 iterations
sum1: 8465052097154389858, 1.12us, 14.25GB/s, 3.91 cycles/elem, 1.03 instrs/cycle, 3.48GHz, 4.01 instrs/elem
sum2: 8465052097154389858, 0.57us, 27.97GB/s, 1.99 cycles/elem, 5.02 instrs/cycle, 3.48GHz, 10.01 instrs/elem
sum3: 8465052097154389858, 0.36us, 44.96GB/s, 1.24 cycles/elem, 4.05 instrs/cycle, 3.48GHz, 5.01 instrs/elem
128kB, 10000 iterations
sum1: 6947699366156898439, 9.05us, 14.14GB/s, 3.95 cycles/elem, 1.01 instrs/cycle, 3.49GHz, 4.00 instrs/elem
sum2: 6947699366156898439, 4.51us, 28.38GB/s, 1.97 cycles/elem, 5.09 instrs/cycle, 3.49GHz, 10.00 instrs/elem
sum3: 6947699366156898439, 3.79us, 33.80GB/s, 1.65 cycles/elem, 3.03 instrs/cycle, 3.49GHz, 5.00 instrs/elem
5000kB, 100 iterations
sum1: 2134986631019855758, 0.35ms, 14.09GB/s, 3.95 cycles/elem, 1.01 instrs/cycle, 3.48GHz, 4.00 instrs/elem
sum2: 2134986631019855758, 0.19ms, 26.27GB/s, 2.12 cycles/elem, 4.72 instrs/cycle, 3.48GHz, 10.00 instrs/elem
sum3: 2134986631019855758, 0.17ms, 28.93GB/s, 1.93 cycles/elem, 2.60 instrs/cycle, 3.48GHz, 5.00 instrs/elem
4294MB, 1 iterations
sum1: 15446485409674718527, 0.44 s, 9.66GB/s, 5.76 cycles/elem, 0.69 instrs/cycle, 3.48GHz, 4.00 instrs/elem
sum2: 15446485409674718527, 0.33 s, 13.19GB/s, 4.22 cycles/elem, 2.37 instrs/cycle, 3.48GHz, 10.00 instrs/elem
sum3: 15446485409674718527, 0.30 s, 14.20GB/s, 3.91 cycles/elem, 1.28 instrs/cycle, 3.47GHz, 5.00 instrs/elem
```

The numbers are provided by the Linux `perf_event_open`

syscall.

The first three datasets are meant to fit in the L1 / L2 / L3 cache. In those cases, the improvements are very pronounced, and `sum3`

is crunching the data at around 4 instructions per cycle, which should be close to the limit on the processor I tested the code on. When the data does not fit in the cache, the bottleneck becomes filling it, and we process the data at roughly 15 GB/s.

I believe that this is as fast as one can go with “simple” single-threaded reading from RAM,
and it’s consistent with data from `sysbench`

:

```
$ sysbench memory --memory-block-size=1G --memory-oper=read --threads=1 run
...
102400.00 MiB transferred (15089.75 MiB/sec)
...
```

The RAM-reading speed could probably be improved using SIMD streaming instructions or by reading from multiple threads, although the implementation would be significantly more complicated.

And so we complete our journey into this low-level trick! If you want more of this, I can’t reccomend Per’s account enough – figuring out how his tricks works has been very educational.

Thanks to Alexandru Scvortov, Niklas Hambüchen, Alex Appetiti, and Carter T Schonwald for reading drafts of this post. Niklas also clarified some details regarding RAM speeds, and suggested `sysbench`

to measure single threaded RAM reading speed in particular. Also thanks to Per Vognsen and Jason Rohem for spotting a few typos, and to Rihalto for pointing out a better `sum3`

and some misleading wording.

Alexander Monakov suggested a more robust C function which works well with both `gcc`

and `clang`

, performs as well as `sum3`

, and does not resort to any assembly:

```
uint64_t sum5(Node *node) {
uint64_t value = 0;
*next = NULL;
Node for (; node; node = node->next) {
for (;;) {
+= node->value;
value if (node + 1 != node->next) {
break;
}
++;
node}
}
return value;
}
```