Branching vs cmov vs gcc vs clang
I recently solved a performance riddle with some binary tree code I’m working on. For fun I’m writing all sorts of different binary tree variations. I’m primarily using gcc, but I occasionally spot check things using clang. The relative performance between each implementation was unexpectedly large. I was also observing inexplicably good performance improvements by merely switching to clang. The improvement switching to clang was well above anything I was expecting: 20%, 30%, even nearly 50%. Why would clang be generating code that performed so well compared to gcc?
TLDR: branches vs branchless code. In this project clang often
generates code that uses cmov
or is otherwise branchless, whereas gcc
generates code that uses branches for the same source. Across my different
tree implementations, ostensibly similar code can induce gcc to choose
branchless or branched code, resulting in large performance differences.
In the particular cases I’m looking at, branchless is better.
Example 1: An array by any other name
Consider a node definition:
struct Node {
int key;
Node* left;
Node* right;
};
and a function using it:
|
|
This function is finding the correct parent under which to insert a new
node holding a key
. In this case both gcc and clang generate a tight
“branchless” loop, which is to say that the conditional assignment on line
4 is performed without an explicit “jump” in code.
The code gcc generates is below. The tight loop is labeled .L6
. The
functions x
var is in the rcx
register, y
in rax
. The code loads
x->right
into rcx
unconditionally, then uses cmov
to overwrite that
with x->left
based on the key comparison. The only branch the processor
attempts to predict is jne .L6
, which it will tend to successfully
predict as this code will usually loop multiple times.
|
|
Clang generates code that is different and perhaps even clever. The
register assignments change to x
in rdi
and y
is still in rdx
. The
counter register, referred to here by its names ecx
and cl
, is used as
an index into the Node
struct, picking either the left or right member
based on the comparison.
|
|
This code can be seen on godbolt here.
When originally writing some of the binary tree code I had an idea similar
to what clang is doing above, so my Node
struct began life with a
child[2]
array like this:
struct Node {
int key;
Node* child[2];
};
And the GetInsertEqualPos
function was written like this:
Node* GetInsertEqualPos(Node* x, Node* y, int key) {
while (x != nullptr) {
y = x;
x = x->child[!(key < x->key)];
}
return y;
}
With this change both gcc and clang generate nearly identical code, similar to the clang example above, as can be seen on godbolt here.
At some point I wanted to make the code more readable, so I introduced helpers and then modified the code to read them:
struct Node {
int key;
Node* child[2];
};
Node* left(Node* x) { return x->child[0]; }
Node* right(Node* x) { return x->child[1]; }
Node* GetInsertEqualPos(Node* x, Node* y, int key) {
while (x != nullptr) {
y = x;
x = (key < x->key) ? left(x) : right(x);
}
return y;
}
With this, clang generated essentially the same code as it has in every example so far, but gcc generated yet another variation that used explicit branches for everything. This can be seen on godbolt here, or below:
GetInsertEqualPos(Node*, Node*, int):
mov rax, rdi
test rdi, rdi
jne .L5
jmp .L9
.L11:
mov rcx, QWORD PTR [rax+8]
test rcx, rcx
je .L10
.L7:
mov rax, rcx
.L5:
cmp DWORD PTR [rax], edx
jg .L11
mov rcx, QWORD PTR [rax+16]
test rcx, rcx
jne .L7
.L10:
ret
.L9:
mov rax, rsi
ret
And it was this variation that was abysmally slow. Binary trees take the left or right path nearly randomly. The processor was unable to successfully predict them, and so the code pays the cost of a branch misprediction quite often. In my benchmarks this penalty amounted to about a 20-30% performance loss.
I switched back to explicit Node* left;
and Node* right;
fields and gcc
now generated branchless code. Et voilà, my Red-Black tree was now on par
C++’s std::multimap
when compiled with gcc.
Example 2: The lower bounds of sadness
This is an example of a function that clang seems to happily compile to
branch free code, but gcc does not. I can’t figure out how to get gcc to
use branch free code, hence the sadness. This function returns the “lower
bound” for a key, is the first node with a key greater than or equal to a
given key. The code is quite similar to GetInsertEqualPos
except that
there is another conditionally assigned variable in the loop. Using the
same Node
definition, the code looks like this:
Node* LowerBound(Node* x, int key) {
Node* lower = nullptr;
while (x != nullptr) {
if (!(x->key < key)) {
lower = x;
x = x->left;
} else {
x = x->right;
}
}
return lower;
}
Now, if we compile this code with clang
and gcc
on
godbolt
we get this for gcc:
|
|
Here, gcc is using comparison tests and branches for everything, placing
the code for the lower = x
branch under the .L6
label. The branch that
the CPU simply can’t predict reliably is the jle .L6
on line 10, since
the binary tree search tends to zig zag arbitrarily down the tree.
Clang chooses cmov
for both assignments:
|
|
Like the GetInsertEqualPos
function, LowerBound
performs better with
branch free code. So, let’s look at one way to force the issue:
|
|
What have we done here? Branch free code using bit hackery. The highlighted lines in the first select template contain all the trickery, but the basic idea is to turn a boolean into a bitmask that is either zero or all ones, then use the bitmask to choose one of two integers.
Gcc generates branch free code (godbolt link):
LowerBound(Node*, int):
xor ecx, ecx
test rdi, rdi
je .L1
.L3:
xor eax, eax
cmp DWORD PTR [rdi+16], esi
mov rdx, rdi
mov r8, QWORD PTR [rdi]
setge al
xor rdx, rcx
neg rax
and rdx, rax
xor rcx, rdx
mov rdx, QWORD PTR [rdi+8]
xor r8, rdx
mov rdi, rdx
and rax, r8
xor rdi, rax
cmp rdx, rax
jne .L3
.L1:
mov rax, rcx
ret
So, I guess that is good? We’ll see soon. First, the cool part. Clang
has generated identical code for this source, using cmov
, as it did
with the straightforward source written with if
statements. Clang
essentially de-obfuscated the bit twiddling done in the new Select
function, figured out its final effect, and rewrite semantically equivalent
code using cmov
.
LowerBound(Node*, int): # @LowerBound(Node*, int)
xor eax, eax
test rdi, rdi
je .LBB0_3
.LBB0_1: # =>This Inner Loop Header: Depth=1
lea rcx, [rdi + 8]
cmp dword ptr [rdi + 16], esi
cmovge rax, rdi
cmovge rcx, rdi
mov rdi, qword ptr [rcx]
test rdi, rdi
jne .LBB0_1
.LBB0_3:
ret
Now, as we saw, gcc didn’t do quite as well. It did generate branch free code. How “bad” was the branchy code before, and how “good” is the new code. I wrote a benchmark and, well, it depends. Generally, if the binary tree fits entirely within the L1 cache the branchy code does better, presumably because the penalty for branch mispredictions in L1 cache is fairly low. Once the tree is larger than the L1 cache, the branch free code begins to win big.
Height | Nodes | Branchy GCC | Branchless Gcc | Change | Notes |
---|---|---|---|---|---|
8 | 255 | 7ns | 8.3ns | +18% | |
10 | 1023 | 8.5ns | 11ns | +33% | fits in L1 cache |
11 | 2047 | 9ns | 14ns | +49% | 2x bigger than L1 cache |
12 | 4095 | 40ns | 17ns | -56% | 4x bigger than L1 cache |
14 | 16383 | 60ns | 28ns | -54% | fits in L2 cache |
16 | 65535 | 107ns | 53ns | -49% | |
18 | 262143 | 114ns | 60ns | -47% | |
20 | 1048576 | 148ns | 85ns | -42% |
How does this new “Branchless Gcc” approach compare against clang, which
uses cmov
? Clang wins in all cases, with gcc being between 35% and 55%
worse.
Height | Nodes | Clang | Branchless Gcc | Change | Notes |
---|---|---|---|---|---|
8 | 255 | 5ns | 8.3ns | +55% | |
10 | 1023 | 7ns | 11ns | +55% | fits in L1 cache |
11 | 2047 | 9ns | 14ns | +49% | 2x bigger than L1 cache |
12 | 4095 | 11.5ns | 17ns | +52% | 4x bigger than L1 cache |
14 | 16383 | 20ns | 28ns | +44% | fits in L2 cache |
16 | 65535 | 37ns | 53ns | +45% | |
18 | 262143 | 42ns | 60ns | +41% | |
20 | 1048576 | 63ns | 85ns | +35% |
Lessons
First, don’t listen to the internet without thinking. There are tons of resources on the internet telling you things like:
- The compiler is so good at optimization that you should just ignore what it does, write simple code, and hope for the best.
- “Branchless” code is usually worse because it introduces a “data dependency” and therefore defeats the processor’s ability to execute things out of order. Processors are now quite good at things like out of order execution, speculative branch prediction, etc.
- You’ll find people saying that clang is more likely to generate code
with branches, whereas gcc is more likely to use
cmov
.
Lies, all lies! It is only through careful benchmarking and performance
analysis that you can really know what is going on with your program.
Learn to effectively use tools like linux
perf, objdump
, and
godbolt.org. Learn to write and run benchmarks
that are unlikely to lie to you. Doing this is a black art; see LLVM’s
benchmarking tips for just a
taste.
Lies, all lies? Not really. What I’ve stumbled upon here is an edge case, where code is predictably not predictable, which makes the processor’s branch prediction predictably bad. Most branches in most code are fairly predictable, so the conventional wisdom is actually correct. What you’ve got to watch out for is how to recognize when you’ve actually got an edge case on your hands, and what to do about it.
Further musing
Krister Walfridsson writes about this same issue in his blog post “Branches/cmov and compiler optimizations”. He goes into more detail about the interaction between compiler optimizations and cmov. I thought this point of his was particularly interesting:
Many (most?) cases where GCC incorrectly decides to use a branch instead of emitting branchless code comes from optimization passes transforming the code to a form where the backend cannot change it to use conditional moves.
Check out Linux Torvald’s explanation of why cmov
is “generally a bad
idea on an aggressively out-of-order
CPU” (backup
link,
and another backup
link).
His explanation of why predictable branches can be effectively free is
succinct and clear.
Some search terms that turn up useful related stuff on the web:
- branchless
- predicated instruction
- gcc’s command line flags:
-fno-if-conversion
-fno-if-conversion2
-fno-tree-loop-if-convert
(see docs).