Two mul or not two mul: how I found a 20% improvement in ed25519 in golang
Written by Egon Elbre
One of the heavy CPU uses in the Storj network is Edwards 25519 signature scheme. Downloading and uploading data needs to send signed messages from Satellite (server) to the Storage Node (computers holding data). The signing ensures that the messages aren't tampered with. Hence, the Satellite needs to do a lot of signing and verification; the same goes for Uplink (client application) and Storage Node. Full details of the protocol can be read in the whitepaper, but it's not that relevant to this optimization story.
Profiling
After profiling signing and verifying, I landed in edwards25519/field.
The code implements multiplication over field elements. For our optimization purposes, you can think of field element as a large number represented as:
For the multiplication there are helpful diagrams directly in the source code:
After simplifying these, you end up with a computation:
So you would compute the new r0
value as (ignoring carry-over):
Which in code looks like:
Optimizing
There seemed some amount of performance possible; some inlining, some reordering of the operation chains, but one thing caught my eye while comparing assembly between AMD64 and ARM64. The multiplication by 19.
Helpful tools for examining assembly are godbolt, lensm and objdump.
For arm64, the code looks obvious; take two numbers and then multiply them. However, for amd64 it's less so. If you are wondering, LEAQ
instruction allows to do a computation LEAQ (Base)(Offset*Scale), Result
, which is equivalent to Result := Base + Offset * Scale
. It's really helpful for indexing into arrays and sometimes to do an x + y * k
operation.
If you decipher the amd64 code you end up with:
At the time I was thinking, "Oh, just some optimization rules are missing from the arm64 backend." I'll test how this optimization affects things by manually replacing the x * 19
with mul19
:
I also removed the initial caching of a1 * 19
values as well to reduce register usage, which seemed to help as well. And I got these results:
Great, 3-6% improvement to multiplication, squaring and inverting on Mac M1.
I did a bunch of other optimizations, which overall resulted in a ~20% improvement for arm64. Of course, one thing did bother me before trying to get things merged -- the compiler should have done the *19
optimization, not me.
Trying to optimize the compiler
I thought about the optimization a bit to generalize it; some general rules for converting multiplication into additions and shifts.
For example:
There's another variant of it:
And additionally:
So, I created an issue report and discuss things further in Go Issue 67575.
It did seem like a trivial optimization at this point, however, then Erifan and Randall pointed out that they are not really seeing the performance improvements. Also, that the cost of the shift operation can depend on the size of the shift.
I did a quick benchmark myself by testing these functions:
And got these results:
Which is annoying. On one arm, machine multiplication is faster and on the other, shifting and adding. And there isn't a really nice way to conditionalize per target CPU in the gc
compiler.
After discussing this optimization, it seemed like a good idea to leave the multiplication as is for now. You can still do the replacement manually, when it matters for your specific platform and application.
But, wait? It did help before!
The manual *19
optimization to ed25519 helped the performance by a significant margin and on both platforms. What gives?
One thing to notice about the ed25519 code is that it's quite heavy on multiplication already. This got me thinking about: how many instructions can you do in parallel.
One amazing collection of suggestions for low-level optimizations are Agner Fog's resources.
There are three main considerations on how many instructions we can do in parallel.
A. Are we waiting for data from RAM or caches to arrive. B. Whether there are dependency chains. C. Which instructions can be done in parallel.
Waiting data from RAM is the obvious thing. If we don't have data, we cannot calculate things.
The second idea is dependency chains. When a calculation takes 3 cycles and another instruction needs to wait for the result, then they cannot run in parallel.
A trivial example would be the difference between:
In practice, of course, this kind of reordering may not help, because the compilers are quite clever; and similarly the CPUs are clever and may reorder things to be better.
And the third is, how many ops can we do in parallel on a single core? Every modern core has multiple execution units. The execution units aren't equal either; there are two or more integer units, one or two floating-point addition units, and one or two floating-point multiplication units. So, you can perform an integer addition, a floating-point addition, and a floating-point multiplication at the same time. Of course, the internal details of cores can vary wildly. See more details in Optimizing C++ by Agner Fog.
I didn't measure exactly which execution units were overwhelmed and which ones; either way -- we got a performance improvement.
Back to the original plan
After all of this, the original changes were already a good solution.
I redid my optimizations and benchmarked on different platforms. And roughly ended up with ~20% total improvement on ARM64 and ~8% amd64. Of course your results may vary depending on your CPU.
For future work, there are additional optimizations possible. For example, there are also AVX2 and AVX512 implementations that perform better.
These improvements will be coming to you soon; in Go 1.25.
Like this post? Share it