The results of our labour, for AMD64 can be found on the asm branch on github:
This is not a proper revision of bsdnt, but a heavily hacked version, just to demonstrate the principles.
In the nn.c file is some inline assembly code which implements the nn_add_mc function in x64 assembler. The format used is Intel format assembly code, instead of the usual AT&T syntax used by inline gcc normally. We made a change to the makefile so that gcc would use the Intel format assembly code pervasively. Note, this will cause gcc to miscompile bsdnt for many revisions of gcc. This is due to bugs in gcc. Specifically gcc versions 4.4.1 and 4.4.4 work, however.
At this point we haven't unrolled the loop (repeated the contents of the loop over and over to save some time on loop arithmetic). But the results are still pretty nice.
The straight C version we had before took about 12 cycles per word (with gcc 4.4.1 on an AMD Opteron K10 machine). With this new assembly code it takes about 3 cycles per word.
With loop unrolling we might hope to halve that again, but this messy optimisation will have to wait, otherwise work on the functionality of the library will slow right down.
We also wrote some assembly code for Core 2 machines. This runs in about 4.5 cycles per word. At the moment this is not committed anywhere, but you can find a copy here:
The important thing to realise about assembly language is it is highly architecture dependent. Our AMD64 code is actually slower than C on an Intel Core 2 machine! The assembly code needs to be prepared for the machine in question. Moreover, this code won't work at all on Windows 64 or on any 32 bit machine!
At some point we'll introduce a configuration script which will select the correct assembly code for the architecture being used.
To actually time this code, we simply hacked t-nn.c to only run one of the nn_add_m tests, but with the nn_add_m executed 1000 times and the actual test that the code works, turned off.
We set the number of words being added in each nn_add_m to 24, as the test machine we used was 2.4GHz. This means the number of seconds the test takes to run is approximately the number of cycles per word. This is of course a temporary hack, just to illustrate working assembly code (the test will pass if you remove the loop that calls nn_add_m a thousand times instead of once).