Reading, Writing and Arithmetic: September 2010

Saturday 25 September 2010

BSDNT - configure and assembly improvements

It's time we added architecture dependent assembly support to bsdnt.

Here is how we are going to do it. For each of the implementation files

we have (nn_linear.c, nn_quadratic.c), we are going to add a _arch file,

e.g. nn_linear_arch.h, nn_quadratic_arch.h, etc.

This file will be #included in the relevant implementation file, e.g.

nn_linear_arch.h will be #included in nn_linear.c.

These _arch.h files will be generated by a configure script, based on

the CPU architecture and operating system kernel. They will merely

include a list of architecture specific .h files in an arch directory.

For example we might have nn_linear_x86_64_core2.h in the arch directory,

which provides routines specific to core2 processors running in 64 bit

mode.

In these architecture specific files, we'll have inline assembly routines

designed to replace various routines in the pure C implementation files

that we have already written. They'll do this by defining flags, e.g.

HAVE_ARCH_nn_mul1_c, which will specify that an architecture specific

version of nn_mul1_c is available. We'll then wrap the implementation of

nn_mul1_c in nn_linear.c with a test for this flag. If the flag is defined,

the C version will not be compiled.

In order to make this work, the configure script has to work out whether

the machine is 32 or 64 bit and what the CPU type is. It will then link

in the correct architecture specific files.

At the present moment, we are only interested in x86 machines running on

*nix (or Windows, but the architecture will be determined in a different

way on Windows).

A standard way of determining whether the kernel is 64 bit or not is to

search for the string x86_64 in the output of uname -m. If something else

pops out then it is probably a 32 bit machine.

Once we know whether we have a 32 or 64 bit machine, we can determine the

exact processor by using the cpuid instruction. This is an assembly

instruction supported by x86 cpus which tells you the manufacturer, family

and model of the CPU.

We include a small C program cpuid.c with some inline assembly to call the

cpuid instruction. As this program will only ever be run on *nix, we can

make use of gcc's inline assembly feature.

When the parameter to the cpuid instruction is 0 we get the vendor ID,

which is a 12 character string. We are only interested in "AuthenticAMD"

and "GenuineIntel" at this point.

When we pass the parameter 1 to the cpuid instruction, we get the processor

model and family.

For Intel processors, table 2-3 in the following document gives information

about what the processor is:

http://www.intel.com/Assets/PDF/appnote/241618.pdf

However the information is out of date. Simply googling for Intel Family 6

Model XX reveals other models that are not listed in the Intel documentation.

The information for AMD processors is a little harder to come by. However,

one can essentially extract the information from the revision guides, though

it isn't spelled out clearly:

http://developer.amd.com/documentation/guides/Pages/default.aspx#revision_Guides

It seems AMD only list recent processors here, and they are all 64 bit.

Information on 32 bit processors can be found here:

http://www.sandpile.org/ia32/cpuid.htm

At this point we'd like to identify numerous different architectures. We

aren't interested in 32 bit architectures, such as the now ancient Pentium

4 or AMD's K7. Instead, we are interested when 32 bit operating system

kernels are running on 64 bit machines. Thus all 32 bit CPUs simply identify

as x86.

There are numerous 64 bit processors we are interested in:

64 bit Pentium 4 CPUs were released until August 2008. We identify them as p4.

All the 64 bit ones support SSE2 and SSE3 and are based on the netburst

technology.

The Intel Core Solo and Core Duo processors were 32 bit and do not interest us.

They were an enhanced version of the p6 architecture. They get branded as x86

for which only generic 32 bit assembly code will be available, if any.

Core 2's are very common. They will identify as core2. They all support SSE2,

SSE3 and SSSE3 (the Penryn and following 45nm processors support SSE4.1 - we

don't distinguish these at this stage).

Atoms are a low voltage processor from Intel which support SSE2, SSE3 and are

mostly 64 bit. We identify them as atom.

More recently Intel has released Core i3, i5, i7 processors, based on the

Nehalem architecture. We identify these as nehalem. They support SSE2, SSE3,

SSSE3, SSE4.1 and SSE4.2.

AMD K8's are still available today. They support SSE2 and SSE3. We identify

them as k8.

AMD K10's are a more recent core from AMD. They support SSE2, SSE3 and SSE4a.

We identify these as k10. There are three streams of AMD K10 processors,

Phenom, Phenom-II and Athlon-II. We don't distinguish these at this point.

So in summary, our configure script first identifies whether we have a 32 or

64 bit *nix kernel. Then the CPU is identified as either x86, p4, core2,

nehalem, k8 or k10, where x86 simply means that it is some kind of 32 bit CPU.

Our configure script then links in architecture specific files as appropriate.

The only assembly code we include so far are the nn_add_mc functions we wrote

for 64 bit core2 and k10. As these are better than nothing on other 64 bit

processors from the same manufacturers, we include this code in the k8

specific files until we write versions for each processor. We also add an

nn_sub_mc assembly file for Intel and AMD 64 bit processors.

The configure script includes the architecture specific .h files starting from

the most recent processors so that code for earlier processors is not used

when something more recent is available.

The github branch for this revision is here: asm2

Previous article: v0.12 mul_classical

Next article: v0.13 - muladd1, muladd

Friday 24 September 2010

BSDNT - v0.12 mul_classical

I've been on holidays for a couple of days, hence the break in transmission.

(For academics: the definition of a holiday is a day where you actually

stop working and do nothing work related for that whole day. I realise the

definition is not widely known, nor is it well-defined.)

Note, due to accidentally including some files in today's release that I

intended to hold over till next time, you have to do ./configure before

make check. This detects your CPU and links in any assembly code

that is relevant for it. More on this when I do the actual blog post about it.

It's time to start implementing quadratic algorithms (i.e. those that

take time O(n^2) to run, such as classical multiplication and division).

Before we do this, we are going to reorganise slightly. The file which

is currently called nn.c we will rename nn_linear.c to indicate that it

contains our linear functions. We'll also rename t-nn.c to t-nn_linear.c.

The macros in that file will be moved into test.h.

We also modify the makefile to build all .c files in the current directory

and to run all tests when we do make check. A new directory "test" will

hold our test files.

Finally, we add an nn_quadratic.c and test/t-nn_quadratic.c to hold the

new quadratic routines and test code that we are about to write.

The first routine we want is nn_mul_classical. This is simply a mul1

followed by addmul1's. Of course this will be horrendously slow, but once

again we defer speeding it up until we start adding assembly routines,

which will start happening shortly.

In a departure from our linear functions, we do not allow zero lengths.

This dramatically simplifies the code and means we do not have to check

for special cases.

There does not appear to be any reason to allow a carry-in to our

multiplication routine. An argument can be made for it on the basis of

consistency with mul1. However, the main use for carry-in's and carry-out's

thus far has been for chaining.

For example, we chained mul1's s*c and t*c for s and t of length m and

c a single word in order to compute (s*B^m + t)*c.

But the analogue in the case of full multiplication would seem to be

chaining to compute (s*B^m + t)*c where s, t and c all have length m. But

it probably doesn't make sense to chain full multiplications in this way

as it would involve splitting the full product t*c say, into two separate

parts of length m, which amongst other things, would be inefficient.

It might actually make more sense to pass a whole array of carries to our

multiplication routine, one for every word of the multiplier. However it is

not clear what use this would be. So, for now at least, we pass no carry-ins

to our multiplication routine.

We settle on allowing a single word of carry out. This may be zero if the

leading words of the multiplicands are small enough. Rather than write this

extra word out, we simply return it as a carry so that the caller can decide

whether to write it.

The classical multiplication routine is quite straightforward, but that plus the

rearrangement we've done is more than enough for one day.

The github branch for this release is here: v0.12

Previous article: v0.11 - generic test code

Next article: configure and assembly

Monday 20 September 2010

BSDNT - v0.11 generic test code

It's time we improved our test framework for all the functions we are
adding, before things get out-of-control.

The best strategy is to actually write an interpreter for a small
expression parser which allows us to test arbitrary expressions.
However, that would be a distraction at this point, so we opt to
simply add some convenience functions.

In particular, we'll add some new random functions which generate
lists of random words and random garbage collected nn's. Our only aim
here is to simplify our test code and cut down the number of lines of
it.

We don't want to make it *more* complicated for people to use, and
C89 places some hard limits on us in that regard.

The first step is to add two new files, test.c and test.h which
will contain our garbage collection and random functions.

We begin by adding an enum type_t, which parameterises all the different
types our garbage collection can clean up. At the moment we only need
one type, NN.

Next we add a variadic function (variable number of arguments) which
creates a whole bunch of random words. Unfortunately C89 doesn't support
variadic macros, so we need to pass pointers to the words, so that the
actual words can be modified.

This random function takes a flag which specifies what type of random
number we want. Specifically we'll have flags for ODD, NONZERO, ANY.

C is also so stupid that we actually need to tell it explicitly, or mark
somehow, the number of arguments we are giving to the variadic function.
In C99 one can work around this by wrapping a variadic function with a
variadic macro. But we don't have that facility here. We choose to simply
pass NULL as the final argument to variadic functions.

We also introduce a global garbage stack which will keep a pointer to all
the objects allocated so far, so that a garbage collector can clean them
up later on, when called. The fact that this garbage is global makes it not
threadsafe, but we are only using the test support for test code at this
point and so we don't care about thread safety. Later we could pass a
context around as a parameter if we wanted to make it threadsafe.

We also add a gc _cleanup function, which cleans away all the garbage, so
that memory leaks don't occur. Later we could expand this function to do
redzone checking and various other automatic tests on the garbage it is
disposing of.

Now we can add some convenience functions for generating random
multiprecision integers.

The randoms_of_len function again takes a NULL terminated list of
*pointers* to nn_t's and both initialises them to the given length and
sets them to random limbs. We can specify various flags here, such as
ANY, FULL, the latter returning a multiprecision integer with exactly
the given number of limbs, with the top limb being nonzero (unless
the number of limbs requested is zero).

Now that we have all this, in our test file test.h, we define a macro
TEST_START which takes a number of iterations and simply sets up a loop
with that many iterations. A TEST_END function then calls garbage collection
to clean up at the end.

Finally, we add a test_generics function, which generates some random values
using our new generics and then cleans up.

By running a program called valgrind over our test code, we can see if all
the memory used by our generics actually got cleaned up after use.

We now roll out these changes to the rest of our test code. We note that the

entire test file is now about two-thirds of the length it was before we rewrote

it!

The new branch is on github: v0.11

Previous article: v0.10 - mod1_preinv

Next article: v0.12 - mul_classical

Sunday 19 September 2010

BSDNT - v0.10 mod1_preinv

We have one more linear function to implement before we move on to some

assembly language optimisation and then quadratic functions, namely mod1.

It returns the remainder after division by a single word.

One might think that this cannot be computed any faster than divrem1,

however the following algorithm allows us to perform the computation

slightly faster.

The algorithm is credited to Peter Montgomery. It is based on the following

observation. Suppose we have computed, in advance, the values

b2 = B^2 mod d and b3 = B^3 mod d.

Then a number of the form a0 + a1 * B + a2 * B^2 + a3 * B^3 can be

reduced mod d by computing s = a0 + a1 * B + a2 * b2 + a3 * b3, then

later reducing s mod d.

Note that as d can be at most B - 1, the values bi are at most B - 2.

The values ai are at most B - 1. Thus ai * bi is at most B^2 - 3*B + 2.

This means that when summing up s, we are adding at most B^2 - 3*B + 2

to B^2 - 1 at each step. The total is at most 2*B^2 - 3*B + 1. In

other words, there might be a carry of *1* into a third limb. But we

already know that b2 = B^2 mod d. Thus we can get rid of this 1 from

the third limb by adding b2 to our total in its place. The total will

then be at most B^2 - 2*B - 1, which fits easily into two limbs.

We may need to do this adjustment twice when summing up. It is an

unpredicted branch to do this correction, but with a deep enough pipeline

it is possible the processor will compute both paths, minimising the cost

of a misprediction.

Thus, with some precomputation, we can reduce four limbs of our dividend

to two limbs with 2 multiplications. Another advantage is that these

multiplications are independent. This gives the processor lots of

opportunity to pipeline them and compute them quickly.

One tricky implementation detail is in testing when s overflows two limbs.

We can accumulate into three limbs, but this is not quite efficient.

One way of testing for overflow of a double word addition is to check if

the result is smaller than one of the summands. If so, an overflow

occurred and we can make our adjustment.

In the case where d is at most B/3, we can make another optimisation. Also

compute b1 = B mod d and perform a third multiplication a1 * b1. We have

that bi is less than B/3. Thus the three products ai*bi are at most (B/3 - 1)*(B - 1)

= B^2/3 - 4*B/3 + 1. Clearly s is now at most B^2 - 3*B + 2 and no overflow

occurs. Therefore no tests or adjustments are required to compute s.

I will leave this optimised case as an exercise and just implement the

generic case for now.

At the end of the algorithm we must reduce a double word mod d. This could

be optimised with a precomputed inverse a la divrem1, but again I skip this

optimisation and leave it as an exercise.

At the start of the algorithm we have to set up differently depending on

whether there are an even or odd number of words (including the carry-in).

If there is an odd number in total, we need to do a single reduction of

the extra word so that an even number remain.

We add a new mod_preinv1_t type and a function precompute_mod_inverse1

to compute it. This computes B, B^2 and B^3 mod d. For now this precomputation

is not optimised to use a precomputed inverse. This would actually save

significant time, but again I leave it as an exercise to optimise this.

The end result looks a little messy compared to the algorithms implemented

so far. I wonder if anyone can find any simplifications of the code.

Here v0.10 is the github branch for this post.

Previous article: v0.9 - divrem1_hensel

Next article: v0.11 - generic test code

Saturday 18 September 2010

BSDNT - v0.9 divrem1_hensel

The next division function we'll introduce is so-called Hensel division, or

right-to-left division. This starts at the least significant word and applies

the usual division algorithm back towards the most significant word.

Of course, now any "remainder" will be at the most significant word end and

have a completely different meaning.

Hensel division is actually just division mod B^m. Thus if q is the Hensel

quotient, you have a = q * d mod B^m. In fact, if r is the Hensel remainder,

you have a = q*d + r*B^m.

If a fits in m words and the division is exact (i.e. a = q*d), the Hensel

division and ordinary (euclidean) division return the same quotient. If the

division is not exact, this is no longer the case.

However, it may be that chaining Hensel divisions together does produce an

exact division, even when division over the bottom part of the chain

wouldn't produce an exact division. Thus, the ability to chain Hensel

divisions, and thus the ability to return the Hensel remainder, is

important.

Well, now we come to the first issue if we want to implement this. What is

the C operator for Hensel division? Actually, there doesn't seem to be one.

We have to implement it ourselves.

If a0 is a word from our dividend, and d is our single word divisor, then

we require q0 such that d * q0 = a0 mod B. In other words, we want q0 and

r0 such that d * q0 = a0 + r0*B.

Suppose we can find a value dinv such that dinv * d = 1 mod B. Then

we have that dinv * d * q0 = dinv * a0 mod B, i.e. q0 = dinv * a0 mod B,

which is a single word-by-word mullow.

Once we have q0 we can compute d * q0, the high word of which is the Hensel

"remainder" r0. We need to subtract this from the next limb of our divisor

before continuing.

We'll make the convention that carry-ins and carry-outs from our Hensel

division are positive rather than negative. We'll just remember to subtract

them. This way, if q is our Hensel quotient and r our Hensel remainder,

then d * q = a + r * B^n.

So how do we compute dinv? Firstly, we better require that d is odd, else

dinv * d = 1 mod B is an impossibility. With this restriction, we have

gcd(d, B) = 1 and therefore the extended euclidean algorithm lets us find

s, t such that s*d + t*B = 1. Then modulo B we have s*d = 1 mod B, i.e.

dinv = s is the precomputed inverse that we are after.

Another way to solve dinv * d = 1 mod B is to use Hensel lifting. This

works by first solving the equation mod 2, then use that solution to

solve it mod 4 and so on. This can all be done with nothing more than

multiplications.

Firstly, we know a solution mod 2, namely 1 (as d is odd), i.e. if v1 = 1

then we have v1 * d = 1 mod 2. Now suppose we have a solution vk mod 2^k,

i.e. vk * d = 1 mod 2^k. We'd like to find a value wk such that

(vk + 2^k * wk) * d = 1 mod 2^2k. This is always possible -- I omit the

easy proof -- so we only need to solve this equation for wk, i.e.

(d * vk - 1) = -2^k * wk * d. We can get rid of the d on the right hand

side by multiplying by vk, i.e. 2^k * w = ((1 - d * vk) * vk) (mod 2^2k).

In other words, two multiplications will suffice to compute

2^k * wk (mod 2^2k) from vk. Then v{k+1} = vk + 2^k * wk is the inverse

of d mod 2^2k.

As a further optimisation of all this, one can just work mod 2^64 all

the way through the computation. In other words, start with v1 = 1

and compute v{k+1} = vk + ((1 - d * vk) * vk) (mod 2^64), six times.

To get from a solution mod 2 to a solution mod 2^64 clearly only takes

six Hensel lifting steps. Of course, with a lookup table mod 2^8 to start

from, instead of starting from a solution mod 2, one can do it in three

steps, requiring just six word-by-word multiplications (mullow only).

However, today I am feeling very lazy and I leave it as an exercise for

someone to implement the Hensel lifting lookup table method.

We add a simple type to hold our precomputed Hensel inverse, and a function

for computing it.

The actual nn_divrem_hensel_1_preinv function is again a straightforward

loop. We ensure our remainder is positive at each step and ensure that

we propagate any borrows from subtractions at each step.

The test code for the Hensel division is very similar to that for the

ordinary euclidean division except that d must be odd and the chaining

test will proceed right-to-left.

OK, that is enough brain strain for one day!

Here v0.9 is the github repository for today's post.

Previous post: v0.8 - divrem1_preinv

Next post: v0.10 - mod1_preinv

Thursday 16 September 2010

BSDNT - v0.8 divrem1_preinv

Today we'll implement a not-stupid division function. Since the earliest

days of computers, people realised that where possible, one should

replace division with multiplication. This is done by computing a

precomputed inverse v of the divisor d and multiplying by v instead of

dividing by d.

One particularly ingenious example of this, which was developed recently,

is the algorithm of Möller and Granlund. It can be found in this paper:

http://www.lysator.liu.se/~nisse/archive/draft-division-paper.pdf

If one has a precomputed inverse v as defined at the start of section 3

of that paper, then one can compute a quotient and remainder using

algorithm 4 of the paper using one multiplication and one mullow.

There is one caveat however. The divisor d must be "normalised". This

means that the divisor is a word whose most significant bit is a 1.

This is no problem for us as we can simply shift n to the left so that it is

normalised, and also shift the two input limbs to the left by the same

amount. The quotient will then be the one we want and the remainder

will be shifted to the left by the same amount as n. We can shift it back

to get the real remainder we are after.

We can use this algorithm in our divrem1 function. However, rather than

shift the remainder back each iteration, as it is the high limb of the

next input to the same algorithm we can simply pass it unmodified to the

next iteration of the divrem1 function.

The name of our function is nn_divrem1_preinv. We also introduce a type,

preinv1_t, to nn.h which contains the precomputed inverse (dinv) and the

number of bits n has been shifted by (norm). We add a function,

precompute_inverse1, which computes this.

In order to compute the number of leading zeroes of n, we simply use a gcc

intrinsic __builtin_clzl which does precisely what we are after. This gives

the number of bits we have to shift by.

We implement the Möller-Granlund algorithm as a macro to save as many cycles

as we can. This time we really have to be careful to escape the variables we

introduce inside the macro, to prevent clashes with parameters that might be

passed in. The standard way of doing this is to prepend some underscores to

the variables inside the macro. We call the macro divrem21_preinv1 to signify

that it is doing a division with dividend of 2 words and divisor of 1 word.

Once this is done, the nn_divrem1_preinv algorithm is very simple to implement.

It looks even simpler than our divrem1_simple function. The test code is very

similar as well. The new function runs much faster than the old one, as can

easily be seen by comparing the pauses for the different functions when

running the test code.

Later we'll have to introduce a timing and profiling function so that we can

be more explicit about how much of a speedup we are getting.

I haven't experimented with the divrem21_preinv macro to see if I can knock

any cycles off the computation. For example, careful use of const on one or

more of the variables inside the macro, or perhaps fewer word_t/dword_t

casts might speed things up by a cycle or two. Let me know if you find a

better combination.

Of course this is really just doing optimisations that the C compiler should

already know about. The best way to get real performance out of it is to add

some assembly language optimisations, which we'll eventually do.

The github branch for today's post is here: v0.8

Previous article: bsdnt assembly language

Next article: v0.9 - divrem1_hensel

Tuesday 14 September 2010

BSDNT - assembly language

Today a few of us (Antony Vennard, Brian Gladman, Gonzalo Tornaria and myself) had a go at writing some assembly language.

The results of our labour, for AMD64 can be found on the asm branch on github:

http://github.com/wbhart/bsdnt/tree/asm

This is not a proper revision of bsdnt, but a heavily hacked version, just to demonstrate the principles.

In the nn.c file is some inline assembly code which implements the nn_add_mc function in x64 assembler. The format used is Intel format assembly code, instead of the usual AT&T syntax used by inline gcc normally. We made a change to the makefile so that gcc would use the Intel format assembly code pervasively. Note, this will cause gcc to miscompile bsdnt for many revisions of gcc. This is due to bugs in gcc. Specifically gcc versions 4.4.1 and 4.4.4 work, however.

At this point we haven't unrolled the loop (repeated the contents of the loop over and over to save some time on loop arithmetic). But the results are still pretty nice.

The straight C version we had before took about 12 cycles per word (with gcc 4.4.1 on an AMD Opteron K10 machine). With this new assembly code it takes about 3 cycles per word.

With loop unrolling we might hope to halve that again, but this messy optimisation will have to wait, otherwise work on the functionality of the library will slow right down.

We also wrote some assembly code for Core 2 machines. This runs in about 4.5 cycles per word. At the moment this is not committed anywhere, but you can find a copy here:

http://groups.google.co.uk/group/bsdnt-devel/msg/15e2883d94014196?hl=en

The important thing to realise about assembly language is it is highly architecture dependent. Our AMD64 code is actually slower than C on an Intel Core 2 machine! The assembly code needs to be prepared for the machine in question. Moreover, this code won't work at all on Windows 64 or on any 32 bit machine!

At some point we'll introduce a configuration script which will select the correct assembly code for the architecture being used.

To actually time this code, we simply hacked t-nn.c to only run one of the nn_add_m tests, but with the nn_add_m executed 1000 times and the actual test that the code works, turned off.

We set the number of words being added in each nn_add_m to 24, as the test machine we used was 2.4GHz. This means the number of seconds the test takes to run is approximately the number of cycles per word. This is of course a temporary hack, just to illustrate working assembly code (the test will pass if you remove the loop that calls nn_add_m a thousand times instead of once).

Previous article: v0.7 - divrem1simple

Next article: v0.8 - divrem1_preinv

Monday 13 September 2010

BSDNT - v0.7 divrem1_simple

It's time for our first division function. This will again be a linear function

in that it divides a multiprecision integer by a single word and returns the

(single word) remainder.

This is the first time we can do substantially better than a simple for loop

(other than when we coded nn_cmp and other functions, which allowed early exit).

To get us going, we'll write a really dumb division function first. It'll just

use C's "/" and "%" operators in a for loop.

C is supposed to optimise simultaneous occurrences of / and % together in the

same piece of code to use a single processor instruction where available.

However, even with that optimisation, our function will be exceedingly slow.

We'll see why that is when we implement a not-stupid divrem1, next time around.

To emphasise its stupidity, we'll call our exceedingly dumb division function

nn_divrem1_simple. Perhaps we can use it in our test code to compare against

when we write the more sophisticated version.

Hopefully our naming conventions won't be defeated by this function, which

proceeds left to right (or most significant word first).

One doesn't usually need to write the remainder down immediately following

the quotient (though one could imagine a function operating that way).

However, it is convenient to be able to accept a "remainder-in", essentially

thought of as an extra limb of the dividend, so that divrem can be chained.

In this way, divrem1 is most similar to right shift in semantics (the latter

is just division by a power of two, after all).

Now, if we blindly try to implement this function, we quickly discover a

problem. Even if we work with double words, a division of a two limb quantity

by a single limb can be problematic.

We can illustrate this using decimal digits, instead of machine words. Suppose

we divided 94 by 7. The resulting quotient is 13 with remainder 3 (I hope!).

The point is, the quotient takes up two "words" 1 and 3, not just a single

word.

We can get around it by reducing the top word first. We have 9 divide 7 is 1

remainder 2. Thus the first word of our quotient is 1. Now we are left with

24 divide 7. This time the quotient is 3 with remainder 3 and everything is

fine. Note the remainder, 3, is less than 7.

The critical thing to note is that once we get things started, everything

works fine from then on. The top word in our double words are always reduced

modulo the divisor after the first iteration.

So the almost trivial algorithm almost works, except for the lead-in, where

we have to get things right. After that, we can divide our dividend word on

word until we reach the bottom, where we'll have a single limb remainder.

Now as mentioned, it is also convenient to supply a "carry-in". This acts like

an additional limb of the dividend. Assuming this carry-in was reduced mod

the divisor, we find that we end up with as many words in our quotient as we

had in our original dividend.

Clearly however, if the carry-in is not reduced, we will end up with even one

more quotient limb! However, there is an easy way to avoid this problem.

We simply decide to restrict the carry-in to be reduced modulo the divisor!

This restriction is not a problematic restriction because we will still be

able to chain our division functions together with this restriction (so long

as we use the same divisor throughout).

Anyhow, with this restriction, the quotient will have precisely the same number

of limbs as the dividend, and since we'll start with the carry-in, not the top

limb of the dividend, and that carry-in will already be reduced, there is no

bootstrapping iteration required, i.e. the problem of a non-reduced first limb

as explained above, simply doesn't exist.

In this way we avoid any complex feed-in and wind-down code, and as for all

the functions so far, the length m can be zero. The result is satisfyingly

symmetric, even if it is terribly slow.

In the test code, if q = a / c with remainder r, we can check that a = q*c + r.

We can use the functions we have already implemented, including our mul1_c code

to verify this.

We also check that passing in a reduced carry-in results in a correct result,

by chaining two divisions together.

This will do for today, as a more serious implementation of divrem1 will

require much more serious code. We'll take a look at that next time.

The github branch here v0.7 is for this post.

Previous article: bsdnt v0.6 - addmul1, submul1

Next article: bsdnt assembly language

Sunday 12 September 2010

BSDNT - v0.6 addmul1, submul1

At last we come to addmul1 and submul1. After implementing these, we would

already be able to implement a full classical multiplication routine, using the

naive O(n^2) multiplication algorithm.

It turns out that addmul1 and submul1 are quite similar to mul1. Instead of

just writing out the result, we have to add it to, or subtract it from the

first operand.

These are really combined operations, i.e. combined mul1 and add_m or sub_m

respectively.

The reason we introduce these is that the processor has multiple pipelines

and merging two operations like this gives us the chance to push more

through those pipelines. We'll add more combined operations later on.

The fact that we can add b[i]*c, a[i] and ci and not overflow a double word

needs some justification.

Clearly b[i] and c are at most B-1. Thus b[i]*c is at most B^2-2B+1. And

clearly a[i] and ci are at most B-1. Thus the total is at most

B^2-2B+1 + (B-1) + (B-1) = B^2-1, which just fits into a double word.

The dword_t type allows us to get away with adding all these in C as though

we don't care about any overflow (in fact we know there is none). Of course

that doesn't make it efficient. This function is still a candidate for

assembly optimisation and loop unrolling later on.

The submul function is essentially the same, so long as we have taken care to

perform operations in the right order.

The naming of our functions causes us slight difficulties here. We'd ideally

like versions of addmul1 and submul1 which write the carry out to the high

limb and versions which add/sub it from the high limb. We opt for the latter

for now. After all, if one were accumulating addmuls, this is what one would

require most often.

Maybe someone reading this has a better idea how to handle these.

We add a test which checks that doing two addmuls in a row is the same as

doing a single addmul with the multiply constant equal to the sum of the

original two. We also add a test to check that chaining addmuls works. We

add similar tests for submul.

We delay coding up a quadratic multiplication basecase as there are a few more

linear functions to work on, most notably the various kinds of division and

remainder functions. These are all interesting functions to write.

The github repo for this version is here: v0.6

Previous article: v0.5 add, sub, cmp

Next article: v0.7 divrem1_simple

Saturday 11 September 2010

BSDNT - v0.5 add, sub, cmp

It's time we added a full add and sub routine. These perform addition and
subtraction with possibly different length operands. For simplicity, we
assume the first operand is at least as long as the second. This saves us
a comparison and a swap or function call.

Add is simply an add_m followed by an add1 to propagate any carries right
through to the end of the longest operand. A similar thing holds for sub.

We'd like to code these as macros, however we'd also like to get the return
value. Thus, we use a static inline function, which the compiler has the
option of inlining if it wants, saving the function call overhead.

We add tests for (a + b) + c = (a + c) + b where b and c are at most the same
length as a. We also check that chaining an add with equal length operands,
followed by one with non-equal operands, works as expected. We add similar
tests for subtraction too.

The next function we add is comparison. It should return a positive value
if its first operand is greater than its second, a negative value if it is
the other way, and zero if the operands are equal.

As for equal_m and equal, we introduce different versions of the function
for equal length operands and possibly distinct operands.

The laziest way to do comparison is to subtract one value from the other
and see what sign the result is. However, this is extremely inefficient
in the case that the two operands are different. It's very likely that
they already differ in the most significant limb, so we start by checking
limb by limb, from the top, until we find a difference.

Various tests are added. We test that equal things are equal, that operands
with different *lengths* compare in the right order, and operands with
different *values* but the same length compare in the right order.

We also take the opportunity to do a copy-and-paste job from our cmp test
code and quickly generate some test code for the nn_equal function, which
didn't have test code up to this point.

From the next section onwards, we will only deal with one or two functions
at a time. All the really simple functions are now done, and we can move on
to more interesting (and useful) things.

The github branch for this post is here: v0.5

Previous article: 0.4 - add1, sub1, neg, not

Next article: v0.6 addmul1, submul1

Friday 10 September 2010

BSDNT - v0.4 add1, sub1, neg, not

Today we introduce some more convenience functions, neg and not, and we
also add the functions add1 and sub1.

All of these functions operate slightly differently to the ones we have
introduced so far.

The functions add1 and sub1 simply add a single word to a multiprecision
integer, propagating any carries/borrows all the way along.

The main loops of add1 and sub1 need to stop if the carry becomes zero. This
is for efficiency reasons. In most cases when adding a constant limb to a
multiprecision integer, only the first limb or two are affected. One doesn't
want to loop over the whole input and output if that is the case.

However, we must be careful, as in the case where the input and output are
not aliased (at the same location), we still need to copy the remaining
limbs of the input to the output location.

We add tests that (a + c1) + c2 = (a + c2) + c1 and do the same thing for
subtraction.

We also check that chaining of add1's and chaining of sub1's works. Until
we can generate more interesting random test integers this test doesn't
give our functions much of a workout. We eventually want to be able to
generate "sparse" integers, i.e. integers with only a few binary 1's or a
few binary 0's in their binary representation. The latter case would be
interesting here as it would test the propagation of carries in our add1
and sub1 functions. We'd also eventually like to explicitly test corner
cases such as multiprecision 0, ~0, 1, etc.

A final test of add1/sub1 that we add is a + c1 - c1 = a.

The not function is logical not. It complements each limb of the input. It
is a simple for loop.

The neg function is twos complement negation, i.e. negation modulo B^m. In
fact, twos complement negation is the same as taking the logical not of the
integer, then adding 1 to the whole thing. The implementation is similar to
add1, except that we complement each limb after reading it, but before adding
the carry.

One difference is that we still need to complement the remaining limbs after
the carry becomes zero, regardless of whether the input and output are
aliased.

The carry out from (neg a) is notionally what you would get if you were
computing 0 - a. In other words, the carry is always 1 unless a is 0. In
order to allow chaining, neg must notionally subtract the carry-in from the
total.

We test that (not (not a)) = a and that neg is the same as a combination of
not and add1 with constant 1. We can also test that adding -b to a is the
same computing as a - b. And finally we can test chaining of neg, as always.

I wonder what the most interesting program is that we could implement on top
of what we have so far. Tomorrow we add a few more convenience functions
before we start heading into the more interesting stuff.

I think I may have solved the test framework problem. More on that when we
get to v0.11 and v0.12.

There is a github branch here v0.4 for this article.

Previous article: v0.3 - copy, zero, normalise, mul1

Next article: v0.5 - add, sub, cmp

Thursday 9 September 2010

BSDNT - v0.3 copy, zero, normalise, mul1

In this section we add a few more trivial convenience functions, nn_copy,

nn_zero and nn_normalise.

The first two are simple for loops which we implement in nn.h.

A pair {c, m} will be considered normalised if either m is zero

(representing the bignum 0) or the limb c[m-1] is nonzero. In other words,

the most significant word of {c, m} will be non-zero.

The only tricky thing to do with nn_normalise is to make sure nothing goes

wrong if all the limbs are zero.

We add numerous tests of these new functions. We zero an integer and check

that it normalises to zero limbs. We copy an integer and make sure it

copies correctly. We also copy an integer, then modify a random limb and

then check that it is no longer equal. This provides a further check that

the nn_equal function is working correctly.

Finally, we do a more thorough test of nn_normalise by zeroing the top few

limbs of an integer then normalising it. We then copy just this many limbs

into a location which has been zeroed and check that the new integer is

still equal to the original *unnormalised* integer. This checks that

nn_normalise hasn't thrown away any nonzero limbs.

Next we add a mul_1 function. As for all of the linear functions we have

been adding, this will be very slow in C. This time we need to retrieve

the top limb of a word-by-word multiply. Again, ANSI C doesn't provide this

functionality, so we use the gcc extensions that allow us to upcast to a

double word before doing the multiply. Again the compiler is supposed to

"do the right thing" with this combination.

We add a test to check that a * (c1 + c2) = a * c1 + a * c2. We also add

a test for chaining of mul1's together, passing the carry-out of one to

the carry-in of the next.

The github branch for this post is here: v0.3

Previous article: v0.2 - subtraction, shifting

Next article: v0.4 - add1, sub1, neg, not

Wednesday 8 September 2010

BSDNT - v0.2 subtraction, shifting

In today's revision we add a subtraction function and left and right

shift functions.

Subtraction is almost trivial after addition. We copy what we did for

addition. The main change is that carry-ins become borrows. we have to

negate our borrows at each step, so that we are always dealing with

positive quantities. The returned borrow is zero or positive.

In the test code for subtraction, we can add a test which checks that

a + b - b = a. This is a further test of the addition function as well.

Left shift is straightforward. Again we make use of the dword_t type

which is twice as wide as a word_t. This allows us to do a single shift

(which the compiler has the option of converting to two shifts if required).

We shift each limb by the specified bits, then hang onto the bits that were

shifted out the top, to be added back in at the next iteration.

For right shift we only have three functions instead of four. It doesn't make

sense to write out the "carry-out" at the bottom of a right shift. Instead

we do a "read-in" nn_r_rsh version which is the exact opposite of nn_s_lsh. It

reads the "carry-in" from an (m + 1)-th limb (which it shifts appropriately).

The nn_r_rsh function, also returns the bits that are shifted out the bottom end.

We can provide an extra test for addition now, by checking that a + a = a << 1

We also check that (a << sh1) >> sh1 gets us back to a.

Finally, we test that we can chain our carry-in carry-out functions together. For

example, I should be able to shift the first few limbs of an integer, feed the

carry-out as a carry-in in to the next shift, which will then shift the rest of

the integer. This should give me the same result as if I had shifted the whole

integer in one go.

We also do similar carry-in/carry-out chaining with the addition and subtraction

functions.

We are going to need to set up a proper test framework at some point, but for

now, whilst we are still working on implementing "linear" functions, I'm simply

going to keep duplicating the test functions over and over and making the

minor modifications required to test the new functions we are adding.

Here is v0.2 on github for this post.

Previous article: v0.1 - basic types and addition

Next article: v0.3 - copy, zero, normalise, mul1

Tuesday 7 September 2010

BSDNT v0.1 - basic types and addition

We start with v0.1 of bsdnt. See the github branch:

http://github.com/wbhart/bsdnt/tree/v0.1

Firstly, we set up a bit of infrastructure before we add our first function,

the addition function.

We begin that with some basic types in nn.h:

word_t - represents a machine word (either 32 or 64 bit)

dword_t - twice the size of a machine word (to handle carries from addition)

nn_t - our basic multiprecision integer type, consisting of an array of words

nn_src_t - same as an nn_t, but declared const, so it can't be modified, used

for input/source parameters

len_t - the length in words of a multiprecision integer. We don't use a struct

as we'd have to pass it by reference, which would be less efficient due

to having to dereference the pointer all the time

rand_t - a random state, unused for now. This will ensure thread safety of our

random functions later on.

We also define WORD_BITS, the number of bits in a machine word. To simplify

things, we'll informally let B refer to the number 2^WORD_BITS in what follows.

Our basic bignum type is a pair {c, len} consisting of an nn_t c and a len_t len

counting the number of limbs of our bignum. At this stage we restrict len to

being non-negative, and we allow leading zero limbs in our representation of

our bignums. This allows for twos complement arithmetic with fixed length

bignums (arithmetic modulo B^len for some fixed len).

The first function we write is a simple addition function. This needs to be

pretty sophisticated. In particular, we want to be able to pass in a carry, so

that addition functions can be chained together, or on the end of other functions.

The first addition function will add two operands of the same length (number of

machine words), which we signify with an m at the end of the function name.

We signify that the function takes a carry-in by appending c to the function name.

We will also want the function to pass a carry-out. It'll return this carry out as a

word_t.

The function nn_add_mc is defined in nn.c. Notice how we cast each word to a

dword first, then do the addition, then cast back to a word for the low word of

the sum, and shift by WORD_BITS to get the high word of the result.

The compiler is *supposed* to optimise this combination. Actually, it makes poor

use of the processor carry flag, and screws up the loop, which is why C is not

the best language for a bignum library. But for now it is the best we can do.

Later on we'll add assembly optimisations to our library.

We could unroll the loop (write the contents four times in a row, say, to amortise

the cost of the loop counter arithmetic over four iterations). However, we are

after a cleanly coded library, so we resist this temptation until we write some

assembly code.

Now we introduce some extra macros in nn.h for our addition function. These

represent various permutations of allowing a carry in or not, and one more

interesting macro, nn_s_add_m. This function checks whether the result

and first input are aliased. If so (in other words, we have a = a + c), then the

carry-out gets *added* to the correct limb of a.

If a and b are not aliased (we have a = b + c), then the carry is *written*

and not added, to the correct limb of a. This macro will make coding cleaner

later on.

Notionally, the extra 's' in the function name stands for 'store'.

We use macros for these functions to stop code duplication, and because macros

prevent extra function call overhead, and sometimes offer an opportunity for the

compiler to optimise further.

Note that C macros are not hygienic. In fact they are just text replacement macros.

Thus we need to be careful about naming the macro variables so they are unlikely

to conflict with symbols passed in by the caller. In this case we don't introduce

any variables inside our macros, so this precaution is not necessary.

We also need to take care when using the macro parameters inside the macro. In

some cases it is necessary to put parentheses around the parameters in case

the caller passes in a complex expression which combines badly with expressions

inside our macro (e.g. due to operator precedence). Where there can be no

confusion when substituting a macro, we do not need to do this.

Next we will need to add some random functions, to generate random integers to add

in our test code. In nn.c, we add functions to generate a random word of data, a

random integer up to a limit and a random multiprecision integer.

The random functions take a random state as a parameter. This is essential to

ensure that we can make our random functions threadsafe at a later date. For now

the random state and associated init and clear functions do nothing.

The random word function generates two random integers slightly bigger than half

a word and stitches them together. This is done using an el cheapo pseudorandom

function which works modulo a prime slightly bigger than half a word. We take some

care not to always end up with even output, or output divisible by a given small

prime.

The test code is in t-nn.c. We add a comparison function nn_equal_m to nn.h, to

test equality of two mp integers, to be used in our test code, and some other

convenience functions. We generate random length mp integers, and check an

identity, in this case the associative law for addition. This is not a very

sophisticated test, and we'll have to improve out test code at a later date.

For now, it passes, and we move on to grander things.

Previous article: BSDNT-introduction

Next article: v0.2 - subtraction, shifting

BSDNT - Introduction

Over the next few weeks to months, I'll be blogging about a new project of mine, called bsdnt.

The aim of this project is to develop a cleanly coded bignum library with (eventually) reasonable performance and a BSD license. I'll be developing this code for a while in a blog/tutorial style.

First some organisational matters.

For those who wish to follow along with the code as it is written, it can be found at:

http://github.com/wbhart/bsdnt/

If you have a github account, you can clone my project on github.

To make a clone of my repository on your local machine, you can do:

git clone git://github.com/wbhart/bsdnt.git bsdnt

The various branches, as I add them, will be called v0.1, v0.2, etc.

To switch branches, simply do, for example:

git checkout origin/v0.1

To make your own branch of this, to mess around in, do:

git checkout -b mybranch

Antony Vennard, set up a google groups discussion list. You can access that

here:

http://groups.google.com/group/bsdnt-devel?hl=en

A word on licensing: if you substantially copy my code, then I appreciate you

retaining the copyright, however if you write substantially your own code,

merely being "inspired" by my code, I am fine with you not adding my copyright to your project. But still let me know about your code, because I'd be interested in seeing it.

To compile the project, you will need a recent version of gcc. I'm using

gcc 4.4.1, and some of the features may not work with earlier than gcc 4.4.

You can proceed to the first blog about the code here:

Next article: BSDNT v0.1 - basic types and addition

Reading, Writing and Arithmetic