Reading, Writing and Arithmetic: 2010

Saturday 20 November 2010

BSDNT - v0.24 nn_bitset/clear/test and nn_test_random

In today's update we make a long overdue change to bsdnt, again to improve our testing

of the library.

We are going to add a function for generating random bignums with sparse binary

representation. We'll also add some other random functions based on this primitive.

Using test integers with sparse binary representation in our test code will push our

functions harder because it will test lots of corner cases such as words that are all

zero, in the middle of routines, and so on. As it is currently, we'd be extremely

lucky for the random word generator we've been using to generate an all zero word, or

a word with all bits set to one for that matter.

The first step to generating such test randoms is to write a function for setting a

given bit in an integer. This will be an nn_linear function despite it not actually

taking linear time. In fact, it will take essentially constant time. However, it is an

nn type function, so it belongs in an nn module.

The routine is very straightforward. Given a bit index b, starting from 0 for the least

significant bit of a bignum, it will simply use a logical OR to set bit b of the bignum.

Rather than construct a bignum 2^b and OR that with our number, we simply determine

which word of the bignum needs altering and create an OR-mask for that word.

Computing which word to adjust is trivial, but depends on the number of bits in a word.

On a 64 bit machine we shift b to the right by 6 bits (as 2^6 = 64), and on a 32 bit

machine we shift b to the right by 5 bits (2^5 = 32). This has the effect of dividing

b by 64 or 32 respectively (discarding the remainder). This gives us the index of the

word we need to adjust.

Now we need to determine which bit of the word needs setting. This is given by the

remainder after dividing b by 64 or 32 respectively, and this can be determined by

logical AND'ing b with 2^6-1 or 2^5-1 respectively. This yields a value c between 0 and

63 (or 31) inclusive, which is a bit index. To turn that into our OR-mask, we simply

compute 2^c (by shifting 1 to the left by c bits).

Now that we have our OR-mask and the index of the word to OR it with, we can update the

required bit. We call this function nn_bit_set.

While we are at it we create two other functions, nn_bit_clear and nn_bit_test.

It's now straightforward to write test functions which randomly set, clear and test

bits in a random bignum.

Next we create a random bignum generator which sets random bits of a bignum. In order

to do this, we simply choose a random number of bits to set, from 0 to the number of words

in the bignum, then we set that many bits at random.

We call this function nn_test_random1.

We now add a second random bignum generator. It works by creating two bignums using the

function nn_test_random1 and subtracting one from the other. This results in test randoms

with long strings of 1's and 0's in its representation.

We call this function nn_test_random2.

Finally, we create a function nn_test_random which randomly switches between these two

algorithms and our original nn_random algorithm to generate random bignums. We switch all

our test code to use nn_test_random by changing the function randoms_of_len to use it.

At this point we can have greater confidence that our functions are all working as they

are supposed to be, as our test code has been suitably fortified at last! (Well, they are

working now, after I spent a day hunting down the bugs that these new randoms found - no,

I am not kidding. That's how good at finding bugs this trick is!)

Today's code is found here: v0.24

Previous article: v0.23 - sha1 and PRNG tests

Next article: BSDNT - interlude

Friday 12 November 2010

BSDNT - v0.23 sha1 and PRNG tests

In a recent update we added three PRNGs (pseudo random number

generators). What we are going to do today is add test code for

these.

But how do you test a pseudo random generator? It's producing

basically random values after all. So what does it matter if the

output is screwed up!?

Well, it does matter, as shown by the problems on 32 bit machines

which I wrote about in the PRNG blog post. It would also matter if

the PRNGs were broken on some platform such that they always output

0 every time!

There's a few ways of testing PRNGs. One is simply to run them for a

given number of iterations, write down the last value it produces and

check that it always does this.

The method we are going to use is slightly more sophisticated. We are

going to hash a long series of outputs from the PRNGs, using a hash

function, and check that the hash of the output is always the same.

Basically, our test code will take a long string of words from the

PRNGs, write them to an array of bytes, then compute the sha1 hash of

that array of bytes. It'll then compare the computed hash with a hash

we've computed previously to ensure it has the same value as always.

Moreover, we'll set it up so that the hash is the same regardless of

whether the machine is big or little endian.

The hash function we are going to use is called sha1. Specifically,

we'll be using an implementation of the same written by Brian Gladman

(he also supplied the new test code for the PRNGs for today's update).

The first step is to identify whether the machine is big or little

endian. This refers to the order of bytes within a word in physical

memory. On little endian machines (such as x86 machines) the least

significant byte of a word comes first. On big endian machines the

order is the other way around. Thus the number 0x0A0B0C0D would have

the byte 0x0D stored first on a little endian machine, but 0x0A stored

first on a big endian machine.

Endianness can be identified by architecture, or it can be identified

with a short program. We choose to use the latter method as it should

be hard to fool. At configure time a short C program will run that will

place bytes into a four byte array, then read that array as a single

32 bit number. We then compare the value to a 32 bit value that would

be stored in the given way on a little endian machine. If it compares

equal, then the machine must be little endian. If not we compare with

a number that would be stored in the given way on a big endian machine.

If the machine doesn't match either order, then it must be a very rare

machine with mixed endianness, which we don't support in bsdnt.

The configure script will write some defines out to config.h which

then allow bsdnt modules to identify whether the machine is little or

big endian at compile time, i.e. at zero runtime cost.

Now to discuss the sha1 hashing scheme.

A hashing scheme simply takes a piece of data and computes from it a

series of bits which serve to "identify" that piece of data. If

someone else has access to the same hashing algorithm and a piece of

data which purports to be an exact copy of the original, then they

can verify its identity by computing its hash and comparing.

Of course a hash is only valuable in this sense if it is much shorter

than the piece of data itself (otherwise you'd just compare the

actual data itself).

A very simple hashing scheme might simply add all the words in the

input to compute a hash consisting of a single word.

A secure hashing scheme has at least two other properties.

i) It shouldn't be possible to determine the original data from its

hash. (The data may be secret and one may wish to provide for the

independent verification of its authenticity by having the recipient

compare the hash of the secret data with a publicly published value.

Or, as is sometimes the case, the hash of secret data, such as a

password, might be transmitted publicly, to compare it with a

pre-recorded hash of the data.)

ii) It must be computationally infeasible to substitute fake data

for the original such that the computed hash of the fake data is the

same as that of the original data.

Of course, if the hash is short compared to the data being hashed,

then by definition many other pieces of data will have the same hash.

The only requirement is that it should be computationally infeasible

to find or construct such a piece of data.

The first step in the SHA1 algorithm is some bookkeeping. The

algorithm, as originally described, works with an input message which

is a multiple of 16 words in length. Moreover, the last 64 bits are

reserved for a value which gives the length of the original message in

bits.

In order to facilitate this, the original message is padded to a

multiple of 16 words in length, with at least enough padding to allow

the final 64 bits to be part of the padding, and to not overlap part

of the message.

The padding is done by first appending a single binary 1 bit, then

binary zeroes are appended until the message is the right length.

Then of course the length in bits of the original message is placed

in the final 64 bits of the message.

The hashing algorithm itself performs a whole load of prescribed

twists and massages of the padded message.

For this purpose some functions and constants are used.

Given 32 bit words B, C and D there are four functions:

1) f0(B, C, D) = (B AND C) OR ((NOT B) AND D)

2) f1(B, C, D) = B XOR C XOR D

3) f2(B, C, D) = (B AND C) OR (B AND D) OR (C AND D)

4) f3(B, C, D) = B XOR C XOR D

and four corresponding 32 bit constants:

1) C0 = 0x5A827999

2) C1 = 0x6ED9EBA1

3) C2 = 0x8F1BBCDC

4) C3 = 0xCA62C1D6

To begin the algorithm, we break the padded message up into 16 word

blocks M1, M2, M3, i.e. each Mi is 16 words of the padded message.

Each block is processed via a set of steps, and an "accumulated" hash

of 160 bits, consisting of five 32 bit words (the final hash we are

after) is computed: H0, H1, H2, H3, H4.

The algorithm starts by initialising the five "hash words" to the

following values:

H0 = 0x67452301, H1 = 0xEFCDAB89, H2 = 0x98BADCFE, H3 = 0x10325476

and H4 = 0xC3D2E1F0.

Each block of 16 words, Mi, of the padded message is then used in

turn to successively transform these five words, according to the

following steps:

a) Break Mi into 16 words W0, W1, ..., W15.

b) For j = 16 to 79, let Wj be the word given by

Wj = S^(-1)(W{j-3}) XOR W{j-8} XOR W{j-14} XOR W{j-16}),

where S^n(X) means to rotate the word X to the left through n bits

(a negative n means right rotation).

c) Make a copy of the hashing words:

A = H0, B = H1, C = H2, D = H3, E = H4

d) For j = 0 to 79 repeat the following set of transformations in

the order given:

TEMP = S^5(A) + f{j/20}(B, C, D) + E + Wj + C{j/20},

E = D, D = C, C = S^30(B), B = A, A = TEMP,

where j/20 signifies "floor division" by 20, and where f and C are

the above-defined functions and constants.

e) Update the hashing words according to the following:

H0 = H0 + A, H1 = H1 + B, H2 = H2 + C, H3 = H3 + D, H4 = H4 + E.

Note that steps a-e are repeated for each block of 16 words, Mi in

the padded message, further manipulating the five words with each run.

The resulting five words H0, H1, H2, H3, H4 after all the words of the

padded message have been consumed, constitutes the sha1 hash of the

original message.

The function to compute the sha1 hash of a message is given in the

files sha1.c and sha1.h in the top level directory of the source

tree.

A new test file t-rand.c is added in the test directory. It contains

the sha1 hash of a large number of words as output by our three

PRNGs. If a user of bsdnt has the same hash for the PRNGs when run

on their machine, then they can have a very high level of confidence

that they are working as expected.

Note that the sha1 algorithm is known as a secure hashing algorithm,

which means that in theory it can be used to hash very important

data so that the recipient can independently confirm the data hasn't

been tampered with (by computing the hash of the value and making

sure it matches some published value).

We don't explain how sha1 actually works. The mysterious constants

are not so mysterious. C0 is the square root of 2 in hexadecimal, C1 is

the square root of 3, C2 the square root of 5, C3 the square root of 10.

I don't know the meaning of the functions f0-f3.

What is worth noting is that in recent years, people have figured out

how to produce sha1 hash collisions (two messages with the same hash).

I don't pretend to be an expert in such things.

All we care about here is that a broken PRNG really can't pretend to

be working, and for that, sha1 works a treat.

Disclaimer: use the information in this post at your OWN RISK!! We

make no representations as to its correctness. The same goes for

bsdnt itself. Read the license agreement for details.

Warning: cryptography is restricted by law in many countries including

many of those where the citizens believe it couldn't possibly be so.

Please check your local laws before making assumptions about what you

may do with crypto.

The code for today's article is here: v0.23

Previous article: v0.22 - Windows support

Next article: v0.24 - nn_bitset/clear/test and nn_test_random

Thursday 11 November 2010

BSDNT - v0.22 Windows support

Today's update is a massive one, and comes courtesy of Brian Gladman. At last we add

support for MSVC 2010 on Windows.

In order to support different architectures we add architecture specific files in the arch

directory. There are three different ways that architectures might be supported:

* Inline assembly code

* Standalone assembly code (using an external assembler, e.g. nasm)

* Architecture specific C code

Windows requires both of the last two of these. On Windows 64 bit, MSVC does not support

inline assembly code, and thus it is necessary to supply standalone assembly code for this

architecture. This new assembly code now lives in the arch/asm directory.

On both Windows 32 and 64 bit there is also a need to override some of the C code from the base

bsdnt library with Windows specific code. This lives in the arch directory.

Finally, the inline assembly used by the base bsdnt library on most *nix platforms is now in the

arch/inline directory.

In each case, the os/abi combination is specified in the filenames of the relevant files. For

example on Windows 32, the files overriding code in nn_linear.c/h are in arch/nn_linear_win32.c/h.

(Note win32 and x64 are standard Windows names for 32 and 64 bit x86 architectures, respectively.)

If the code contains architecture specific code (e.g. assembly code) then the name of the file

contains an architecture specifier too, e.g. arch/inline/nn_linear_x86_64_k8.h for code specific

to the AMD k8 and above.

It's incomprehensible that Microsoft doesn't support inline assembly in their 64 bit compiler

making standalone assembly code necessary. It would be possible to use the Intel C compiler on

Windows 64, as this does support inline assembly. But this is very expensive for our volunteer

developers, so we are not currently supporting this. Thus, on Windows 64, the standalone

assembly is provided in the arch/asm directory as just mentioned.

Brian has also provided MSVC build solution files for Windows. These are in the top level source

directory as one might expect.

There are lots of differences on Windows that requires functions in our standard nn_linear.c,

nn_quadratic.c and helper.c files to be overridden on Windows.

The first difference is that on 64 bit Windows, the 64 bit type is a long long, not a long. This

is handled by #including a types_arch.h file in helper.h. On most platforms this file is empty.

However, on Windows it links to an architecture specific types.h file which contains the

requisite type definitions. So a word_t is a long long on Windows.

Also, when dealing with integer constants, we'd use constants like 123456789L when the word type

is a long, but it has to become 123456789LL when it is a long long, as on Windows 64. To get

around this, an architecture specific version of the macro WORD(x) can be defined. Thus, when

using a constant in the code, one merely writes WORD(123456789) and the macro adds the correct

ending to the number depending on what a word_t actually is.

Some other things are different on Windows too. The intrinsic for counting leading zeroes is

different to that used by gcc on other platforms. The same goes for the function for counting

trailing zeroes. We've made these into macros and given them the names high_zero_bits and

low_zero_bits respectively. The default definitions are overridden on Windows in the architecture

specific versions of helper.h in the arch directory.

Finally, on Windows 64, there is no suitable native type for a dword_t. The maximum sized

native type is 64 bits. Much of the nn_linear, and some of the nn_quadratic C code needs to

be overridden to get around this on Windows. We'll only be using dword_t in basecase algorithms

in bsdnt, so this won't propagate throughout the entire library. But it is necessary to

override functions which use dword_t on Windows.

Actually, if C++ is used, one can define a class called dword_t and much of the code can

remain unchanged. Brian has a C++ branch of bsdnt which does this. But for now we have C code

only on Windows (otherwise handling of name mangling in interfacing C++ and assembly code

becomes complex to deal with).

Brian has worked around this problem by defining special mul_64_by_64 and div_128_by_64

functions on 64 bit Windows. These are again defined in the architecture specific version of

helper.h for Windows 64.

Obviously some of the precomputed inverse macros need to be overridden to accomodate these

changes, and so these too have architecture specific versions in the Windows 64 specific version

of the helper.h file.

In today's release we also have a brand new configure file for *nix. This is modified to handle

all the changes we've made to make Windows support easy. But Antony Vennard has also done

some really extensive work on this in preparation for handling standalone assembly on arches

that won't handle our inline assembly (and for people who prefer to write standalone assembly

instead of inline assembly).

The new configure file has an interactive mode which searches for available C compilers (e.g.

gcc, clang, icc, nvcc) and assemblers (nasm, yasm) and allows the user to specify which to use.

This interactive feature is off by default and is only a skeleton at present (it doesn't actually

do anything). It will be the subject of a blog post later on when the configure file is finished.

The code for today is found at: v0.22

Previous article: v0.21 - PRNGs

Next article: v0.23 - sha1 and prng tests

Sunday 31 October 2010

In this update we are going to replace the el cheapo random generator in
bsdnt with something of higher quality.

Some time ago, Brian Gladman brought to my attention the fact that on 32
bit machines, the test code for bsdnt actually caused Windows to hang!

The issue required some sleuthing work on Brian's part to track down.
He eventually discovered the cause of the problem, and it was, oddly
enough, the pseudo-random number generator (PRNG) I had used.

Brian suspected the PRNG immediately because of his past experience as a
professional cryptographer. In fact, it turns out that PRNG's of the
kind that I had used, aren't particularly good even if they don't have
bugs!

The particular kind of PRNG I had used is called a linear congruential
PRNG. They start with the PRNG initialised to some random seed value,
n = seed. Then each time they are called, they apply the transformation
n = (n*c1 + c2) % p for some explicit constants c1, c2 and a large
enough "prime" p.

One can take c2 = 0 in the transformation and it is also common to see
p = 2^b for some b (e.g. b = WORD_BITS, and yes, I know p = 2^b is not
usually prime). When p = 2^b it is usually the case that the top half of
the bits output have reasonable random properties, but the bottom half
do not. In this case it is acceptable to stitch two LC PRNG's together to
give the full number of bits.

When p is an actual prime, these PRNG's are called prime modulus linear
congruential PRNG's and they aren't too bad when implemented properly.
They still fail to meet some standards of random quality.

To return a whole word of random bits, one either needs to use a prime
p that is larger than a word, which is usually impractical, or again
stitch the output of two prime modulus LC PRNG's together.

However, one needs to be careful, as the period of the generator is p-1,
so if one is on a 32 bit machine, it doesn't do to use a prime p just
over half the size of a word (the first mistake I made), otherwise the
period is just over 65536. That isn't too good for generating random
values for test code!

But how was my LC PRNG causing Windows to crash!? The reason was that
in some of the test functions we required bignums that didn't overflow
when summed together. This of course depends almost entirely on the top
few bits of the bignums being added together.

The problem was that in the expression n = (n*c1 + c2) % p, I was using
values of c1 and c2 which were not reduced mod p. It turns out that this
is crucial to correct operation. It might seem that the result ends up
being reduced mod p anyway, and indeed it would be if n*c1 fit in a word.
However, because it doesn't you actually get ((n*c1 + c2) % 2^32) % p
which causes the binary representation of the value generated to be quite
regular.

Anyhow, on Windows (and probably on other 32 bit machines) the test code
generates length 90 bignums over and over at some point, looking in vain
to find pairs of such bignums which when added do not overflow. As these
are garbage collected at the end of the test function, memory starts
filling up with the orphaned bignums that are discarded by the test code
as it looks for appropriate values. This eventually overwhelms the heap
allocator on Windows and causes the entire OS to crash!

The problem of writing decent PRNG's has been studied extensively, and one
expert in the subject is George Marsaglia. He famously turned up on a
usenet forum in January of 1999 and dumped not one, but piles of fast, high
quality PRNG's which do not suffer from the problems that other PRNG's do.

Amazingly, many of the PRNG's in common usage today are either written by
George, or based on ones he wrote. So he's some kind of legend!

Anyhow, we will make use of two of his PRNG's, Keep It Simple Stupid (KISS)
and Super KISS (SKISS) and a third PRNG called Mersenne Twister, due to
Makoto Matsumoto and Takuji Nishimura in 1997.

George's generators are in turn based on some simpler PRNG's. He begins by
defining a linear congruential generator, with c1 = 69069 and c2 = 1234567.
This is taken p = mod 2^32 (yes, it's not prime). This has good properties
on its top 16 bits, but not on its bottom 16 bits, and for this reason had
been widely used before George came along. This generator has period 2^32.

Next he defines a pair of multiply with carry (MWC) generators. These are
of the form n = c1*lo(n) + hi(n) where lo(n) is the low 16 bits of n, hi(n)
is the high 16 bits and c1 is an appropriately chosen constant.

He stitches together a pair of these MWC PRNG's mod 2^16 to give 32 random
bits. For simplicity we'll refer to this combined random generator as MWC.
This has a period of about 2^60.

Thirdly he defines a (3-)shift-register generator (SHR3). This views the
value n as a binary vector of 32 bits and applies linear transformations
generated from 32 x 32 matrices L and R = L^T according to
n = n(I + L^17)(I + R^13)(I + L^5) where I is the 32 x 32 identity matrix.

In order to speed things up, special transformations are chosen that can
be efficiently implemented in terms of XOR and shifts. This is called an
Xorshift PRNG. We'll just refer to it as SHR3.

Now given appropriate seed values for each of these PRNG's Marsaglia's
KISS PRNG is defined as MWC ^ CONG + SHR3. This generator passes a whole
slew of tests and has a period of 2^123. In this update we make it the
default random generator for bsdnt.

Super KISS is a random generator defined by George later in 2009. It gives
immense periods by adding together the output of three PRNG's, one with a
massive order. It is basically defined by SKISS = SUPR + CONG + SHR3.

Here, the new generator SUPR is based on a prime p = a*b^r + 1 such that
the order of b mod p has magnitude quite near to p - 1.

It starts with a seeded vector z of length r, all of whose entries are
less than b and an additional value c which is less than a.

One then updates the pair (z, c) by shifting the vector z to the left by
one place and setting the right-most entry to (b - 1) - ((a*z1 + c) mod b)
where z1 is the entry shifted out at the left of z. Then c is set to t/b.

Naturally in practice one uses b = 2^32 so that all the intermediate
reductions mod b are trivial.

As with most generators which have massive periods the "state" held by this
generator is large. It requires data mod p for a multiprecision p.

Note the similarity with the MWC generator except for the "complement" mod
b that occurs. This is called a CMWC (Complemented-Multiply-With-Carry)
generator.

George proposed using the prime p = 640*b^41265+1, where the order of b is
5*2^1320481. The period of the CMWC generator is then greater than
2^1300000.

Of course, at each iteration of the algorithm, 41265 random words are
generated in the vector. Once these are exhausted, the next iteration of
the algorithm is made.

The algorithm SUPR in the definition of SKISS is thus just a simple
array lookup to return one of the words of the vector z. Each time SKISS
is run, the index into the array is increased until all words of the array
are exhausted, at which point the CMWC algorithm is iterated to refill the
array.

We now come to describing the Mersenne twister.

It is based on the concept of a feedback shift register (FSR). An FSR shifts
its value left by 1 bit, feeding at the right some linear combination of the
bits in its original value. The Mersenne twister is conceptually a
generalisation of this.

The difference with the Mersenne twister is that the "feedback" is effected
by a certain "twist". This is effected by applying a "linear transformation"
A of a certain specific form, with multiplication by A having addition
replaced by xor in the matrix multiplication. The twist can be described
more straightforwardly, and we give the more straightforward description
below.

One sets up a Mersenne twister by picking a recurrence degree n, a "middle
word" 1 <= m <= n and a number of bits for a bitmask, 0 <= r <= 32. One

picks these values so that p = 2^(n*w - r) - 1 is a Mersenne prime (hence

the name of this PRNG). Given a vector of bits a = [a0, a1, ..., a{w-1}] of length

w and a sequence x of words of w bits, the Mersenne twister is defined by a

recurrence relation x[k+n] = x[k+m] ^ ((upper(x[k]) | lower(x[k+1])) A)

where upper and lower return the upper w - r and lower r bits of their

operands, and where A is the "twist" spoken of and defined below, in terms of

a. Of course ^ here is the xor operator, not exponentiation. For a vector X of w

bits, XA is given by X>>1 if X[0] == 0 otherwise it is given by (X>>1) ^ a.

Some theory is required to find an A such that the Mersenne twister will have
maximum theoretical period 2^(n*w - r) - 1.

To finish off, the Mersenne twister is usually "tempered". This tempering
simply mangles the bits in a well understood way to iron out some of the
known wrinkles in the MT algorithm.

Only a couple of sets of parameters are in common use for Mersenne twisters.
These are referred to as MT19937 for 32 bit words and MT19937-64 for 64 bit
words.

As with all PRNG's, there is a whole industry around "cracking" these things.
This involves starting with a short sequence of values from a PRNG and
attempting to find the starting constants and seed values.

Obviously, in crytographic applications, there is not much point generating
"secure" keys with a PRNG with a single source of entropy. Even if your key
is generated by multiplying primes of many words in length, if those words
were generated from a PRNG seeded from the current time, it may only take
a few iterations and a good guess as to which PRNG you used, to determine
the constants used in the PRNG and thus your entire key. And that's
irrespective of which constants you chose in your PRNG!

So if you are doing crypto, you need to take additional precautions to
generate secure keys. Just seeding a PRNG from the time probably won't cut
it!

Some PRNG's are more "secure" than others, meaning that knowing a
few output values in a row doesn't give terribly much information about
which values may follow. But if you rely on a PRNG to be secure, you
are essentially betting that because you don't know how to get the
next few values and nor does anyone else that has written about the
subject, then no one at all knows. Of course one needs to ask oneself
if they would tell you if they did.

Another assumption one should never make is that no one has the computing
power to brute force your PRNG.

Some PRNG's are designed for cryptographic applications, and maybe one can
believe that these are "safe" to use, for some definition of safe.

Anyhow, we only care about random testing at this point. In today's update
32 and 64 bit KISS, SKISS and MT PRNG's are added in the directory rand.
Our randword, randinit, and randclear functions are all replaced with
appropriate calls to KISS functions.

There is also an option to change the default PRNG used by bsdnt. Is it my
imagination or does the test code now run faster, even on a 64 bit machine!

At some point we will add some tests of the new PRNG's. These will compare
the outputs with known or published values to check that they are working as
designed for a large number of iterations.

Brian Gladman contributed to this article and also did most of the work
in implementing Marsaglia's PRNG's in bsdnt. The Mersenne twisters were
originally written by Takuji Nishimura and Makoto Matsumoto and made available
under a BSD license. Brian did most of the work in adapting these for bsdnt.

The code for today's update is here: v0.21

Previous article: v0.20 - redzones

Next article: v0.22 - Windows support

Saturday 30 October 2010

BSDNT - v0.20 redzones

In this update we implement another improvement to the test code in bsdnt. I don't know

what the correct name is, but I call them redzones.

The basic idea is this: suppose you have a function nn_blah say, and it writes to an nn_t

b say. If it writes well beyond the allocated space for b, then almost certainly a

segfault will occur. But what if it only writes a word or two before the beginning or

after the end of the allocated space? Very likely this will cause a segfault only on

some systems, depending on the granularity of the heap allocator and depending

on what other bsdnt data might be in the overwritten space!

So what if we could detect this kind of error? Well, that is what redzones hope to do.

Essentially if an nn_t b is allocated with m words of space, when redzones are turned on

it allocates m + 2C words of space for some small constant C. It then fills the first

and last C words of b with known words of data (usually some recognisable pattern of bits).

When the garbage collector cleans up, it examines the redzones to ensure that they have

not been altered. If they have, they raise an error.

The nn_t b is set to point just after the first C words, which contain the redzone, and in

every other respect act like a normal nn_t. The user needn't know that an extra C words

of data were allocated immediately before and after the length m nn_t they requested.

Nor do they need to be aware of the checking that goes on when the nn_t is finally cleaned

up, that the redzones haven't been touched.

Of course it's nice to be able to turn redzones off sometimes, when testing the library.

Therefore I've added a configure option -noredzones which turns off redzones if they are

not required. This works by setting a #define WANT_REDZONES 0 in config.h. The

memory allocator for nn_t's and the garbage collector both operate differently if redzones

are turned on.

At present, the only way to allocate memory for nn_t's in test code is to use

randoms_of_len, so it is convenient to rewrite this to call a function alloc_redzoned_nn

instead of malloc, and for the garbage collector to call free_redzoned_nn. These new

functions are defined in test.c.

The only difference when WANT_REDZONES is set in config.h is that REDZONE_WORDS, which is defined in test.h is changed from 0 to 4 words (meaning 4 redzone words are to be

allocated at each end of a redzoned nn_t). Having redzones of length 0 is the same as not

having them at all. So this makes the functions easy to write.

Also in test.h REDZONE_BYTE is defined to the hexaecimal byte 0xA which has binary bit

pattern 1010, i.e. alternating one's and zeroes. This is the value that is placed into the

redzones byte-by-byte before the nn_t is used. At the end, when they are cleaned up, the

garbage collector examines the redzones to ensure they are still filled with these bytes.

Fortunately checking redzones does not dramatically slow down our test code, and no new

test failures result. This means it is highly likely that our nn_t functions do not overwrite

their bounds. To check that the new redzones code works, it is a simple matter of mocking

up a broken function which overwrites its bounds. The new code complains loudly as it

should, unless redzones are switched off at configure time.

The code for today's update is here: v0.20

Previous article: v0.19 - asserts

Next article: v0.21 - prngs

Monday 25 October 2010

BSDNT - v0.19 asserts

About a week ago I got enthused to work on another coding project I've been

wanting to experiment with for a long time. I discovered that it was highly

addictive and I just couldn't put it down. It's also given me some interesting

ideas for a higher level interface to bsdnt. But more on that later when we

start working on it.

Unfortunately in that week of time there have been no bsdnt updates.

Moreover, on the weekend my main computer died (physical hard drive failure).

I pulled out my backup machine and Windows wanted to install 3 months of

"important updates". Naturally this caused the machine to crash, I was unable

to recover to the restore point it set, the startup repair didn't work and

the only solution was a format and reload.

*Fifteen and a half hours* later I had reinstalled Windows and it had finally

finished putting the approximately 165 "important updates" back on!!

Unfortunately all the bsdnt blog articles I had meticulously prepared in

advance were lost. Thus I am regenerating what I can from the diff between

revisions of bsdnt. Sorry if they end up being shorter than past updates.

Fortunately I did not lose any of the code I wrote, as that was backed up in

a git repo on an external server!

Anyhow, in this update we make a very simple change to bsdnt, again in an

attempt to improve the test quality of the library. We add asserts to the

code.

An assert is a check that is made at runtime in live code to test if some

predefined condition holds. If the assert fails, an error message is printed

specifying the line of code where the assert is located and what the

condition was that failed.

Now, I am not personally a great believer in asserts. As they are runtime

checks, they require computing cycles, which is just a no-no for a bignum

library. The other option is to turn them off when not testing code. However,

this simply leads to the asserts rarely being run when they are needed.

The other problem with asserts is that they pollute the code, making the

source files longer and appear more complex.

However, there is one situation where I believe they can be very helpful,

and that is in checking the interface of functions within a library and that

it is being respected both in intra-library calls and by the test code for

the library.

Specifically, assert are useful for checking that valid inputs have been

passed to the functions, e.g. you might have a restriction that a Hensel

modulus be odd. Adding an assert allows you to test that all the moduli

you pass to the function in your test runs are in fact odd.

The main advantage in putting asserts into the code is that it forces you

to think through what all the conditions should be that you assert. In

adding asserts to the code in bsdnt I discovered one function in which the

test code was pushing the code to do things I didn't write it to cover.

This forced me to either rewrite the test, or drop that as a condition (I

think I chose the former for consistency with other related functions in

bsdnt).

Of course we do not want to consume cycles when the library is run by the

end user, and so we make asserts optional. This is done using a configure

switch. By default the macro WANT_ASSERT is set to 0 in a file config.h by

configure. However, if the user passes the option -assert to configure, it

sets the value of this define to 1.

A macro ASSERT is then defined in helper.h which is either an empty macro

in the default case or is an alias for the C assert function if WANT_ASSERT

is set to 1 in config.h.

Of course we have to remember to turn asserts on to run the test code, and

this really highlights their main weakness. As I mentioned, the asserts I

added did clarify the interface, but I don't believe they showed up any

bugs in bsdnt. With this expectation, asserts can be a useful tool.

The code for today's update is here: v0.19

Previous article: v0.18 - printx_word, nn_printx

Next article: v0.20 - redzones

Saturday 16 October 2010

BSDNT - v0.18 printx_word, nn_printx

It is time we improved our test code again. We'll spend a few days updating

things to make improvements in the way we test.

Today's update is quite straightforward. We currently have no way of printing

nn_t's. This is quite inconvenient when it comes to the test code, where

little to no diagnostic information is printed at all. In particular, we

aren't printing out any of the multiple precision integers for examination

when a test fails.

Now, it is actually quite a difficult job to print bignum integers in decimal.

In fact, as far as I can see, one requires a function which allocates temporary

space to efficiently print integers. This is an interesting challenge:

is there an algorithm to convert from binary to decimal and print the result,

with just O(1) temporary space, with any complexity.

I think it may be possible if one allows the input to be destroyed. If so, a

subsidiary question would be to do the same thing without destroying the

input. I doubt that is possible, but I do not have a proof. Of course, to be

practical, we'd require an algorithm which doesn't destroy the input.

To get around this issue, we'll start with a simple nn_printx algorithm,

which will print a bignum in hexadecimal. We also add an nn_printx_short

function which prints the first couple of words of a bignum, an ellipsis and

then the final couple of words. This is useful for large bignums that would

print for screens and screens due to their size. We'll use this in our test

code to prevent printing too much output upon test failure.

Another function we add is an nn_printx_diff function. It accepts two nn_t's

and prints information about the range of words where they differ and prints

the first and last differing word in each case.

There is one tricky aspect to our printing functions however. A word is often

an unsigned long, but on some platforms it will be an unsigned long long. For

this reason, when printing a word, we need to use %lx as the format specifier

on some platforms and %llx on others.

So we need to add a routine which will print a word and abstract away the

format specifier so the caller doesn't have to think about it. The function

we include to do this is caled printx_word. It prints a word without needing

to specify a format specifier.

We add files helper.c/h to bsdnt which will contain routines like this one

which aren't specific to our nn module. A few existing functions and macros

also get moved there. The configure system will automatically look for

architecture specific versions of helper.c, allowing us to override the

definition of the functions in that file on a per architecture basis.

We add the printx_word function to helper.c which can be overridden with

an architecture specific version. On a platform where %llx is required, an

architecture specific version will simply replace the generic version which

uses %lx.

In test.h we add some macros, print_debug and print_debug_diff which use the

stringizing operator to print the names of the variables and then print their

values. The stringizing operator (#) is a preprocessor macro which turns a

macro parameter into a string. In our case, we pass the variable name to the

macro and turn it into a string so that we can print the variable name.

A few modifications to the TEST_START and TEST_END macros in test.h also

allow us to give a unique name to each test which is then printed along with

the iteration at which the test failed. This also uses the stringizing

operator so that the caller of TEST_START can specify the unique name for the

test. It seems difficult to come up with an automatic way of generating

unique test names, so this will have to do.

It would also be a useful thing to have it print the value of the random seed

at the start of a failing iteration too. After we have improved the random

generation code in bsdnt v0.21, perhaps someone would like to try adding this

feature.

We roll out our new diagnostic printing routines to our test code. Of course,

to see any of this new code in action, one has to introduce a bug in one of

the tests so that the new diagnostic code is actually run. I leave it to you

to fiddle around introducing bugs to see that the new test code does actually

print useful diagnostic information.

Later on we'll add a bsdnt_printf function which will be variadic and accept

a format specifier like the C printf function and which will have a

consistent %w for printing a word. This will also make things easier on

Windows, where currently the format specifier will be wrong in many places.

We'll fix this problem in a later update.

The code for today's post is here: v0.18

Previous article: v0.17 - div_hensel

Next article: v0.19 - asserts

Friday 15 October 2010

BSDNT - v0.17 div_hensel

Now that we have nn_mullow_classical, we can add nn_div_hensel. As explained,

this will take an integer a of length n and divide by an integer d of length

m modulo B^n, returning a quotient q and an "overflow".

The overflow will be two words which agree with the overflow from mullow(q*d).

As per the euclidean division, the dividend a will be destroyed.

The algorithm is somewhat simpler than the euclidean algorithm. If d1 is the

least significant word of d then we use an inverse mod B of d1 (dinv say) and

multiply it by the current word of the dividend being considered (working from

right to left) to get a quotient word q1.

We then subtract q1*d (appropriately shifted) from the dividend. There is no

adjustment to do as the inverse mod B is unique (so long as d is odd).

Any borrows and overflows from the subtractions are accumulated in the two

overflow and returned.

In our test code, we check a few things. Firstly, for an exact division, we

want that the quotient is really the exact quotient of a by d. As the

quotient returned is m words, which may be larger than the actual quotient,

we check that any additional words of q are zero. We do this by normalising q.

The second test we do is for an inexact division. We check that the the

overflow words turn out to be the same as the overflow from mullow(q*d).

Note that if we wish to make Hensel division efficient for doing an exact

division, say of a 2m - 1 by m division, we merely pass m words of the

dividend in instead of all 2m - 1 words, so that the quotient is also m words.

Once again we don't allow an overflow-in to Hensel division. This wouldn't

give us any kind of chaining property anyway.

Instead, we'd have to do a mulhigh(q, d) and subtract that from the high

part of the chain before continuing, and the mulhigh will accept our overflow

words from the low Hensel division.

In fact, we add a chaining test that does precisely this. We do a Hensel

division on the low n words of a chain, subtract a mulhigh from the high

m words of the chain, then compute the high m words of the quotient using

a second Hensel division.

To make our test code work, we add an ODD flag to randoms_of_len so that only

odd divisors are used with Hensel division.

It isn't clear if it makes sense to allow a carry-in at the top of div_hensel

or not. On the one hand, it might seem logical to allow a carry-in on account

of the way mul_classical works. On the other hand, divrem_hensel1 took a

carry-in, but at the bottom. This was for chaining rather than a read-in of

an extra word. We choose the latter convention, as it seems to make more

sense here and stops the code from becoming horribly complex.

The code for today's post is here: v0.17

Previous article: v0.16 - mullow_classical

Next article: v0.18 - printx_word, nn_printx

Thursday 14 October 2010

BSDNT - v0.16 mullow_classical, mulhigh_classical

The most logical routine to implement next would be Hensel division. Its
main application is in doing exact division.

For that reason, we might want to focus on Hensel division without
remainder first.

This would take an m word integer and divide it by an n word integer,
division proceeding from right to left. Essentially it gives us division
modulo B^m.

However, before we implement this, it makes sense to think about what its
"inverse" operation might be.

If q is the Hensel quotient of a by d mod B^m, then the inverse operation
can be thought of as multiplication of q by d mod B^m.

Hensel division gives an m word quotient q, so this would imply that its
inverse should be a multiplication of {d, n} by {q, m} mod B^m. We might call
this inverse operation mullow, as it returns the low m words of the product
d*q.

However, we need to be careful with this kind of multiplication. We'd also
like to have a mulhigh which returns the high part of the multiplication,
and we'd like the sum of mullow and mulhigh to be the same as a full mul.

However, there is a problem if mullow merely returns the product mod B^m. Any
carries out of mullow will have been lost. Also, all the word by word

multiplications that contribute to the high word of the product mod B^m
will be thrown away.

To rectify the problem we accumulate an "overflow" out of the mullow
corresponding to the sum of all these high words and carries. As this
overflow is essentially the sum of arbitrary words it may take up two words.

Thus, instead of mullow yielding m words it will yield m + 2 words. We'd
like to pass the extra two words as an "overflow-in" to mulhigh, thus the
logical thing is to return these two words from mullow separately from the
product mod B^m itself.

Hensel division will also return two overflow words. After all, what it
essentially does to the dividend is subtract a mullow of the quotient by the
divisor. So, the overflow from Hensel division will be defined as precisely
the overflow from mullow(q*d).

We manage the overflow by accumulating it in an dword_t. However, as we don't
wish the user to have to deal with dword_t's (these are used in our internal
implementations only), we split this dword_t into two separate words at the
end and return them as an array of two words representing the "overflow".

Today we shall only implement mullow and mulhigh. The first of these is a lot
like a full multiplication except that the addmul1's become shorter as the
algorithm proceeds and the carry-out's have to be accumulated in two words, as
explained.

At the same time we implement mulhigh. This takes two "overflow-in" words and
computes the rest of the product, again in a fashion similar to a full
multiplication.

Our test code simply stitches a mullow and mulhigh together to see that the
chain is the same as a full multiplication.

we have to be careful in that if one does an n by n mullow, the mulhigh that
we wish to chain with this must start with an n-1 by 1 multiplication,
not an n by 1, otherwise the sum of the mullow and mulhigh would contain the
cross product of certain terms twice.

We also have to be careful in the case where the full multiplication is only
a single word by a single word. Here the overflow out of the mullow part is
only a single word and there is no mulhigh to speak of. It merely passes
the overflow from the mullow straight through.

Both mullow and mulhigh must accept all nonzero lengths, as per full
multiplication. This causes a few cases to deal with in mulhigh. This doesn't
seem particularly efficient or elegant, but there seems to be little we
can do about that.

An interesting question is what the inverse of mulhigh is. Essentially,
this is our euclidean divapprox.

There's something slightly unsatisfying here though. Recall that the divapprox
algorithm proceeds by subtracting values q1*d' where q1 is the current quotient
word and d' is what is currently left of the divisor. We throw away one word of
the divisor at each iteration until finally we are left with just two words.

It would be wholly more satisfying if we didn't require this extra word of the
divisor throughout. We'd then be working right down to a single word in the
divisor so that we will really have subtracted a mulhigh by the time the
algorithm completes.

Any modification I can think of making to the euclidean division to make this
seem more natural also makes it much less efficient.

Perhaps some further thought will lead to a more satisfying way of thinking about
these things, which isn't also less efficient in practice.

The code for today is here: v0.16

Previous article: v0.15 - divapprox_classical

Next article: v0.17 - div_hensel

Wednesday 13 October 2010

BSDNT - v0.15 divapprox_classical

During the past few weeks, Brian Gladman has been doing

some tremendous updates, including some random number generators and

making bsdnt work on Windows (MSVC).

We discuss all matters related to bsdnt on our development list:

http://groups.google.co.uk/group/bsdnt-devel?hl=en

Anyhow, now for today's update.

We'd now like to implement a variant of our divrem_classical algorithm. This
time we'd like to just return a quotient, with no remainder. The question is,
can this be done in less time than a full divrem?

At first sight, the answer seems to be no. As we saw in the post on the divrem
algorithm, every word of both the dividend and divisor counts, and we need
to keep the dividend completely updated as the algorithm proceeds, otherwise
we will get the wrong quotient.

So, with the exception of perhaps the final update (which is only needed to
determine the remainder), there doesn't seem to be much we can save.

But what if we allowed the quotient to be approximate, say within 1 of the actual

quotient? In fact, let's demand that if q' is our approximate quotient, that |a -

d*q'| < d. In other words, we allow the quotient to be too large by 1, but not

too small by 1.

Ideally, what we would like is to be doing about half the updating work.
Specifically, we'd like to be truncating both the dividend and divisor as we
go.

Ordinarily we start with something like

AAAAAAAAAAAAAAA /
DDDDDDDDD

To get the first quotient word we shift the divisor and pad with zeroes, thus

AAAAAAAAAAAAAAA /
DDDDDDDD000000

After one iteration, a word of the dividend has been removed, and we then
shift the divisor again.

AAAAAAAAAAAAAA /
DDDDDDDD00000

We continue until we have

AAAAAAAAA /
DDDDDDDD

Now there is only one quotient word left. But we notice that we don't use
most of the remaining dividend or the divisor to determine this quotient
word. In fact, we could almost truncate to

AA /
D

What if we had truncated at this same point all along?

In fact, if we truncate so that the final division is a two word by one
word division (here we have to be careful, in that we are talking about the
number of words *after* normalisation), then clearly our quotient could be
out by as much as two on that final division, by what we have said in an
earlier post. That is of course ignoring any accumulated error along the way.

As we don't wish to multiply the entire thing out to see what we have to do
to correct it, it is clear that this amount of truncation is too great.

So let's truncate one further word to the right in both the dividend and
divisor, so that the final division (to get the final quotient word) is a
three word by two word division.

In fact, in the example above, as there will be five quotient words, there
will be five iterations of the algorithm, after which we want two words of
the divisor remaining. So, we will start with

AAAAAAA.A /
DDDDDD.D

(The decimal points I have inserted are arbitrary, and only for notational
purposes in what follows.)

After five iterations, throwing away one more word of the divisor each time,
we'll be left with

AA.A /
D.D

The first thing to notice is that our previous divrem algorithms, with the
adjustments they made as they went, gave the precise quotient given the
data they started with.

The second thing to notice is that truncating both the dividend and the
divisor at the same point, as above, will not yield a quotient that is too
small. In fact, the quotient we end up with will be the same as what we would
have obtained if we had not truncated the dividend at all, and only truncated
the divisor. Additional places in the dividend can't affect the algorithm.

Truncating the divisor, on the other hand, may result in a different quotient
than we would have obtained without truncation. In fact, as we end up
subtracting less at each update than we would if all those words were still
there, we may end up with a quotient which is too large. The divisor may also
divide more times, because of the truncation, than it would have if it had not
been truncated.

However, it is not enough to merely consider how the quotient changes with
truncation in order to see how far we can be out. We'll likely end up with a
very pessimistic estimate if we do this, because we may suppose that the
quotient can be one too large at each iteration, which is not true.

Instead, the quantity to keep track of is the original dividend minus the
product of the full divisor d and the computed quotient q. At the end of the
algorithm, this is the actual remainder we'll end up with, and we'd like to
keep track of how much our *computed* remainder (what's left of the dividend
after the algorithm completes) differs from this actual remainder.

Essentially, we accumulate an error in the computed remainder due to the
truncation.

Clearly, at each iteration, the word after the decimal point in what remains
of the dividend may be completely incorrect. And we may miss a borrow out of
this place into the place before the decimal point. So after n iterations of
the algorithm, the dividend may become too large by n. Of course n.0 is much
smaller than our original (normalised) divisor d (also considered as a decimal
D.DD...).

At the final step of the algorithm, we will have a dividend which is too large
by at most this amount, and we'll be using a divisor which is truncated to
just two words. However, the latter affects the computed remainder by an
amount much less than the original d (if Q is the final quotient word, it is
as though we added q*0.0DDDDDD to our divisor, so that the full divisor would
go Q times where it otherwise would only go Q-1 times).

So these two sources of error only increase the computed value q'*d + r (where
q' is the computed quotient and r is the computed remainder) by an amount less
than d. Thus, the computed quotient q' can be at most one larger than the
actual quotient q.

This is equivalent to the required |a - d*q'| < d.

So it seems that if we truncate out dividend at the start of the algorithm,
and our divisor after each iteration, we can get an approximate quotient q'
within the required bounds.

We'll leave it to another time to describe the usefulness of an algorithm
which computes a quotient which may be out by one. What we will note is that
we've done computations with much smaller integers. It therefore costs us
significantly less time than a full divrem.

In this week's branch we implement this algorithm. In the test code, we check
the required quotient is the same as the one returned by divrem, or at most
one too large.

The trickiest part is ensuring we truncate at the right point. We want to
finish on the last iteration with two words *after* normalisation of the
divisor.

Actually, if we are really smart, we realise that if d does not need
to be shifted by much to normalise it, we can get away with finishing with
just two *unnormalised* words in the divisor. The error will still be much
less than d.

To be safe, if the number of quotient words is to be n, I check if the
leading word of the unnormalised divisor is more than 2*n. If not, too much
normalisation may be required, and I set up the algorithm to finish with
three unnormalised words instead of two. Otherwise it is safe to finish
with two words in the unnormalised divisor.

The algorithm begins by computing the number of words s the divisor needs to
be to start. This is two more than the number of iterations required to
get all but the final quotient word, since we should have two words in the
divisor at this point. If normalisation is going to be a problem, we add one
to this so that we compute the final quotient word with three unnormalised
words in the divisor.

Now the number of words required to start depends only on the size of the
quotient, and thus it may be more than the number of words d actually has.
Thus we begin with the ordinary divrem algorithm until the number of words
required is less than the number of words d actually has.

Now we truncate d to the required number of words and the dividend to one
more than that. The remaining part of the algorithm proceeds in the same
manner, throwing away a divisor word every iteration.

Only one thing can go wrong, and that is the following: because we are

truncating the divisor at each point, we may end up subtracting too little from

the dividend. In fact, what can happen is that the top word of the dividend

does not become zero after we subtract q*d' (where d' is the truncated divisor).

When this happens, the top word of the dividend may be 1 after the subtraction.

We know that the least significant word of the dividend could be completely

wrong, and the next word may be too large by about as many iterations as

we've completed so far.

Thus, in order to fix our overflow, we subtract from the dividend as much as we

need to in order for the overflow to disappear. We don't mind our dividend

being too large, as we adjust for that in the algorithm. But we cannot allow it to

become too small. Thus we must only subtract from the dividend precisely the

amount required to make the overflow vanish.

We can safely subtract away whatever is in the bottom two words of the

dividend, as this is not even enough to remove the overflow. And then we can

subtract 1 from the whole dividend. This must remove the overflow and is

clearly the least we can get away with subtracting to do so.

The code for today's post is here: v0.15

Previous article: v0.14 - divrem_classical

Next article: v0.16 - mullow_classical

Wednesday 6 October 2010

BSDNT - v0.14 divrem_classical

It's time to implement our schoolboy division routine. I prefer the name

divrem_classical, in line with the multiplication routines.

This function will take a dividend a, divisor d, carry-in ci and returns

a quotient q and remainder r. We'll also need to pass in a precomputed

inverse to speed up the dword_t by word_t divisions that need to occur.

We are going to implement the first algorithm spoken of in the previous

post, namely the one which uses a single "normalised" word of the divisor.

We need to be careful not to just shift the top word of the divisor so

that it is normalised, but if it has more than one word, shift any high

bits of the second word as well. We want the top WORD_BITS bits of the

divisor, starting with the first nonzero bit.

We can use our macro for doing a dword_t by word_t division to get each

new word of the quotient. We start with the carry-in and the most

significant word of a. The macro will automatically shift these by the

same amount as we shifted the leading words of the divisor.

As per the divrem1 function, we require that the carry-in be such that

the quotient won't overflow. In other words, we assume that if the

divisor d is m words, then the top m words of the dividend including the

carry-in, are reduced mod d.

First we need to check that we are not in the special case where truncating

the dividend and divisor to two and one words respectively would cause

an overflow of the quotient word to be computed. This only happens

when the top word of the dividend equals the top word of the divisor, as

explained in the previous post.

If the truncation would cause an overflow in the quotient, we collect a

quotient word of ~0, as discussed in the previous post. If not, we compute

the quotient using our macro.

After this point, the remainder is computed. We allow the function to

destroy the input a for this purpose. We leave it up to the caller to make

a copy of a and pass it to the function, if this is not desired

behaviour.

We do our adjustment so that the quotient is correct. We need a while loop

for this, as mentioned in the previous article.

Finally we write out the quotient word and read in the next word of the

dividend that remains.

In the test code we use our mul_classical and muladd_classical functions to

check that divrem_classical is indeed the inverse of these functions, with

zero remainder and nonzero remainder respectively.

The code for today's post is here: v0.14

Previous article: divrem discussion

Next article: v0.15 - divapprox_classical

Saturday 2 October 2010

BSDNT - divrem discussion

Today's post is a discussion only, without accompanying code. This is because

the topic is divrem using the "school boy" algorithm, and this is not as

straightforward as one might imagine. The discussion below is informal, and

may contain errors. Please let me know if so and I can make some adjustments

before releasing the code for the next article, where we implement this.

As a consolation, I have released a failed attempt at using C89 macros to make

even more simplifications to our test code. In particular, I tried to make the

chain tests more "automatic". Unfortunately, this didn't work out. It ended up

making the test code too complicated to use and it was going to be way too much

work to convert all of it over to the new test system.

This test code "simplification" was completely abandoned and will not be merged.

But if you are curious you can see it here: gentest

In particular, take a look at generic.c/h and the test code in t-nn.c.

Anyhow, back to today's discussion:

Firstly, what exactly do we mean by the school boy division algorithm?

Certainly I leaned how to do 39 divide 7 at school, and even 339 divide 7.

But how about 10615 divide 1769?

The first step is to divide 1769 into 1061. It doesn't go, so we grab another

digit. So it becomes 1769 into 10615. I have no idea how many times it goes!

Either I missed something at school when I studied "long division", or I only

think I learned an algorithm suitable for problems like this.

In fact, the only way I know to do this division is trial and error, i.e.

to go through all the likely quotient candidates from one to nine.

Using a calculator I find it goes 6 times.

I think I intuitively know how to proceed from here. We place the digit 6

into our quotient, multiply 1769 by 6 and subtract to form a remainder.

But now we have a problem. What if we are dividing 17699949 into 106150000.

Does it still go 6 times? In fact it does not. But how would I know this

without being able to divide 17699949 into 106150000 in the first place!

So I have two problems. Firstly, if I simply truncate the numerator and

denominator to a certain number of digits, even the first digit of my

quotient may be wrong, and secondly, if I use more digits in the first place

my intiuitive "algorithm" doesn't help. It basically says, guess and refine.

That's all very well, but when my digits are 64 bit words, I may not be so

lucky with the guessing thing.

Now an interesting thing to note in the example above is that no matter how

many digits we add to the dividend, 1769 will always go into the first 5

digits 10615 of the dividend, 6 times (with some remainder), no more, no

less.

To see this, we note that 6*1769 is less than 10615 and 7*1765 is greater.

So the only way we could get 1765 to go in 7 times would be to increase

those first five digits of the dividend. Any following digits are irrelevant

to that first digit, 6, of the quotient.

This is a general feature of integer division, and also applies for the

word based (base B) integer division problem.

However, adding digits to the divisor is not the same, as the example above

shows. In fact 17699949 only goes 5 times into 106150000.

Now the question is, how far off can it be if I truncate the dividend and

divisor? How bad can it get?

We can answer this as follows. Note that 17699949 < 17700000. So 10615xxxx

divided by 1769yyyy is greater than or equal to 10615 divided by 1770, which

happens to be 5. So the answer has to be either 5 or 6 times.

What I have done here is simply add 1 to the first four digits of the

divisor, to get a lower bound on the first digit of the quotient. In this

way, I can get a small range of possible values for the first digit of the

quotient. Then I can simply search through the possibilities to find the

correct quotient digit. This matches my intuition of the long division

algorithm at least.

It seems that *perhaps* a quotient digit obtained in this way will be at

most out by 1. In other words, roughly speaking, if X is the first few digits

of the dividend and Y the appropriate number of digits of the divisor (in the

example above, X = 10615 and Y = 1769), then perhaps X / Y >= X / (Y + 1)

>= (X / Y) - 1, where by division here, I mean integer division. Let's call

this condition R.

Let's suppose now that we are working on integers written as binary words

instead of decimal digits. Let's also suppose that X / Y < B. Let's call

this condition S.

Well, it is easy to generate a counterexample to condition R. Let Y = 2,

X = B. Clearly R is not satisfied, even though S is.

But this seems to only be a problem because Y is so small. So, what if we

constrain Y to be at least B/2?

It turns out that condition R still doesn't hold. A counterexample is

Y=B/2, X = B*Y - 2.

However, I claim that X / (Y + 1) >= (X / Y) - 2 does hold, so long as

condition S holds. Let us call this inequality T.

Clearly there does not exist a counterexample to T with X <= Y.

Let us suppose that for a certain Y >= B/2, there is a counterexample to T.

Clearly if X0 is the first such counterexample, then X0 is a multiple of Y,

but not of Y + 1. Nor is X0 - 1 a multiple of Y + 1 (else X0 - 2 would have

been a counterexample).

It is also clear that every multiple of Y from X0 on is a counterexample, as

the left hand side can never "catch up" again.

Thus we have the following result: if there exists a counterexample to T for

some Y >= B/2 and X < BY, then X = (B - 1)*Y is a counterexample.

Substituting this value of X into T yields that T holds if and only if

(B - 1)*Y / (Y + 1) >= B - 3 for all Y >= B/2. Thus, if we can show that this

last inequality holds for all Y >= B/2 then we have proven T.

Note that this inequality is equivalent to (B - 1)*Y >= (Y + 1)*(B - 3), i.e.

2*Y >= B - 3. However, as Y >= B/2, this holds under our hypotheses.

So putting all of the above together, suppose we are dividing A by D and that

Y >= B/2 is the leading word of the divisor D and B*Y > X >= Y is the first

word or two words of the numerator A (whichever is appropriate). Then if Q0

is the leading word of the quotient Q = A / D, then we have shown that

X / Y >= Q0 >= (X / Y) - 2.

In other words, X / Y can be no more than 2 away from the first word of the

quotient Q = A / D that we are after.

This leads us immediately to an algorithm for computing a quotient of two

multiprecision integers.

(i) Let Y be the first WORD_BITS bits of the divisor D, so that

B > Y >= B/2.

(ii) Let X be the first word or two words (appropriately shifted as per Y) of

A such that B*Y > X >= Y.

(iii) Let Q0 = X/Y.

(iv) Set R = A - D*Q0 * B^m (for the appropriate m), to get a "remainder".

While R < 0, subtract 1 from Q0 and add D * B^m to R.

(v) Write down Q0 as the first word of the quotient A / D and continue on by

replacing A by R and returning to step (i) to compute the next word of the

quotient, etc.

Another algorithm can be derived as follows. Instead of using Y >= B/2 in the

above algorithm, let's choose Y >= B^2/2. A similar argument to the above

shows that X / (Y + 1) >= (X / Y) - 1 for Y >= B^2/2 and B*Y > X >= Y. It boils

down to showing that (B - 1)*Y / (Y + 1) >= B - 2 for Y >= B^2/2, i.e. that

Y >= B - 2, which is clearly satisfied under our hypotheses.

The algorithm is precisely the same, but at step (iv) we can replace the while

statement with an if statement and perform at most one adjustment to our

quotient Q0 and to R.

Now we return to step (v), in which we said that we could just continue

on from step (i) to compute the next word of the quotient Q. If we do this

and set A = R then compute the new X, what we would like is something like

the divrem1 algorithm we implemented, where, (after possibly some kind of

initial iteration that we handled specially), it is always true that the new

X is two words and has its first word reduced mod Y.

However, this does not follow from 0 <= R < D*B^m, and it may be that the

first word of the remainder is equal to Y! This is due to the truncation of

R to get Y.

It is clear from 0 <= R < D*B^m that if X is the first two words of the

remainder that X <= B*Y. So to make the algorithm continue in the way we'd

like, we only have to deal with the special case where X = B*Y.

We know that in the first algorithm above, the next word of the quotient may

be B - 1 or B - 2, since we know already that it is not B. We must multiply

out and check which it is.

In the second algorithm above, where Y >= B^2/2, the only possibility for

the next quotient word is B - 1, as we know it is not B.

In the next article we will look at implementing the first of the two

algorithms above, leveraging the code we already have for computing and using

a precomputed inverse.

Previous article: v0.13 - muladd1, muladd

Next article: v0.14 - divrem_classical

Friday 1 October 2010

BSDNT - v0.13 muladd1, muladd

We noted in the last article that multiplication need not take a carry-in

and that it doesn't seem related to the linear functions we have been

implementing.

Instead, we think of multiplication in a different way, as an inverse of

division. We'll soon implement division with remainder, i.e. given a and

d, find q and r such that a = d*q + r, where 0 <= r < d

If d and q are of lengths m and n respectively, then r is of length at most

m and a is either of length m + n or m + n - 1.

Multiplication corresponds to the case where r = 0.

As we will certainly not wish to have to zero pad our division out to

m + n limbs if a is in fact only m + n - 1 limbs, it makes sense that our

division will take a carry-in. For this reason, it made sense for our

multiplication to yield a carry-out, i.e. it will write m + n - 1 limbs

and return an m + n - th limb, which may or may not be 0.

When r is not zero in our division, the inverse operation would take a

value r of n limbs and add d*q to it, returning the result as a value a of

m + n - 1 limbs (and a carry which may or may not be zero). We call this

routine muladd. In the case where r and a happen to be aliased, the result

will be written out, overwriting a.

We first implement nn_muladd1 which is identical to addmul1 except that

muladd1 writes its result out to a location not necessarily coincident with

any of the inputs. In other words, nn_addmul1 is an in-place operation

whereas nn_muladd1 is not.

Next we implement nn_muladd_classical. It takes an argument a of length m

to which it adds the product of b of length m and c of length n. The result

may or may not alias with a.

We also implement a version of the linear and quadratic muladds which writes

out the carry, naming the functions with the nn_s_ prefix, as per the

convention initiated in our nn_linear module.

As for multiplication, we don't allow zero lengths. This dramatically

simplifies the code.

An important application of the muladd function will be in chaining full

multiplication. We'll discuss this when we get to dealing with multiplication

routines with better than quadratic complexity.

Our test code simply checks that muladd is the same as a multiplication and

and an addition.

The github repo for this post is here: v0.13

Previous article: configure and assembly

Next article: divrem discussion

Saturday 25 September 2010

BSDNT - configure and assembly improvements

It's time we added architecture dependent assembly support to bsdnt.

Here is how we are going to do it. For each of the implementation files

we have (nn_linear.c, nn_quadratic.c), we are going to add a _arch file,

e.g. nn_linear_arch.h, nn_quadratic_arch.h, etc.

This file will be #included in the relevant implementation file, e.g.

nn_linear_arch.h will be #included in nn_linear.c.

These _arch.h files will be generated by a configure script, based on

the CPU architecture and operating system kernel. They will merely

include a list of architecture specific .h files in an arch directory.

For example we might have nn_linear_x86_64_core2.h in the arch directory,

which provides routines specific to core2 processors running in 64 bit

mode.

In these architecture specific files, we'll have inline assembly routines

designed to replace various routines in the pure C implementation files

that we have already written. They'll do this by defining flags, e.g.

HAVE_ARCH_nn_mul1_c, which will specify that an architecture specific

version of nn_mul1_c is available. We'll then wrap the implementation of

nn_mul1_c in nn_linear.c with a test for this flag. If the flag is defined,

the C version will not be compiled.

In order to make this work, the configure script has to work out whether

the machine is 32 or 64 bit and what the CPU type is. It will then link

in the correct architecture specific files.

At the present moment, we are only interested in x86 machines running on

*nix (or Windows, but the architecture will be determined in a different

way on Windows).

A standard way of determining whether the kernel is 64 bit or not is to

search for the string x86_64 in the output of uname -m. If something else

pops out then it is probably a 32 bit machine.

Once we know whether we have a 32 or 64 bit machine, we can determine the

exact processor by using the cpuid instruction. This is an assembly

instruction supported by x86 cpus which tells you the manufacturer, family

and model of the CPU.

We include a small C program cpuid.c with some inline assembly to call the

cpuid instruction. As this program will only ever be run on *nix, we can

make use of gcc's inline assembly feature.

When the parameter to the cpuid instruction is 0 we get the vendor ID,

which is a 12 character string. We are only interested in "AuthenticAMD"

and "GenuineIntel" at this point.

When we pass the parameter 1 to the cpuid instruction, we get the processor

model and family.

For Intel processors, table 2-3 in the following document gives information

about what the processor is:

http://www.intel.com/Assets/PDF/appnote/241618.pdf

However the information is out of date. Simply googling for Intel Family 6

Model XX reveals other models that are not listed in the Intel documentation.

The information for AMD processors is a little harder to come by. However,

one can essentially extract the information from the revision guides, though

it isn't spelled out clearly:

http://developer.amd.com/documentation/guides/Pages/default.aspx#revision_Guides

It seems AMD only list recent processors here, and they are all 64 bit.

Information on 32 bit processors can be found here:

http://www.sandpile.org/ia32/cpuid.htm

At this point we'd like to identify numerous different architectures. We

aren't interested in 32 bit architectures, such as the now ancient Pentium

4 or AMD's K7. Instead, we are interested when 32 bit operating system

kernels are running on 64 bit machines. Thus all 32 bit CPUs simply identify

as x86.

There are numerous 64 bit processors we are interested in:

64 bit Pentium 4 CPUs were released until August 2008. We identify them as p4.

All the 64 bit ones support SSE2 and SSE3 and are based on the netburst

technology.

The Intel Core Solo and Core Duo processors were 32 bit and do not interest us.

They were an enhanced version of the p6 architecture. They get branded as x86

for which only generic 32 bit assembly code will be available, if any.

Core 2's are very common. They will identify as core2. They all support SSE2,

SSE3 and SSSE3 (the Penryn and following 45nm processors support SSE4.1 - we

don't distinguish these at this stage).

Atoms are a low voltage processor from Intel which support SSE2, SSE3 and are

mostly 64 bit. We identify them as atom.

More recently Intel has released Core i3, i5, i7 processors, based on the

Nehalem architecture. We identify these as nehalem. They support SSE2, SSE3,

SSSE3, SSE4.1 and SSE4.2.

AMD K8's are still available today. They support SSE2 and SSE3. We identify

them as k8.

AMD K10's are a more recent core from AMD. They support SSE2, SSE3 and SSE4a.

We identify these as k10. There are three streams of AMD K10 processors,

Phenom, Phenom-II and Athlon-II. We don't distinguish these at this point.

So in summary, our configure script first identifies whether we have a 32 or

64 bit *nix kernel. Then the CPU is identified as either x86, p4, core2,

nehalem, k8 or k10, where x86 simply means that it is some kind of 32 bit CPU.

Our configure script then links in architecture specific files as appropriate.

The only assembly code we include so far are the nn_add_mc functions we wrote

for 64 bit core2 and k10. As these are better than nothing on other 64 bit

processors from the same manufacturers, we include this code in the k8

specific files until we write versions for each processor. We also add an

nn_sub_mc assembly file for Intel and AMD 64 bit processors.

The configure script includes the architecture specific .h files starting from

the most recent processors so that code for earlier processors is not used

when something more recent is available.

The github branch for this revision is here: asm2

Previous article: v0.12 mul_classical

Next article: v0.13 - muladd1, muladd

Friday 24 September 2010

BSDNT - v0.12 mul_classical

I've been on holidays for a couple of days, hence the break in transmission.

(For academics: the definition of a holiday is a day where you actually

stop working and do nothing work related for that whole day. I realise the

definition is not widely known, nor is it well-defined.)

Note, due to accidentally including some files in today's release that I

intended to hold over till next time, you have to do ./configure before

make check. This detects your CPU and links in any assembly code

that is relevant for it. More on this when I do the actual blog post about it.

It's time to start implementing quadratic algorithms (i.e. those that

take time O(n^2) to run, such as classical multiplication and division).

Before we do this, we are going to reorganise slightly. The file which

is currently called nn.c we will rename nn_linear.c to indicate that it

contains our linear functions. We'll also rename t-nn.c to t-nn_linear.c.

The macros in that file will be moved into test.h.

We also modify the makefile to build all .c files in the current directory

and to run all tests when we do make check. A new directory "test" will

hold our test files.

Finally, we add an nn_quadratic.c and test/t-nn_quadratic.c to hold the

new quadratic routines and test code that we are about to write.

The first routine we want is nn_mul_classical. This is simply a mul1

followed by addmul1's. Of course this will be horrendously slow, but once

again we defer speeding it up until we start adding assembly routines,

which will start happening shortly.

In a departure from our linear functions, we do not allow zero lengths.

This dramatically simplifies the code and means we do not have to check

for special cases.

There does not appear to be any reason to allow a carry-in to our

multiplication routine. An argument can be made for it on the basis of

consistency with mul1. However, the main use for carry-in's and carry-out's

thus far has been for chaining.

For example, we chained mul1's s*c and t*c for s and t of length m and

c a single word in order to compute (s*B^m + t)*c.

But the analogue in the case of full multiplication would seem to be

chaining to compute (s*B^m + t)*c where s, t and c all have length m. But

it probably doesn't make sense to chain full multiplications in this way

as it would involve splitting the full product t*c say, into two separate

parts of length m, which amongst other things, would be inefficient.

It might actually make more sense to pass a whole array of carries to our

multiplication routine, one for every word of the multiplier. However it is

not clear what use this would be. So, for now at least, we pass no carry-ins

to our multiplication routine.

We settle on allowing a single word of carry out. This may be zero if the

leading words of the multiplicands are small enough. Rather than write this

extra word out, we simply return it as a carry so that the caller can decide

whether to write it.

The classical multiplication routine is quite straightforward, but that plus the

rearrangement we've done is more than enough for one day.

The github branch for this release is here: v0.12

Previous article: v0.11 - generic test code

Next article: configure and assembly

Reading, Writing and Arithmetic