Tuesday, 28 February 2017

Parallelising integer and polynomial multiplication in Flint

As I explained in my last two blogs, over the last eight months I have been working on the OpenDreamKit project, funded by the EU in the Horizon2020 initiative. In this blog post I want to discuss the work I've been doing to speed up integer and polynomial multiplication as part of the OpenDreamKit high performance computing and optimisation component.

Flint makes use of a variety of algorithms for integer and polynomials. For integers, GMP/MPIR is used, and algorithms such as the schoolboy algorithm, Karatsuba and Toom-Cook algorithms are used. For polynomials the schoolboy algorithm is used, along with the Karatsuba algorithm and various variants of Kronecker segmentation.

For large integer and polynomial multiplications, however, Flint uses an FFT convolution.

Actually, integer multiplication is reduced to polynomial multiplication. The large integers are broken, down to the bit, into equal sized chunks, which are interpreted as coefficients of a polynomial. The polynomials are multiplied, and the result is again interpreted as a large integer.

Polynomial multiplications of large degree polynomials with small coefficients are also reduced to integer multiplication, by a method known as Kronecker Segmentation, where the polynomials are evaluated at a sufficiently large power of two, the resulting large integers are multiplied, and then the coefficients of the product polynomial are identified in the zero padded output.

The FFT convolution

The FFT convolution is a method that has much better complexity than all of these methods. It can multiply polynomials in O(d log d) steps, where d is the degree of the polynomial. Compare this with the classical schoolboy method, which takes O(d^2) steps.

At its core, the FFT convolution efficiently multiplies two large polynomials modulo x^(2^n) - 1. This is like polynomial multiplication with wraparound. Every 2^n coefficients, the result is wrapped around to the start. But we can use it to multiply polynomials of degree d if 2d < 2^n so that wraparound of the result doesn't interfere.

To multiply modulo x^(2^n) - 1 the FFT convolution uses an evaluation, pointwise multiplication, interpolation strategy. The polynomials to be multiplied are first evaluated at 2^n-th roots of unity. These 2^n values for each polynomial are multiplied with each of their counterparts from the other polynomial, to yield 2^n "pointwise products" and then the result polynomial is interpolated from these 2^n values.

Notice that the number of roots of unity is always a power of two in the FFT algorithm, so you are always doing multiplication modulo x^(2^n) - 1 for some n. In other words, convolutions always have length a power of two.

The evaluation phase of the FFT convolution is called the forward FFT. In the middle are the pointwise multiplications, i.e. a total of 2^n coefficient multiplications, and then the interpolation phase is called the inverse FFT or IFFT, for short.

Both the FFT and IFFT take O(n log n) operations, all of which are additions, subtractions and multiplications by roots of unity. The pointwise multiplications are simply coefficient multiplications. For example, if the polynomials being multiplied were over the complex numbers, the pointwise multiplications would be products of complex numbers.

The Schoenhage-Strassen algorithm

Performing FFTs and IFFTs over the complex numbers is problematic. When implemented on a computer, we have to use floating point arithmetic for this, which can result in rounding errors which can't be effectively controlled.

Instead of working over the complex numbers, the Schoenhage-Strassen algorithm works over another ring with roots of unity, namely Z/pZ where p = 2^(2^k) + 1. Note that in this ring, 2 is a 2^(k+1)-th root of unity. This allows for convolutions up to that length. But the advantage is that elements of this ring can be represented exactly, using (multiprecision) integer arithmetic,

Pointwise multiplications become multiplications modulo p = 2^(2^k) + 1, which can also be performed exactly. Thus, the result of such an algorithm is always exactly correct if the inputs are polynomials over the integers, so long as the output coefficients are smaller than p.

The Flint FFT convolution code uses the Schoenhage-Strassen algorithm to multiply large polynomials and integers. However, it incorporates many improvements, which I will now describe.

Efficient FFT butterflies

The basic low level component of the FFT and IFFT algorithms is called a butterfly. The FFT butterfly transforms pairs of inputs as follows

[a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})]

where z is some power of the 2^(k+1)-th root of unity discussed above. As this root of unity is a power of 2 in the Schoenhage-Strassen algorithm, multiplication by z is just a binary shift to the left or right.

The IFFT butterflies are similar. They take the form

[a{i}, b{i}] => [a{i}+z^(-i)*b{i}, a{i}-z^(-i)*b{i}]

Since the FFT coefficients are multiprecision integers (modulo p), the butterfly operations require applying the given  operations on multiprecision integers of a fixed precision.

To speed these up, we make use of MPIR's low level assembly optimised functions. The sumdiff function performs an addition and a subtraction at the same time, and of course we make use of code for shifting multiprecision integers. In addition, the subtraction modulo p requires complementing the integers in twos complement format.

One of the OpenDreamKit projects we worked on was to speed up these basic assembly operations on modern CPUs, especially those that support AVX instructions.

You can read about the work that was done on this by Alex Kruppa [1] and the various improvements he got over the old MPIR assembly code.

Extended Fermat numbers

Numbers of the form p = 2^(2^k) + 1 are called Fermat numbers. But they limit us to working with coefficients in both our input polynomials and the output polynomial, which don't exceed p. To give us more flexibility, we can work with numbers of the form p = 2^(a*2^k) + 1 for some number a. Now 2^a is a 2^(k+1)-th root of unity, but we get to work with much larger coefficients at very little extra cost. Of course the pointwise multiplications are now more expensive, but sometimes this extra cost is worth paying for the extra flexibility we get.

Truncated Fourier Transform

As we noticed above, the length of a convolution is always a power of 2. We are always working modulo x^(2^n) - 1.

This means that if we are multiplying polynomials of degree d, everything is fine until 2d = 2^n. At this point, the output polynomial will become too large and we have to increase n. This leads to ugly steps in performance at power of two lengths.

To get around this, we make use of an algorithm known as the Truncated Fourier Transform (TFT). Truncating the FFT to non-power-of-2 lengths is relatively straightforward. We simply omit the operations which deal with coefficients beyond the desired length.

The pointwise multiplications are also easy. Simply do fewer pointwise multiplications.

But now we have less information going into the inverse FFT than we normally would. We use some clever linear algebra to reconstruct the missing information as the algorithm proceeds. 

By truncating the entire convolution in this way, it is possible to save almost all the work that would be required to do a full, length 2^n transform.

Negacyclic transform

When the pointwise multiplications get very large, there is a trick we can use to perform them using an FFT of half the usual length.

Notice that the pointwise multiplications are done modulo p = 2^(a*2^k) + 1. If we split such a number up, this almost looks like doing a polynomial multiplication modulo x^(2^n) - 1 for some value of x, except for that pesky -1 instead of +1.

To get around this, we multiply x by a suitable root of unity and rescale so that the FFT convolution effectively becomes multiplication modulo x^(2^n) + 1. In other words, we use the FFT wraparound feature to our advantage.

This trick is due to Nussbaumer and saves time whenever the FFT coefficients are themselves large enough to be multiplied using an FFT, or perhaps slightly smaller than that.

The main problem with this approach is the size of the FFT coefficients of that small FFT convolution. It usually isn't quite the right size.

To get around this, it is possible to do a naive convolution where the coefficients are just one or two words in size and use Chinese remaindering to deal with coefficients that are just slightly larger than this Nussbaumer convolution can handle.

Algebraic identities

Another trick that can be used is when the number a is a multiple of 3 in p = 2^(2^a) + 1, specifically when the pointwise multiplications are small.

In such a case, we can use the identity x^{3a} + 1 = (x^a + 1)(x^{2a} - x^a + 1) to break the pointwise multiplications into smaller multiplication modulo x^a + 1 and x^{2a} - x^a + 1 respectively. If those multiplications are small enough to be quadratic time, splitting them up and using Chinese remaindering is actually faster than doing the one large multiplication.

Multiplication modulo x^a + 1 takes time some constant multiple of a^2 and multiplication modulo x^{2a} - x^a + 1 takes four times as long again. The total is some constant multiple of 5a^2.

Doing a single multiplication modulo x^{3a} + 1, on the other hand, would take time equal to some multiple of 9a^2. So splitting things up using identities can be a big win.

The sqrt2 trick

In the ring Z/pZ where p = 2^S  + 1 the value w = 2^(2S/4) - 2^(S/4) is a square root of 2. And recall that 2 was a root of unity in the Schoenhage-Strassen algorithm. Thus w is also a root of unity.

Making use of the root of unity w allows us to perform convolutions of twice the length using the Schoenhage-Strassen algorithm. And multiplication by powers of w is not so difficult. Any even power of w is just a power of 2, and any odd power is just the difference of two such powers.

Ordinarily, making the length of an FFT convolution twice as big, multiplies the time taken by a factor of four, since the coefficients required to support the larger convolution also have to be twice as large. But the sqrt2 trick allows us to do it for just over twice the cost.

The matrix Fourier algorithm

One of the big problems with the Schoenhage-Strassen algorithm as described above is that it works with a lot of data. There are 2^n multiprecision FFT coefficients being dealt with at any one time. The pattern of access to these coefficients is also not very regular throughout the algorithm, and so once the entire convolution falls out of cache, performance can be very bad.

The performance degradation can be reduced to some degree by switching to a recursive implementation of the FFT, which quickly gets down to blocks of coefficients that fit in cache. But it doesn't mitigate it entirely.

One way of resolving this is the matrix Fourier algorithm. Instead of viewing the input as a linear array of coefficients, it breaks the single large FFT down into many smaller FFTs, as though the input coefficients were the rows and columns of a matrix.

Two passes over the data are required, in order to handle lots of row FFTs and lots of column FFTs. But the overall saving is dramatic.

Of course, combining this trick with all the other tricks above is really a lot of work. But we do this in Flint.

In particular, we break the coefficients of the truncated Fourier transform up in such a way that each of the row and column FFTs has no wasted space.


Now we have an obvious target for parallelisation. As we have broken the single large FFT into many smaller independent ones, these can be sent to different threads to perform.

Of course, nothing is ever that simple. We must ensure that the temporary space allocated throughout the algorithm is not overwritten by different threads. The algorithm is also more efficient if some of the inward column FFT transforms for example, are combined with the relevant pointwise multiplications and outward IFFT column transforms.

All of this also requires all the coefficients to be in memory in the right order, so that memory access is not complicated. Fortunately there are two kinds of FFT and IFFT, namely decimation in time and decimation in frequency. Picking the correct kinds at the right time allows things to be managed more or less efficiently.

The matrix Fourier algorithm also requires multiplying by some additional roots of unity at the right points, and these need to be merged into one layer of the FFTs and IFFTs so that they don't become an additional cost.

But all complications aside, the matrix Fourier algorithm is both cache efficient and a good target for parallelisation.

As part of my OpenDreamKit work, I parallelised the Flint Schoenhage-Strassen code in precisely this way, maintaining all of the above optimisations. I made use of OpenMP for this.

The results were quite good, with up to a 5.5x speedup on 8 cores for large integer multiplications (I also parallelised splitting the large integer up into FFT coefficients).

You can see timings and a high level summary of the work that I did in the writeup I did for the final deliverable for this task of the OpenDreamKit project [1].

[1] https://github.com/OpenDreamKit/OpenDreamKit/issues/120

Friday, 24 February 2017

Assembly superoptimisation in MPIR

As I mentioned in my previous blog, I have been involved in the OpenDreamKit project, funded by the EU through their Horizon2020 programme.

Today I want to blog about a project I was involved with here at the University of Kaiserslautern, but which was carried out by two people we hired, namely Alex Best and Alex Kruppa (we made the joke that we are only hiring Alex's on H2020 and referred to them as Alex v1.0 and Alex v2.0).

About MPIR

MPIR stands for Multiple Precision Integers and Rationals. It is a fork of the GMP library [1] for multiple precision arithmetic for integers, rationals and floating point numbers.

GMP and MPIR consist of three main components: (i) assembly optimised bignum arithmetic, (ii) highly optimised bignum algorithms implemented in C, (iii) high level interfaces.

The way MPIR works is to provide assembly implementations of low level arithmetic primitives, such as addition, subtraction, shifting, multiplication and many other similar things, for each new microprocessor architecture that comes out.

You may have heard of Intel's tick-tock cycle. They bring out a completely new microarchitecture in their tock cycle, and they shrink it and optimise it in their tick cycle. Every year or so, there is a new tick or tock.

Starting in 2014, they introduced a new phase called their refresh cycle. So it's now tick-tock-refresh, since it is getting to be too complicated to do new microarchitectures every two or three years.

What this means for MPIR is that every year or so there is a new tick, tock or refresh for Intel, and similar for AMD, that needs support at the assembly level.

Over the years there have been many new instruction set technologies that have had to be supported, such as X86_64, MMX, 3DNOW!, SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, AVX512, BMI, BMI2.

It's surprising to many people that the difference between bignum code output by an optimising compiler like gcc and handwritten assembly code can be a factor of 4-12 in performance. Of course, if you already have assembly code for a prior, related architecture, the improvement you can get with handwritten assembly code is much less than this. However, the difference accumulates with time, as you forgo more and more potential improvements and support for more and more new instruction sets and technologies.

We made the case in the OpenDreamKit grant that this ongoing maintenance to support new instruction sets requires investment to keep up with the latest processor technology. Each new microarchitecture requires as much as 3-6 months full time work to support!


In addition to writing new assembly code to support each new microprocessor iteration, one can use superoptimisation to get up to another factor of two difference in performance (though often much less).

Superoptimisation takes already written assembly code and explores all valid reorderings of the assembly instructions, that don't change the behaviour of the code, and looks for the fastest reordering.

As typical assembly sequences can be dozens of lines long, this cannot be done by hand. There can be billions of valid reorderings.

The reason this kind of rearrangement can make a difference is because the CPU uses a very simple algorithm to determine which instructions to put in which pipeline. There are also limitations on how many of certain types of instructions can be executed at the same time, e.g. because of a limited number of integer ports, etc.

By rearranging the assembly instructions, we can sometimes cause the CPU scheduler to put the instructions in the pipeline in just the right order that resources are used optimally.

If an assembly function, like a multiplication routine, is going to be used quadrillions of times, it is certainly well worth trying to get an optimal ordering, since this will save a large amount of CPU time for a lot of people.

The AJS Superoptimiser

Alex Best was the first of the two Alex's to work on writing a superoptimiser for MPIR that supported the recent AVX and BMI instruction sets.

He began with an assembly JIT (Just-In-Time) library [2] written by Petr Kobalicek and improved it for use with MPIR, e.g. by supporting yasm syntax and removing numerous limitations.

On top of this, he wrote a superoptimiser called AJS [3, 6] which cleverly works out which reorderings of a segment of assembly code will be valid and times them all to find the optimal one.

AJS is tailored to work with MPIR assembly functions, but it could be adapted by a sufficiently talented individual to work with other projects if desired.

AJS takes a large number of command line options which control things such as which lines of assembly code should be reordered, how the output is to be written, how timing is to be done, etc.

After six months of work, Alex Kruppa took over AJS development. The basic superoptimiser worked, modulo some bugs, but it still couldn't be used because of an unexpected problem we encountered.

In the good ole days, getting cycle accurate timing was trivial. x86 architectures had an instruction for this. But over the time, CPUs have become more and more complex, and the demands on them have become greater. We don't know whether gamers are to blame, or patent trolls, or Intel engineers, but cycle accurate timings these days, can only be obtained with a great deal of trouble.

It literally took 3 or so months to solve the problem of getting cycle accurate timings on Intel processors. Some combination of fiddling with hyperthreading, address space layout randomisation, kernel options, frequency scaling, performance counters, kernel modules, stack address manipulation, SSE to AVX switching and various other tricks later, we finally got more or less cycle accurate timing on some machines we were superoptimising for.

After this, Alex Kruppa was able to superoptimise for two recent Intel microarchitectures, namely Haswell and Skylake.

He also did some optimisation (though not superoptimisation) for an older AMD architecture called Bulldozer.

As this was all more work than could be accomplished in the 12 months total that we had available, we also relied heavily on outside volunteer effort. We are very thankful to Jens Nurmann in particular who was able to supply handwritten Skylake assembly for many functions which were easy to convert to the MPIR interface.

Brian Gladman also helped to modify these function so they would work on Windows (which uses a different ABI, i.e. functions store their arguments in different registers on Windows).

In some cases, GMP already had very fast or optimal assembly code for these processors, and where our internal interface is the same as theirs, we were able to use some of their functions in MPIR.


The result of all this effort is a new version of MPIR which will be released in a couple of days, with much faster assembly optimised functions for Haswell, Skylake and Bulldozer architectures.

We are also in the process of doing some optimisation for Broadwell, a tick to Skylake's tock.

You can see tables of all the performance improvements that were obtained for Haswell, Skylake and Bulldozer on the final writeup for the OpenDreamKit project here [4].

As can be seen, even over the previous assembly code that was being used for these architectures (which had been written for earlier, but related microarchitectures), we obtain as much as 20 or 30 percent improvement. This represents a real-world speedup that one can expect to see in most calls to MPIR on those architectures.

Future work

Of course, we'd like to do much more for the three architectures we optimised for. There wasn't time to speed up division functions, for example, or Montgomery REDC, which are all assembly optimised in MPIR.

And there are the Excavator, Steamroller, Broadwell, Kaby Lake, Xen, Cannon Lake and other architectures still to go.

Volunteers are of course welcome. We need assembly experts who are interested in writing new assembly code, superoptimising it and maintaining support in MPIR for new architectures as they arise. If you can help, please volunteer on our development list, which you can find on our website [5].

[1] https://gmplib.org/
[2] https://github.com/asmjit
[3] https://github.com/alexjbest/ajs
[4] https://github.com/OpenDreamKit/OpenDreamKit/issues/118
[5] http://mpir.org/
[6] https://github.com/akruppa/ajs

Thursday, 23 February 2017

Integer factorisation in Flint (OpenDreamKit)

For the past eight months or so, I've been working part time as a contributor to the OpenDreamKit project [1], funded by the EU through their Horizon2020 initiative.

One of the deliverables we had for that project was a parallel version of the quadratic sieve for factoring large integers (typically in the 20-90 decimal digit range).

The executive summary is that this work has now been committed to the Flint repository [2] and is usable (see the qsieve_factor function in the qsieve module).

The starting point was an old implementation of the quadratic sieve in Flint which only worked for small integers, and which was very fragile, written mostly by me.

A Google Summer of Code student Nitin Kumar did some preliminary work on it. I've finished off that implementation and parallelised it, tuned it and made it more robust, as part of the OpenDreamKit project.

Factoring integers

There are many practical algorithms for factoring integers:
  • trial division
  • Euler's method
  • Fermat's method
  • One line factor
  • Lehman's algorithm
  • p-1 method
  • p+1 method
  • Elliptic Curve Method (ECM)
  • Pollard's rho
  • Quadratic sieve
  • General number field sieve
  • Special number field sieve
In the range 20-90 decimal digits, there are two methods that one wishes to use. If the number has small factors that were not found by trial division or Pollard's Rho, there is the ECM method. But if one is fairly sure that all that remains are two large factors prime factors, i.e. n = pq, then one usually makes use of the quadratic sieve method.

Difference of squares

As for many factoring methods, the quadratic sieve ultimately expresses the number n to be factored
as a difference of squares n = a^2 - b^2 = (a - b)(a + b).

As can be seen from the difference of squares formula, this factors n if a - b is not 1.

The quadratic sieve makes use of a variant of this idea, namely to find two integers a and b such that a^2 = b^2 mod n.

If a and b are not the same modulo n (and differ by more than 1 modulo n) then we have (a - b)(a + b) = wn for some integer w.

This doesn't guarantee we factor n, but about half the time it will do so.

Dixon's method

Of course, trying random values a and b is impractical, but there is a way to make it work.

Instead of looking for values a such that a^2 is a square of another integer b modulo n, we look for many values of a such that a^2 has lots of small prime factors. We then multiply these together to build up a square.

For example, if

a1^2 = 2*3*7^2 mod n
a2^2 = 2*7 mod n
a3^2 = 3*7 mod n

then (a1*a2*a3)^2 = (2*3*7^2)^2 mod n, and we can take a = a1*a2*a3 and b = 2*3*7^2. Then we will have a^2 = b^2 mod n as required.

Expressions such as the ones above are called relations, and the small primes that appear on the right side of those relations make up what is called a factor base.

The idea is to pick a range, say [2, 10000] which the factor base primes are allowed to be in, and then try random values ai^2 modulo n to see if they factorise fully over the factor base.

Note that if all the values a1, a2, etc are taken in the range [sqrt(n), sqrt(2n)) then their square will be in the range [n, 2n), which means that their squares can be reduced modulo n simply by subtracting n.

In other words, we consider the values a1^2 - n, a2^2 - n and so on.

The quadratic sieve

The quadratic sieve is a practical improvement on Dixon's method.

Firstly, we note that if a1^2 - n is divisible by the factor base prime p, then so is (a1 + p)^2 - n. Secondly, we note that there is only one other value i less than p for which (a1 + i)^2 - n is divisible by p.

In short, if I want to find all the values i in some range [-M, M] say for which i^2 - n is divisible by a factor base prime p, I just need to find the two values i in the range [0, p) which work, and all the others will be a multiple of p away from those two.

This leads to a useful algorithm that can be implemented on a computer. First find the two square roots of n modulo p, to get the initial two values of i, then set an array up in memory, with all the values in the range [-M, M]. From the two initial values of i, stride off by steps of p in each direction and mark off all the other values of i for which i^2 - n will be divisible by p.

If we do this for some big array V = [-M, M] for each of the factor base primes p in the factor base range, F = [2, N] say, then we can just look at all the entries in the array A and see which ones correspond to values that are highly divisible by factor base primes.

In other words, we can find lots of values a1, a2, a3, etc, in V = [-M, M] such that a1^2 - n, a2^2 - n, a3^2 - n are fully divisible by factor base primes.

This process of striding off by steps of p in a big array V = [-M, M] to cheaply mark off all entries corresponding to values divisible by p is known as sieving.


There are many, many improvements that can be made to the naive quadratic sieve above. We list some of them here:

Multiple polynomials

The naive sieve essentially looks for solutions to x^2 - n = 0 mod p, where p is a factor base prime. But note that if x is just a little larger than sqrt(n) the value x^2 - n will be very small. But as soon as x gets too big, x^2 - n will also be large. The probability that it completely factors over the factor base F = [2, N] diminishes, and it becomes harder and harder to find relations.

To get around this, we use polynomials other than x^2.

There's a few different methods of doing this. The first method is to choose only those values of x for which x^2 - n is divisible by a known prime q, say. These lie in an arithmetic progression.

We choose q to be larger than the factor base primes. Although this makes the corresponding values of x^2 larger, we know in advance that q^2 will be a factor, which we can remove, bringing things down to about the same size as for the naive x^2 - n case.

By switching the value of q, we get to look for relations amongst fairly small values, for each different value of q.

A second technique is to use polynomials (Ax + B)^2 - n for different values of A and B.

If we choose the values A and B such that n = B^2 - AC for some integer c, then the polynomials become A^2x^2 + 2ABx + B^2 - B^2 + AC = A(Ax^2 + 2Bx + C). In other words, we are guaranteed a factor of A that we can remove.

The game then becomes finding relations amongst small values of Ax^2 + 2Bx + C.

There are heuristics for exactly how large the coefficient A should be to give the optimal speed in the algorithm.

Knuth-Schroeppel method

Sometimes it turns out to be more efficient to find a factor of kn for some small multiple k than for the original number n we are trying to factor. The Knuth-Schroeppel algorithm helps find such a multiplier k.

To avoid just finding a bad factorisation, we always take the greatest common divisor with n instead of kn in the final step of the algorithm. For example, if we have kn = a^2 - b^2, we take the greatest common divisor of a - b with n to avoid getting an extraneous factor of k in our final factorisation.

Self initialisation (hypercube method)

In switching the polynomials Ax^2 + 2Bx + C mentioned above, we have to find solutions to Ax^2 + 2Bx + C modulo p for each factor base prime. This can be quite expensive every time we switch polynomial.

The self initialisation method saves additional time by giving us many cheap switches of B for each value of A that is used.

The way this is done is to make the value A a product of small factor base primes. If there are s such factor base primes, then it turns out that there are 2^s values of B that can be used and which are cheap to compute. Moreover, the solutions modulo p that we need to compute for each factor base prime p can be computed from one another very cheaply, using some precomputed information.

Large prime variant

We can allow partial relations which are a product of factor base primes and one larger prime. We compute many of these partial relations, and whenever two are found with the same large prime, we multiply the partial relations together, which gives us the equivalent of one full relation.

With some graph theoretic techniques, one can also allow two large primes per partial relation. However, this technique is only useful for very large factorisations. It is even possible to use three large primes, but this is typically only useful on very large clusters for extremely large factorisations.

Small prime variant

It turns out that a nontrivial proportion of the time spent sieving is used to sieve with very small factor base primes p. These hit the sieve much more often than larger primes and so take a disproportionate amount of time for the value they contribute to the sieving process.

The small prime variant skips sieving with these primes and only checks divisibility by these primes for factorisations that are very close to being relations.

Block Lanczos linear algebra

The problem of finding a combination of relations to multiply together to give a square can be achieved by representing the problem as a linear algebra problem over GF2. If a prime occurs to an odd power in a relation, the appropriate location in the GF2 matrix is set to 1, otherwise it is set to 0. The problem then becomes one of looking for kernel vectors of this matrix over GF2.

The Block Lanczos algorithm can be used to find such kernel vectors quite efficiently.

Bradford-Monagan-Percival method

This method is similar to the self-initialisation method, except that the coefficient A of the polynomial is chosen to have one large factor q, which is increased every time the polynomial is switched. This helps avoid duplicate relations in an easy way and lets one use the same product of small primes over and over in the coefficient A.

Carrier-Wagstaff method

This is another technique designed to help with the selection fo the coefficients A of the polynomials. The possible factor base primes that could make up the factors of the coefficient A are divided into two sets, those with odd indices and those with even indices in the array of factor base primes.

All but one of the factors is taken from the first set, then the final prime is chosen from the other set such that the product gives the most optimal value of A.

Other tricks

There are many other tricks that can be used to cut the time used to test whether something that looks like a relation really is a relation. There are also lots of strategies for combining large prime partial relations into full relations.

There are also numerous tricks associated with the processor cache. Rather a lot of precomputed information is stored for each factor base prime. The sieve interval itself may also be larger than the processor cache. Things can be dramatically sped up by breaking such large blocks of information into smaller blocks and processing small blocks at a time. Each small sieve block is processed against each small block of factor base primes, and so on.

These days, there are also many opportunities to use SIMD intrinsics, e.g. to search the sieve interval for highly composite values which might be relations.

We also make use of precomputed inverses to do divisions and reductions modulo small values, e.g. by factor base primes.

Parallelising the relation sieving

The relation sieving in the quadratic sieve can be relatively easily parallelised. As the sieving per polynomial is largely independent of the sieving for any other polynomial (with the exception of the file handling for the large prime partial relations), it is quite easy to parallelise the sieving portion of the quadratic sieve.

In Flint we use OpenMP to parallelise the relation sieving. We combine the Carrier-Wagstaff and Bradford-Monagan-Percival (and of course the self-initialisation) methods to create polynomials that can be switched easily without too many duplicate relations. We then sieve for different polynomials in different threads.


There are many points at which the quadratic sieve can either abort early, or fail, especially if tuning values aren't set quite right. In our Flint implementation, if we fail to find a factor of a number n, we restart the sieve with a larger factor base. This shouldn't ever happen in practice if the tuning values are set just right, but it is possible in theory.

We also increase the size of the factor base if something goes wrong, for example, we ran out of polynomials before generating enough relations, and so on. Again, these problems should not happen in practice, but they do happen if wildly incorrect tuning values are supplied.


The code to implement all of the above is now merged into Flint and has been really thoroughly tested and tuned for number up to about 90 digits.

In summary, the implementation is relatively competitive from about 130 bits onward.

Some tables giving timings for one and multiple threads are given in our final report for the EU [3] which is included at the start of the GitHub issue for that project.

Future work

It takes about 10 years to really implement a world class quadratic sieve implementation. We would have a long way to go if our aim was to have world beating performance. But for a general purpose library, what we have right now is actually pretty reasonable. There are lots of things we could improve. These include:

  • Further cache efficient handling of large arrays in the sieve
  • Implementing double and triple large prime variants
  • Parallelising the file handling for large relations
  • Implementing a quadratic sieve which handles small factorisations (50-130 bits) efficiently

[1] http://opendreamkit.org/
[2] https://github.com/wbhart/flint2
[3] https://github.com/OpenDreamKit/OpenDreamKit/issues/119

Monday, 9 January 2017

Singular and Julia

In my previous blog, I mentioned the new computer algebra system, Oscar that is being developed [1].

Later this week, a dozen or more Oscar people are converging in Kaiserslautern to discuss the next steps in the part of Oscar that is making use of the programming language Julia. Specifically, we will be discussing some of the technical issues in integrating Polymake, Singular and GAP with the Antic "cornerstone", which is written entirely in Julia and C.

One of the agenda items is a demonstration of a new Julia package which Oleksandr Motsak and I have written, called Singular.jl [2]. It's part of a two pronged strategy with regard to the integration of Singular into Oscar.

The first part of that strategy will be to write an interpreter for the Singular programming language in Julia, so as to take advantage of JIT compilation in Singular. The second part of the strategy is Singular.jl, which I'll describe in detail here.

Singular.jl consists of two parts. The first is a low level interface to Singular [3]. It makes use of Julia's ability to interface to C++ (even templated code) in real time [4]. Julia has the ability to mix C++ code with Julia code and have it all compile down to LLVM instructions for execution by the JIT. It makes use of the Clang library to compile arbitrary C++ code.

This low-level interface to Singular preserves Singular type and function names, and does not make use of garbage collection at the Julia level. Calling a Singular kernel function in this way should have no significant overhead as compared with calling it in a C++ program. You can see this interface layer in the libsingular directory of Singular.jl.

On top of this low-level interface is a garbage collected high level interface, which is compatible with the Nemo generics system. In this way, we make Singular functionality, including the Groebner basis engine, directly accessible to Nemo [5] and the other parts of the Julia/Antic ecosystem.

Nemo provides various generic mathematical structures and algorithms, e.g. dense linear algebra, univariate polynomials (and more recently multivariate), residue rings, fraction fields, power series and so on. By implementing the interface required by Nemo, other packages can make use of all this generic functionality over ring implementations provided by those packages (hardly a novel concept for computer algebra systems, but an important one nonetheless).

Eventually we will also allow Singular to build polynomial rings over rings and fields provided by Nemo and other systems. Singular has this capacity, due to work of Hannes Schoenemann and others, and Oleksandr Motsak already prototyped making Nemo rings available as coefficient rings to Singular as part of the due diligence we did for the big grant we applied for.

So far in Singular.jl we have high level Julia wrappers for Singular integers, rationals, prime fields, integer residue rings, finite fields, polynomials, ideals, modules, vectors, matrices and resolutions.

It's time to see Singular.jl in action. Here is how it looks when Singular.jl is started up. It has a dependency on Nemo, so you see the Nemo.jl welcome message.

We start with creating the Singular integer ring with a Nemo residue ring over it. This is specialised by Singular.jl to actually make use of a Singular residue ring behind the scenes.

Now let's create a Singular polynomial ring, some polynomials and do a computation with them. Because I didn't specify an ordering, it defaults to the Singular dp ordering (degrevlex) and the ordering, C, for module generators.

One can specify an optional parameter (ordering) to change the polynomial ordering. (Later on, we may separate the module ordering so that it is not needed unless creating module elements. At present, a suitable default is selected automatically.)

Next, we create some ideals over this polynomial ring, count the number of generators and manipulate the generators, using Julia array syntax (which can be overloaded for any purpose).

Next, let's create a Singular ideal and take its standard basis (Groebner basis), using the Singular std command. Singular can compute Groebner bases over all the basic coefficient rings that it supports internally. (The functionality is there to compute them over externally supplied coefficient fields, though it is considered less well tested at this point.)

The next most logical thing to do is to compute the syzygy matrix of an ideal. This displays by default as a linear combination of module generators, but it is also possible to convert to a matrix representation.

One of the things we need to improve in the near future is the formatting for the matrix printing. For now, we just did the easiest thing in Singular.jl.

Of course, one can do basic ideal arithmetic.

The next thing one might wish to do is compute resolutions. At present, only the Schreyer resolution is available in Singular.jl. We'll add La Scala and minimal resolutions soon.

One can of course access all the modules in the resolution, precisely as one would expect.

And of course, doing arithmetic on module elements works just as expected.

And naturally, one can compute standard bases and resolutions of modules.

Finally, let's have a demonstration of a computation using the new generic multivariate polynomial code in Nemo (only available in the mpoly development branch at present), over a Singular coefficient ring. Here Singular is not computing the GCD, but Nemo is. And of course it is faster than the native functionality in Singular itself.

That's all for now. In coming blogs I'll talk more about the Singular interpreter and make updates about Oscar development as information comes to hand.

[1] https://wbhart.blogspot.de/2016/11/new-computer-algebra-system-oscar_20.html
[2] https://github.com/wbhart/Singular.jl
[3] https://www.singular.uni-kl.de/
[4] https://github.com/Keno/Cxx.jl
[5] https://github.com/wbhart/Nemo.jl

Sunday, 20 November 2016

New Computer Algebra System : OSCAR

On Friday we found out the exciting news that the German funding agency (DFG) has approved a massive transregional grant here in Germany, that the computer algebra group here at the University of Kaiserslautern has been heavily involved with. There was much celebration in the Department of Mathematics where I work when we received the news!

The official announcement was made on the DFG website on Monday. If you can read German, it is here.

The grant is called "Symbolic Tools in Mathematics and their Application" and is a collaboration between multiple universities here in Germany: Aachen, Kaiserslautern and Saarbrücken, with additional external people at Siegen, Berlin, Stuttgart and Tübingen, with a total of about twenty extremely experienced PIs. The grant is initially for four years, but can be renewed up to twice, for a total of 12 years (assuming we get renewed).

The exciting news from my point of view is project B1 of the grant, which we have internally been calling project B1 : Central Software Project. This is a collaboration to produce a visionary new Open Source computer algebra system. It will be called OSCAR : Open Source Computer Algebra Resource and is to be built on four "cornerstones", namely the Gap System, Singular, Polymake and ANTIC (the internal name for a new project I've been involved with, under Claus Fieker; see below).

The Principal Investigators

Before I get to any details, in no particular order, let me introduce the PI's for the Central Software Project.

Prof. Mohamed Barakat is behind the HomAlg project. HomAlg is a GAP package that allows one to do homological algebra from a categorical point of view. (You may also know Mohamed Barakat's students Sebastian Gutsche and Sebastian Posur, who worked on HomAlg/CAP.)

Dr. Frank Lübeck is one of the main authors of the Chevie project, which is a computer algebra project for computing with generic character tables of Lie groups, Coxeter groups, Iwahori-Hecke algebras and others. It makes use of the GAP system. Over the years, Frank has been a major contributor to GAP.

Prof. Michael Joswig is one of the main authors of the Polymake system. This is an Open Source project for research into polyhedral geometry. Michael Joswig is the Einstein Professor for Discrete Mathematics/Geometry at TU Berlin.

Prof. Claus Fieker is one of the original authors of the KASH/KANT system, for algebraic number theory, which became part of Magma. He is also one of the main authors of Hecke, a new Open Source algebraic number theory project based on Nemo. (Some people may also know his postdoc, Tommy Hofmann, who is a coauthor with him of Hecke and of Nemo.)

Prof. Wolfram Decker is the coordinator of a prior Priority Project funded by the DFG on Computer Algebra, which is just concluding. Wolfram Decker is one of the people who is directly responsible for directing the Singular project and is a key member of the Algebra, Geometry and Computer Algebra group in Kaiserslautern.

In this blog, I'm only going to speak about the Central Software Project. But it is important to realise that the grant is much, much larger than this one project (it is just one of about 23 funded, interrelated projects in the grant). The Central Software Project should be thought of as providing the technical infrastructure for a very large mathematics grant in the area of computer algebra, although there are other projects within the grant that will contribute in a significant way to OSCAR. I'll maybe talk about some of them in later blogs.

The entire grant application had more than 100 pages in the preproposal and 400 pages in the full proposal, and took years to put together! At the final referee stage, the grant had 10 eminent referees, most from overseas.

What is OSCAR?

So what is OSCAR all about? Again, I can only speak about it in terms of what I understand from the PI's, and my knowledge gets greater the closer you get to the bit I will be directly involved with.

The basic idea is to develop a new Computer Algebra System, covering group theory, algebraic geometry/commutative algebra, polyhedral geometry and number theory. Our aim is to directly compete with the closed source system, Magma, hence the focus on many areas where Magma has been particularly influential.

All of the cornerstone systems (except ANTIC, which is quite new) have developed significant projects and communities in their own right, over decades, and it is our hope to combine the strengths of all these systems into a single system, which is highly integrated and easy to use. And we hope to do all this without significant disruption to those existing ecosystems. OSCAR will be a single system, which builds on all four of those existing systems and which tightly integrates those systems.

One major motivation for OSCAR is mathematical applications where it is necessary to do cross-disciplinary research, for example, where one might need a multigraded, equivariant Cox ring of a toric variety over a number field, or graphs of groups in division algebras, or matrix groups over a polynomial ring over a number field. To perform such exotic calculations, it is quite often not enough to simply interface systems such as GAP, Singular and ANTIC which separately provide either matrix groups, polynomials rings or number fields, say. One needs a very tight integration, especially if one wants such computations to be practical.

Some of us believe that it is the monolithic integration of Magma that has allowed it to be successful in the areas we hope to impact. Therefore, we in the Open Source community need to get really smart about integration in order to compete.

I personally will be working on both low and mid-level integration of the four cornerstone systems, using the programming language Julia. But there will be other efforts too, for example, there will be a coordinated serialisation effort, a parallelism project, new packages, interfaces and technologies in the cornerstone systems, and so on.

There will be a number of people other than myself hired to work on OSCAR. Let me mention two that will be associated with the Central Software Project specifically.

Dr. Reimer Behrends, who shares an office with me, and who is one of the main architects and authors of HPC-GAP and HPC-Singular will be working on aspects of parallelisation in OSCAR.

As mentioned already, Sebastian Gutsche will be working on both GAP and Polymake integration with the other cornerstones. He will be doing this using Julia, and in other ways.

How am I involved in this?

So, the rest of what I'm going to write is just about the part I have been involved with. The OSCAR website when it goes live will be the best source of information on the direction the overall project is going.

Some of you may have heard of my work on the Nemo project (joint work with Claus Fieker, Tommy Hofmann, Fredrik Johansson and others). This is a computer algebra package written in the Julia programming language. The main idea behind this, from my perspective, is to demonstrate two things:

1) It is possible to integrate multiple packages, such as Flint, Arb, Singular, Hecke, Antic (the C library -- not the ANTIC cornerstone mentioned above) and so on, in an efficient way, using the Julia programming language. And we've had some early success; already, we have demonstrated that it is possible to build Singular polynomial rings over coefficient rings provided by Nemo itself, or by Antic or Hecke, etc.

2) It is possible to develop extremely efficient implementations of generic algorithms over generic rings (rather than the specific rings that we have highly optimised implementations for, e.g. in the Flint system). This is possible in the Julia language, due to its innovative type system and JIT (Just-In-Time) compiler.

(It should be noted that Polymake has a very sophisticated form of runtime specialisation, and what may be thought of as an early kind of JIT compiler.)

I personally believe that high level algorithms implemented in an old-style interpreter are sometimes not enough to get a practical implementation for some research questions. Sometimes, the difference between a practical implementation and failure, is not bottlenecks in low level C code, but generic algorithms implemented in a high level, interpreted language. But unfortunately, we don't have enough skilled, low-level developers to reimplement such things in a low-level language every time they appear!

So one of the things I am particularly interested in, is applications where a "mid-level" language like Julia can be used to speed up generic algorithms. This has been one of the ideas behind my work on Nemo, so far, and the reason why I chose to develop it in the Julia programming language. Julia facilitates high-level constructs, with low-level performance.

Already, on top of Nemo, Prof. Claus Fieker and Dr. Tommy Hofmann, have built the Hecke system, also written in the Julia programming language, for doing computations in orders of number fields and class group computations. This is a culmination of efforts of Claus Fieker's project under the Priority Project, which I mentioned above.

What is the ANTIC cornerstone?

So with all of that background in mind, I can now describe the ANTIC cornerstone of the Central Software Project. ANTIC is the internal name we have had for the combination of GMP/MPIR, MPFR, Flint, Antic, Arb, Nemo and Hecke, that we have been busy with integrating and in some cases, developing under the Priority Programme. It will now become one of the four cornerstones of the Central Software Project, and will essentially be responsible for providing algebraic number fields and related things to the new Computer Algebra System, OSCAR.

We now begin the task of doing our part to help with the integration of ANTIC with the other cornerstones of the OSCAR system, namely Singular, Polymake and GAP.

There are already preexisting and independent projects underway to integrate Flint and Antic (the C library) into GAP as packages, which we aim to support in whatever way we can.

And within our collaboration, Michael Joswig has already suggested projects to me that will require integration of Antic with Polymake, and work is already quite advanced in the direction of integrating Singular and Nemo/Hecke. And there have already been extensive discussions about future integration of ANTIC with GAP, probably via Nemo and/or Hecke.

The ANTIC project achieves its integration, and some of its generic algorithms are directly implemented in, the Julia programming language. We aim to contribute the expertise we have been developing here in Kaiserslautern, in the Julia language, to the wider integration of the four cornerstones in OSCAR.

Will you join us?

Obviously OSCAR is a massive undertaking. We hope that the mathematics community and Computer Algebra communities at large will support our efforts.

OSCAR is an Open Source project, and we hope that it will attract many contributors over the years and that it will support many Masters and PhD projects!

The really important thing is that OSCAR is motivated by a large number of serious mathematical projects across computer algebra. These applications will drive development of the project and I believe, lead to a high quality, well-tested project, leading to many substantial contributions and publications for a long time to come.

Join us if you are interested. In coming days we will very quickly be setting up websites and infrastructure. In the meantime, feel free to contact me or any of the other people you know who are involved with this project.

And join us in celebrating what is a fantastic investment in Open Source computer algebra, that we hope will be of great benefit for the mathematical community at large!

Further Information

I hope to keep people up-to-date with what I'm personally working on, through this blog. I'll also link to other press releases, blogs and articles about OSCAR and our grant, as I become aware of them. So follow this blog if you are interested in hearing more as things develop.

Here is the University of Kaiserslautern press release (German).

Tuesday, 22 May 2012

Reduced Binary Quadratic Forms

Over the past few weeks I've been writing some code for computing reduced binary quadratic forms:

(a, b, c) : = ax^2 + bxy + cy^2

The discriminant of a form is d = b^2 - 4ac. The code I wrote works for d < 0.

Such a form is reduced if |b| <= a <= c and |b| >= 0 if a = c or a = |b|.

You can see the code here:

There are two versions of the function qfb_reduced_forms.

Version 1

The first version works by iterating over all possible values "a" (it's a theorem that |b| <= a <= sqrt(|d|/3)) and c <= (1 - d)/4) and searching for valid triples (a, b, c).

To find valid triples, we note that if d = b^2 - 4ac then d = b^2 mod 4a. So we simply solve this quadratic congruence and search for all roots "b" in the correct range.

To speed things up, we factorise all the different possible values "a" by sieving. This makes the square root code much faster, as it doesn't have to factorise 4a before computing square roots.

This approach requires softly O(|d|^{1/4}) primes. There are about O(|d|^{1/2}) forms on average, and indeed the run time is softly O(|d|^{1/2}) (subject to the Generalised Riemann Hypothesis -- needed to guarantee we can find quadratic nonresidues quickly enough for the Tonelli-Shanks modular square root algorithm).

Version 2

The second approach iterates over all possible values |b| (the same bound applies as for "a").

This time we factorise (d - b^2)/4 for each "b" into products "ac". Again, we do this by sieving (this time quadratic sieving, which first finds square roots of "d" mod various primes "p"). 

However, we must sieve with primes "p" dividing values (d - b^2)/4. In other words we need softly O(|d|^{1/2}) primes. On a 64 bit machine, this exhausts the precomputed primes at about |d| = 100000000, so this technique is a bit limited.

The advantage of this technique is that it is about 50% faster as implemented. It still takes time softly O(|d|^{1/2}) though (again subject to GRH).


Magma computes reduced binary quadratic forms, so I did two comparisons:

Comparison A) Compute all reduced forms for discriminants 0 > d > -1000000.

Comparison B) Compute all reduced forms for disciminants -1000000000 > d > -1000000100.

Here are the timings:

Comparison flint/qfb Magma
A 118s 947s
B 1s 120s

Sunday, 23 October 2011

BSDNT - interlude

You will notice that I have not worked on BSDNT for about a year now. Well, I'm thinking of restarting the project soon. I did complete two new revisions v0.25 and v0.26 since I stopped blogging. The first of these added random functions for a single word. They generate single words which have an increased probability of triggering corner cases, e.g. by a sparse binary representation. The second of these updates is a bsdnt_printf function. This is like printf but adds a %w format specifier for printing single words. There is also a %m for printing a len_t and %b for printing a bits_t.

I likely won't get much more done on BSDNT until early next year, but here is what I am planning:

1) I am tremendously grateful to Dr. Brian Gladman for his work on an MSVC version of the library. However, I started to struggle to keep up with this side of things more than I thought. Microsoft's MSVC doesn't support inline assembler in 64 bit x86. This means the entire plan of the MSVC version has to be different. It seems like far too much effort to combine both sets of code into a single library. I've therefore decided (sorry Brian) to ditch the MSVC code from my copy of BSDNT.

2) I wasn't happy with the interface of the division code. There are a few issues to consider.

The first issue is chaining. Obviously carry-in occurs at the left rather than the right. But for general division functions should the carry-in be a single limb or multiple limbs. It seems like the remainder after division is going to be m limbs and so the carry-in should be also. It is not clear what is better here. Internally, the algorithms deal with just a single carry-in limb because they use 2x1 divisions to compute the quotient digits. Perhaps chaining just means that we consider the first digit of the remainder to be carry-in for the next division.

Another problem associated with this is that when reading the carry-in from the array, if the carry-in happens to be zero then the array entry may not exist in memory. This means the code has to always check if the carry-in should be zero or not before proceeding.

The other issue to consider is shifting for normalisation. One assumes that the precomputed inverse is computed from a normalised value (the first couple of limbs of the divisor). Now, it is not necessary to shift either dividend or divisor ahead of time. One can still perform the subtractions that occur in division, on unshifted values. One does need to shift limbs of the dividend in turn however, as the algorithm proceeds, in order to compute the limbs of the quotient. But this shifting can occur in registers and need not be written out anywhere. This is implemented, but currently every limb gets shifted twice. This can be cut down to a single shift.

3) The PRNGs are currently quite hard to read. They have numerous macros to access their context objects. They are extremely flexible, but possibly overengineered. I'd like to simplify their implementations somewhat.

4) The configure script is a little overengineered. The idea of supporting lots of compilers is nice. But in reality GCC should exist almost everywhere. The original concept of BSDNT was to use inline assembly for architecture support. This gets around issues with global symbol prefixes and wotnot. It also makes the library really simple to read. Even on Windows 64 there is MinGW64 and this is the only setup I aim to target in that direction.

I hope to deal with all of these issues before proceeding with development of BSDNT. Give me some time as I am busy until about the end of the year. However, I do plan to continue development of BSDNT after sorting out these issues, because I think that fundamentally what we have is very solid.