MPIR is an open source project based on the GNU Multi Precision (GMP, see http://www.gmplib.org/) library, but still licensed under version 2 of the LGPL.
About a month ago version GMP 4.3.0 was released, which they had been preparing for a LONG time. We expected some nice features, and found some, which we have been subsequently catching up with.
In particular we needed:
* Faster assembly code for multiplication basecase
* Faster unbalanced integer multiplication (where you are multiplying integers of different sizes)
* Improvements to the speed of multiplying medium sized integers (50-2000 words where 1 word = 2^64 on a 64 bit machine)
* Asymptotically fast extended GCD
* Faster integer multiplication for large integers (> 2000 limbs)
* Faster integer squaring
* Other assembly improvements
Jason Moxham, a brilliant MPIR developer from the UK decided to take on the mul_basecase challenge. He's been writing an assembly optimiser for quite a few months. It takes hand written assembly code and reorganises the instructions over and over, within permitted bounds, to try and find an optimal sequence.
The results over the past year are pretty impressive to see:
|Multiplications per second|
|128 x 128 bits||512 x 512 bits||8192 x 8192 bits|
Note that the number of multiplications that can be done per second has more than doubled in most cases since last year, and all this just from assembly language improvements.
The timings for this table were made on an Opteron (AMD K8) server, running at 2.8 GHz.
Toom multiplication can be described in terms of a polynomial multiplication problem. Firstly the large integers to be multiplied are split apart into pieces, which form the coefficients of polynomials. Then the original integer multiplication problem becomes one of polynomial multiplication, then a reconstruction phase at the end, where the polynomial coefficients of the product are stitched together to make the product integer.
This can be thought of in terms of writing the original integer in terms of some large base, 2^M, where M is usually a multiple of the machine word size (e.g. M = 64B on a 64 bit machine).
For example, Toom-3 breaks the two integers into 3 pieces each, corresponding to the digits in base 2^M:
A = a0 + a1 * 2^M + a2 * 2^2M
B = b0 + b1 * 2^M + b2 * 2^2M
So we construct two polynomials:
f(x) = a0 + a1 * x + a2 * x^2
g(x) = b0 + b1 * x + b2 * x^2
Then A * B = f(2^M) * g(2^M). So we first compute h(x) = f(x) * g(x), and then A * B = h(2^M).
To multiply the polynomials f(x) and g(x), we note that their product will be degree 4, and so we can determine it fully if we know its value at 5 independent points. We choose, for convenience, the points 0, 1, -1, 2, infinity. Finally we note that h(0) = f(0) * g(0), h(1) = f(1) * g(1), etc.
Thus to compute the product we only need to find the values of f(x) and g(x) at the chosen points, do five small multiplications to get the value of h(x) at those points, then interpolate to get the coefficients of h(x).
So we have replaced the original large multiplication with five small ones. Note that each of the small multiplications involves integers at most one third of the size. If we just used schoolboy multiplication to multiply the "digits" of our large integers, we'd do 3 x 3 = 9 small "digit multiplications". Instead, through the magic of evaluation/interpolation we only have 5 small "digit multiplications" to do (and a few additions and subtractions, etc., for the evaluation and interpolation phases).
The basecase multiplication code already handles unbalanced multiplication, so I decided to focus on unbalanced variants of Toom multiplication algorithms.
In MPIR we use Toom-2 (also known as Karatsuba multiplication - though we use a variant called Knuth multiplication where evaluation happens at 0, -1 and infinity), Toom-3, Toom-4 and Toom-7.
I spent a fair bit of time optimising Toom-3. As we have a lot of new assembly instructions available in MPIR, I was actually able to get the interpolation sequence used down from about 11 to 8 steps in MPIR 1.2. I also discovered that it was possible to use the output integer for temporary space. If one is careful, one can even set it up so that the temporaries stored in the output space don't have to be moved at the end, i.e. they are in precisely the right place as part of the output integer at the end of the algorithm.
I made similar optimisations for Toom-4 and Toom-7 and I also switched the interpolation phase of the algorithms over to twos complement.
Here are the results of this work in the Toom range:
The differences in timing in the karatsuba region for MPIR give an indication of how much of a speedup is occurring on account of better assembly code. Anything beyond that indicates a speedup in the Toom algorithms themselves.
Note GMP 4.2.1 had Karatsuba and Toom-3 only, GMP 4.3.1 does not have Toom-7.
Now unbalanced multiplication proceeds in the same way, except that we have integers, and thus polynomials, of different length:
f(x) = a0 + a1 * x + a2 * x^2 + a3 * x^3
g(x) = b0 + b1 * x
Again the product is degree four and so we need to interpolate at five points, which we can choose to be precisely the same points that we used for Toom-3, even reusing the interpolation code, if we wish.
We call this Toom-42, denoting that we split our integers into 4 and 2 "digits" respectively, in our base 2^M.
The result is 5 small multiplications instead of 4 x 2 = 8; still a substantial saving.
Finally I implemented some of the unbalanced variants. In particular we now have Toom-42, an unbalanced version of Toom-33 (where the top coefficients are not necessarily exactly the same size) and Toom-32.
The results for unbalanced multiplications in the Toom range are now quite good:
|15000 x 10000 bits||20000 x 10000 bits||30000 x 10000 bits|
Timings are on the 2.4GHz Core 2 (eno).
Squaring using the Toom algorithms is much the same, except that there is no need to evaluate the same polynomial twice. We also save because the pointwise multiplications are now also squares and we can recurse, right down to a basecase squaring case.
All of the same optimisations can be applied for Toom squaring as for ordinary Toom multiplication. Here are the comparisons:
|Kara (5120 bits)||Toom3 (12800 bits)||Toom4 (51200 bits)||Toom7 (131072 bits)|
The performance is now comparable to GMP until the Toom7 region, where we pull ahead considerably. These timings were done on the 2.8GHz Opteron (K8) server.
For multiplication of large integers, MPIR uses a Fast Fourier Transform (FFT) method. Instead of working over the complex numbers, one can work over a ring Z/pZ where p is a special prime, e.g. p = 2^M + 1 where M is some power of 2. This trick is due to Schonhage and Strassen.
For MPIR 1.2 we decided to switch over to the new FFT of Pierrick Gaudry, Alexander Kruppa and Paul Zimmermann. An implementation of ideas they presented at ISAAC 2007 was available to download from Paul and Alex's websites.
Most of the work in merging this FFT was removing bugs, making the code work efficiently on Windows and writing and running tuning code for all the platforms we support.
I wrote the tuning code, Brian Gladman discovered some primitives to use on Windows which replace inline assembler available on Linux and Jason Moxham, Brian Gladman, Jeff Gilchrist and myself tuned the FFT on a range of systems.
Here are some examples of the speedup:
|2000000 bits||6000000 bits||20000000 bits||64000000 bits|
Timings are again on the 2.8GHz Opteron server.
A longstanding issue with MPIR has been the lack of fast extended GCD. We had merged Niels Mollers asymptotically fast GCD code, but unfortunately no suitable extended GCD implementation was available to us.
I began by reading Niels' paper to understand the half-GCD algorithm (ngcd) that he invented. Essentially half-GCD algorithms get their asymptotic improvement over the ordinary Euclidean algorithm as follows:
Begin by noting that the first few steps only depend on the uppermost bits of the original integers. Thus instead of working to full precision, one can split off the topmost bits and compute the first few steps of the Euclidean GCD on those bits. One keeps track of the steps taken in matrix form. One then has an exact representation for the steps taken so far in the algorithm.
Next one applies the matrix to the original integers. The integers that result are smaller than the original integers. One then finishes off the algorithm by applying the usual steps of the GCD algorithm to the smaller integers.
Of course to get an asymptotic improvement one needs to take care how to apply the matrix and to recurse the whole algorithm down to some fast basecase.
Once I understood that Niels' implementation explicitly computed the matrix at each step, it occurred to me that in order to turn it into an extended GCD algorithm, all I had to do was apply the matrix to the cofactors and keep track of any other changes that would affect the cofactors, throughout the algorithm.
As an example of the speedup, for 1048576 bit integers extended GCD benches as follows:
MPIR 1.1.2: 0.453
MPIR 1.2.0: 4.06
Again the benchmarks are for the 2.8GHz Opteron.
Finally this mammoth MPIR release also contained numerous sped up assembly routines due to Jason Moxham for AMD Opterons (which usually also translate to improvements for other 64 bit x86 platforms).
Below I include timings (smaller is better) at each point where improvements have been made in the assembly routines in MPIR.
|MPIR 0.9||MPIR 1.0||MPIR 1.1.2||MPIR 1.2|
This time timings are taken on a 2.6GHz AMD K10 box.
Numerous new assembly functions have also been added for 64 bit x86 machines since MPIR began, including: mpn_addmul_2, mpn_addadd_n, mpn_sublsh1_n, mpn_divexact_byff, mpn_rsh1add_n, mpn_lshift1, mpn_rshift1, mpn_rsh1sub_n, mpn_mul_2, mpn_lshift2, mpn_rshift2.