Sunday, 9 July 2017

Update on fast multivariate polynomial arithmetic in Flint

Fast multivariate polynomial arithmetic in Flint


It's time for me to update everyone on the implementation of fast multivariates in Flint. We are working on this as part of the OpenDreamKit project, funded by the EU. The eventual aim is to provide a robust, extremely fast implementation of multivariate polynomial arithmetic and GCD that:

  • is parallelised (hopefully scaling linearly with number of cores)
  • supports basic monomial orderings
  • supports individual exponents up to 64 bits and exponent vectors of arbitrary size
  • provides multiplication, exact division, divisibility testing, division with remainder and GCD
  • is open source
  • is highly optimised for Z, Q and Z/pZ

What else already exists?


But this is a solved problem, right? There's heaps of libraries that already do this, aren't there?

Indeed there are lots of cool libraries and proprietary software. Let's summarise each of them:

Singular

Singular is a Computer Algebra System for polynomial systems, developed here in Kaiserslautern. It has sparse multivariate polynomial arithmetic over various fields, supporting exponents from 2 to 64 bits (no smooth promotion for the user doing polynomial arithmetic, but internally in Groebner bases promotion happens optimally and automatically) and in practice, any number of variables (up to about 2^15).

But Singular's polynomial representation is optimised for Groebner bases (GBs). In particular, GBs require that one be able to get the leading monomial of a polynomial very fast and to be able to multiply polynomials by monomials extremely fast. Whilst a basic Buchberger GB implementation can benefit from fast basic arithmetic, most of the really efficient GB algorithms cannot. And it is not enough to be able to return the value of the leading monomial. One really needs to be able to return it without copying any data. For this purpose, Singular uses a linked list implementation of (sparse distributed) multivariate polynomials. Of course this makes Singular very slow for ordinary multivariate arithmetic.

Naturally, Singular's multivariate arithmetic is not parallelised.

Another issue with Singular is that its multivariate arithmetic is only "fast" if you don't supply a module ordering. But if you don't supply one, you can't use the polynomials you generate in most of Singular's higher level functions. In other words, the implementation is very highly tied to Singular's domain of specialty.

Some of Singular's multivariate functionality is supplemented by Factory, which we'll describe next.

Factory

Factory is a library distributed with Singular, for multivariate polynomial arithmetic, GCD and factorisation. Martin Lee put a lot of time into Factory over the past few years, to speed it up, in particular switching to Flint for univariate polynomial arithmetic and speeding up factorisation routines.

Factory still uses a sparse representation for polynomials, but optimised more for arithmetic. I believe it uses a sparse recursive representation, which is a logical choice for such a library.

Some operations in Factory are parallelised, basically because those operations are parallelised in Flint. But basic multivariate arithmetic is not parallel.

In the long run, the work that we are currently doing for ODK will possibly replace the core of Factory. I'm sorry that I don't have explicit timings for Factory below. I may include them in future.

Magma

This is a closed source Computer Algebra System. Based on my timings, I now believe that Magma does certain critical operations in a heuristic way, so I no longer trust it. But still, this is the system to beat if possible. It has for a long time provided exceptionally fast basic arithmetic. Allan Steel in particular has done an amazing job with core arithmetic in Magma. Unfortunately his ideas are often not published, and he doesn't seem to find time to reply to emails asking about his algorithms. An open code base (even if not Open Source) would be highly preferable!

Magma supports exponents up to about 32 bits, but its basic multivariate arithmetic is not parallelised. Some operations seem to be missing (or I couldn't find them documented). But it generally has very good coverage. Magma is good at both GBs and basic arithmetic.

Giac

Giac is part of the Xcas computer algebra system, which is widely used in schools in France. Moreover, it is Open Source!

Bernard Parisse has done a magnificent job speeding up multivariate arithmetic in Giac. It has also been really fantastic to have email exchanges with Bernard and exchange ideas.

Giac supports exponents up to 16 bits, but as far as I'm aware, only supports lexical orderings. The front end is symbolic and a conversion is done to a raw, internal polynomial format for doing polynomial arithmetic. One can change the priority of variables, but degree orderings and reverse orderings can only be supported artificially.

Since Flint currently only allows the order of variables to be reversed and not the order of comparisons (as required for true reverse lexical orderings), I decided to blog about our timings now, since comparison with Giac is actually fair. Properly supporting reverse lexical orderings (work in progress right now), will very slightly slow down our benchmarks.

Giac is parallelised and gives a nice speedup with multiple threads. It also has really fast multivariate GCD and factorisation. 

Piranha

Piranha is another Open Source library, written in C++. It makes heavy use of templating and is insanely fast. It is also parallelised and scales nearly linearly with the number of cores. Its maintainer, Francesco Biscani is also really talkative by email and has lots of great ideas.

Piranha has a reputation for being fast, and it is well-deserved. However, I don't include it in timings below since Piranha uses a hashed representation for polynomials, which means that it doesn't support things like division of multivariate polynomials (it's really hard to get the leading term of a polynomial from a hash table!)

For those interested to know, I think we are beating Piranha on multiplication of dense polynomials, but it beats us by a factor of 3 or so on the sparse benchmarks. This is to be expected given their representation. We are thinking of also adding a "hash" ordering for cases where division is not required, so that we can compete with Piranha on that benchmark.

Piranha is an amazing project. It supports arithmetic and symbolic operations. It even supports Poisson and divisor series, which are presumably important in applications in Celestial mechanics. 

Piranha also has very limited range for its packed monomial representation, which is fine for some applications, but not useful for our purposes. It is still under development though, and I expect some of its limitations will eventually be lifted.

Sdmp

Sdmp is the library developed by Michael Monagan and Roman Pearce for fast polynomial arithmetic in Maple. It's closed source, but its authors regularly publish papers describing their (revolutionary) ideas. Unfortunately, I only have access to the versions of these papers on their website and have had trouble implementing some of their algorithms as given in their papers. The fixes seem to be nontrivial and have consumed a lot of my time. An open code base (even if not Open Source) would be highly preferable!

I do not provide timings of Sdmp below, since I do not have access to it. I believe we are probably beating Sdmp on the dense benchmarks, but maybe 10-15% slower on the sparse benchmarks, but this is purely conjecture based on various bits of information I have.

Of course, Sdmp is parallelised, and scales well with the number of cores up to about 16 cores.

Trip

Trip has a reputation for being extremely fast, represents decades of work, and should be hard to compete with especially on multiple cores, as it scales extremely well. It is closed source, though occasionally papers are published describing the algorithms in very high level detail, and giving timings.

I don't have a full copy of Trip, but the authors confirm that the academic version is full speed and can be used for timings for academic purposes. I should be able to include timings for functionality available in the academic version of Trip at some point in the future. I give just a couple of data points in the comments below, but want to get familiar with the system before putting up official timings.

Pari/GP

Pari/GP is Open Source and implements something like a recursive dense or even recursive sparse format, depending on how you define it. Unfortunately I can't time it, since it isn't capable of the benchmarks I'm doing, due to the memory usage of their representation.

What Pari/GP is doing is usually fine for Number Theory, where things tend to be pretty dense. But it isn't trying to solve the same problem we are trying to solve, which is a really fast, flexible system for multivariate arithmetic in both the sparse and dense cases.

SageMath

Initially I thought it was pointless including SageMath since I assumed it just used Singular. However, on my machine, Sage is actually slower than Singular, and so I decided to include timings for that, at least for the cases that finished in under two hours.

Presumably the Sage developers didn't know about certain operations in Factory (or they weren't available at the time), since some Sage timings are really bad. I don't know what they do instead. Sage is Open Source, but there'd be more motivation to look up the source code if it was faster than everyone else, instead of slower.

As the work we are doing is part of OpenDreamKit, it is anticipated that Sage will eventually benefit greatly from the work we are doing.

Update on Flint multivariate polynomial arithmetic


Over the past year, I've spent a lot of time focusing on an implementation of multivariate arithmetic over the integers. Of course, one needs to be a little careful to describe what one actually means by division with remainder of multivariate polynomials over Z. But the idea has been to get a decent implementation over Z that can be easily modified for multivariate polynomials over Z/pZ and Q.

We currently have single core code only, for multiplication, divisibility testing, quotient only, quotient with remainder and reduction by an ideal (specified as an array of generators), plus a bunch of simpler operations, of course.

The code supports exponents up to 64 bits (actually 63 bits, since we use the top bit to detect overflows), with automatic packing of monomials where possible. The number of words in an exponent vector is essentially unlimited.

We support lexical and degree lexical orderings, with degree reverse lexical ordering not far away (we currently support reversing the order of variables, just not the order of comparisons).

The big news for this blog post is that it all now passes very strong test code, and I've documented the interface. It should be merged into flint/trunk soon.

The other exciting news is we've hired Daniel Schultz to work on the project for OpenDreamKit, for two and a half years. He will primarily be focusing on parallelising the code, though he'll no doubt work on many other things as well, such as a hash ordering, reverse lexical orderings, GCD and so on.

One major thing we are still missing is quasidivision. There are two kinds required: one that allows division over Q using only integer arithmetic, and the other that allows division over the fraction field of a multivariate polynomial ring using only polynomial arithmetic over the integers. Neither of these should be difficult to implement now that we have the basics working.

A puzzle


So far we can't get down to the same time as Magma for divisibility testing of multivariate polynomials. This is a critical operation for various fast multivariate GCD algorithms. Magma seems to be able to do this 200 times faster than it can do exact quotient!

I've asked a lot of smart people about this, and no one has an idea how it is possible to do this so fast. Moreover, Magma curiously takes about the same time as a heuristic test. There's really not enough time to do anything other than a few evaluations at some small values, and even then you have to do some tricks to pull that off!

I've written to Allan Steel to ask how this is done, but he hasn't replied. At this stage I strongly suspect Magma is doing a heuristic test only. Since the source code is not open for me to examine, and no one published this amazing algorithm, I am free to assume what I want about it.

Because I don't believe Magma is doing the right thing here, I've asked it to provide me with the exact quotient when doing divisibility testing (which just happens to be what our code does anyway). This makes the Magma times more realistic.

Benchmarks


And now for the benchmarks of all the systems I have on one machine (it's nearly 7 years old), with the exceptions noted above.

In each case there are "dense" and "sparse" benchmarks. Note that we don't include really dense or really sparse benchmarks. In the really dense case, you'd probably want to use an FFT based algorithm. Our code is probably more suited to the really sparse case. But it does well in the sparse case anyway (if we don't compare against a hash based approach), so that's what we've chosen to use for the benchmarks.

Timings that took longer than 2 hours are marked with a dash.

In some cases, systems don't provide the exact operation required, but I've allowed division with remainder to be substituted when that is the case.

The coefficient ring is Z for all operations, though Giac uses Q and internally optimises when working over Z. The benchmarks are picked to not introduce denominators.

Multiplication (dense Fateman):

  • f = (1 + x + y + z + t)^n
  • time p = f*(f + 1)

Multiplication (sparse Pearce):

  • f = (1 + x + y + 2z^2 + 3t^3 + 5u^5)^n
  • g = (1 + u + t + 2z^2 + 3y^3 + 5x^5)^n
  • time p = f*g

Quotient only division (dense):

  • f = (1 + x + y + z + t)^n
  • p = f*(f + 1) + x^(n - 3)
  • time q = p/f discard remainder

Quotient only division (sparse):

  • f = (1 + x + y + 2z^2 + 3t^3 + 5u^5)^n
  • g = (1 + u + t + 2z^2 + 3y^3 + 5x^5)^n
  • p = f*g + x^(n - 3)
  • time q = p/f discard remainder

Yes, exact quotient can be much faster than multiplication!!

N.B: due to an accident of interfacing, previously published Flint times for this benchmark were using reflected lexicograhical ordering, instead of lexicographical ordering. This has now been corrected.

Divisibility test with quotient (dense):

  • f = (1 + x + y + z + t)^n
  • p = f*(f + 1)
  • time flag, q = divides(p, f)
  • flag indicates whether division was exact, in which case q is the quotient
  • (division with remainder can be substituted for this operation)

Divisibility test with quotient (sparse):

  • f = (1 + x + y + 2z^2 + 3t^3 + 5u^5)^n
  • g = (1 + u + t + 2z^2 + 3y^3 + 5x^5)^n
  • p = f*g
  • time flag, q = divides(p, f)
  • flag indicates whether division was exact, in which case q is the quotient
  • (division with remainder can be substituted for this operation)

12 comments:

  1. Hello Bill, it is likely you used the symbolic ring in Sage, not the dedicated polynomial rings. For example with the "Divisibility test with quotient (sparse)" and n=6 I get:

    sage: R.=PolynomialRing(ZZ)
    sage: f = (1 + x + y + 2*z^2 + 3*t^3 + 5*u^5)^6
    sage: g = (1 + u + t + 2*z^2 + 3*y^3 + 5*x^5)^6
    sage: p = f*g
    sage: %timeit f.divides(p)
    10 loops, best of 3: 110 ms per loop

    or multiplication:

    sage: R.=PolynomialRing(ZZ)
    sage: f = (1 + x + y + 2*z^2 + 3*t^3 + 5*u^5)^6
    sage: g = (1 + u + t + 2*z^2 + 3*y^3 + 5*x^5)^6
    sage: %timeit _ = f*g
    10 loops, best of 3: 90.3 ms per loop

    ReplyDelete
  2. I used R. = PolynomialRing(ZZ, 4)

    Is this not the dedicated polynomial rings?

    ReplyDelete
  3. Ah, I see this blog doesn't like the Sage code format. Of course I had angled brackets and four variables in the brackets. But divides doesn't compute the quotient, so I can't use that for the benchmark I have given. The other timings you give are consistent with what I have. Your machine is a little faster than mine.

    ReplyDelete
  4. Impressive timings!

    I'd be curious to see a similar comparison over Z/pZ, say with p=2,p=3, p~2^30, p~2^60. This would presumably also let you test with larger values of n in a reasonable amount of time.

    It would also be interesting to see tests involving different numbers of variables (e.g. 1,2,4,8,16).

    ReplyDelete
  5. We'll keep that in mind for the next iteration, Andrew. We currently haven't got code over Z/pZ, and it might be interesting to do comparisons with especially larger numbers of variables.

    ReplyDelete
  6. Oh, I should add that the coefficients don't get very large in these examples. It's the number of terms in the result that is the main reason for the long periods of time. For example, in the sparse case there are over 28 million terms in the product polynomials.

    ReplyDelete
  7. it would be good to include links to packages you are talking about, at least more exotic ones. E.g. Trip is very hard to google...

    ReplyDelete
  8. Here you go:

    http://www.imcce.fr/fr/presentation/equipes/ASD/trip/trip.html

    Note that some limited version of Trip is available for download. But I don't know what "limited" means. I'd hesitate to publish timings based on that. Also, I don't know if publishing benchmarks is considered "academic research".

    ReplyDelete
    Replies
    1. I have written to one of the authors of Trip to ask permission to use the academic version for timings, and to see if this is the latest version. But I don't feel comfortable publishing timings for it at the moment, since I don't know how to use it. Perhaps in a later blog I will do so. It's not even clear if it will run on my machine, so we will see.

      Delete
    2. Mickael Gastineau said it is fine to use Trip for timings. Trip 1.4.0 works fine on my machine. I already made some silly mistakes trying to time things, so I will add full timing results at a later date when I am more familiar with the system.

      For now, here are the Trip-1.4.0 timing results on the same machine as the above timings:

      Dense Multiplication: n = 30: 50s
      Sparse Multiplication: n = 16: 37s

      However, Trip has multiple different representations and the right one should be selected for the particular problem, so these timings are unofficial until I've had time to check everything carefully.

      Note that Trip scales very well with the number of cores, so will be quite hard to beat on parallel timings I believe.

      Delete
  9. So, representation matters. Using Trip-1.4.0's POLPV (one of their dense representations) the dense timing becomes:

    Dense Multiplication: n = 30: 41s

    The best representation for the sparse case seems to be the POLYV representation which I originally used, so that timing remains:

    Sparse Multiplication: n = 16: 37s

    ReplyDelete
  10. Bernard Parisse helped me track down why the sparse quotient only times for Flint looked so ridiculously low. In fact, due to an interfacing issue between Nemo and Flint, I was inadvertently using reflected lexicographical ordering instead of true lexicographical ordering, which affected this benchmark. The timings have been corrected above, and indeed, the Flint timings now look much more reasonable.

    ReplyDelete