Sunday 21 June 2009

Stacks and Elliptic Cohomology

This week I became interested in two different topics due to conversations that I overheard. The first is the topic of stacks and the second is elliptic cohomology.

Stacks

Apparently there are numerous different kinds of stacks - Deligne-Mumford Stacks, Artin Stacks and for the die hard, apparently more general kinds of stacks.

The following is probably completely wrong, but is my understanding of what stacks are about.

Consider elliptic curves defined over the complex numbers K = C say. It is a classical result that up to isomorphism, these can be parameterised by points in the complex upper half plane, modulo the action of the modular group.

Now the upper half plane, modulo the modular group can be compactified by adding a point at the cusp, and made into a Riemann surface (of genus zero in this case). We can put coordinates on this Riemann surface (the complex j-function as it happens) and turn it into an algebraic curve of genus 0.

In fact, two elliptic curves are isomorphic iff their j-invariants are equal. In other words the Riemann sphere or j-line as it is often called, can be thought of as classifying all elliptic curves. In fact we call C the course moduli space for elliptic curves.

Now suppose we try to construct a universal space for families of elliptic curves over this base space. The problem is that an elliptic curve can have extra automorphisms and it is possible for a family of elliptic curves to contain isomorphic elliptic curves for this reason. That prevents us from having a univeral space for families of elliptic curves.

The way we get around this is using stacks. We define the stack of elliptic curves which is a category whose objects are families of elliptic curves over a base space (fixed for that family) and we define a morphism to be a map between families of elliptic curves along with a map between the corresponding base spaces such that the map between families is compatible with the map between base spaces.

Furthermore, for it to be a morphism (X'->B') -> (X->B), we require that if we pull the family of elliptic curves X back along the map B'->B of base spaces, we get a family of elliptic curves isomorphic to the family X'.

We can restrict to the subcategory of families of elliptic curves over a fixed base space B if we want. We call this subcategory the fibre over B.

Now note that the fibre over a base space B is a groupoid (a categories whose only morphisms are isomorphisms). We say that the stack of elliptic curves is a category fibred in groupoids.

Now it is clear that there is a universal family of elliptic curves with respect to this construction.

There's more to stacks than this (some of the critical components of the definition are omitted above).

A stack is of Deligne-Mumford type (formerly an algebraic stack) if it satisfies some additional conditions, in particular that there is an etale surjective morphism (called an atlas) from a scheme U to the stack F, amongst other things.

An Artin stack (nowadays what is referred to as an algebraic stack) simply replaces etale with smooth in the previous definition.

Anyhow, what was interesting to me is that nowadays stacks are replacing schemes as the ultimate objects of interest. A lot of work has been done to popularise them. Hey, if you want to know more theres only 1000 pages to read: The Stacks Project.

Elliptic cohomology

The second topic which piqued my interest this week was elliptic cohomology. I thought maybe this might be related to parabolic cohomology, which is defined in terms of parabolic cusps (fixed points of parabolic elements of SL_2(R)). But I don't know that this is the origin of the term.

Instead I found this enormous survey on the web, which is written helpfully:


That's a long document to be called a summary, so I'll give a summary of the summary.

To a topological space X we can associate the singular cohomology groups A^n(X) = H^n(X; Z) which can be characterised by the Eilenberg-Steenrod axioms. Any collection of functors and connecting maps satisfying these axioms necessarily gives you the usual integral cohomology functors (X \subseteq Y) -> H^n(X, Y; Z). More generally we can replace Z with any abelian group M.

Now if we drop the last of the E-S axioms:

* If X is a point then A^n(X) = { 0 if n \neq 0 and Z is n = 0 }

then we get something more general, called a cohomology theory.

Interestingly, complex K theory is an example of such a cohomology theory!

Complex K-theory is a so-called multiplicative cohomology theory, because A^n(*) is a graded commutative ring. Another nice feature is it is periodic:

* According to the Bott periodicity theorem for complex K-theory, there is an element \beta in K^2(*) such that multiplication by \beta induces an isomorphisms: \beta : K^n(X) -> K^{n+2}(X)

and even:

* K^i(*) = 0 if i is odd.

Ordinary cohomology H*(X; A) for a commutative ring A is even, but to make it periodic we need to take products over every second cohomology group A^n(X; A) = \prod_k H^{n+2k}(X; A)

Now when A is an even periodic cohomology, it turns out that A(CP^\infty) of the infinite dimensional complex projective space, is isomorphic to a formal power series ring A(*)[[t]] over the commutative ring A(*).

We can view the element t as the first Chern class of the universal line bundle O(1). The space CP^\infty is a classifying space for complex line bundles, i.e. for any complex line bundle L on a space X there is a (classifying) map \phi : X -> CP^\infty and an isomorphism L <-> \phi*(O(1)).

We then define c1(L) = \phi*(t) \in A(X), the first Chern class of the cohomology A.

In ordinary cohomology the first Chern class of a tensor product of line bundles is simply the sum of the first Chern classes of the line bundles.

In the case of complex K-theory line bundles L can be thought of as representatives of elements of K(X) itself. We write such an element [L]. Then c1(L) = [L] - 1.

Now c1(L1 \tensor L2) = c1(L1) + c1(L2) + c1(L1)c1(L2).

In the general case we have:

c1(L1 \tensor L2) = f(c1(L1), c1(L2)) for some f \in A(*)[[t1, t2]].

It turns out that the following properties hold:

* f(0, t) = f(t, 0) = t
* f(u, v) = f(v, u)
* f(a, f(b, c)) = f(f(a, b), c)

A power series with these properties is called a commutative 1-dimensional formal group law over the commutative ring A(*). It gives a group structure on the formal scheme Spf A*(X)[[t]] = Spf A(CP^\infty).

We call the first of the group laws above f(a, b) = a + b the additive formal group law denoted \hat{G_a} and the other f(a, b) = a + b + a*b the multiplicative formal group law, denoted \hat{G_m}.

Now one wonders if there are other possible formal group laws. It turns out that the Lazard ring is a ring classifying formal group laws. Quillen proved that it comes from a cohomology called periodic complex cobordism denoted MP. There is a canonical isomorphism MP(CP^\infty) <-> MP(*)[[t]] and the coefficient ring MP(*) is the Lazard ring.

One may construct the moduli stack of all formal group laws M_{FGL} so that for a commutative ring R, the set of homomorphisms Hom(Spec R, M_{FGL}) can be identified with the power series f(u, v) \in R[[u, v]] satisfying the three conditions above. Then M_{FGL} is an affine scheme, in fact M_{FGL} = Spec MP(*), as we'd expect for a moduli stack.

Well it gets a bit more complicated than that. One must mod out by the action on M_{FGL} by the group of automorphisms of the formal affine line Spf Z[[x]]. This yields the stack of formal groups M_{FG}.

Now what is an elliptic cohomology?

Well we need to relax one thing slightly. Instead of demanding that we have a periodic cohomology, we'll just require our multiplicative cohomology A to be weakly-periodic:

* The natural map A^2(*) \tensor_{A(*)} A^n(*) -> A^{n+2}(*) is an isomorphism for all n \in Z.

Now we can define an elliptic cohomology A:

* R is a commutative ring

* E is an elliptic curve over R

* A is a multiplicative cohomology which is even and weakly-perioidic

* There are isomorphisms A(*) <-> R and \hat{E} <-> Spf A(CP^\infty) of formal groups, over R which is isomorphic to A(*)

Here \hat{E} represents the formal completion of E along its identity section. It is a commutative 1-dimensional formal group over R. It is classified by a map \phi : Spec R -> M_{FG}.

For finite cell complexes X we may interpret the complex cobordism groups MP^n(X) as quasi-coherent sheaves on the moduli stack M_{FG} and in the case of a formal group over R we can define A^n(X) to be the pullback of these sheaves along \phi.

When \phi is flat then we call the formal group Landweber-exact. Landweber gave a criterion for determining when a formal group is Landweber-exact.

Anyhow it turns out that in the elliptic cohomology case, when the formal group is Landweber-exact, the elliptic curve and the isomorphism of the definition of elliptic cohomology are uniquely given. In this case, giving the elliptic cohomology theory is exactly the same thing as giving an elliptic curve.

Now in the case of the formal multiplicative group, there is a universal formal group. But in the case of elliptic cohomology, there is no such thing as a universal elliptic curve over a commutative ring. The moduli stack of elliptic curves M_{1,1} is not an affine variety, and not even a scheme. However, it is a Deligne-Mumford stack.

For each etale morphism \phi : R \to M_{1,1} there is an elliptic curve E_\phi and happily, when \phi is etale, the associated formal group \hat{E_\phi} is Landweber-exact.

Well, this allows us to create on M_{1,1} a presheaf taking on values in the category of cohomology theories. But this is a nasty construction and so we try to represent our cohomology theories by representing spaces.

Roughly speaking this is where the theory of E-\infty rings or E-\infty spectra comes in. Basically a cohomology theory A_\phi can be represented by a spectrum. That eventually allows one to develop a universal elliptic cohomology theory.

We define a presheaf O_{M}^{Der} of E_\infty rings representing elliptic cohomology theories on the category {\phi : Spec R -> M_{1,1}} mentioned above.

Now we extract our universal cohomology theory by taking a "homotopy limit" of the functor O_{M}^{Der}. This gives us the E-\infty ring of topological modular forms tmf[\Delta^{-1}].

(Note tmf is what you get when you replace M_{1,1} by its Deligne-Mumford compactification. After inverting 2 and 3 [to do with the extra automorphisms on elliptic curves] there is an isomorphism from tmf to the ring of integral holomorphic modular forms.)

OK that summarises the first 9 pages of the long summary above. But probably it isn't much of an improvement on the original. But it helps the memory to write it all down somewhere.

Sunday 7 June 2009

MPIR - version 1.2

Finally version 1.2 of MPIR (Multiple Precision Integers and Rationals), of which I am a lead developer, is released: http://www.mpir.org/ 

MPIR is an open source project based on the GNU Multi Precision (GMP, see http://www.gmplib.org/) library, but still licensed under version 2 of the LGPL.

About a month ago version GMP 4.3.0 was released, which they had been preparing for a LONG time. We expected some nice features, and found some, which we have been subsequently catching up with.

In particular we needed:

* Faster assembly code for multiplication basecase 

* Faster unbalanced integer multiplication (where you are multiplying integers of different sizes)

* Improvements to the speed of multiplying medium sized integers (50-2000 words where 1 word = 2^64 on a 64 bit machine)

* Asymptotically fast extended GCD

* Faster integer multiplication for large integers (> 2000 limbs)

* Faster integer squaring

* Other assembly improvements

Multiplication Basecase

Jason Moxham, a brilliant MPIR developer from the UK decided to take on the mul_basecase challenge. He's been writing an assembly optimiser for quite a few months. It takes hand written assembly code and reorganises the instructions over and over, within permitted bounds, to try and find an optimal sequence.

The results over the past year are pretty impressive to see:

Multiplications per second
128 x 128 bits512 x 512 bits8192 x 8192 bits
MPIR 1.2.05379464612488043117404
GMP 4.3.05276650610879150114927
MPIR 1.1.25180225211802334111772
MPIR 1.0.03585659810928085111641
MPIR 0.9.037299412812245286301
GMP 4.2.125896136638354260819

Note that the number of multiplications that can be done per second has more than doubled in most cases since last year, and all this just from assembly language improvements.

The timings for this table were made on an Opteron (AMD K8) server, running at 2.8 GHz. 

Toom Multiplication

Toom multiplication can be described in terms of a polynomial multiplication problem. Firstly the large integers to be multiplied are split apart into pieces, which form the coefficients of polynomials. Then the original integer multiplication problem becomes one of polynomial multiplication, then a reconstruction phase at the end, where the polynomial coefficients of the product are stitched together to make the product integer.

This can be thought of in terms of writing the original integer in terms of some large base, 2^M, where M is usually a multiple of the machine word size (e.g. M = 64B on a 64 bit machine).

For example, Toom-3 breaks the two integers into 3 pieces each, corresponding to the digits in base 2^M:

A = a0 + a1 * 2^M + a2 * 2^2M
B = b0 + b1 * 2^M + b2 * 2^2M

So we construct two polynomials:

f(x) = a0 + a1 * x + a2 * x^2
g(x) = b0 + b1 * x + b2 * x^2

Then A * B = f(2^M) * g(2^M). So we first compute h(x) = f(x) * g(x), and then A * B = h(2^M).

To multiply the polynomials f(x) and g(x), we note that their product will be degree 4, and so we can determine it fully if we know its value at 5 independent points. We choose, for convenience, the points 0, 1, -1, 2, infinity. Finally we note that h(0) = f(0) * g(0), h(1) = f(1) * g(1), etc.

Thus to compute the product we only need to find the values of f(x) and g(x) at the chosen points, do five small multiplications to get the value of h(x) at those points, then interpolate to get the coefficients of h(x).

So we have replaced the original large multiplication with five small ones. Note that each of the small multiplications involves integers at most one third of the size. If we just used schoolboy multiplication to multiply the "digits" of our large integers, we'd do 3 x 3 = 9 small "digit multiplications".  Instead, through the magic of evaluation/interpolation we only have 5 small "digit multiplications" to do (and a few additions and subtractions, etc., for the evaluation and interpolation phases).

The basecase multiplication code already handles unbalanced multiplication, so I decided to focus on unbalanced variants of Toom multiplication algorithms.

In MPIR we use Toom-2 (also known as Karatsuba multiplication - though we use a variant called Knuth multiplication where evaluation happens at 0, -1 and infinity), Toom-3, Toom-4 and Toom-7. 

I spent a fair bit of time optimising Toom-3. As we have a lot of new assembly instructions available in MPIR, I was actually able to get the interpolation sequence used down from about 11 to 8 steps in MPIR 1.2. I also discovered that it was possible to use the output integer for temporary space. If one is careful, one can even set it up so that the temporaries stored in the output space don't have to be moved at the end, i.e. they are in precisely the right place as part of the output integer at the end of the algorithm.

I made similar optimisations for Toom-4 and Toom-7 and I also switched the interpolation phase of the algorithms over to twos complement. 

Here are the results of this work in the Toom range:

Toom Multiplication
Kara (3200 bits)Toom3 (7680 bits)Toom4 (25600 bits)Toom7 (131072 bits)
MPIR 1.2.0300414
74337
11428
1153
GMP 4.3.1274738
68498
11263
1042
MPIR 1.1.2260501
60039
9890
993
MPIR 1.0.0261599
62275
9900
826
GMP 4.2.1163273
33460
4980
408

Timings are on a 2.4GHz Core 2 (eno).

The differences in timing in the karatsuba region for MPIR give an indication of how much of a speedup is occurring on account of better assembly code. Anything beyond that indicates a speedup in the Toom algorithms themselves. 

Note GMP 4.2.1 had Karatsuba and Toom-3 only, GMP 4.3.1 does not have Toom-7.

Unbalanced Multiplication

Now unbalanced multiplication proceeds in the same way, except that we have integers, and thus polynomials, of different length:

f(x) = a0 + a1 * x + a2 * x^2 + a3 * x^3
g(x) = b0 + b1 * x

Again the product is degree four and so we need to interpolate at five points, which we can choose to be precisely the same points that we used for Toom-3, even reusing the interpolation code, if we wish.

We call this Toom-42, denoting that we split our integers into 4 and 2 "digits" respectively, in our base 2^M.

The result is 5 small multiplications instead of 4 x 2 = 8; still a substantial saving.

Finally I implemented some of the unbalanced variants. In particular we now have Toom-42, an unbalanced version of Toom-33 (where the top coefficients are not necessarily exactly the same size) and Toom-32.

The results for unbalanced multiplications in the Toom range are now quite good:

Unbalanced Multiplication
15000 x 10000 bits20000 x 10000 bits30000 x 10000 bits
MPIR 1.2.032975
25239
14995
GMP 4.3.131201
24099
14104
MPIR 1.1.223190
19970
13289
GMP 4.2.113235
10855
7235

Timings are on the 2.4GHz Core 2 (eno).

Toom Squaring

Squaring using the Toom algorithms is much the same, except that there is no need to evaluate the same polynomial twice. We also save because the pointwise multiplications are now also squares and we can recurse, right down to a basecase squaring case.

All of the same optimisations can be applied for Toom squaring as for ordinary Toom multiplication. Here are the comparisons:

Toom Squaring
Kara (5120 bits)Toom3 (12800 bits)Toom4 (51200 bits)Toom7 (131072 bits)
MPIR 1.2.0358656
82274
10514
2762
GMP 4.3.0357031
83676
10430
2564
MPIR 1.1.2349686
78510
9917
2347
GMP 4.2.1185201
42256
5112
1237

The performance is now comparable to GMP until the Toom7 region, where we pull ahead considerably. These timings were done on the 2.8GHz Opteron (K8) server.

FFT Multiplication

For multiplication of large integers, MPIR uses a Fast Fourier Transform (FFT) method. Instead of working over the complex numbers, one can work over a ring Z/pZ where p is a special prime, e.g. p = 2^M + 1 where M is some power of 2. This trick is due to Schonhage and Strassen.

For MPIR 1.2 we decided to switch over to the new FFT of Pierrick Gaudry, Alexander Kruppa and Paul Zimmermann. An implementation of ideas they presented at ISAAC 2007 was available to download from Paul and Alex's websites.

Most of the work in merging this FFT was removing bugs, making the code work efficiently on Windows and writing and running tuning code for all the platforms we support.

I wrote the tuning code, Brian Gladman discovered some primitives to use on Windows which replace inline assembler available on Linux and Jason Moxham, Brian Gladman, Jeff Gilchrist and myself tuned the FFT on a range of systems.

Here are some examples of the speedup:

FFT Multiplication
2000000 bits
6000000 bits
20000000 bits64000000 bits
MPIR 1.2.074.9
12.8
5.20
1.47
GMP 4.3.052.4
13.4
3.66
0.813
MPIR 1.1.247.2
12.5
3.09
0.742
GMP 4.2.132.8
8.66
2.11
0.528

Timings are again on the 2.8GHz Opteron server.

Extended GCD

A longstanding issue with MPIR has been the lack of fast extended GCD. We had merged Niels Mollers asymptotically fast GCD code, but unfortunately no suitable extended GCD implementation was available to us. 

I began by reading Niels' paper to understand the half-GCD algorithm (ngcd) that he invented. Essentially half-GCD algorithms get their asymptotic improvement over the ordinary Euclidean algorithm as follows:

Begin by noting that the first few steps only depend on the uppermost bits of the original integers. Thus instead of working to full precision, one can split off the topmost bits and compute the first few steps of the Euclidean GCD on those bits. One keeps track of the steps taken in matrix form. One then has an exact representation for the steps taken so far in the algorithm.

Next one applies the matrix to the original integers. The integers that result are smaller than the original integers. One then finishes off the algorithm by applying the usual steps of the GCD algorithm to the smaller integers.

Of course to get an asymptotic improvement one needs to take care how to apply the matrix and to recurse the whole algorithm down to some fast basecase.

Once I understood that Niels' implementation explicitly computed the matrix at each step, it occurred to me that in order to turn it into an extended GCD algorithm, all I had to do was apply the matrix to the cofactors and keep track of any other changes that would affect the cofactors, throughout the algorithm.

As an example of the speedup, for 1048576 bit integers extended GCD benches as follows:

MPIR 1.1.2:   0.453 
MPIR 1.2.0:     4.06

Again the benchmarks are for the 2.8GHz Opteron.

Assembly Improvements

Finally this mammoth MPIR release also contained numerous sped up assembly routines due to Jason Moxham for AMD Opterons (which usually also translate to improvements for other 64 bit x86 platforms).

Below I include timings (smaller is better) at each point where improvements have been made in the assembly routines in MPIR.

Assembly Improvements
MPIR 0.9MPIR 1.0MPIR 1.1.2MPIR 1.2
mpn_add_n1598
1524

mpn_sub_n1598
1524

mpn_xor_n3015
2271

mpn_and_n
3014

1772
1523
mpn_ior_n
3014

1771
1525
mpn_nand_n
3013

2035
1775
mpn_nior_n
3013
1773


mpn_andn_n
3013
2273


mpn_iorn_n
3015
2271


mpn_addadd_n

2525

2190
mpn_addsub_n

2526

2189
mpn_subadd_n


2526
2190
mpn_addmul_1
3094
2524


mpn_submul_1
3092
2524

mpn_mul_1
3024
2522


mpn_sublsh1_n

2527
2400
2190
mpn_com_n
3014
1271

mpn_divexact_by3
12016
2278


mpn_lshift
2531
1701


mpn_rshift
25321617


mpn_hamdist
8281
1791


mpn_popcount
7281
1527


MPN_COPY
3012

2017
1021
mpn_divrem_1
23649

15119

MPN_ZERO
2013


783

This time timings are taken on a 2.6GHz AMD K10 box.

Numerous new assembly functions have also been added for 64 bit x86 machines since MPIR began, including: mpn_addmul_2, mpn_addadd_n, mpn_sublsh1_n, mpn_divexact_byff, mpn_rsh1add_n, mpn_lshift1, mpn_rshift1, mpn_rsh1sub_n, mpn_mul_2, mpn_lshift2, mpn_rshift2.