Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Prime Numbers

.pdf
Скачиваний:
49
Добавлен:
23.03.2015
Размер:
2.99 Mб
Скачать

502 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC

There is one more important aspect of the DGT convolution: All mod operations are with respect to Mersenne primes, and so an implementation can enjoy the considerable speed advantage we have previously encountered for such special cases of the modulus.

9.5.6Sch¨onhage method

The pioneering work in [Sch¨onhage and Strassen 1971], [Sch¨onhage 1982], based on Strassen’s ideas for using FFTs in large-integer multiplication, focuses on the fact that a certain number-theoretical transform is possible— using exclusively integer arithmetic—in the ring Z2m +1. This is sometimes called a Fermat number transform (FNT) (see Exercise 9.52) and can be used within a certain negacyclic convolution approach as follows (we are grateful to P. Zimmermann for providing a clear exposition of the method, from which description we adapted our rendition here):

Algorithm 9.5.23. [Fast multiplication (mod 2n + 1) (Sch¨onhage)] Given two integers 0 ≤ x, y < 2n + 1, this algorithm returns the product xy mod (2n + 1).

1. [Initialize]

Choose FFT size D = 2k dividing n;

Writing n = DM , set a recursion length n ≥ 2M + k such that D divides n , i.e., n = DM ;

2.

[Decomposition]

 

 

 

 

 

 

 

Split x and y into D parts of M bits each, and store these parts, considered

 

as residues modulo (2n +1), in two respective arrays A

, . . . , A

 

and

 

 

 

 

0

D−1

 

 

B0, . . . , BD−1, taking care that an array element could in principle have

 

n + 1 bits later on;

 

 

 

 

 

3.

[Prepare DWT by weighting the A, B signals]

 

 

 

 

D)

 

 

 

 

 

 

for(0 ≤ j <jM

{

n

+ 1);

 

 

 

 

Aj = (2

Aj ) mod (2

 

 

 

 

 

Bj = (2jM Bj ) mod (2n

+ 1);

 

 

 

4.

}

 

 

 

 

 

 

[In-place, symbolic FFTs]

 

// Use 22M as D-th root mod(2n

 

 

 

A = DF T (A);

 

 

+ 1).

B= DF T (B);

5.[Dyadic stage]

6.

for(0 ≤ j < D) Aj = Aj Bj mod (2n + 1);

 

[Inverse FFT]

 

 

 

 

 

A = DF T (A);

 

 

// Inverse via index reversal, next loop.

7.

[Normalization]

 

 

 

 

 

for(0 ≤ j < D) {k+jM

mod (2

n

+ 1);

// AD defined as A0.

8.

Cj = AD−j /2

 

// Reverse and twist.

[Adjust signs]

 

 

 

 

9.5 Large-integer multiplication

503

if(Cj > (j + 1)22M ) Cj = Cj (2n

+ 1);

}

// Cj now possibly negative.

9. [Composition]

Perform carry operations as in steps [Adjust carry in base B] for B = 2M (the original decomposition base) and [Final modular adjustment] of Algorithm 9.5.17 to return the desired sum:

xy mod (2n + 1) = D−1 Cj 2jM mod (2n + 1);

j=0

Note that in the [Decomposition] step, AD−1 or BD−1 may equal 2M and have M + 1 bits in the case where x or y equal 2n. In Step [Prepare DWT . . .], each multiply can be done using shifts and subtractions only, as 2n ≡ −1( mod 2n + 1). In Step [Dyadic stage], one can use any multiplication algorithm, for example a grammar-school stage, Karatsuba algorithm, or this very Sch¨onhage algorithm recursively. In Step [Normalization], the divisions by a power of two again can be done using shifts and subtractions only. Thus the only multiplication per se is in Step [Dyadic stage], and this is why the method can attain, in principle, such low complexity. Note also that the two FFTs required for the negacyclic result signal C can be performed in the order DIF, DIT, for example by using parts of Algorithm 9.5.5 in proper order, thus obviating the need for any bit-scrambling procedure.

As it stands, Algorithm 9.5.23 will multiply two integers modulo any Fermat number, and such application is an important one, as explained in other sections of this book. For general multiplication of two integers x and y, one may call the Sch¨onhage algorithm with n ≥ lg x + lg y , and zeropadding x, y accordingly, whence the product xy mod 2n +1 equals the integer product. (In a word, the negacyclic convolution of appropriately zero-padded sequences is the acyclic convolution—the product in essence. ) In practice, Sch¨onhage suggests using what he calls “suitable numbers,” i.e., n = ν2k with k − 1 ≤ ν ≤ 2k − 1. For example, 688128 = 21 · 215 is a suitable number. Such numbers enjoy the property that if k = n/2 + 1, then n = ν+12 2k is also a suitable number; here we get indeed n = 11 · 28 = 2816. Of course, one loses a factor of two initially with respect to modular multiplication, but in the recursive calls all computations are performed modulo some 2M + 1, so the asymptotic complexity is still that reported in Section 9.5.8.

9.5.7Nussbaumer method

It is an important observation that a cyclic convolution of some even length D can be cast in terms of a pair of convolutions, a cyclic and a negacyclic, each of length D. The relevant identity is

2(x × y) = [(u+ × v+) + (u×v)] [(u+ × v+) (u×v)], (9.36)

where u, v signals depend in turn on half-signals:

u± = L(x) ± H(x),

v± = L(y) ± H(y).

504 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC

This recursion formula for cyclic convolution can be proved via polynomial algebra (see Exercise 9.42). The recursion relation together with some astute algebraic observations led [Nussbaumer 1981] to an e cient convolution scheme devoid of floating-point transforms. The algorithm is thus devoid of rounding-error problems, and often, therefore, is the method of choice for rigorous machine proofs involving large-integer arithmetic.

Looking longingly at the previous recursion, it is clear that if only we had a fast negacyclic algorithm, then a cyclic convolution could be done directly, much like that which an FFT performs via decimation of signal lengths. To this end, let R denote a ring in which 2 is cancelable; i.e., x = y whenever 2x = 2y. (It is intriguing that this is all that is required to “ignite” Nussbaumer convolution.) Assume a length D = 2k for negacyclic convolution, and that D factors as D = mr, with m|r. Now, negacyclic convolution is equivalent to polynomial multiplication (mod tD + 1) (see Exercises), and as an operation can in a certain sense be “factored” as specified in the following:

Theorem 9.5.24 (Nussbaumer). Let D = 2k = mr, m|r. Then negacyclic convolution of length-D signals whose elements belong to a ring R is equivalent, in the sense that polynomial coe cients correspond to signal elements, to multiplication in the polynomial ring

S = R[t]/ tD + 1 .

Furthermore, this ring is isomorphic to

T [t]/ (z − tm) ,

where T is the polynomial ring

T = R[z]/ (zr + 1) .

Finally, zr/m is an m-th root of −1 in T .

Nussbaumer’s idea is to use the root of 1 in a manner reminiscent of our DWT, to perform a negacyclic convolution.

Let us exhibit explicit polynomial manipulations to clarify the import of Theorem 9.5.24. Let

x(t) = x0 + x1t + · · · + xD−1tD−1,

and similarly for signal y, with the xj , yj in R. Note that x ×y is equivalent to multiplication x(t)y(t) in the ring S. Now decompose

m−1

x(t) = Xj (tm)tj ,

j=0

and similarly for y(t), and interpret each of the polynomials Xj , Yj as an element of ring T ; thus

Xj (z) = xj + xj+mz + · · · + xj+m(r−1)zr−1,

9.5 Large-integer multiplication

505

and similarly for the Yj . It is evident that the (total of) 2m X, Y polynomials can be stored in two arrays that are (r, m)-transpositions of x, y arrays respectively. Next we multiply x(t)y(t) by performing the cyclic convolution

Z = (X0, X1, . . . , Xm−1, 0, . . . , 0) × (Y0, Y1, . . . , Ym−1, 0, . . . , 0) ,

where each operand signal here has been zero-padded to total length 2m. The key point here is that Z can be evaluated by a symbolic DFT, using what we know to be a primitive 2m-th root of unity, namely zr/m. What this means is that the usual FFT butterfly operations now involve mere shuttling around of polynomials, because multiplications by powers of the primitive root just translate coe cient polynomials. In other words the polynomial arithmetic now proceeds along the lines of Theorem 9.2.12, in that multiplication by a power of the relevant root is equivalent to a kind of shift operation.

At a key juncture of the usual DFT-based convolution method, namely the dyadic (elementwise) multiply step, the dyadic operations can be seen to be themselves length-r negacyclic convolutions. This is evident on the observation that each of the polynomials Xj , Yj has degree (r − 1) in the variable z = tm, and so zr = tD = 1. To complete the Z convolution, a final, inverse DFT, with root z−r/m, is to be used. The result of this zero-padded convolution is seen to be a product in the ring S:

 

2m−2

 

 

j

 

x(t)y(t) =

Zj (tm)tj ,

(9.37)

 

=0

 

from which we extract the negacyclic elements of x ×y as the coe cients of the powers of t.

Algorithm 9.5.25 (Nussbaumer convolution, cyclic and negacyclic).

Assume length-(D = 2k) signals x, y whose elements belong to a ring R, which ring also admits of cancellation-by-2. This algorithm returns either the cyclic (x × y) or negacyclic (x ×y) convolution. Inside the negacyclic function neg is a “small” negacyclic routine smallneg, for example a grammar-school or Karatsuba version, which is called below a certain length threshold.

1. [Initialize]

r = 2 k/2 ; m = D/r; blen = 16;

// Now m divides r. // Tune this small-negacyclic breakover length to taste.

2.[Cyclic convolution function cyc, recursive] cyc(x, y) {

By calling half-length cyclic and negacyclic convolutions, return the desired cyclic, via identity (9.36);

}

3.[Negacyclic convolution function neg, recursive]

neg(x, y) {

if(len(x) ≤ blen) return smallneg(x, y);

506 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC

4. [Transposition step]

Create a total of 2m arrays Xj , Yj each of length r;

Zero-pad the X, Y collections so each collection has 2m polynomials; Using root g = zr/m, perform (symbolically) two length-2m DFTs to

ˆ ˆ

get the transforms X, Y ;

5. [Recursive dyadic operation]

ˆ ˆ ˆ

for(0 h < 2m) Zh = neg(Xh, Yh);

6. [Inverse transform]

Using root g = z−r/m, perform (symbolically) a length-(2m) inverse

ˆ

DWT on Z to get Z;

7. [Untranspose and adjust]

reduce polynomials according to tD

= 1)

Working in the ring S (i.e.,

 

n

 

find the coe cients zn of t

 

, n [0, D − 1], from equation (9.37);

return (zn);

 

 

// Return the negacyclic of x, y.

}

 

 

 

 

Detailed implementation of Nussbaumer’s remarkable algorithm can be found in [Crandall 1996a], where enhancements are discussed. One such enhancement is to obviate the zero-padding of the X, Y collections (see Exercise 9.66). Another is to recognize that the very formation of the Xj , Yj amounts to a transposition of a two-dimensional array, and memory can be reduced significantly by e ective such transposition “in place.” [Knuth 1981] has algorithms for in-place transposition. Also of interest is the algorithm [Fraser 1976] mentioned in connection with Algorithm 9.5.7.

9.5.8Complexity of multiplication algorithms

In order to summarize the complexities of the aforementioned fast multiplication methods, let us clarify the nomenclature. In general, we speak of operands (say x, y) to be multiplied, of size N = 2n, or n binary bits, or D digits, all equivalently in what follows. Thus for example, if the digits are in base B = 2b, we have

Db ≈ n

signifying that the n bits of an operand are split into D signal elements. This symbolism is useful because we need to distinguish between bitand operation-complexity bounds.

Recall that the complexity of grammar-school, Karatsuba, and Toom– Cook multiplication schemes all have the form O (nα) = O (lnα N ) bit operations for all the involved multiplications. (We state things this way because in the Toom–Cook case one must take care to count bit operations due to the possibly significant addition count.) So for example, α = 2 for grammar-school methods, Karatsuba and Toom–Cook methods lower this α somewhat, and so on.

Then we have the basic Sch¨onhage–Strassen FFT multiplication Algorithm 9.5.12. Suddenly, the natural description has a di erent flavor, for we

9.5 Large-integer multiplication

507

know that the complexity must be

O(D ln D)

operations, and as we have said, these are usually, in practice, floatingpoint operations (both adds and multiplies are bounded in this fashion). Now the bit complexity is not O((n/b) ln(n/b))—that is, we cannot just substitute D = n/b in the operation-complexity estimate—because floatingpoint arithmetic on larger digits must, of course, be more expensive. When these notions are properly analyzed we obtain the Strassen bound of

O(n(C ln n)(C ln ln n)(C ln ln ln n) · · ·)

bit operations for the basic FFT multiply, where C is a constant and the ln ln · · · chain is understood to terminate when it falls below 1. Before we move ahead with other estimates, we must point out that even though this bit complexity is not asymptotically optimal, some of the greatest achievements in the general domain of large-integer arithmetic have been achieved with this basic Sch¨onhage–Strassen FFT, and yes, using floating-point operations.

Now, the Sch¨onhage Algorithm 9.5.23 gets neatly around the problem that for a fixed number of signal digits D, the digit operations (small multiplications) must get more complex for larger operands. Analysis of the recursion within the algorithm starts with the observation that at top recursion level, there are two DFTs (but very simple ones—only shifting and adding occur) and the dyadic multiply. Detailed analysis yields the best-known complexity bound of

O(n(ln n)(ln ln n))

bit operations, although the Nussbaumer method’s complexity, which we discuss next, is asymptotically equivalent.

Next, one can see that (as seen in Exercise 9.67) the complexity of Nussbaumer convolution is

O(D ln D)

operations in the R ring. This is equivalent to the complexity of floating-point FFT methods, if ring operations are thought of as equivalent to floating-point operations. However, with the Nussbaumer method there is a di erence: One may choose the digit base B with impunity. Consider a base B n, so that b ln n, in which case one is e ectively using D = n/ ln n digits. It turns out that the Nussbaumer method for integer multiplication then takes O(n ln ln n) additions and O(n) multiplications of numbers each having O(ln n) bits. It follows that the complexity of the Nussbaumer method is asymptotically that of the Sch¨onhage method, i.e., O(n ln n ln ln n) bit operations. Such complexity issues for both Nussbaumer and the original Sch¨onhage–Strassen algorithm are discussed in [Bernstein 1997].

508 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC

Algorithm

optimal B

complexity

 

 

 

Basic FFT, fixed-base

. . .

Oop(D ln D)

Basic FFT, variable-base

O(ln n)

O(n(C ln n)(C ln ln n) . . .)

Sch¨onhage

O(n1/2)

O(n ln n ln ln n)

Nussbaumer

O(n/ ln n)

O(n ln n ln ln n)

 

 

 

Table 9.1 Complexities for fast multiplication algorithms. Operands to be multiplied have n bits each, which during top recursion level are split into D = n/b digits of b bits each, so the digit size (the base) is B = 2b. All bounds are for bit complexity, except that Oop means operation complexity.

9.5.9Application to the Chinese remainder theorem

We described the Chinese remainder theorem in Section 2.1.3, and there gave a method, Algorithm 2.1.7, for reassembling CRT data given some precomputation. We now describe a method that not only takes advantage of preconditioning, but also fast multiplication methods.

Algorithm 9.5.26 (Fast CRT reconstruction with preconditioning).

Using the nomenclature of Theorem 2.1.6, we assume fixed moduli m0, . . . , mr−1 whose product is M , but with r = 2k for computational convenience. The goal of the algorithm is to reconstruct n from its given residues (ni). Along the way, tableaux (qij ) of partial products and (nij ) of partial residues are calculated. The algorithm may be reentered with a new n if the mi remain fixed.

1. [Precomputation] for(0 ≤ i < r) {

Mi = M/mi;

vi = Mi1 mod mi;

}

for(0 ≤ j ≤ k) {

for(0 ≤ i ≤ r − 2j ) qij =

}

//Generate the Mi and inverses.

//Generate partial products.

i+2j 1

a=i ma;

2.[Reentry point for given input residues (ni)] for(0 ≤ i < r) ni0 = vini;

3.[Reconstruction loop]

for(1 ≤ j ≤ k) {

for(i = 0; i < r; i = i+2j ) nij = ni,j−1qi+2j−1,j−1 +ni+2j−1,j−1qi,j−1;

}

4. [Return the unique n in [0, M − 1]] return n0k mod q0k;

Note that the first, precomputation, phase of the algorithm can be done just once, with a particular input of residues (ni) used for the first time at the initialization phase. Note also that the precomputation of the (qij )

9.6 Polynomial arithmetic

509

can itself be performed with a fast divide-and-conquer algorithm of the type discussed in Chapter 8.8 (for example, Exercise 9.74). As an example of the operation of Algorithm 9.5.26, let us take r = 8 = 23 and eight moduli: (m1, . . . , m8) = (3, 5, 7, 11, 13, 17, 19, 23). Then we use these moduli along with

8

the product M = i=1 mi = 111546435, to obtain at the [Precomputation] phase M1, . . . , M8, which are, respectively,

37182145, 22309287, 15935205, 10140585, 8580495, 6561555, 5870865, 4849845,

(v1, . . . , v8) = (1, 3, 6, 3, 1, 11, 9, 17),

and the tableau

(q00, . . . , q70) = (3, 5, 7, 11, 13, 17, 19, 23), (q01, . . . , q61) = (15, 35, 77, 143, 221, 323, 437),

(q02, . . . , q42) = (1155, 5005, 17017, 46189, 96577), q03 = 111546435,

where we recall that for fixed j there exist qij for i [0, r − 2j ]. It is important to note that all of the computation up through the establishment of the q tableau can be done just once—as long as the CRT moduli mi are not going to change in future runs of the algorithm. Now, when specific residues ni of some mystery n are to be processed, let us say

(n1, . . . , n8) = (1, 1, 1, 1, 3, 3, 3, 3),

we have after the [Reconstruction loop] step, the value

n0k = 878271241,

which when reduced mod q03 is the correct result n = 97446196. Indeed, a quick check shows that

97446196 mod (3, 5, 7, 11, 13, 17, 19, 23) = (1, 1, 1, 1, 3, 3, 3, 3).

The computational complexity of Algorithm 9.5.26 is known in the following form [Aho et al. 1974, pp. 294–298], assuming that fast multiplication is used. If each of the r moduli mi has b bits, then the complexity is

O(br ln r ln(br) ln ln(br))

bit operations, on the assumption that all of the precomputation for the algorithm is in hand.

9.6Polynomial arithmetic

It is an important observation that polynomial multiplication/division is not quite the same as large-integer multiplication/division. However, ideas discussed in the previous sections can be applied, in a somewhat di erent manner, in the domain of arithmetic of univariate polynomials.

510 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC

9.6.1Polynomial multiplication

We have seen that polynomial multiplication is equivalent to acyclic convolution. Therefore, the product of two polynomials can be e ected via a cyclic and a negacyclic. One simply constructs respective signals having the polynomial coe cients, and invokes Theorem 9.5.10. An alternative is simply to zero-pad the signals to twice their lengths and perform a single cyclic (or single negacyclic).

But there exist interesting—and often quite e cient—means of multiplying polynomials if one has a general integer multiply algorithm. The method amounts to placing polynomial coe cients strategically within certain large integers, and doing all the arithmetic with one high-precision integer multiply. We give the algorithm for the case that all polynomial coe cients are nonnegative, although this constraint is irrelevant for multiplication in polynomial rings (mod p):

Algorithm 9.6.1

(Fast polynomial multiplication: Binary segmentation).

Given two polynomials x(t) =

 

D−1

j

and y(t) =

 

E−1

k

with all coef-

 

j=0 xj t

 

 

k=0 ykt

 

ficients integral and

nonnegative, this algorithm returns the polynomial product

 

 

 

 

 

 

 

 

 

z(t) = x(t)y(t) in the form of a signal having the coe cients of z.

1.

[Initialize]

2.

Choose b such that 2b > max{D, E} max{xj } max{yk};

[Create binary segmentation integers]

 

X = x 2b ;

Y= y 2b ;

//These X, Y can be constructed by arranging a binary array of su ciently many 0’s, then writing in the bits of each coe cient, justified appropriately.

3.[Multiply]

u = XY ;

 

 

 

 

 

// Integer multiplication.

4. [Reassemble coe cients into signal]

for(0 ≤ l <

D + E

1)

{

 

bl

 

b

zl = u/2

 

mod 2 ;

// Extract next b bits.

}

 

D−E−2

l

; // Base-b digits of u are desired coe cients.

return z = l=0

 

zlt

The method is a good one in the sense that if a large-integer multiply is at hand, there is not very much extra work required to establish a polynomial multiply. It is not hard to show that the bit-complexity of multiplying two degree-D polynomials in Zm[X], that is, all coe cients are reduced modulo m, is

O M D ln Dm2 ,

where M (n) is as elsewhere the bit-complexity for multiplying two integers of n bits each.

9.6 Polynomial arithmetic

511

Incidentally, if polynomial multiplication in rings is done via fast integer convolution (recall that acyclic convolution is su cient, and so zero-padded cyclic will do), then one may obtain a di erent expression for the complexity bound. For the Nussbaumer Algorithm 9.5.25 one requires O(M (ln m)D ln D) bit operations, where M is the usual integer-multiplication complexity. It is interesting to compare these various estimates for polynomial multiplication (see Exercise 9.70).

9.6.2Fast polynomial inversion and remaindering

D−1

xj t

j

be a polynomial. If x0

= 0, there is a formal inversion

Let x(t) = j=0

 

1/x(t) = 1/x0 (x1/x20)t + (x21/x30 − x2/x20)t2 + · · ·

that admits of rapid evaluation, by way of a scheme we have already invoked for reciprocation, the celebrated Newton method. We describe the scheme in the case that x0 = 1, from which case generalizations are easily inferred. In

what follows, the notation

z(t) mod tk

is a polynomial remainder (which we cover later), but in this setting it is simple truncation: The result of the mod operation is a polynomial consisting of the terms of polynomial z(t) through order tk−1 inclusive. Let us define, then, a truncated reciprocal,

R[x, N ] = x(t)1 mod tN +1

as the series of 1/x(t) through degree tN , inclusive.

Algorithm 9.6.2 (Fast polynomial inversion). Let x(t) be a polynomial with first coe cient x0 = 1. This algorithm returns the truncated reciprocal R[x, N ] through a desired degree N .

1. [Initialize] g(t) = 1; n = 1;

2. [Newton loop]

while(n < N + 1) { n = 2n;

if(n > N + 1) n = N + 1; h(t) = x(t) mod tn;

h(t) = h(t)g(t) mod tn;

g(t) = g(t)(2 − h(t)) mod tn;

}

return g(t);

//Degree-zero polynomial.

//Working degree precision.

//Double the working degree precision.

//Simple truncation.

//Newton iteration.

One point that should be stressed right o is that in principle, an operation f (t)g(t) mod tn is simple truncation of a product (the operands usually themselves being approximately of degree n). This means that within

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]