
Prime Numbers
.pdf502 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
There is one more important aspect of the DGT convolution: All mod operations are with respect to Mersenne primes, and so an implementation can enjoy the considerable speed advantage we have previously encountered for such special cases of the modulus.
9.5.6Sch¨onhage method
The pioneering work in [Sch¨onhage and Strassen 1971], [Sch¨onhage 1982], based on Strassen’s ideas for using FFTs in large-integer multiplication, focuses on the fact that a certain number-theoretical transform is possible— using exclusively integer arithmetic—in the ring Z2m +1. This is sometimes called a Fermat number transform (FNT) (see Exercise 9.52) and can be used within a certain negacyclic convolution approach as follows (we are grateful to P. Zimmermann for providing a clear exposition of the method, from which description we adapted our rendition here):
Algorithm 9.5.23. [Fast multiplication (mod 2n + 1) (Sch¨onhage)] Given two integers 0 ≤ x, y < 2n + 1, this algorithm returns the product xy mod (2n + 1).
1. [Initialize]
Choose FFT size D = 2k dividing n;
Writing n = DM , set a recursion length n ≥ 2M + k such that D divides n , i.e., n = DM ;
2. |
[Decomposition] |
|
|
|
|
|
|
|
Split x and y into D parts of M bits each, and store these parts, considered |
||||||
|
as residues modulo (2n +1), in two respective arrays A |
, . . . , A |
|
and |
|||
|
|
|
|
0 |
D−1 |
|
|
|
B0, . . . , BD−1, taking care that an array element could in principle have |
||||||
|
n + 1 bits later on; |
|
|
|
|
|
|
3. |
[Prepare DWT by weighting the A, B signals] |
|
|
|
|||
|
D) |
|
|
|
|
|
|
|
for(0 ≤ j <jM |
{ |
n |
+ 1); |
|
|
|
|
Aj = (2 |
Aj ) mod (2 |
|
|
|
|
|
|
Bj = (2jM Bj ) mod (2n |
+ 1); |
|
|
|
||
4. |
} |
|
|
|
|
|
|
[In-place, symbolic FFTs] |
|
// Use 22M as D-th root mod(2n |
|
|
|||
|
A = DF T (A); |
|
|
+ 1). |
B= DF T (B);
5.[Dyadic stage]
6. |
for(0 ≤ j < D) Aj = Aj Bj mod (2n + 1); |
|
|||
[Inverse FFT] |
|
|
|
|
|
|
A = DF T (A); |
|
|
// Inverse via index reversal, next loop. |
|
7. |
[Normalization] |
|
|
|
|
|
for(0 ≤ j < D) {k+jM |
mod (2 |
n |
+ 1); |
// AD defined as A0. |
8. |
Cj = AD−j /2 |
|
// Reverse and twist. |
||
[Adjust signs] |
|
|
|
|
9.5 Large-integer multiplication |
503 |
if(Cj > (j + 1)22M ) Cj = Cj − (2n |
+ 1); |
} |
// Cj now possibly negative. |
9. [Composition]
Perform carry operations as in steps [Adjust carry in base B] for B = 2M (the original decomposition base) and [Final modular adjustment] of Algorithm 9.5.17 to return the desired sum:
xy mod (2n + 1) = D−1 Cj 2jM mod (2n + 1);
j=0
Note that in the [Decomposition] step, AD−1 or BD−1 may equal 2M and have M + 1 bits in the case where x or y equal 2n. In Step [Prepare DWT . . .], each multiply can be done using shifts and subtractions only, as 2n ≡ −1( mod 2n + 1). In Step [Dyadic stage], one can use any multiplication algorithm, for example a grammar-school stage, Karatsuba algorithm, or this very Sch¨onhage algorithm recursively. In Step [Normalization], the divisions by a power of two again can be done using shifts and subtractions only. Thus the only multiplication per se is in Step [Dyadic stage], and this is why the method can attain, in principle, such low complexity. Note also that the two FFTs required for the negacyclic result signal C can be performed in the order DIF, DIT, for example by using parts of Algorithm 9.5.5 in proper order, thus obviating the need for any bit-scrambling procedure.
As it stands, Algorithm 9.5.23 will multiply two integers modulo any Fermat number, and such application is an important one, as explained in other sections of this book. For general multiplication of two integers x and y, one may call the Sch¨onhage algorithm with n ≥ lg x + lg y , and zeropadding x, y accordingly, whence the product xy mod 2n +1 equals the integer product. (In a word, the negacyclic convolution of appropriately zero-padded sequences is the acyclic convolution—the product in essence. ) In practice, Sch¨onhage suggests using what he calls “suitable numbers,” i.e., n = ν2k with k − 1 ≤ ν ≤ 2k − 1. For example, 688128 = 21 · 215 is a suitable number. Such numbers enjoy the property that if k = n/2 + 1, then n = ν+12 2k is also a suitable number; here we get indeed n = 11 · 28 = 2816. Of course, one loses a factor of two initially with respect to modular multiplication, but in the recursive calls all computations are performed modulo some 2M + 1, so the asymptotic complexity is still that reported in Section 9.5.8.
9.5.7Nussbaumer method
It is an important observation that a cyclic convolution of some even length D can be cast in terms of a pair of convolutions, a cyclic and a negacyclic, each of length D. The relevant identity is
2(x × y) = [(u+ × v+) + (u− ×− v−)] [(u+ × v+) − (u− ×− v−)], (9.36)
where u, v signals depend in turn on half-signals:
u± = L(x) ± H(x),
v± = L(y) ± H(y).
504 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
This recursion formula for cyclic convolution can be proved via polynomial algebra (see Exercise 9.42). The recursion relation together with some astute algebraic observations led [Nussbaumer 1981] to an e cient convolution scheme devoid of floating-point transforms. The algorithm is thus devoid of rounding-error problems, and often, therefore, is the method of choice for rigorous machine proofs involving large-integer arithmetic.
Looking longingly at the previous recursion, it is clear that if only we had a fast negacyclic algorithm, then a cyclic convolution could be done directly, much like that which an FFT performs via decimation of signal lengths. To this end, let R denote a ring in which 2 is cancelable; i.e., x = y whenever 2x = 2y. (It is intriguing that this is all that is required to “ignite” Nussbaumer convolution.) Assume a length D = 2k for negacyclic convolution, and that D factors as D = mr, with m|r. Now, negacyclic convolution is equivalent to polynomial multiplication (mod tD + 1) (see Exercises), and as an operation can in a certain sense be “factored” as specified in the following:
Theorem 9.5.24 (Nussbaumer). Let D = 2k = mr, m|r. Then negacyclic convolution of length-D signals whose elements belong to a ring R is equivalent, in the sense that polynomial coe cients correspond to signal elements, to multiplication in the polynomial ring
S = R[t]/ tD + 1 .
Furthermore, this ring is isomorphic to
T [t]/ (z − tm) ,
where T is the polynomial ring
T = R[z]/ (zr + 1) .
Finally, zr/m is an m-th root of −1 in T .
Nussbaumer’s idea is to use the root of −1 in a manner reminiscent of our DWT, to perform a negacyclic convolution.
Let us exhibit explicit polynomial manipulations to clarify the import of Theorem 9.5.24. Let
x(t) = x0 + x1t + · · · + xD−1tD−1,
and similarly for signal y, with the xj , yj in R. Note that x ×− y is equivalent to multiplication x(t)y(t) in the ring S. Now decompose
m−1
x(t) = Xj (tm)tj ,
j=0
and similarly for y(t), and interpret each of the polynomials Xj , Yj as an element of ring T ; thus
Xj (z) = xj + xj+mz + · · · + xj+m(r−1)zr−1,
506 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
4. [Transposition step]
Create a total of 2m arrays Xj , Yj each of length r;
Zero-pad the X, Y collections so each collection has 2m polynomials; Using root g = zr/m, perform (symbolically) two length-2m DFTs to
ˆ ˆ
get the transforms X, Y ;
5. [Recursive dyadic operation]
≤ ˆ ˆ ˆ
for(0 h < 2m) Zh = neg(Xh, Yh);
6. [Inverse transform]
Using root g = z−r/m, perform (symbolically) a length-(2m) inverse
ˆ
DWT on Z to get Z;
7. [Untranspose and adjust] |
reduce polynomials according to tD |
= −1) |
|||
Working in the ring S (i.e., |
|||||
|
n |
|
|||
find the coe cients zn of t |
|
, n [0, D − 1], from equation (9.37); |
|||
return (zn); |
|
|
// Return the negacyclic of x, y. |
||
} |
|
|
|
|
Detailed implementation of Nussbaumer’s remarkable algorithm can be found in [Crandall 1996a], where enhancements are discussed. One such enhancement is to obviate the zero-padding of the X, Y collections (see Exercise 9.66). Another is to recognize that the very formation of the Xj , Yj amounts to a transposition of a two-dimensional array, and memory can be reduced significantly by e ective such transposition “in place.” [Knuth 1981] has algorithms for in-place transposition. Also of interest is the algorithm [Fraser 1976] mentioned in connection with Algorithm 9.5.7.
9.5.8Complexity of multiplication algorithms
In order to summarize the complexities of the aforementioned fast multiplication methods, let us clarify the nomenclature. In general, we speak of operands (say x, y) to be multiplied, of size N = 2n, or n binary bits, or D digits, all equivalently in what follows. Thus for example, if the digits are in base B = 2b, we have
Db ≈ n
signifying that the n bits of an operand are split into D signal elements. This symbolism is useful because we need to distinguish between bitand operation-complexity bounds.
Recall that the complexity of grammar-school, Karatsuba, and Toom– Cook multiplication schemes all have the form O (nα) = O (lnα N ) bit operations for all the involved multiplications. (We state things this way because in the Toom–Cook case one must take care to count bit operations due to the possibly significant addition count.) So for example, α = 2 for grammar-school methods, Karatsuba and Toom–Cook methods lower this α somewhat, and so on.
Then we have the basic Sch¨onhage–Strassen FFT multiplication Algorithm 9.5.12. Suddenly, the natural description has a di erent flavor, for we
9.5 Large-integer multiplication |
507 |
know that the complexity must be
O(D ln D)
operations, and as we have said, these are usually, in practice, floatingpoint operations (both adds and multiplies are bounded in this fashion). Now the bit complexity is not O((n/b) ln(n/b))—that is, we cannot just substitute D = n/b in the operation-complexity estimate—because floatingpoint arithmetic on larger digits must, of course, be more expensive. When these notions are properly analyzed we obtain the Strassen bound of
O(n(C ln n)(C ln ln n)(C ln ln ln n) · · ·)
bit operations for the basic FFT multiply, where C is a constant and the ln ln · · · chain is understood to terminate when it falls below 1. Before we move ahead with other estimates, we must point out that even though this bit complexity is not asymptotically optimal, some of the greatest achievements in the general domain of large-integer arithmetic have been achieved with this basic Sch¨onhage–Strassen FFT, and yes, using floating-point operations.
Now, the Sch¨onhage Algorithm 9.5.23 gets neatly around the problem that for a fixed number of signal digits D, the digit operations (small multiplications) must get more complex for larger operands. Analysis of the recursion within the algorithm starts with the observation that at top recursion level, there are two DFTs (but very simple ones—only shifting and adding occur) and the dyadic multiply. Detailed analysis yields the best-known complexity bound of
O(n(ln n)(ln ln n))
bit operations, although the Nussbaumer method’s complexity, which we discuss next, is asymptotically equivalent.
Next, one can see that (as seen in Exercise 9.67) the complexity of Nussbaumer convolution is
O(D ln D)
operations in the R ring. This is equivalent to the complexity of floating-point FFT methods, if ring operations are thought of as equivalent to floating-point operations. However, with the Nussbaumer method there is a di erence: One may choose the digit base B with impunity. Consider a base B n, so that b ln n, in which case one is e ectively using D = n/ ln n digits. It turns out that the Nussbaumer method for integer multiplication then takes O(n ln ln n) additions and O(n) multiplications of numbers each having O(ln n) bits. It follows that the complexity of the Nussbaumer method is asymptotically that of the Sch¨onhage method, i.e., O(n ln n ln ln n) bit operations. Such complexity issues for both Nussbaumer and the original Sch¨onhage–Strassen algorithm are discussed in [Bernstein 1997].
9.6 Polynomial arithmetic |
509 |
can itself be performed with a fast divide-and-conquer algorithm of the type discussed in Chapter 8.8 (for example, Exercise 9.74). As an example of the operation of Algorithm 9.5.26, let us take r = 8 = 23 and eight moduli: (m1, . . . , m8) = (3, 5, 7, 11, 13, 17, 19, 23). Then we use these moduli along with
8
the product M = i=1 mi = 111546435, to obtain at the [Precomputation] phase M1, . . . , M8, which are, respectively,
37182145, 22309287, 15935205, 10140585, 8580495, 6561555, 5870865, 4849845,
(v1, . . . , v8) = (1, 3, 6, 3, 1, 11, 9, 17),
and the tableau
(q00, . . . , q70) = (3, 5, 7, 11, 13, 17, 19, 23), (q01, . . . , q61) = (15, 35, 77, 143, 221, 323, 437),
(q02, . . . , q42) = (1155, 5005, 17017, 46189, 96577), q03 = 111546435,
where we recall that for fixed j there exist qij for i [0, r − 2j ]. It is important to note that all of the computation up through the establishment of the q tableau can be done just once—as long as the CRT moduli mi are not going to change in future runs of the algorithm. Now, when specific residues ni of some mystery n are to be processed, let us say
(n1, . . . , n8) = (1, 1, 1, 1, 3, 3, 3, 3),
we have after the [Reconstruction loop] step, the value
n0k = 878271241,
which when reduced mod q03 is the correct result n = 97446196. Indeed, a quick check shows that
97446196 mod (3, 5, 7, 11, 13, 17, 19, 23) = (1, 1, 1, 1, 3, 3, 3, 3).
The computational complexity of Algorithm 9.5.26 is known in the following form [Aho et al. 1974, pp. 294–298], assuming that fast multiplication is used. If each of the r moduli mi has b bits, then the complexity is
O(br ln r ln(br) ln ln(br))
bit operations, on the assumption that all of the precomputation for the algorithm is in hand.
9.6Polynomial arithmetic
It is an important observation that polynomial multiplication/division is not quite the same as large-integer multiplication/division. However, ideas discussed in the previous sections can be applied, in a somewhat di erent manner, in the domain of arithmetic of univariate polynomials.
510 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
9.6.1Polynomial multiplication
We have seen that polynomial multiplication is equivalent to acyclic convolution. Therefore, the product of two polynomials can be e ected via a cyclic and a negacyclic. One simply constructs respective signals having the polynomial coe cients, and invokes Theorem 9.5.10. An alternative is simply to zero-pad the signals to twice their lengths and perform a single cyclic (or single negacyclic).
But there exist interesting—and often quite e cient—means of multiplying polynomials if one has a general integer multiply algorithm. The method amounts to placing polynomial coe cients strategically within certain large integers, and doing all the arithmetic with one high-precision integer multiply. We give the algorithm for the case that all polynomial coe cients are nonnegative, although this constraint is irrelevant for multiplication in polynomial rings (mod p):
Algorithm 9.6.1 |
(Fast polynomial multiplication: Binary segmentation). |
||||||||
Given two polynomials x(t) = |
|
D−1 |
j |
and y(t) = |
|
E−1 |
k |
with all coef- |
|
|
j=0 xj t |
|
|
k=0 ykt |
|
||||
ficients integral and |
nonnegative, this algorithm returns the polynomial product |
||||||||
|
|
|
|
|
|
|
|
|
z(t) = x(t)y(t) in the form of a signal having the coe cients of z.
1. |
[Initialize] |
2. |
Choose b such that 2b > max{D, E} max{xj } max{yk}; |
[Create binary segmentation integers] |
|
|
X = x 2b ; |
Y= y 2b ;
//These X, Y can be constructed by arranging a binary array of su ciently many 0’s, then writing in the bits of each coe cient, justified appropriately.
3.[Multiply]
u = XY ; |
|
|
|
|
|
// Integer multiplication. |
4. [Reassemble coe cients into signal] |
||||||
for(0 ≤ l < |
D + E |
− |
1) |
{ |
||
|
bl |
|
b |
|||
zl = u/2 |
|
mod 2 ; |
// Extract next b bits. |
|||
} |
|
D−E−2 |
l |
; // Base-b digits of u are desired coe cients. |
||
return z = l=0 |
|
zlt |
The method is a good one in the sense that if a large-integer multiply is at hand, there is not very much extra work required to establish a polynomial multiply. It is not hard to show that the bit-complexity of multiplying two degree-D polynomials in Zm[X], that is, all coe cients are reduced modulo m, is
O M D ln Dm2 ,
where M (n) is as elsewhere the bit-complexity for multiplying two integers of n bits each.