
Prime Numbers
.pdf472 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
q = 0;
while(v2(r) ≤ v2(y)) {
q = q − 2v2(r)−v2(x); r = r − 2v2(r)−v2(y)y;
}
q = q cmod 2v2(y)−v2(x)+1; r = x + qy/2v2(y)−v2(x); return (q, r);
}
5. [Half-binary gcd function (recursive)]
hbingcd(k, x, y) { // Matrix returned; G, u, v, k1, k2, k3, q, r are local.
G = |
1 |
0 |
|
|
0 |
1 ; |
|
||
if(v2(y) > k) return G; |
|
|||
k1 |
= k/2 ; |
|
||
k3 |
= 22k1+1; |
|
||
u = x mod k3; |
|
|||
v = y mod k3; |
|
|||
G = hbingcd(k1, u, v); |
// Recurse. |
|||
(u, v)T = G(x, y)T ; |
|
k2 = k − v2(v); if(k2 < 0) return G;
(q, r) = divbin(u, v);
k3 = 2v2(v)−v2(u);
G = |
0 |
k3 |
G; |
|
k3 |
q |
|||
|
|
|||
k3 = 22k2+1; |
|
|||
u = |
v2−v2(v) |
mod k3; |
||
v = |
r2−v2(v) |
mod k3; |
G = hbingcd(k2, u, v)G; return G;
}
See the last part of Exercise 9.20 for some implicit advice on what operations in Algorithm 9.4.7 are relevant to good performance. In addition, note that to achieve the remarkably low complexity of either of these recursive gcds, the implementor should make sure to have an e cient large-integer multiply. Whether the multiply occurs in a matrix multiplication, or anywhere else, the use of breakover techniques should be in force. That is, for small operands one uses grammar-school multiply, then for larger operands one may employ a Karatsuba or Toom–Cook approach, but use one of the optimal, FFT-based options for very large operands. In other words, the multiplication complexity M (N ) appearing in the complexity formula atop the present section needs be taken seriously upon implementation. These various fast multiplication algorithms are discussed later in the chapter (Sections 9.5.1 and 9.5.2).

9.5 Large-integer multiplication |
473 |
It is natural to ask whether there exist extended forms of such recursive- gcd algorithms, along the lines, say, of Algorithm 2.1.4 or Algorithm 9.4.3, to e ect asymptotically fast modular inversion. The answer is yes, as explained in [Stehl´ and Zimmermann 2004] and [Cesari 1998].
9.5 Large-integer multiplication
When numbers have, say, hundreds or thousands (even millions) of decimal digits, there are modern methods for multiplication. In practice, one finds that the classical “grammar-school” methods just cannot e ect multiplication in certain desired ranges. This is because, of course, the bit complexity of grammar-school multiply of two size-N numbers is O ln2 N . It turns out that by virtue of modern transform and convolution techniques, this complexity can be brought down to
O(ln N (ln ln N )(ln ln ln N )),
as we discuss in more detail later in this section.
The art of large-integer arithmetic has, especially in modern times, sustained many revisions. Just as with the fast Fourier transform (FFT) engineering literature itself, there seems to be no end to the publication of new approaches, new optimizations, and new applications for computational number theory. The forest is su ciently thick that we have endeavored in this section to render an overview rather than an encyclopedic account of this rich and exotic field. An interesting account of multiplication methods from a theoretical point of view is [Bernstein 1997], and modern implementations are discussed, with historical references, in [Crandall 1994b, 1996a].
9.5.1Karatsuba and Toom–Cook methods
The classical multiplication methods can be applied on parts of integers to speed up large-integer multiplication, as observed by Karatsuba. His recursive scheme assumes that numbers be represented in split form
x = x0 + x1W,
with x0, x1 [0, W − 1], which is equivalent to base-W representation, except that here the base will be about half the size of x itself. Note that x is therefore a “size-W 2” integer. For two integers x, y of this approximate size, the Karatsuba relation is
xy = |
t + u |
− |
v + |
t − u |
W + vW 2 |
, |
(9.17) |
|
|
||||||
2 |
2 |
|
|
|
where
t = (x0 + x1)(y0 + y1),
u= (x0 − x1)(y0 − y1), v = x1y1,
474 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
and we obtain xy, which is originally a size-W 2 multiply, for the price of only three size-W multiplies (and some final carry adjustments, to achieve base-W representation of the final result). This is in principle an advantage, because if grammar-school multiply is invoked throughout, a size-W 2 multiply should be four, not three times as expensive as a size-W one. It can be shown that if one applies the Karatsuba relation to t, u, v themselves, and so on recursively, the asymptotic complexity for a size-N multiply is
O (ln N )ln 3/ ln 2
bit operations, a theoretical improvement over grammar-school methods. We say “theoretical improvement” because computer implementations will harbor so-called overhead, and the time to arrange memory and recombine subproducts and so on might rule out the Karatsuba method as a viable alternative. Still, it is often the case in practice that the Karatsuba approach does, in fact, outperform the grammar-school approach over a machineand implementation-dependent range of operands.
But a related method, the Toom–Cook method, reaches the theoretical boundary of O ln1+ N bit operations for the multiplicative part of size-N multiplication—that is, ignoring all the additions inherent in the method. However, there are several reasons why the method is not the final word in the art of large-integer multiply. First, for large N the number of additions is considerable. Second, the complexity estimate presupposes that multiplications by constants (such as 1/2, which is a binary shift, and so on) are inexpensive. Certainly multiplications by small constants are so, but the Toom–Cook coe cients grow radically as N increases. Still, the method is of theoretical interest and does have its practical applications, such as fast multiplication on machines whose fundamental word multiply is especially sluggish with respect to addition. The Toom–Cook method hinges on the idea that given two polynomials
x(t) = x0 |
+ x1t + . . . + xD−1tD−1, |
(9.18) |
y(t) = y0 |
+ y1t + . . . + yD−1tD−1, |
(9.19) |
the polynomial product z(t) = x(t)y(t) is completely determined by its values at 2D − 1 separate t values, for example by the sequence of evaluations (z(j)), j [1 − D, D − 1]:
Algorithm 9.5.1 (Symbolic Toom–Cook multiplication). Given D, this algorithm generates the (symbolic)Toom–Cook scheme for multiplication of (D- digit)-by-(D-digit) integers.
1. [Initialize]
Form two symbolic polynomials x(t), y(t) each of degree (D − 1), as in equation (9.18);
2. [Evaluation]
9.5 Large-integer multiplication |
475 |
Evaluate symbolically z(j) = x(j)y(j) for each j [1 − D, D − 1], so that each z(j) is cast in terms of the original coe cients of the x and y
|
polynomials; |
|
|
|
|
|
3. |
[Reconstruction] |
|
|
|
|
|
|
Solve symbolically for the coe cients zj in the following linear system of |
|||||
|
(2D − 1) equations: |
|
|
|
||
|
z(t) = |
2D−2 |
|
k |
, t [1 |
− D, D − 1]; |
4. |
[Report scheme] |
k=0 |
zkt |
|
Report a list of the (2D − 1) relations, each relation casting zj in terms of the original x, y coe cients;
The output of this algorithm will be a set of formulae that give the coe cients of the polynomial product z(t) = x(t)y(t) in terms of the coe cients of the original polynomials. But this is precisely what is meant by integer multiplication, if each polynomial corresponds to a D-digit representation in a fixed base B.
To underscore the Toom–Cook idea, we note that all of the Toom–Cook multiplies occur in the [Evaluation] step of Algorithm 9.5.1. We give next a specific multiplication algorithm that requires five such multiplies. The previous, symbolic, algorithm was used to generate the actual relations of this next algorithm:
Algorithm 9.5.2 (Explicit D = 3 Toom–Cook integer multiplication).
For integers x, y given in base B as
x = x0 + x1B + x2B2, y = y0 + y1B + y2B2,
this algorithm returns the base-B digits of the product z = xy, using the theoretical minimum of 2D − 1 = 5 multiplications for acyclic convolution of length-3 sequences.
1. [Initialize]
r0 = x0 − 2x1 + 4x2; r1 = x0 − x1 + x2; r2 = x0;
r3 = x0 + x1 + x2; r4 = x0 + 2x1 + 4x2; s0 = y0 − 2y1 + 4y2; s1 = y0 − y1 + y2; s2 = y0;
s3 = y0 + y1 + y2; s4 = y0 + 2y1 + 4y2;
2. [Toom–Cook multiplies] for(0 ≤ j < 5) tj = rj sj ;
3. [Reconstruction]
476 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
z0 = t2;
z1 = t0/12 − 2t1/3 + 2t3/3 − t4/12;
z2 = −t0/24 + 2t1/3 − 5t2/4 + 2t3/3 − t4/24; z3 = −t0/12 + t1/6 − t3/6 + t4/12;
z4 = t0/24 − t1/6 + t2/4 − t3/6 + t4/24;
4. [Adjust carry] carry = 0;
for(0 ≤ n < 5) {
v = zn + carry; zn = v mod B; carry = v/B ;
}
return (z0, z1, z2, z3, z4, carry);
Now, as opposed to the Karatsuba method, in which a size-B2 multiply is brought down to that of three size-B ones for, let us say, a “gain” of 4/3, Algorithm 9.5.2 does a size-B3 multiply in the form of five size-B ones, for a gain of 9/5. When either algorithm is used in a recursive fashion (for example, the Step [Toom–Cook multiplies] is done by calling the same, or another, Toom–Cook algorithm recursively), the complexity of multiplication of two size-N integers comes down to
O (ln N )ln(2D−1)/ ln D ,
small multiplies (meaning of a fixed size independent of N ), which complexity can, with su ciently high Toom–Cook degree d = D − 1, be brought down below any given complexity estimate of O ln1+ N small multiplies. However, it is to be noted forcefully that this complexity ignores the addition count, as well as the constant-coe cient multiplies (see Exercises 9.37, 9.78 and Section 9.5.8).
The Toom–Cook method can be recognized as a scheme for acyclic convolution, which together with other types of convolutions, we address later in this chapter. For more details on Karatsuba and Toom–Cook methods, the reader may consult [Knuth 1981], [Crandall 1996a], [Bernstein 1997].
9.5.2Fourier transform algorithms
Having discussed multiplication methods that enjoy complexities as low as O ln1+ N small fixed multiplications (but perhaps unfortunate addition counts), we shall focus our attention on a class of multiplication schemes that enjoy low counts of all operation types. These schemes are based on the notion of the discrete Fourier transform (DFT), a topic that we now cover in enough detail to render the subsequent multiply algorithms accessible.
At this juncture we can think of a “signal” simply as a sequence of elements, in order to forge a connection between transform theory and the field of signal processing. Throughout the remainder of this chapter, signals
9.5 Large-integer multiplication |
477 |
might be sequences of polynomial coe cients, or sequences in general, and will be denoted by x = (xn), n [0, D − 1] for some “signal length” D.
The first essential notion is that multiplication is a kind of convolution. We shall make that connection quite precise later, observing for the moment that the DFT is a natural transform to employ in convolution problems. For the DFT has the unique property of converting convolution to a less expensive dyadic product. We start with a definition:
Definition 9.5.3 (The discrete Fourier transform (DFT)). Let x be a signal of length D consisting of elements belonging to some algebraic domain in which D−1 exists, and let g be a primitive D-th root of unity in that domain; that is, gk = 1 if and only if k ≡ 0 (mod D). Then the discrete Fourier transform of x is that signal X = DF T (x) whose elements are
|
D−1 |
|
|
|
j |
|
|
Xk = |
xj g−jk, |
(9.20) |
|
|
|
=0 |
|
with the inverse DF T −1(X) = x given by |
|
||
1 |
D−1 |
|
|
xj = |
|
Xkgjk. |
(9.21) |
|
|||
D |
|
|
k=0
That the transform DF T −1 is well-defined as the correct inverse is left as an exercise. There are several important manifestations of the DFT:
Complex-field DFT: x, X CD, g a primitive D-th root of 1 such as e2πi/D;
Finite-field DFT: x, X FDpk , g a primitive D-th root of 1 in the same field;
Integer-ring DFT: x, X ZDN , g a primitive D-th root of 1 in the ring, D−1, g−1 exist.
It should be pointed out that the above are common examples, yet there are many more possible scenarios. As just one extra example, one may define a DFT over quadratic fields (see Exercise 9.50).
In the first instance of complex fields, the practical implementations involve floating-point arithmetic to handle complex numbers (though when the signal has only real elements, significant optimizations apply, as we shall see). In the second, finite-field, cases one uses field arithmetic with all terms reduced (mod p). The third instance, the ring-based DFT, is sometimes applied simultaneously for N = 2n − 1 and N = 2n + 1, in which cases the assignments g = 2 and D = n, D = 2n, respectively, can be made when n is coprime to both N, N .
It should be said that there exists a veritable menagerie of alternative transforms, many of them depending on basis functions other than the complex exponential basis functions of the traditional DFT; and often, such alternatives admit of fast algorithms, or assume real signals, and so on. Though such transforms lie beyond the scope of the present book, we observe
478 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
that some of them are also suited for the goal of convolution, so we name a few: The Walsh–Hadamard transform, for which one needs no multiplication, only addition; the discrete cosine transform (DCT), which is a real-signal, real-multiplication analogue to the DFT; various wavelet transforms, which sometimes admit of very fast (O(N ) rather than O(N ln N )) algorithms; realvalued FFT, which uses either cos or sin in real-only summands; the realsignal Hartley transform, and so on. Various of these options are discussed in [Crandall 1994b, 1996a].
Just to clear the air, we hereby make explicit the almost trivial di erence between the DFT and the celebrated fast Fourier transform (FFT). The FFT is an operation belonging to the general class of divide-and-conquer algorithms, and which calculates the DFT of Definition 9.5.3. The FFT will typically appear in our algorithm layouts in the form X = F F T (x), where it is understood that the DFT is being calculated. Similarly, an operation F F T −1(x) returns the inverse DFT. We make the distinction explicit because “FFT” is in some sense a misnomer: The DFT is a certain sum—an algebraic quantity—yet the FFT is an algorithm. Here is a heuristic analogy to the distinction: In this book, the equivalence class x (mod N ) are theoretical entities, whereas the operation of reducing x modulo p we have chosen to write a little di erently, as x mod p. By the same token, within an algorithm the notation X = F F T (x) means that we are performing an FFT operation on the signal X; and this operation gives, of course, the result DF T (x). (Yet another reason to make the almost trivial distinction is that we have known students who incorrectly infer that an FFT is some kind of “approximation” to the DFT, when in fact, the FFT is sometimes more accurate then a literal DFT summation, in the sense of roundo error, mainly because of reduced operation count for the FFT.)
The basic FFT algorithm notion has been traced all the way back to some observations of Gauss, yet some authors ascribe the birth of the modern theory to the Danielson–Lanczos identity, applicable when the signal length D is even:
D−1 |
xj g−jk = |
D/2−1 |
g2 |
|
−jk |
+ g−k |
D/2−1 |
g2 |
|
−jk . |
DF T (x) = |
x2j |
|
x2j+1 |
|
||||||
|
|
|
|
|
|
j |
|
|
||
j=0 |
|
j=0 |
|
|
|
|
=0 |
|
|
|
(9.22) A beautiful identity indeed: A DFT sum for signal length D is split into two sums, each of length D/2. In this way the Danielson–Lanczos identity ignites a recursive method for calculating the transform. Note the so-called twiddle factors g−k, which figure naturally into the following recursive form of FFT. In this and subsequent algorithm layouts we denote by len(x) the length of a signal x. In addition, when we perform element concatenations of the form (aj )j J we mean the result to be a natural, left-to-right, element concatenation as the increasing index j runs through a given set J. Similarly, U V is a signal having the elements of V appended to the right of the elements of U .
480 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
the Cooley–Tukey form, the phrase meaning that as in the Danielson–Lanczos splitting identity (9.22), we cut up (decimate) the time domain—the index on the original signal. The Gentleman–Sande FFT falls into the “decimation in frequency” class, for which a similar game is played on the k index of the transform elements Xk.
Algorithm 9.5.5 (FFT, in-place, in-order loop forms with bit-scramble).
Given a (D = 2d)-element signal x, the functions herein perform an FFT via nested loops. The two essential FFTs are laid out as decimation-in-time (Cooley– Tukey) and decimation-in-frequency (Gentleman–Sande) forms. Note that these forms can be applied symbolically, or in number-theoretical transform mode, by identifying properly the root of unity and the ring or field operations.
1. [Cooley–Tukey, decimation-in-time FFT] |
|
|
F F T (x) { |
|
|
scramble(x); |
|
|
n = len(x); |
|
|
for(m = 1; m < n; m = 2m) { |
// m ascends over 2-powers. |
|
for(0 ≤ j < m) { |
|
|
a = g−jn/(2m); |
|
|
for(i = j; i < n; i = i + 2m) |
|
|
(xi, xi+m) = (xi + axi+m, xi − axi+m); |
||
} |
|
|
} |
|
|
return x; |
|
|
} |
|
|
2. [Gentleman–Sande, decimation-in-frequency FFT] |
||
F F T (x) { |
|
|
n = len(x); |
|
|
for(m = n/2; m ≥ 1; m = m/2) { |
// m descends over 2-powers. |
|
for(0 ≤ j < m) { |
|
|
a = g−jn/(2m); |
|
|
for(i = j; i < n; i = i + 2m) |
|
|
(xi, xi+m) = (xi + xi+m, a(xi − xi+m)); |
||
} |
|
|
} |
|
|
scramble(x); |
|
|
return x; |
|
|
} |
|
|
3. [In-place scramble procedure] |
|
|
scramble(x) { |
// In-place, reverse-binary element scrambling. |
|
n = len(x); |
|
|
j = 0; |
|
|
for(0 ≤ i < n − 1) { |
|
|
if(i < j) (xi, xj ) = (xj , xi); |
// Swap elements. |
|
k = n/2 ; |
|
|
9.5 Large-integer multiplication |
481 |
while(k ≤ j) { j = j − k; k = k/2 ;
}
j = j + k;
}
return;
}
It is to be noted that when one performs a convolution in the manner we shall exhibit later, the scrambling procedures are not needed, provided that one performs required FFTs in a specific order.
Correct is Gentleman–Sande form (with scrambling procedure omitted) first, Cooley–Tukey form (without initial scrambling) second. This works out because, of course, scrambling is an operation of order two.
Happily, in cases where scrambling is not desired, or when contiguous memory access is important (e.g., on vector computers), there is the Stockham FFT, which avoids bit-scrambling and also has an innermost loop that runs essentially consecutively through data memory. The cost of all this is that one must use an extra copy of the data. The typical implementations of the Stockham FFT are elegant [Van Loan 1992], but there is a particular variant that has proved quite useful on modern vector machinery. This special variant is the “ping-pong” FFT, because one goes back and forth between the original data and a separate copy. The following algorithm display is based on a suggested design of [Papadopoulos 1999]:
Algorithm 9.5.6 (FFT, “ping-pong” variant, in-order, no bit-scramble).
Given a (D = 2d)-element signal x, a Stockham FFT is performed, but with the original x and external data copy y used in alternating fashion. We interpret X, Y below as pointers to the (complex) signals x, y, respectively, but operating under the usual rules of pointer arithmetic; e.g., X[0] is the first complex datum of x initially, but if 4 is added to pointer X, then X[0] = x4, and so on. If exponent d is even, pointer X has the FFT result, else pointer Y has it.
1. [Initialize] |
|
J = 1; |
|
X = x; Y = y; |
// Assign memory pointers. |
2. [Outer loop] |
|
for(d ≥ i > 0) { |
|
m = 0; |
|
while(m < D/2) { |
|
a = e−2πim/D; |
|
for(J ≥ j > 0) { |
|
Y [0] = X[0] + X[D/2]; |
|
Y [J] = a(X[0] − X[D/2]); |
|
X = X + 1; |
|
Y = Y + 1; |
|