
Prime Numbers
.pdf512 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
multiplication loops, one need not handle terms of degree higher than indicated. In convolution-theory language, we are therefore doing “half-cyclic” convolutions, so when transform methods are used, there is also gain to be realized because of the truncation.
As is typical of Newton methods, the dynamical precision degree n essentially doubles on each pass of the Newton loop. Let us give an example of the workings of the algorithm. Take
x(t) = 1 + t + t2 + 4t3
and call the algorithm to output R[x, 8]. Then the values of g(t) at the end of each pass of the Newton loop come out as
1 − t,
1 − t − 3t3,
1 − t − 3t3 + 7t4 − 4t5 + 9t6 − 33t7,
1 − t − 3t3 + 7t4 − 4t5 + 9t6 − 33t7 + 40t8,
and |
indeed, |
this |
last |
output of g(t) multiplied by the original x(t) is |
||||
1 + |
43t9 |
− 92t |
10 |
+ 160t |
11 |
, showing that the last output g(t) is correct through |
||
8 |
). |
|
|
|||||
O(t |
|
|
|
|
|
|
|
Polynomial remaindering (polynomial mod operation) can be performed in much the same way as some of our mod algorithms for integers used a “reciprocal.” However, it is not always possible to divide one polynomial by another and get a unique and legitimate remainder: This can depend on the ring of coe cients for the polynomials. However, if the divisor polynomial has its high coe cient invertible in the ring, then there is no problem with divide and remainder; see the discussion in Section 2.2.1. For simplicity, we shall restrict to the case that the divisor polynomial is monic, that is, the high coe cient is 1, since generalizing is straightforward. Assume that x(t), y(t) are polynomials and that y(t) is monic. Then there are unique polynomials q(t), r(t) such that
x(t) = q(t)y(t) + r(t), and r = 0 or deg(r) < deg(x). We shall write
r(t) = x(t) mod y(t),
and view q(t) as the quotient and r(t) as the remainder. Incidentally, for some polynomial operations one demands that coe cients lie in a field, for example in the evaluation of polynomial gcd’s, but many polynomial operations do not require field coe cients. Before exhibiting a fast polynomial remaindering algorithm, we establish some nomenclature:
Definition 9.6.3 (Polynomial operations). |
Let x(t) = |
|
D−1 |
xj t |
j |
be a |
||
|
j=0 |
|
||||||
polynomial. We define the reversal of x by degree d as the |
polynomial |
|
||||||
|
|
|
|
|
||||
|
d |
|
|
|
|
|
|
|
rev(x, d) = |
j |
|
tj , |
|
|
|
|
|
x |
d−j |
|
|
|
|
|
||
|
=0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
514 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
Whatever method used for polynomial gcd, the fast polynomial remaindering scheme of this section can be applied as desired for the internal polynomial mod operations.
9.6.3Polynomial evaluation
We next discuss polynomial evaluation techniques. The essential problem is
to evaluate a polynomial x(t) = D−1 xj tj at, say, each of n field values
j=0
t0, . . . , tn−1. It turns out that the entire sequence (x(t0), x(t1), . . . , x(tn−1)) can be evaluated in
O n ln2 min{n, D}
field operations. We shall split the problem into three basic cases:
(1)The arguments t0, . . . , tn−1 lie in arithmetic progression.
(2)The arguments t0, . . . , tn−1 lie in geometric progression.
(3)The arguments t0, . . . , tn−1 are arbitrary.
Of course, case (3) covers the other two, but in (1), (2) it can happen that special enhancements apply.
Algorithm 9.6.5 |
(Evaluation of polynomial on arithmetic progression). |
|
|
D−1 |
j |
d), x(a + 2d), . . . , x(a + (n − 1)d). (The method attains its best e ciency when |
||
Let x(t) = |
j=0 |
xj t . This algorithm returns the n evaluations x(a), x(a + |
n is much greater than D.)
1. [Evaluate at first D points]
for(0 ≤ j < D) ej = x(a + jd);
2. [Create di erence tableau] for(1 ≤ q < D) {
for(D − 1 ≥ k ≥ q) ek = ek − ek−1;
}
3. [Operate over tableau]
E0 = e0;
for(1 ≤ q < n) {
Eq = Eq−1 + e1;
for(1 ≤ k < D − 1) ek = ek + ek+1;
}
return (Eq ), q [0, n − 1];
A variant of this algorithm has been used in searches for Wilson primes (see Exercise 9.73, where computational complexity issues are also discussed).
Next, assume that evaluation points lie in geometric progression, |
say |
tk = T k for some constant T , so we need to evaluate every sum xj T kj |
for |
k [0, D − 1]. There is a so-called Bluestein trick, by which one transforms
9.6 Polynomial arithmetic |
515 |
such sums according to
j |
xj T kj = T −k2/2 |
j |
xj T −j2/2 T (−k−j)2/2, |
|
|
|
|
and thus calculates the left-hand sum via the convolution implicit in the righthand sum. However, in certain settings it is somewhat more convenient to avoid halving the squares in the exponents, relying instead on properties of the triangular numbers ∆n = n(n + 1)/2. Two relevant algebraic properties of these numbers are
∆α+β = ∆α + ∆β + αβ, ∆α = ∆−α−1.
A variant of the Bluestein trick can accordingly be derived as
j |
xj T kj = T ∆−k j |
xj T ∆j T −∆−(k−j) . |
Now the implicit convolution can be performed using only integral powers of the T constant. Moreover, we can employ an e cient, cyclic convolution by carefully embedding the x signal in a longer, zero-padded signal and reindexing, as in the following algorithm.
Algorithm 9.6.6 |
(Evaluation of polynomial on geometric progression). |
|||||||
This |
|
|
D−1 |
j |
|
) |
, k |
[0, D 1]. |
Let x(t) = |
|
j=0 xj t , and let T have an inverse in the arithmetic domain. |
||||||
|
algorithm returns the sequence of values |
x(T k |
|
|
− |
|||
1. [Initialize] |
|
|
|
|||||
|
Choose N = 2n such that N ≥ 2D; |
|
|
|
|
|||
|
for(0 ≤ j < D) xj = xj T ∆j ; |
|
|
// Weight the signal x. |
||||
|
Zero-pad x = (xj ) to have length N ; |
|
|
|
|
|||
|
y = |
T −∆N/2−j−1 , j [0, N − 1]; |
// Create symmetrical signal y. |
2.[Length-N cyclic convolution] z = x × y;
3.[Final assembly of evaluation results]
return x(T k) = T ∆k−1 zN/2+k−1 , k [0, D − 1]; |
k |
) |
We see that a single convolution serves to evaluate all of the values x(T |
|
at once. It is clear that the complexity of the entire evaluation is O(D ln D) field operations. One important observation is that an actual DFT is just such an evaluation over a geometric progression; namely, the DFT of (xj ) is the sequence x(g−k) , where g is the appropriate root of unity for the transform. So Algorithm 9.6.6 is telling us that evaluations over geometric progressions are, except perhaps for the minor penalty of zero-padding and so on, essentially of FFT complexity given only that g is invertible. It is likewise clear that any FFT can be embedded in a convolution of power-of-two length,
9.6 Polynomial arithmetic |
517 |
Note that in the calculations of w(t), z(t) the intent is that the product must be expanded, to render w, z as signals of coe cients. The operations to expand these products must be taken into account in any proper complexity estimate for this evaluation algorithm (see Exercise 9.75). Along such lines, note that an especially e cient way to implement Algorithm 9.6.7 is to preconstruct a polynomial remainder tree; that is, to exploit the fact that the polynomials in Step [Assemble half-polynomials] have been calculated from their own respective halves, and so on.
To lend support to the reader who desires to try this general evaluation Algorithm 9.6.7, let us give an example of its workings. Consider the task of calculating the number 64! not by the usual, sequential multiplication of successive integers but by evaluating the polynomial
x(t) = t(1 + t)(2 + t)(3 + t)(4 + t)(5 + t)(6 + t)(7 + t)
= 5040t + 13068t2 + 13132t3 + 6769t4 + 1960t5322t6 + 28t7 + t8
at the 8 points
T = (1, 9, 17, 25, 33, 41, 49, 57)
and then taking the product of the eight evaluations to get the factorial. Since the algorithm is fully recursive, tracing is nontrivial. However, if we assign b = 2, say, in Step [Set breakover] and print out the half-polynomials w, z and polynomial-mod results a, b right after these entities are established, then our output should look as follows. On the first pass of eval we obtain
w(t) = 3825 − 4628t + 854t2 − 52t3,
z(t) = 3778929 − 350100t + 11990t2 − 180t3 + t4,
a(t) = x(t) mod w(t)
=−14821569000 + 17447650500t − 2735641440t2 + 109600260t3, b(t) = x(t) mod z(t)
=−791762564494440 + 63916714435140t − 1735304951520t2
+16010208900t3,
and for each of a, b there will be further recursive passes of eval. If we keep tracing in this way, the subsequent passes reveal
w(t) = 9 − 10t + t2, z(t) = 425 − 42t + t2,
a(t) = −64819440 + 64859760t,
b(t) = −808538598000 + 49305458160t,
and, continuing in recursive order,
w(t) = 1353 − 74t + t2, z(t) = 2793 − 106t + t2,
a(t) = −46869100573680 + 1514239317360t,
b(t) = −685006261415280 + 15148583316720t.
518 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
There are no more recursive levels (for our example choice b = 2) because the eval function will break over to some classical method such as an easy instance of Horner’s rule and evaluate these last a(t), b(t) values directly, each one at four t = ti values. The final returned entity from eval turns out to be the sequence
(x(t0), . . . , x(t7)) = (40320, 518918400, 29654190720, 424097856000,
3100796899200, 15214711438080, 57274321104000, 178462987637760).
Indeed, the product of these eight values is exactly 64!, as expected. One should note that in such a “product” operation—where evaluations are eventually all multiplied together—the last phase of the eval function need not return a union of two signals, but may instead return the product eval(a, u) eval(b, v). If that is the designer’s choice, then the step [Check breakover threshold . . .] must also return the product of the indicated x(ti).
Incidentally, polynomial coe cients do not necessarily grow large as the above example seems to suggest. For one thing, when working on such as a factoring problem, one will typically be reducing all coe cients modulo some N , at every level. And there is a clean way to handle the problem of evaluating x(t) of degree D at some smaller number of points, say at t0, . . . , tn−1 with n < D. One can simply calculate a new polynomial s as the remainder
s(t) = x(t) mod |
n−1(t − tj ) |
, |
|
|
j |
|
|
|
=0 |
|
|
whence evaluation of s (whose degree is now about n) at the n given points ti will su ce.
9.7Exercises
9.1.Show that both the base-B and balanced base-B representations are unique. That is, for any nonnegative integer x, there is one and only one collection of digits corresponding to each definition.
9.2.Although this chapter has started with multiplication, it is worthwhile to look at least once at simple addition and subtraction, especially in view of signed arithmetic.
(1) Assuming a base-B representation for each of two nonnegative integers x, y, give an explicit algorithm for calculating the sum x + y, digit by digit, so that this sum ends up also in base-B representation.
(2)Invoke the notion of signed-integer arithmetic, by arguing that to get general sums and di erences of integers of any signs, all one needs is the summation algorithm of (1), and one other algorithm, namely, to calculate the di erence x− y when x ≥ y ≥ 0. (That is, every add/subtract problem can be put into one of two forms, with an overall sign decision on the result.)
9.7 Exercises |
519 |
(3)Write out complete algorithms for addition and subtraction of integers in base B, with signs arbitrary.
9.3.Assume that each of two nonnegative integers x, y is given in balanced base-B representation. Give an explicit algorithm for calculating the sum x + y, digit by digit, but always staying entirely within the balanced base-B representation for the sum. Then write out a such a self-consistent multiply algorithm for balanced representations.
9.4.It is known to children that multiplication can be e ected via addition alone, as in 3 · 5 = 5 + 5 + 5. This simple notion can actually have practical import in some scenarios (actually, for some machines, especially older machines where word multiply is especially costly), as seen in the following tasks, where we study how to use storage tricks to reduce the amount of calculation during a large-integer multiply. Consider the multiplication of D-digit, base-(B = 2b) integers of size 2n, so that n ≈ bD. For the tasks below, define a “word” operation (word multiply or word add) as one involving two size-B operands (each having b bits).
(1)Argue first that standard grammar-school multiply, whereby one constructs via word multiplies a parallelogram and then adds up the columns via word adds, requires O(D2) word multiplies and O(D2) word adds.
(2)Noting that there can be at most B possible rows of the parallelogram, argue that all possible rows can be precomputed in such a way that the full multiply requires O(BD) word multiplies and O(D2) word adds.
(3)Now argue that the precomputation of all possible rows of the parallelogram can be done with successive additions and no multiplies of any kind, so that the overall multiply can be done in O(D2 + BD) word adds.
(4)Argue that the grammar-school paradigm of task (1) above can be done with O(n) bits of temporary memory. What, then, are the respective memory requirements for tasks (2), (3)?
If one desires to create an example program, here is a possible task: Express large integers in base B = 256 = 28 and implement via machine task (2) above, using a 256-integer precomputed lookup table of possible rows to create the usual parallelogram. Such a scheme may well be slower than other largeinteger methods, but as we have intimated, a machine with especially slow word multiply can benefit from these ideas.
9.5. Write out an explicit algorithm (or an actual program) that uses the wn relation (9.3) to e ect multiple-precision squaring in about half a multipleprecision multiply time. Note that you do not need to subtract out the term δn explicitly, if you elect instead to modify slightly the i sum. The basic point is that the grammar-school rhombus is e ectively cut (about) in half. This exercise is not as trivial as it may sound—there are precision considerations attendant on the possibility of huge column sums.

9.7 Exercises |
521 |
where “do” simply means one repeats what is in the braces for some appropriate total iteration count. Note that the duplication of the y iteration
is intentional! Show that this scheme formally generates the binomial series of
√
1 + a via the variable x. How many correct terms obtain after k iterations of the do loop?
Next, calculate some real-valued square roots in this way, noting the important restriction that |a| cannot be too large, lest divergence occur (the formal correctness of the resulting series in powers of a does not, of course, automatically guarantee convergence).
Then, consider this question: Can one use these ideas to create an algorithm for extracting integer square roots? This could be a replacement for Algorithm 9.2.11; the latter, we note, does involve explicit division. On this question it may be helpful to consider, for given n to be square-rooted, such as n/4q = 2−q √n or some similar construct, to keep convergence under control.
Incidentally, it is of interest that the standard, real-domain, Newton iteration for the inverse square root automatically has division-free form, yet we appear to be compelled to invoke such as the above coupled-variable expedient for a positive fractional power.
9.15. The Cullen numbers are Cn = n2n + 1. Write a Montgomery powering program specifically tailored to find composite Cullen numbers, via relations such as 2Cn −1 ≡1 (mod Cn). For example, within the powering algorithm for modulus N = C245 you would be taking say R = 2253 so that R > N . You could observe, for example, that C141 is a base-2 pseudoprime in this way (it is actually a prime). A much larger example of a Cullen prime is Wilfrid Keller’s C18496. For more on Cullen numbers see Exercise 1.83.
9.16. Say that we wish to evaluate 1/3 using the Newton reciprocation of the text (among real numbers, so that the result will be 0.3333 . . .). For initial guess x0 = 1/2, prove that for positive n the n-th iterate xn is in fact
22n − 1 xn = 3 · 22n ,
in this way revealing the quadratic-convergence property of a successful Newton loop. The fact that a closed-form expression can even be given for the Newton iterates is interesting in itself. Such closed forms are rare—can you find any others?
9.17. Work out the asymptotic complexity of Algorithm 9.2.8, in terms of a size-N multiply, and assuming all the shifting enhancements discussed in the text. Then give the asymptotic complexity of the composite operation (xy) mod N , for 0 ≤ x, y < N , in the case that the generalized reciprocal is not yet known. What is the complexity for (xy) mod N if the reciprocal is known? (This should be asymptotically the same as the composite Montgomery operation (xy) mod N if one ignores the precomputations attendant to the latter.) Incidentally, in actual programs that invoke the Newton–Barrett ideas,