
Prime Numbers
.pdf532 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
9.63. If the Mersenne prime p = 289 − 1 is used in the DGT integer convolution Algorithm 9.5.22 for zero-padded, large-integer multiply, and the elements of signals x, y are interpreted as digits in base B = 216, how large can x, y be? What if balanced digit representation (with each digit in [−215, 215 − 1]) is used?
9.64.Describe how to use Algorithm 9.5.22 with a set of Mersenne primes to e ect integer convolution via CRT reconstruction, including the precise manner of reconstruction. (Incidentally, CRT reconstruction for a Mersenne prime set is especially straightforward.)
9.65.Analyze the complexity of Algorithm 9.5.22, with a view to the type of recursion seen in the Sch¨onhage Algorithm 9.5.23, and explain how this compares to the entries of Table 9.1.
9.66.Describe how DWT ideas can be used to obviate the need for zeropadding in Algorithm 9.5.25. Specifically, show how to use not a length-(2m) cyclic, rather a length-m cyclic and a length-m negacyclic. This is possible because we have a primitive m-th root of −1, so a DWT can be used for the negacyclic. Note that this does not significantly change the complexity, but in practice it reduces memory requirements.
9.67.Prove the complexity claim following the Nussbaumer Algorithm 9.5.25 for the O(D ln D) operation bound. Then analyze the somewhat intricate problem of bit-complexity for the algorithm. One way to start on such bit-complexity analysis is to decide upon the optimal base B, as intimated in the complexity table of Section 9.5.8.
9.68.For odd primes p, the Nussbaumer Algorithm 9.5.25 will serve to evaluate cyclic or negacyclic convolutions (mod p); that is, for ring R identified with Fp. All that is required is to perform all R-element operations (mod p), so the structure of the algorithm as given does not change. Use such a Nussbaumer implementation to establish Fermat’s last theorem for some large exponents p, by invoking a convolution to e ect the Shokrollahi DFT. There are various means for converting DFTs into convolutions. One method is to invoke the Bluestein reindexing trick, another is to consider the DFT to be a polynomial evaluation problem, and yet another is Rader’s trick (in the case that signal length is a prime power). Furthermore, convolutions of not- power-of-two length can be embedded in larger, more convenient convolutions (see [Crandall 1996a] for a discussion of such interplay between transforms and convolutions). You would use Theorem 9.5.14, noting first that the DFT length can be brought down to (p − 1)/2. Then evaluate the DFT via a cyclic convolution of power-of-two length by invoking the Nussbaumer method (mod p). Aside from the recent and spectacular theoretical success of A. Wiles in proving the “last theorem,” numerical studies have settled all exponents p < 12000000 [Buhler et al. 2000]. Incidentally, the largest prime to have been shown regular via the Shokrollahi criterion is p = 671008859 [Crandall 1996a].

9.7 Exercises |
533 |
9.69.Implement Algorithm 9.6.1 for multiplication of polynomials with coe cients (mod p). Such an implementation is useful in, say, the Schoof algorithm for counting points on elliptic curves, for in that method, one has not only to multiply large polynomials, but create powering ladders that rely on the large-degree-polynomial multiplies.
9.70.Prove both complexity claims in the text following Algorithm 9.6.1. Describe under what conditions, e.g., what D, p ranges, or what memory constraints, and so on, which of the methods indicated—Nussbaumer convolution or binary-segmentation method—would be the more practical.
For further analysis, you might consider the Shoup method for polynomial multiplication [Shoup 1995], which is a CRT-convolution-based method, which will have its own complexity formula. To which of the two above methods does the Shoup method compare most closely, in complexity terms?
9.71.Say that polynomials x(t), y(t) have coe cients (mod p) and degrees ≈ N . For Algorithm 9.6.4, which calls Algorithm 9.6.2, what is the asymptotic
bit complexity of the polynomial mod operation x mod y, in terms of p and N ? (You need to make an assumption about the complexity of the integer multiplication for products of coe cients.) What if one is, as in many integer mod scenarios, doing many polynomial mods with the same modulus polynomial y(t), so that one has only to evaluate the truncated inverse R[y, ] once?
9.72. Here we explore another relation for Bernoulli numbers (mod p). Prove the theorem that if p ≥ 5 is prime, a is coprime to p, and we define d = −p−1 mod a, then for even m in [2, p − 3],
p−1
Bm (am − 1) ≡ jm−1(dj mod a) (mod p). m
j=0
Then establish the corollary that
Bm |
1 |
(p−1)/2 |
|
|
|
|
j |
m |
(2−m − 1) ≡ |
2 |
jm−1 (mod p). |
|
|
|
=1 |
Now achieve the interesting conclusion that if p ≡ 3 (mod 4), then B(p+1)/2 cannot vanish (mod p).
Such summation formulae have some practical value, but more computationally e cient forms exist, in which summation indices need cover only a fraction of the integers in the interval [0, p − 1], see [Wagsta 1978], [Tanner and Wagsta 1987].
9.73. Prove that Algorithm 9.6.5 works. Then modify the algorithm for a somewhat di erent problem, which is to evaluate a polynomial given in product form
x(t) = t(t + d)(t + 2d) · · · (t + (n − 1)d),
534 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
at a single given point t0. The idea is to choose some optimal G < n, and start with a loop
for(0 ≤ j < G) aj = |
|
G−1 |
(t0 |
+ (q + Gj)d); |
|
|
|
q=0 |
|
||||
Arrive in this way at an |
algorithm that requires O(G2 |
+ n/G) multiplies and |
||||
|
|
|
|
|
O(n + G2) adds to find x(t0). Show that by recursion on the partial product in the for() loop above (which partial product is again of the type handled
by the overall algorithm), one can find x(t0) in O(nφ+ ) multiplies, where |
|||
φ = √ |
|
|
1 /2 is the golden mean. In this scenario, what is the total count |
5 |
− |
||
of adds? |
|
Finally, use this sort of algorithm to evaluate large factorials, for example to verify primality of some large p by testing whether (p − 1)! ≡ −1 (mod p). The basic idea is that the evaluations of
(t + 1)(t + 2) · · · (t + m)
at |
2 |
{ |
0, m, 2m, . . . , (m |
− |
1)m |
} |
do yield, when multiplied all together, |
|
points |
|
|
|
m !. Searches for Wilson primes have used this technique with all arithmetic performed (mod p2) [Crandall et al. 1997].
9.74.Say that a polynomial x(t) is known in product form, that is,
|
D−1 |
x(t) = |
k |
(t − tk), |
|
|
=0 |
with the field elements tk given. By considering the accumulation of pairwise products, show that x can be expressed in coe cient form x(t) = x0 + x1t +
· · · + xD−1tD−1 in O D ln2 D field operations.
9.75. Prove that Algorithm 9.6.7 works, and establish a complexity estimate (expressed in terms of ring operations) if the partial polynomials and also the polynomial mods are all e ected in “grammar-school” fashion. Then, what is a complexity estimate if the partial polynomials are generated via fast multiplication (see Exercise 9.74) but the mods are still classical? Then, what complexity accrues if fast polynomial multiply and mod (as in Algorithm 9.6.4) are both in force?
As an extension, investigate some striking new results in regard to remainder trees (see Section 3.3) and—due to D. Bernstein—scaled remainder trees for polynomials [Bernstein 2004a]. Such methods with appropriate recasting of Algorithm 9.6.7 can result in complexity certainly as good as
O D ln2+o(1) D , with reductions in the implied big-O constant obtainable
via such as storage of key FFT operands during large-integer multiplication.
9.76. Investigate ways to relax the restriction that D be a power of two in Algorithm 9.6.7. One way, of course, is just to assume that the original polynomial has a flock of zero coe cients (and perforce, that the evaluation point set T has power-of-two length), and pretend the degree of x is thus one
9.8 Research problems |
535 |
less than a power of two. But another is to change the Step [Check breakover threshold . . .] to test just whether len(T ) is odd. These kinds of approaches will ensure that halving of signals can proceed during recursion.
9.8Research problems
9.77. As we have intimated, the enhancements to power ladders can be intricate, in many respects unresolved. In this exercise we tour some of the interesting problems attendant on such enhancements.
When an inverse is in hand (alternatively, when point negations are available in elliptic algebra), the add/subtract ladder options make the situation more interesting. The add/subtract ladder Algorithm 7.2.4, for example, has an interesting “stochastic” interpretation, as follows. Let x denote a real number in (0, 1) and let y be the fractional part of 3x; i.e., y = 3x − 3x . Then denote the exclusive-or of x, y by
z = x y,
meaning z is obtained by an exclusive-or of the bit streams of x and y together. Now investigate this conjecture: If x, y are chosen at random, then with probability 1, one-third of the binary bits of z are ones. If true, this conjecture means that if you have a squaring operation that takes time S, and a multiply operation that takes time M , then Algorithm 7.2.4 takes about time (S + M/3)b, when the relevant operands have b binary bits. How does this compare with the standard binary ladders of Algorithms 9.3.1, 9.3.2? How does it compare with a base-(B = 3) case of the general windowing ladder Algorithm 9.3.3? (In answering this you should be able to determine whether the add/subtract ladder is equivalent or not to some windowing ladder.)
Next, work out a theory of precise squaring and addition counts for practical ladders. For example, a more precise complexity estimate for he left-right binary ladder is
C (b(y) − 1)S + (o(y) − 1)M,
where the exponent y has b(y) total bits, of which o(y) are 1’s. Such a theory should be extended to the windowing ladders, with precomputation overhead not ignored. In this way, describe quantitatively what sort of ladder would be best for a typical cryptography application; namely, x, y have say 192 bits each and xy is to be computed modulo some 192-bit prime.
Next, implement an elliptic multiplication ladder in base B = 16, which means as in Algorithm 9.3.3 that four bits at a time of the exponent are processed. Note that, as explained in the text following the windowing ladder algorithm, you would need only the following point multiples: P, 3P, 5P, 7P . Of course, one should be precomputing these small multiples also in an e cient manner.
Next, study yet other ladder options (and this kind of extension to the exercise reveals just how convoluted is this field of study) as described in
536 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
[M¨uller 1997], [De Win et al. 1998], [Crandall 1999b] and references therein. As just one example of attempted refinements, some investigators have considered exponent expansions in which there is some guaranteed number of 0’s interposed between other digits. Then, too, there is the special advantage inherent in highly compressible exponents [Yacobi 1999], such study being further confounded by the possibility of base-dependent compressibility. It is an interesting research matter to ascertain the precise relation between the compressibility of an exponent and the optimal e ciency of powering to said exponent.
9.78. In view of complexity results such as in Exercise 9.37, it would seem that a large-D version of Toom–Cook could, with recursion, be brought down to what is essentially an ideal bit complexity O ln1+ N . However, as we have intimated, the additions grow rapidly. Work out a theory of Toom–Cook addition counts, and discuss the tradeo s between very low multiplication complexity and overwhelming complexity of additions. Note also the existence of addition optimizations, as intimated in Exercise 9.38.
This is a di cult study, but of obvious practical value. For example, there is nothing a priori preventing us from employing di erent, alternating Toom– Cook schemes within a single, large recursive multiply. Clearly, to optimize such a mixed scheme one should know something about the interplay of the multiply and add counts, as well as other aspects of overhead. Yet another such aspect is the shifting and data shuttling one must do to break up an integer into its Toom–Cook coe cients.
9.79. How far should one be able to test numerically the Goldbach conjecture by considering the acyclic convolution of the signal
G = (1, 1, 1, 0, 1, 1, 0, 1, 1, 0, . . .)
with itself? (Here, as in the text, the signal element Gn equals 1 if and only if 2n + 3 is prime.) What is the computational complexity for this convolutionbased approach for the settling of Goldbach’s conjecture for all even numbers not exceeding x? Note that the conjecture has been settled for all even numbers up to x = 4 · 1014 [Richstein 2001]. We note that explicit FFTbased computations up to 108 or so have indeed been performed [Lavenier and Saouter 1998]. Here is an interesting question: Can one resolve Goldbach representations via pure-integer convolution on arrays of b-bit integers (say b = 16 or 32), with prime locations signified by 1 bits, knowing in advance that two prime bits lying in one integer is a relatively rare occurrence?
9.80. One can employ convolution ideas to analyze certain higher-order additive problems in rings ZN , and perhaps in more complicated settings leading into interesting research areas. Note that Exercise 9.41 deals with sums of squares. But when higher powers are involved, the convolution and spectral manipulations are problematic.
To embark on the research path intended herein, start by considering a k- th powers exponential sum (the square and cubic versions appear in Exercise
9.8 Research problems |
537 |
1.66), namely
N −1
Uk(a) = e2πiaxk /N .
x=0
Denote by rs(n) the number of representations of n as a sum of s k-th powers in ZN . Prove that whereas
N −1 |
|
|
|
|
|
rs(n) = N s, |
||
n=0 |
|
|
it also happens that |
|
|
N −1 |
1 |
N −1 |
|
|
|
rs(n)2 = |
N |
|Uk(a)|2s. |
n=0 |
|
a=0 |
It is this last relation that allows some interesting bounds and conclusions. In fact, the spectral sum of powers |U |2s, if bounded above, will allow lower bounds to be placed on the number of representable elements of ZN . In other words, upper bounds on the spectral amplitude |U | e ectively “control” the representation counts across the ring, to analytic advantage.
Next, as an initial foray into the many research options, use the ideas and results of Exercises 1.44, 1.66 to show that a positive constant c exists such that for p prime, more than a fraction c of the elements of Zp are sums of two cubes. Admittedly, we have seen that the theory of elliptic curves completely settles the two-cube question—even for rings ZN with N composite—in the manner of Exercise 7.20, but the idea of the present exercise is to use the convolution and spectral notions alone. How high can you force c for, say, su ciently large primes p? One way to proceed is first to show from the “p3/4” bound of Exercise 1.66 that every element of Zp is a sum of 5 cubes, then to obtain sharper results by employing the best-possible ”p1/2” bound. And what about this spectral approach for composite N ? In this case one may employ, for appropriate Fourier indices a, an “N 2/3” bound (see for example [Vaughan 1997, Theorem 4.2]).
Now try to find a simple proof of the theorem: If N is prime, then for every k there exist positive constants ck, k such that for a ≡0 (mod N ) we
have
|Uk(a)| < ckN 1− k .
Then, show from this that for any k there is a fixed s (independent of everything except k) such that every element of ZN , prime N , is a sum of s k-th powers. Such bounds as the above on |U | are not too hard to establish, using recursion on the Weyl expedient as used for the cubic case in Exercise 1.66. (Some of the references below explain how to do more work, to achievek ≈ 1/k, in fact.)
Can you show the existence of the fixed s for composite N ? Can you establish explicit values for s for various k (recall the “4,5” dichotomy for the
538 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
cubic case)? In such research, you would have to find upper bounds on general U sums, and indeed these can be obtained; see [Vinogradov 1985], [Ellison and Ellison 1985], [Nathanson 1996], [Vaughan 1997]. However, the hard part is to establish explicit s, which means explicit bounding constants need to be tracked; and many references, for theoretical and historical reasons, do not bother with such detailed tracking.
One of the most fascinating aspects of this research area is the fusion of theory and computation. That is, if you have bounding parameters ck, k for k-th power problems as above, then you will likely find yourself in a situation where theory is handling the “su ciently large” N , yet you need computation to handle all the cases of N from the ground up to that theory threshold. Computation looms especially important, in fact, when the constant ck is large or, to a lesser extent, when k is small. In this light, the great e orts of 20th-century analysts to establish general bounds on exponential sums can now be viewed from a computational perspective.
These studies are, of course, reminiscent of the literature on the celebrated Waring conjecture, which conjecture claims representability by a fixed number s of k-th powers, but among the nonnegative integers (e.g., the Lagrange four-square theorem of Exercise 9.41 amounts to proof of the k = 2, s = 4 subcase of the general Waring conjecture). The issues in this full Waring scenario are di erent, because for one thing the exponential sums are to be taken not over all ring elements but only up to index x ≈ N 1/k or thereabouts, and the bounding procedures are accordingly more intricate. In spite of such obstacles, a good research extension would be to establish the classical Waring estimates on s for given k—which estimates historically involve continuous integrals—using discrete convolution methods alone. (In 1909 D. Hilbert proved the Waring conjecture via an ingenious combinatorial approach, while the incisive and powerful continuum methods appear in many references, e.g., [Hardy 1966], [Nathanson 1996], [Vaughan 1997].) Incidentally, many Waring-type questions for finite fields have been completely resolved; see for example [Winterhof 1998].
9.81. Is there a way to handle large convolutions without DFT, by using the kind of matrix idea that underlies Algorithm 9.5.7? That is, you would be calculating a convolution in small pieces, with the usual idea in force: The signals to be convolved can be stored on massive (say disk) media, while the computations proceed in relatively small memory (i.e., about the size of some matrix row/column).
Along these lines, design a standard three-FFT convolution for arbitrary signals, except do it in matrix form reminiscent of Algorithm 9.5.7, yet do not do unnecessary transposes. Hint: Arrange for the first FFT to leave the data in such a state that after the usual dyadic (spectral) product, the inverse FFT can start right o with row FFTs.
Incidentally, E. Mayer has worked out FFT schemes that do no transposes of any kind; rather, his ideas involve columnwise FFTs that avoid common memory problems. See [Crandall et al. 1999] for Mayer’s discussion.
9.8 Research problems |
539 |
9.82.A certain prime suggested in [Craig-Wood 1998], namely
p = 264 − 232 + 1,
has advantageous properties in regard to CRT-based convolution. Investigate some of these advantages, for example by stating the possible signal lengths for number-theoretical transforms modulo p, exhibiting a small-magnitude element of order 64 (such elements might figure well into certain FFT structures), and so on.
9.83. Here is a surprising result: Length-8 cyclic convolution modulo a Mersenne prime can be done via only eleven multiplies. It is surprising because the Winograd bound would be 2 · 8 − 4 = 12 multiplies, as in Exercise 9.39. Of course, the resolution of this paradox is that the Mersenne mod changes the problem slightly.
To reveal the phenomenon, first establish the existence of an 8-th root of unity in Fp2 , with p being a Mersenne prime and the root being symbolically simple enough that DGTs can be performed without explicit integer multiplications. Then consider the length-8 DGT, used to cyclically convolve two integer signals x, y. Next, argue that the transforms X, Y have su cient symmetry that the dyadic product X Y requires two real multiplies and three complex multiplies. This is the requisite count of 11 muls.
An open question is: Are there similar “violations” of the Winograd bound for lengths greater than eight?
9.84. Study the interesting observations of [Yagle 1995], who notes that matrix multiplication involving n×n matrices can be e ected via a convolution of length n3. This is not especially surprising, since we cannot do an arbitrary length-n convolution faster than O(n ln n) operations. However, Yagle saw that the indicated convolution is sparse, and this leads to interesting developments, touching, even, on number-theoretical transforms.

Appendix
BOOK PSEUDOCODE
All algorithms in this book are written in a particular pseudocode form describable, perhaps, as a “fusion of English and C languages.” The motivations for our particular pseudocode design have been summarized in the Preface, where we have indicated our hope that this “mix” will enable all readers to understand, and programmers to code, the algorithms. Also in the Preface we indicated a network source for Mathematica implementations of the book algorithms.
That having been said, the purpose of this Appendix is to provide not a rigorous compendium of instruction definitions, for that would require something like an entire treatise on syntactical rules as would be expected to appear in an o -the-shelf C reference. Instead, we give below some explicit examples of how certain pseudocode statements are to be interpreted.
English, and comments
For the more complicated mathematical manipulations within pseudocode, we elect for English description. Our basic technical motivation for allowing “English” pseudocode at certain junctures is evident in the following example. A statement in the C language
if((n== floor(n)) && (j == floor(sqrt(j))*floor(sqrt(j)))) . . .,
which really means “if n is an integer and j is a square,” we might have cast in this book as
if(n, √j Z) . . .
That is, we have endeavored to put “chalkboard mathematics” within conditionals. We have also adopted a particular indentation paradigm. If we had allowed (which we have not) such English as:
For all pseudoprimes in S, apply equation (X); Apply equation (Y);
then, to the aspiring programmer, it might be ambiguous whether equation
(Y) were to be applied for all pseudoprimes, or just once, after the loop on equation (X). So the way we wish such English to appear, assuming the case that equation (Y) is indeed applied only once after looping, is like so:
For all pseudoprimes in S, apply equation (X);
Apply equation (Y);

542 |
Appendix BOOK PSEUDOCODE |
Because of this relegation of English statements to their own lines, the interpretation that equation (Y) is to be invoked once, after the pseudoprime loop, is immediate. Accordingly, when an English statement is su ciently long that it wraps around, we have adopted reverse indentation, like so:
Find a random t [0, p − 1] such that t2 − a is a quadratic nonresidue
(mod p), via Algorithm 2.3.5;
√
x = (t + t2 − a)(p+1)/2;
. . .;
In this last example, one continually chooses random integers t in the stated range until one is found with the required condition, and then one goes to the next step, which calls for a single calculation and the assignment of letter x to the result of the calculation.
Also in English will be comments throughout the book pseudocode. These take the following form (and are right-justified, unlike pseudocode itself):
√
x = (t + t2 − a)(p+1)/2; // Use Fp2 arithmetic.
The point is, a comment prefaced with “//” is not to be executed as pseudocode. For example, the above comment is given as a helpful hint, indicating perhaps that to execute the instruction one would first want to have a subroutine to do Fp2 arithmetic. Other comments clarify the pseudocode’s nomenclature, or provide further information on how actually to carry out the executable statement.
Assignment of variables, and conditionals
We have elected not to use the somewhat popular assignment syntax x := y, rather, we set x equal to y via the simple expedient x = y. (Note that in this notation for assignment used in our pseudocode, the symbol “=” does not signify a symmetric relation: The assignment x = y is not the same instruction as the assignment y = x.) Because assignment appears on the face of it like equality, the conditional equality x == y means we are not assigning, merely testing whether x and y are equal. (In this case of testing conditional equality, the symbol “==” is indeed symmetric.) Here are some examples of our typical assignments:
x = 2; |
// Variable x gets the value 2. |
x = y = 2; |
// Both x and y get the value 2. |
F = { }; |
// F becomes the empty set. |
(a, b, c) = (3, 4, 5); |
// Variable a becomes 3, b becomes 4, c becomes 5. |
Note the important rule that simultaneous (vector) assignment assumes first the full evaluation of the vector on the right side of the equation, followed by the forcing of values on the left-hand side. For example, the assignment
(x, y) = (y2, 2x);
means that the right-hand vector is evaluated for all components, then the left-hand vector is forced in all components. That is, the example is equivalent to the chain (technically, we assume neither of x, y invokes hidden functions)