
Prime Numbers
.pdf522 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
one can place within the general mod routine a check to see whether the reciprocal is known, and if it is not, then the generalized reciprocal algorithm is invoked, and so on.
9.18.Work out the asymptotic complexity of Algorithm 9.2.13 for given x, N in terms of a count of multiplications by integers c of various sizes. For example, assuming some grammar-school variant for multiplication, the bitcomplexity of an operation yc would be O(ln y ln c). Answer the interesting question: At what size of |c| (compared to N = 2q + c) is the special form reduction under discussion about as wasteful as some other prevailing schemes (such as long division, or the Newton–Barrett variants) for the mod operation? Incidentally, the most useful domain of applicability of the method is the case that c is one machine word in size.
9.19.Simplify algorithm 9.4.2 in the case that one does not need an extended solution ax + by = g, rather needs only the inverse itself. (That is, not all the machinations of the algorithm are really required.)
9.20.Implement the recursive gcd Algorithm 9.4.6. (Or, implement the newer Algorithm 9.4.7; see next paragraph.) Optimize the breakover param-
eters lim and prec for maximum speed in the calculation of rgcd(x, y) for each of x, y of various (approximately equal) sizes. You should be able to see rgcd() outperforming cgcd() in the region of, very roughly speaking, thousands of bits. (Note: Our display of Algorithm 9.4.6 is done in such a way that if the usual rules of global variables, such as matrix G, and variables local to procedures, such as the variables x, y in hgcd() and so on, are followed in the computer language, then transcription from our notation to a working program should not be too tedious.)
As for Algorithm 9.4.7, the reader should find that di erent optimization issues accrue. For example, we found that Algorithm 9.4.6 typically runs faster if there is no good way to do such as trailing-zero detection and bit-shifting on huge numbers. On the other hand, when such expedients are e cient for the programmer, the newer Algorithm 9.4.7 should dominate.
9.21.Prove that Algorithm 9.2.10 works. Furthermore, work out a version that uses the shift-splitting idea embodied in the relation (9.12) and comments following. A good source for loop constructs in this regard is [Menezes et al. 1997].
Also, investigate the conjecture in [Oki 2003] that one may more tightly assign s = 2B(N − 1) in Algorithm 9.2.10.
9.22.Prove that Algorithm 9.2.11 works. It helps to observe that x is definitely decreasing during the iteration loop. Then prove the O(ln ln N ) estimate for the number of steps to terminate. Then invoke the idea of
changing precision at every step, to show that the bit-complexity of a properly tuned algorithm can be brought down to O ln2 N . Many of these ideas date back to the treatment in [Alt 1979].
524 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
algorithm that inputs x, y and requires only four internal variables to calculate the inverse of x.
9.32.Can Algorithm 9.4.4 be generalized to composite p?
9.33.Prove that Algorithms 9.4.4 and 9.4.5 work. For the latter algorithm, it may help to observe how one inverts a pure power of two modulo a Mersenne prime.
9.34.In the spirit of the special-case mod Algorithm 9.2.13, which relied heavily on bit shifting, recast Algorithm 9.4.5 to indicate the actual shifts required in the various steps. In particular, not only the mod operation but multiplication by a power of two is especially simple for Mersenne prime moduli, so use these simplifications to rewrite the algorithm.
9.35.Can one perform a gcd on two numbers each of size N in polynomial time (i.e., time proportional to some power lgα N ), using a polynomial number of parallelized processors (i.e., lgβ N of them)? An interesting reference is [Cesari 1998], where it is explained that it is currently unknown whether such a scheme is possible.
9.36.Write out a clear algorithm for a full integer multiplication using the Karatsuba method. Make sure to show the recursive nature of the method, and also to handle properly the problem of carry, which must be addressed when any final digits overflow the base size.
9.37.Show that a Karatsuba-like recursion on the (D = 3) Toom–Cook method (i.e., recursion on Algorithm 9.5.2) yields integer multiplication of two size-N numbers in what is claimed in the text, namely, O((ln N )ln 5/ ln 3) word multiplies. (All of this assumes that we count neither additions nor the constant multiplies as they would arise in every recursive [Reconstruction] step of Algorithm 9.5.2.)
9.38.Recast the [Initialize] step of Algorithm 9.5.2 so that the ri, si can be most e ciently calculated.
9.39.We have seen that an acyclic convolution of length N can be e ected in 2N −1 multiplies (aside from multiplications by constants; e.g., a term such as 4x can be done with left-shift alone, no explicit multiply). It turns out that a cyclic convolution can be e ected in 2N − d(N ) multiplies, where d is the standard divisor function (the number of divisors of n), while a negacyclic can be e ected in 2N − 1 multiplies. (These wonderful results are due chiefly to S. Winograd; see the older but superb collection [McClellan and Rader 1979].) Here are some explicit related problems:
(1)Show that two complex numbers a + bi, c + di may be multiplied via only three real multiplies.
(2)Work out an algorithm that performs a length-4 negacyclic in nine multiplies, but with all constant mul or div operations being by powers of

9.7 Exercises |
525 |
two (and thus, mere shifts). The theoretical minimum is, of course, seven multiplies, but such a nine-mul version has its advantages.
(3)Use Toom–Cook ideas to develop an explicit length-4 negacyclic scheme that does require only seven multiplies.
(4)Can one use a length-(D > 2) negacyclic to develop a Karatsuba-like
multiply that is asymptotically better than O (ln D)ln 3/ ln 2 ?
(5)Show how to use a Walsh–Hadamard transform to e ect a length-16 cyclic convolution in 43 multiplies [Crandall 1996a]. Though the theoretical minimum multiply count for this length is 27, the Walsh–Hadamard scheme has no troublesome constant coe cients. The scheme also appears to be a kind of bridge between Winograd complexities (linear in N ) and transform-based complexities (N ln N ). Indeed, 43 is not even as large as 16 lg 16. Incidentally, the true complexity of the Walsh–Hadamard scheme is still unknown.
9.40. Prove Theorem 9.5.13 by way of convolution ideas, along the following lines. Let N = 2 · 3 · 5 · · · pm be a consecutive prime product, and define
rN (n) = #{(a, b) : a + b = n; gcd(a, N ) = gcd(b, N ) = 1; a, b [1, N − 1]},
that is, rN (n) is the number of representations we wish to bound below. Now define a length-N signal y by yn = 1 if gcd(n, N ) = 1, else yn = 0. Define the cyclic convolution
RN (n) = (y × y)n,
and argue that for n [0, N − 1],
RN (n) = rN (n) + rN (N + n).
In other words, the cyclic convolution gives us the combined representations of n and N + n. Next, observe that the Ramanujan sum Y (9.26) is the DFT of y, so that
N −1
RN (n) = N1 Yk2e2πikn/N . k=0
Now prove that R is multiplicative, in the sense that if N = N1N2 with N1, N2 coprime, then RN (n) = RN1 (n)RN2 (n). Conclude that
RN (n) = ϕ2(N, n),
where ϕ2 is defined in the text after Theorem 9.5.13. So now we have a closed form for rN (n) + rN (N + n). Note that ϕ2 is positive if n is even. Next, argue that if a + b = n (i.e., n is representable) then 2N − n is also representable. Conclude that if rN (n) > 0 for all even n [N/2+1, N −1], then all su ciently large even integers are representable. This means that all we have to show is that for n even in [N/2 + 1, N − 1], rN (n + N ) is suitably small compared to

526 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
ϕ2(N, n). To this end, observe that a + b = N + n implies b > n, and consider the count
#{b [n, N ] : b ≡0, n (mod N )}.
By estimating that count, conclude that for a suitable absolute constant C and even n [N/2 + 1, N − 1]
rN (n) ≥ C n 2 − 2m+1.
(ln ln N )
This settles Theorem 9.5.13 for large enough products N , and the smaller cases one may require such as N = 2, 6, 30 can be handled by inspecting the finite number of cases n < 2N .
We note that the theorem can be demonstrated via direct sieving techniques. Another alternative is to use the Chinese remainder theorem with some combinatorics, to get RN as the ϕ2 function. An interesting question is: Can the argument above (for bounding rN (N + n)), which is admittedly a sieve argument of sorts, be completely avoided, by doing instead algebraic manipulations on the negacyclic convolution y ×− y? As we intimated in the text, this would involve the analysis of some interesting exponential sums. We are unaware of any convenient closed form for the negacyclic, but if one could be obtained, then the precise number of representations n = a +b would likewise be cast in closed form.
9.41. Interesting exact results involving sums of squares can be achieved elegantly through careful application of convolution principles. The essential idea is to consider a signal whose elements xn2 are 1’s, with all other elements 0’s. Let p be an odd prime, and start with the definition
xˆk = |
(p−1)/2 |
1 − |
δ0j |
2 |
k/p, |
j=0 |
2 |
e−2πij |
|||
|
|
|
|
|
|
where δij = 1 if i = j and is otherwise 0. Show that xˆ0 = p/2, while for k [1, p − 1] we have
|
xˆk = |
ωk |
√ |
|
|
|
p, |
||||
|
|
||||
|
2 |
|
|
|
|
where ωk = kp , −i |
kp , respectively, as p ≡ 1, 3 (mod 4). The idea is to |
||||
show all of this as a |
corollary to Theorem 2.3.7. (Note that the theory of |
||||
|
|
|
|
|
more general Gauss character sums connects with primality testing, as in our Lemma 4.4.1 and developments thereafter.) Now for n [0, p−1] define Rm(n) to be the count of m-squares representations
a21 + a22 + · · · + a2m ≡ n (mod p)
in integers aj [0, (p − 1)/2], except that a representation is given a weight factor of 1/2 for every zero component aj . For example, a representation

528 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
9.44.Implement Algorithm 9.5.19, with a view to proving that p = 2521 − 1 is prime via the Lucas–Lehmer test. The idea is to maintain the peculiar, variable-base representation for everything, all through the primality test. (In other words, the output of Algorithm 9.5.19 is ready-made as input for a subsequent call to the algorithm.) For larger primes, such as the gargantuan new Mersenne prime discoveries, investigators have used run lengths such that q/D, the typical bit size of a variable-base digit, is roughly 16 bits or less. Again, this is to suppress as much as possible the floating-point errors.
9.45.Implement Algorithm 9.5.17 to establish the character of various
Fermat numbers, using the Pepin test, that Fn is prime if and only if 3(Fn −1)/2 ≡ −1 (mod Fn). Alternatively, the same algorithm can be used in factorization studies [Brent et al. 2000]. (Note: The balanced representation error reduction scheme mentioned in Exercise 9.55 also applies to this algorithm for arithmetic with Fermat numbers.) This method has been employed for the resolution of F22 in 1993 [Crandall et al. 1995] and F24 [Crandall et al. 1999].
9.46.Implement Algorithm 9.5.20 to perform large-integer multiplication via cyclic convolution of zero-padded signals. Can the DWT methods be applied to do negacyclic integer convolution via an appropriate CRT prime set?
9.47.Show that if the arithmetic field is equipped with a cube root of
unity, then for D = 3 |
· |
2k |
one can |
perform a length-D cyclic convolution |
||
|
|
|
k |
convolutions. (See Exercise 9.43 and |
||
by recombining three separate length-2 |
|
consider the symbolic factorization of tD − 1 for such D.) This technique has actually been used by G. Woltman in the discovery of new Mersenne primes (he has employed IBDWTs of length 3 · 2k).
9.48. Implement the ideas in [Percival 2003], where Algorithm 9.5.19 is generalized for arithmetic modulo Proth numbers k · 2n ± 1. The essential idea is that working modulo a number a ± b can be done with good error control,
as long as the prime product p|ab p is su ciently small. In the Percival approach, one generalizes the variable-base representation of Theorem 9.5.18 to involve products over prime powers in the form
|
D−1 |
|
|
|||
|
|
|||||
x = |
xj |
|
p kj/D |
|
|
q−mj/D +mj/D, |
|
j=0 |
k |
a |
q |
m |
b |
|
|
p |
|
for fast arithmetic modulo a − b.
Note that the marriage of such ideas with the fast mod operation of Algorithm 9.2.14 would result in an e cient union for computations that need to move away from the restricted theme of Mersenne/Fermat numbers. Indeed, as evidenced in the generalized Fermat number searches described in [Dubner and Gallot 2002], wedding bells have already sounded.

9.7 Exercises |
529 |
9.49.In the FFT literature there exists an especially e cient real-signal
transform called the Sorenson FFT. This is a split-radix transform that
√
uses 2 and a special decimation scheme to achieve essentially the lowest-
complexity FFT known for real signals; although in modern times the issues of memory, machine cache, and processor features are so overwhelming that sheer complexity counts have fallen to a lesser status. Now, for the ring Zn with n = 2m + 1 and m a multiple of 4, show that a square root of 2 is given by
√
2 = 23m/4 − 2m/4.
Then, determine whether a Sorenson transform modulo n can be done simply
√
by using what is now the standard Sorenson routine but with 2 interpreted as above. (Detailed coding for a Sorenson real-signal FFT is found in [Crandall 1994b].)
9.50.Study the transform that has the usual DFT form
N −1
Xk = xj h−jk,
j=0
except that the signal elements x and the root h of order N exist in the field
√ j
Q5 . This has been called a number-theoretical transform (NTT) over the
“golden section quadratic field,” because the golden mean φ = 5 − 1 /2 is in the field. Assume that we restrict further to the ring Z[φ] so that the signal elements and the root are of the form a + bφ with a, b integers. Argue
first that a multiplication in the domain takes three integer multiplies. Then |
||||
consider the field Fp √ |
|
|
and work out a theory for the possible length of |
|
5 |
||||
such transforms over that |
field, when the root is taken to be a power of the |
|||
golden mean φ. Then, consider the transform (N is even) |
||||
|
|
|
|
N/2−1 |
|
|
|
Xk = |
j |
|
|
|
H−jkxj |
|
|
|
|
|
=0 |
where the new signal vector is xj = (aj , bj ) and where the original signal component was xj = aj + bj φ in the field. Here, the matrix H is
H = |
1 |
1 |
. |
|
1 |
0 |
|||
|
|
Describe in what sense this matrix transform is equivalent to the DFT definition preceding, that the powers of H are given conveniently in terms of Fibonacci numbers
H |
n |
= |
Fn+1 |
Fn |
, |
|
Fn |
Fn−1 |
|||
|
|
|
|
and that this n-th power can be computed in divide-and-conquer fashion in O(ln n) matrix multiplications. In conclusion, derive the complexity of this matrix-based number-theoretical transform.
530 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
This example of exotic transforms, being reminiscent of the discrete Galois transform (DGT) of the text, appears in [Dimitrov et al. 1995], [Dimitrov et al. 1998], and has actually been proposed as an idea for obtaining meaningful spectra—in a discrete field, no less—of real-valued, real-world data.
9.51.Pursuant to Algorithm 9.5.22 for cyclic convolution, work out a similar algorithm for negacyclic integer convolution via a combined DGT/DWT method, with halved run length, meaning you want to convolve two real integer sequences each of length D, via a complex DGT of length D/2. You would need to establish, for a relevant weighted convolution of length D/2, a (D/2)-th
root of i in a field Fp2 with p a Mersenne prime. Details that may help in such an implementation can be found in [Crandall 1997b].
9.52.Study the so-called Fermat number transform (FNT) defined by
D−1
Xk = xj g−jk (mod fn),
j=0
where fn = 2n + 1 and g has multiplicative order D in Zn. A useful choice is g a power of two, in which case, what are the allowed signal lengths D? The FNT has the advantage that the internal butterflies of a fast implementation involve multiply-free arithmetic, but the distinct disadvantage of restricted signal lengths. A particular question is: Are there useful applications of the FNT in computational number theory, other than the appearance in the Sch¨onhage Algorithm 9.5.23?
9.53.In such as Algorithm 9.5.7 one may wish to invoke an e cient transpose. This is not hard to do if the matrix is square, but otherwise, the problem is nontrivial. Note that the problem is again trivial, for any matrix, if one is allowed to copy the original matrix, then write it back in transpose order. However this can involve long memory jumps, which are not necessary, as well as all the memory for the copy.
So, work out an algorithm for a general in-place transpose, that is, no matrix copy allowed, trying to keep everything as “local” as possible, meaning you want in some sense minimal memory jumps. Some references are [Van Loan 1992], [Bailey 1990].
9.54.By analyzing the respective complexities of the steps of Algorithm
9.5.8, (1) show that the complexity claim of the text holds for calculating Xk; (2) give more precise information about the implied big-O constant in the bound 9.24; and (3) prove the inequality (9.25); and 4) explain how the inequality leads to the claimed complexity estimate for the algorithm.
The interested reader might investigate/improve on the clever aspect of the original Dutt–Rokhlin method, which was to expand an oscillation eicz in a Gaussian series [Dutt and Rokhlin 1993]. There have
9.55.Rewrite Algorithm 9.5.12 to employ balanced-digit representation (Definition 9.1.2). Note that the important changes center on the carry
9.7 Exercises |
531 |
adjustment step. Study the phenomenon raised in the text after the algorithm, namely, that of reduced error in the balanced option. There exist some numerical studies of this, together with some theoretical conjectures (see [Crandall and Fagin 1994], [Crandall et al. 1999] and references therein), but very little is known in the way of error bounds that are both rigorous and pragmatic.
9.56. Show that if p = 2q |
− 1 with q odd and x {0, . . . , p − 1}, then |
2 |
x mod p can be calculated using two size-(q/2) multiplies. Hint: Represent x = a + b2(q+1)/2 and relate the result of squaring x to the numbers
(a + b)(a + 2b) and (a − b)(a − 2b).
This interesting procedure gives nothing really new—because we already know that squaring (in the grammar-school range) is about half as complex as multiplication—but the method here is a di erent way to get the speed doubling, and furthermore does not involve microscopic intervention into the squaring loops as discussed for equation (9.3).
9.57.Do there always exist primes p1, . . . , pr required in Algorithm 9.5.20, and how does one find them?
9.58.Prove, as suggested by the statement of Algorithm 9.5.20, that any convolution element of x × y in that algorithm is indeed bounded by N M 2. For application to large-integer multiplication, can one invoke balanced representation ideas, that is, considering any integer (mod p) as lying in [−(p + 1)/2, (p − 1)/2], to lower the bounding requirements, hence possibly reducing the set of CRT primes?
9.59.For the discrete, prime-based transform (9.33) in cases where g has a square root, h2 = g, answer precisely: What is a closed form for the transform
element Xk if the input signal is defined x = |
hj2 , j = 0, . . . , p − 1? |
|
Noting the peculiar simplicity of the X |
k, find an |
analogous signal x having |
|
|
N elements in the complex domain, for which the usual, complex-valued FFT has a convenient property for the magnitudes |Xk|. (Such a signal is called a “chirp” signal and has high value in testing FFT routines, which must, of course, exhibit a numerical manifestation of the special magnitude property.)
9.60.For the Mersenne prime p = 2127 − 1, exhibit an explicit primitive 64-th root of unity a + bi in Fp2 .
9.61.Show that if a + bi is a primitive root of maximum order p2 − 1 in Fp2
(with p ≡ 3 (mod 4), so that “i” exists), then a2 + b2 must be a primitive root of maximum order p − 1 in Fp. Is the converse true?
Give some Mersenne primes p = 2q − 1 for which 6 + i is a primitive root in Fp2 .
9.62.Prove that the DGT integer convolution Algorithm 9.5.22 works.