
Prime Numbers
.pdf452 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
Thus, the sequence of iterates 2b, f (2b), f (f (2b)), . . . that the algorithm produces is strictly increasing until a value s is reached with c ≤ s < c + 2. The number r sent to Step [Adjust result] is r = f (s). If c ≥ 4, we also have c ≤ r < c + 2. But c ≥ 4 unless N = 1 or 2. In these cases, in fact whenever N is a power of 2, the algorithm terminates immediately with the value r = N . Thus, the algorithm always terminates with the number c , as claimed.
We remark that the number of steps through the Newton iteration in Algorithm 9.2.8 is O(ln(b + 1)) = O(ln ln(N + 2)). In addition, the number of iterations for the while loop in step [Adjust result] is at most 2.
Armed with the iteration for the generalized reciprocal, we can proceed to develop a mod operation that itself involves only multiplies, adds, subtracts, and binary shifts.
Algorithm 9.2.10 |
(Division-free mod). This algorithm returns x mod N |
and x/N , for any |
nonnegative integer x. The only precalculation is to have |
established the generalized reciprocal R = R(N ). This precalculation may be done via Algorithm 9.2.8.
1. [Initialize]
s = 2(B(R) − 1); div = 0;
2. [Perform reduction loop] |
|
d = xR/2s ; |
|
x = x − N d; |
|
if(x ≥ N ) { |
|
x = x − N ; |
|
d = d + 1; |
|
} |
|
div = div + d; |
|
if(x < N ) return (x, div); |
// x is the mod, div is the div. |
goto [Perform reduction loop]; |
|
This algorithm is essentially the Barrett method [Barrett 1987], although it is usually stated for a commonly encountered range on x, namely, 0 ≤ x < N 2. But we have lifted this restriction, by recursively using the basic formula
x mod N x − N xR/2s , |
(9.11) |
where by “ ” we mean that for appropriate choice of s, the error in this relation is a small multiple of N . There are many enhancements possible to Algorithm 9.2.10, where we have chosen a specific number of bits s by which one is to right-shift. There are other interesting choices for s; indeed, it has been observed [Bosselaers et al. 1994] that there are certain advantages to “splitting up” the right-shifts like so:
x mod N x − N R x/2b−1 /2b+1 , |
(9.12) |
9.2 Enhancements to modular arithmetic |
453 |
where b = B(R) − 1. In particular, such splitting can render the relevant multiplications somewhat simpler. In fact, one sees that
R x/2b−1! /2b+1! = x/N − j |
(9.13) |
for j = 0, 1, or 2. Thus using the left-hand side for d in Algorithm 9.2.10 involves at most two passes through the while loop. And there is an apparent savings in time, since the length of x can be about 2b, and the length of R about b. Thus the multiplication xR in Algorithm 9.2.10 is about 2b × b bits, while the multiplication inherent in (9.12) is only about b × b bits. Because a certain number of the bits of xR are destined to be shifted into oblivion (a shift completely obscures the relevant number of lower-order bits), one can intervene into the usual grammar-school multiply loop, e ectively cutting the aforementioned rhombus into a smaller tableau of values. With considerations like this, it can be shown that for 0 ≤ x < N 2, the complexity of the x mod N operation is asymptotically (large N ) the same as a size-N multiply. Alternatively, the complexity of the common operation (xy) mod N , where 0 ≤ x, y < N , is that of two size-N multiplies.
Studies have been carried out for the classical long divide, (Algorithm D [Knuth 1981]), Montgomery and Barrett methods [Bosselaers et al. 1994], [Montgomery 1985], [Arazi 1994], [Ko¸c et al. 1996]. There would seem to be no end to new div-mod algorithms; for example, there is a sign estimation technique of [Ko¸c and Hung 1997], suitable for cryptographic operations (such as exponentiation) when operands are large. While both the Montgomery and (properly refined) Barrett methods are asymptotically of the same complexity, specific implementations of the methods reveal ranges of operands for which a particular approach is superior. In cryptographic applications, the Montgomery method is sometimes reported to be slightly superior to the Barrett method. One reason for this is that reaching the asymptotically best complexity for the Montgomery method is easier than for the Barrett method, the latter requiring intervention into the loop detail. However, there are exceptions; for example, [De Win et al. 1998] ended up adopting the Barrett method for their research purposes, presumably because of its ease of implementation (at the slightly suboptimal level), and its essential competitive equality with the Montgomery method. It is also the case that the inverses required in the Montgomery method can be problematic for very large operands. There is also the fact that if one wants just one mod operation (as opposed to a long exponentiation ladder), the Montgomery method is contraindicated. It would appear that a very good choice for general, largeinteger arithmetic is the symbiotic combination of our Algorithms 9.2.8 and 9.2.10. In factorization, for example, one usually performs (xy) mod N so very often for a stable N , that a single calculation of the generalized reciprocal R(N ) is all that is required to set up the division-free mod operations.
We mention briefly some new ideas in the world of divide/mod algorithms. One idea is due to G. Woltman, who found ways to enhance the Barrett divide Algorithm 9.2.10 in the (practically speaking) tough case when x is much greater than a relatively small N . One of his enhancements is to change

454 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
precision modes in such cases. Another new development is an interesting Karatsuba-like recursive divide, in [Burnikel and Ziegler 1998]. The method has the interesting property that the complexities of finding the div or just a mod result are not quite the same.
Newton methods apply beyond the division problem. Just one example |
||
is the important computation of √N . One may employ a (real domain) |
||
Newton iteration for √a in the form |
|
|
xn+1 = xn + |
a , |
(9.14) |
2 |
2xn |
|
to forge an algorithm for integer parts of square roots:
Algorithm 9.2.11 (Integer part of square root). This algorithm returns
√
N for positive integer N .
1.[Initialize]
x = 2 B(N )/2 ;
2.[Perform Newton iteration] y = (x + N/x )/2 ; if(y ≥ x) return x;
x = y;
goto [Perform Newton iteration];
We may use Algorithm 9.2.11 to test whether a given positive integer N
√
is a square. After x = N is computed, we do one more step and check whether x2 = N . This equation holds if and only if N is a square. Of course, there are other ways to rule out very quickly whether N is a perfect square, for example to test instances of ( Np ) for various small primes p, or the residue of N modulo 8.
It can be argued that Algorithm 9.2.11 requires O(ln ln N ) iterations to terminate. There are many interesting complexity issues with this and other Newton method applications. Specifically, it is often lucrative to change dynamically the working precision as the Newton iteration progresses, or to modify the very Newton loops (see Exercises 9.14 and 4.11).
9.2.3Moduli of special form
Considerable e ciency in the mod operation can be achieved when the modulus N is of special form. The Barrett method of the previous section is fast because it exploits mod 2q arithmetic. In this section we shall see that if the modulus N is close to a power of 2, one can exploit the binary nature of modern computers and carry out the arithmetic very e ciently. In particular,
forms
N = 2q + c,
where |c| is in some sense “small” (but c is allowed to be negative), admit e cient mod N operations. These enhancements are especially important in the studies of Mersenne primes p = 2q − 1 and Fermat numbers Fn = 22n + 1,
9.2 Enhancements to modular arithmetic |
455 |
although the techniques we shall describe apply equally well to general moduli 2q ±1, any q. That is, whether or not the modulus N has additional properties of primality or special structure is of no consequence for the mod algorithm of this section. A relevant result is the following:
Theorem 9.2.12 (Special-form modular arithmetic). For N = 2q + c, c an integer, q a positive integer, and for any integer x,
x ≡ (x mod 2q ) − c x/2q (mod N ). |
(9.15) |
Furthermore, in the Mersenne case c = −1, multiplication by 2k modulo N is equivalent to left-circular shift by k bits (so if k < 0, this is right-circular shift). For the Fermat case c = +1, multiplication by 2k, k positive, is equivalent to (−1) k/q times the left-circular shift by k bits, except that the excess shifted bits are to be negated and carry-adjusted.
As they are easiest to analyze, let us discuss the final statements of the theorem first. Since
2k = 2k mod q 2q k/q ,
and also 2q ≡ −c (mod N ), the statements are really about k [1, q − 1] and negatives of such k. As examples, take N = 217 − 1 = 131071 = 111111111111111112, x = 8977 = 100011000100012, and consider the product 25x (mod N ). This will be the left-circular shift of x by 5 bits, or 1100010001000102 = 25122, which is the correct result. Incidentally, these results on multiplication by powers of 2 are relevant for certain numbertheoretical transforms and other algorithms. In particular, discrete Fourier transform arithmetic in the ring Zn with n = 2m + 1 can proceed—on the basis of shifting rather than explicit multiplication—when the root in question is a power of 2.
The first result of Theorem 9.2.12 allows us to calculate x mod N very rapidly, on the basis of the “smallness” of c. Let us first give an example of the computation of x = 13000 modulo the Mersenne prime N = 27 − 1 = 127. It is illuminating to cast in binary: 13000 = 110010110010002, then proceed via the theorem to split up x easily into two parts whenever it exceeds N (all congruences here are with respect to modulus N ):
x ≡ 11001011001000 mod 10000000 + 11001011001000/10000000
≡ 1001000 + 1100101 ≡ 10101101 ≡ 101101 + 1 ≡ 101110.
As the result 1011102 = 46 < N , we have achieved the desired value of 13000 mod 127 = 46. The procedure is thus especially simple for the Mersenne cases N = 2q − 1; namely, one takes the “upper” bits of x (meaning the bits from the 2q position and up, inclusive) and adds these to the “lower” bits (meaning the lower q bits of x). The general procedure runs as follows, where we adopt for convenience the bitwise “and” operator & and right-shift >>, left-shift << operators:
456 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
Algorithm 9.2.13 (Fast mod operation for special-form moduli). Assume modulus N = 2q + c, with B(|c|) < q. This algorithm returns x mod N for x > 0. The method is generally more e cient for smaller |c|.
1. [Perform reduction] |
|
|
|
while(B(x) > q) { |
//qRight-shift does x/2 |
q |
q . |
y = x >> q; |
|
||
x = x − (y << q); |
// Or x = x&(2 − 1), or x = x mod 2 . |
||
x = x − cy; |
|
|
|
} |
|
|
|
if(x == 0) return x; |
|
|
|
2. [Adjust] |
// Defined as −1, 0, 1 as x <, =, > 0. |
||
s = sgn(x); |
|||
x = |x|; |
|
|
|
if(x ≥ N ) x = x − N ; |
|
|
|
if(s < 0) x = N − x; |
|
|
|
return x; |
|
|
|
It is not hard to show that this algorithm terminates and gives the result x mod N .
Because the method involves nothing but “small” multiplications (by c), applications are widespread. Modern discoveries of new Mersenne primes have used this mod method in the course of extensive Lucas–Lehmer primality testing. There is even a patented encryption scheme based on elliptic curves over fields Fpk , where p = 2q + c, and if extra e ciency is desired, p ≡ −1 (mod 4) (for example, p can be any Mersenne prime, or a prime 2q + 7, and so on), with elliptic algebra performed on the basis of essentially negligible mod operations [Crandall 1994a]. Such fields have been called optimal extension fields (OEFs), and further refinements can be achieved by adroit choice of the exponent k and irreducible polynomial for the Fpk arithmetic. It is also true of such elliptic curves that curve order can be assessed more quickly by virtue of the fast mod operation. Yet another application of the special mod reduction is in the factorization of Fermat numbers. The method has been used in the recent discoveries of new factors of the Fn for n = 13, 15, 16, 18 [Brent et al. 2000]. For such large Fermat numbers, machine time is so extensive that any algorithmic enhancements, whether for mod or other operations, are always welcome. In recent times the character of even larger Fn has been assessed in this way, where now the Pepin primality test involves a great many (mod Fn) operations. The proofs that F22, F24 are composite used the special-form mod of this section [Crandall et al. 1995], [Crandall et al. 1999], together with fast multiplication discussed later in the chapter.
It is interesting that one may generalize the special-form fast arithmetic yet further. Consider numbers of the Proth form:
N = k · 2q + c.
We next give a fast modular reduction technique from [Gallot 1999], which is suitable in cases where k and c are low-precision (e.g., single-word) parameters:
9.3 Exponentiation |
457 |
Algorithm 9.2.14 (Fast mod operation for Proth moduli). Assume modulus N = k · 2q + c, with bit length B(|c|) < q (and c can be negative or zero). This algorithm returns x mod N for 0 < x < N 2. The method is generally more e cient for smaller k, |c|.
1.[Define a useful shift-add function n] n(y) {
return N y; // But calculate rapidly, as: N y = ((ky) << q) + cy.
}
2.[Approximate the quotient]
y = x>>q ; k
t = n(y);
if(c < 0) goto [Polarity switch]; while(t > x) {
t = n(y); y = y − 1;
}
return x − t;
3. [Polarity switch] while(t ≤ x) {
y = y + 1; t = n(y);
}
y = y − 1; t = n(y);
return x − t;
This kind of clever reduction is now deployed in software that has achieved significant success in the discoveries of, as just two examples, new factors of Fermat numbers, and primality proofs for Proth primes.
9.3Exponentiation
Exponentiation, or powering, is especially important in prime number and factorization studies, for the simple reason that so many known theorems involve the operation xy , or most commonly xy (mod N ). In what follows, we give various algorithms that e ciently exploit the structure of the exponent y, and sometimes the structure of x. We have glimpsed already in Section 2.1.2, Algorithm 2.1.5, an important fact: While it is certainly true that something like (xy ) mod N can be evaluated with (y − 1) successive multiplications (mod N ), there is generally a much better way to compute powers. This is to use what is now a commonplace computational technique, the powering ladder, which can be thought of as a nonrecursive (or “unrolled”) realization of equivalent, recursive algorithms. But one can do more, via such means as preprocessing the bits of the exponent, using alternative base expansions for
458 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
the exponent, and so on. Let us first summarize the categories of powering ladders:
(1)Recursive powering ladder (Algorithm 2.1.5).
(2)Left-right and right-left “unrolled” binary ladders.
(3)Windowing ladders, to take advantage of certain bit patterns or of alternative base expansions, a simple example of which being what is essentially a ternary method in Algorithm 7.2.7, step [Loop over bits . . .], although one can generally do somewhat better [M¨uller 1997], [De Win et al. 1998], [Crandall 1999b].
(4)Fixed-x ladders, to compute xy for various y but fixed x.
(5)Addition chains and Lucas ladders, as in Algorithm 3.6.7, interesting references being such as [Montgomery 1992b], [M¨uller 1998].
(6)Modern methods based on actual compression of exponent bit-streams, as in [Yacobi 1999].
The current section starts with basic binary ladders (and even for these, various options exist); then we turn to the windowing, alternative-base, and fixed-x ladders.
9.3.1Basic binary ladders
We next give two forms of explicit binary ladders. The first, a “left-right” form (equivalent to Algorithm 2.1.5), is comparable in complexity (except when arguments are constrained in certain ways) to a second, “right-left” form.
Algorithm 9.3.1 (Binary ladder exponentiation (left-right form)).
This algorithm computes xy . We assume the binary expansion (y0, . . . , yD−1) of y > 0, where yD−1 = 1 is the high bit.
1.[Initialize] z = x;
2.[Loop over bits of y, starting with next-to-highest] for(D − 2 ≥ j ≥ 0) {
z = z2; |
// For modular arithmetic, do modN here. |
if(yj == 1) z = zx; |
// For modular arithmetic, do modN here. |
} |
|
return z; |
|
This algorithm constructs the power xy by running through the bits of the exponent y. Indeed, the number of squarings is (D − 1), and the number of operations z = z x is clearly one less than the number of 1 bits in the exponent y. Note that the operations turn out to be those of Algorithm 2.1.5. A mnemonic for remembering which of the left-right or right-left ladder forms is equivalent to the recursive form is to note that both Algorithms 9.3.1 and 2.1.5 involve multiplications exclusively by the steady multiplier x.
9.3 Exponentiation |
459 |
But there is a kind of complementary way to e ect the powering. This alternative is exemplified in the relation
x13 = x (x2)2 (x4)2,
where there are again 2 multiplications and 3 squarings (because x4 was actually obtained as the middle term (x2)2). In fact, in this example we see more directly the binary expansion of the exponent. The general formula would be
xy = x yj 2j = xy0 (x2)y1 (x4)y2 · · · , (9.16)
where the yj are the bits of y. The corresponding algorithm is a “right-left” ladder in which we keep track of successive squarings of x:
Algorithm 9.3.2 (Binary ladder exponentiation (right-left form)). |
|
|
|||
This algorithm computes xy . We assume the binary expansion (y |
0 |
, . . . , y |
D−1 |
) |
|
of y > 0, where yD−1 = 1 is the high bit. |
|
|
|||
|
|
|
|
||
1. [Initialize] |
|
|
|
|
|
z = x; a = 1; |
|
|
|
|
|
2. [Loop over bits of y, starting with lowest] |
|
|
|
|
|
for(0 ≤ j < D − 1) { |
|
|
|
|
|
if(yj == 1) a = za; |
// For modular arithmetic, do modN here. |
||||
z = z2; |
// For modular arithmetic, do modN here. |
||||
} |
// For modular arithmetic, do modN here. |
||||
return az; |
This scheme can be seen to involve also (D − 1) squarings, and (except for the trivial multiply when a = z 1 is first invoked) has the same number of multiplies as did the previous algorithm.
Even though the operation counts agree on the face of it, there is a certain advantage to the first form given, Algorithm 9.3.1, for the reason that the operation z = zx involves a fixed multiplicand, x. Thus for example, if x = 2 or some other small integer, as might be the case in a primality test where we raise a small integer to a high power (mod N ), the multiply step can be fast. In fact, for x = 2 we can substitute the operation z = z + z, avoiding multiplication entirely for that step of the algorithm. Such an advantage is most telling when the exponent y is replete with binary 1’s.
These observations lead in turn to the issue of asymptotic complexity for ladders. This is a fascinating—and in many ways open—field of study. Happily, though, most questions about the fundamental binary ladders above can be answered. Let us adopt the heuristic notation that S is the complexity of squaring (in the relevant algebraic domain for exponentiation) and M is the complexity of multiplication. Evidently, the complexity C of one of the above ladders is asymptotically
C (lg y)S + HM,
460 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC
where H denotes the number of 1’s in the exponent y. Since we expect about “half 1’s” in a random exponent, the average-case complexity is thus
C (lg y)S + ( 12 lg y)M.
Note that using (9.4) one can often achieve S M/2 so reducing the expression for the average-case complexity of the above ladders to C (lg y)M . The estimate S M/2 is not a universal truth. For one thing, such an estimate assumes that modular arithmetic is not involved, just straight nonmodular squaring and multiplication. But even in the nonmodular world, there are issues. For example, with FFT multiplication (for very large operands, as described later in this chapter), the S/M ratio can be more like 2/3. With some practical (modular, grammar-school) implementations, the ratio S/M is about 0.8, as reported in [Cohen et al. 1998]. Whatever subroutines one uses, it is of course desirable to have fewer arithmetic operations to perform. As we shall see in the following section, it is possible to achieve further operation reduction.
9.3.2Enhancements to ladders
In factorization studies and cryptography it is a rule of thumb that power ladders are used much of the time. In factorization, the so-called stage one of many methods involves almost nothing but exponentiation (in the case of ECM, elliptic multiplication is the analogue to exponentiation). In cryptography, the generation of public keys from private ones involves exponentiation, as do digital signatures and so on. It is therefore important to optimize powering ladders as much as possible, as these ladder operations dominate the computational e ort in the respective technologies.
One interesting method for ladder enhancement is sometimes referred to as “windowing.” Observe that if we expand not in binary but in base 4, and we precompute powers x2, x3, then every time we encounter two bits of the exponent y, we can multiply by one of 1 = x0, x1, x2, x3 and then square twice to shift the current register to the left by two bits. Consider for example the task of calculating x79, knowing that 79 = 10011112 = 10334. If we express the exponent y = 79 in base 4, we can do the power as
x79 = x42 x3 4 x3,
which takes 6S + 2M (recall nomenclature S, M for square and multiply). On the other hand, the left-right ladder Algorithm 9.3.1 does the power this way:
x79 = x23 x 2 x 2 x 2 x,
for a total e ort of 6S + 4M , more than the e ort for the base-4 method. We have not counted the time to precompute x2, x3 in the latter method, and so
9.3 Exponentiation |
461 |
the benefit is not so readily apparent. But a benefit would be seen in most cases if the exponent 79 were larger, as in many cryptographic applications.
There are many detailed considerations not yet discussed, but before we touch upon those let us give a fairly general windowing ladder that contains
most of the applicable ideas: |
|
|
|
|
|
||||
Algorithm 9.3.3 |
(Windowing ladder). |
This |
algorithm |
computes xy . We |
|||||
assume a base-(B |
= |
|
2b) expansion (as |
in |
Definition |
9.1.1), denoted by |
|||
(y |
0, . . . , yD−1) of y > |
0, with high digit |
|
= 0, |
so |
each digit satisfies |
|||
|
yD−1 d |
: |
1 |
< d < B; d odd} |
|||||
0 ≤ yi < B. We also assume that the values {x |
|||||||||
have been precomputed. |
|
|
|
|
|
|
|
||
1. [Initialize] |
|
|
|
|
|
|
|
|
|
|
z = 1; |
|
|
|
|
|
|
|
|
2. [Loop over digits] |
|
|
|
|
|
|
|
|
|
|
|
) |
{ |
|
|
|
|
|
|
|
for(D − 1 ≥ i ≥ 0c |
|
|
|
|
|
|
||
|
Express yi = 2 d, where d is odd or zero; |
|
|
|
|||||
|
z = z(xd)2c ; |
|
|
|
|
|
|
// xd from storage. |
|
|
if(i > 0) z = z2b ; |
|
|
|
|
|
|||
|
} |
|
|
|
|
|
|
|
|
|
return z; |
|
|
|
|
|
|
|
|
To give an example of why only odd powers of x need to be precomputed, let us take the example of y = 262 = 4068. Looking at this base-8 representation,
we see that
x262 = x4 8 8 x6,
but if x3 has been precomputed, we can insert that x3 at the proper juncture, and Algorithm 9.3.3 tells us to exponentiate like so:
x262 = x4 8 4 x3 2 .
Thus, the precomputation is relegated to odd powers only. Another way to exemplify the advantage is in base 16 say, for which each of the 4-bit sequences: 1100, 0110, 0011 in any exponent can be handled via the use of x3 and the proper sequencing of squarings.
Now, as to further detail, it is possible to allow the “window”—essentially the base B—to change as we go along. That is, one can look ahead during processing of the exponent y, trying to find special strings for a little extra e ciency. One “sliding-window” method is presented in [Menezes et al. 1997]. It is also possible to use our balanced-base representation, Definition 9.1.2, to advantage. If we constrain the digits of exponent y to be
− B/2 ≤ yi ≤ (B − 1)/2 ,
and precompute odd powers xd where d is restricted within the range of these digit values, then significant advantages accrue, provided that the inverse