Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Prime Numbers

.pdf
Скачиваний:
43
Добавлен:
23.03.2015
Размер:
2.99 Mб
Скачать

462 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC

powers are available. In the case of elliptic multiplication, let us say we desire “exponentiation” [k]P , where P is a point, k the exponent. We need to precompute, then, only the multiples

{[d]P : 1 < d < B/2 ; d odd},

because negations [−d]P are immediate, by the rules of elliptic algebra. In this way, one can fashion highly e cient windowing schemes for elliptic multiplication. See Exercise 9.77 for yet more considerations.

Ignoring precomputation, it can be inferred that in Algorithm 9.3.3 with base B = 2b the asymptotic (large-y) requirement is Db lg y squarings (i.e., one squaring for each binary bit of y). This is, of course, no gain over the squarings required in the basic binary ladders. But the di erence lies in the multiplication count. Whereas in the basic binary ladders the (asymptotic) number of multiplications is the number of 1’s, we now only need at most one multiplication per b bits; in fact, we only need 1 2−b of these on average, because of the chance of a zero digit in random base-B expansions. Thus, the average-case asymptotic complexity for the windowing algorithm is

C (lg y)S + (1 2−b)

lg y

M,

 

 

b

 

 

1

 

 

(lg y)S +

which when b = 1 is equivalent to the previous estimate C

 

( 2 lg y)M for the basic binary ladders. Note though as the window size b increases, the burden of multiplications becomes negligible. It is true that precomputation considerations are paramount, but in practice, a choice of b = 3 or b = 4 will indeed reduce noticeably the ladder computations.

Along the lines of the previous remarks concerning precomputation, an interesting ladder enhancement obtains in the case that the number x is to be reused. That is, say we wish to exponentiate xy for many di erent y values, with x fixed. We can compute and store fixed powers of the fixed x, and use them to advantage.

Algorithm 9.3.4 (Fixed-x ladder for xy ). This algorithm computes xy . We assume a base-B (not necessarily binary) expansion (y0, . . . , yD−1) of y > 0, with high digit yD−1 > 0. We also assume that the (total of (B − 1)(D − 1)) values

{xiBj : i [1, B − 1]; j [1, D − 1]} have been precomputed.

1.[Initialize] z = 1;

2.[Loop over digits]

for(0 ≤ j < D) z = zxyj Bj ; return z;

This algorithm clearly requires, beyond precomputation, an operation count

C DM lglgBy M,

9.4 Enhancements for gcd and inverse

463

so the fact of a “stable” value for x really can yield high e ciency, because of the (lg B)1 factor. Depending on precise practical setting and requirements, there exist yet further enhancements, including the use of less extensive lookup tables (i.e., using only the stored powers such as xBj ), loosening of the restrictions on the ranges of the for() loops depending on the range of values of the y digits in base B (in some situations not every possible digit will occur), and so on. Note that if we do store only the reduced set of powers xBj , the Step [Loop over digits] will have nested for() loops. There also exist fixed-y algorithms using so-called addition chains, so that when the exponent is stable some enhancements are possible. Both fixed-x and fixed-y forms find applications in cryptography. If public keys are generated as fixed x values raised to secret y values, for example, the fixed-x enhancements can be beneficial. Similarly, if a public key (as x = gh) is to be raised often to a key power y, then the fixed-y methods may be invoked for extra e ciency.

9.4Enhancements for gcd and inverse

In Section 2.1.1 we discussed the great classical algorithms for gcd and inverse. Here we explore more modern methods, especially methods that apply when the relevant integers are very large, or when some operations (such as shifts) are relatively e cient.

9.4.1Binary gcd algorithms

There is a genuine enhancement of the Euclid algorithm worked out by D. Lehmer in the 1930s. The method exploits the fact that not every implied division in the Euclid loop requires full precision, and statistically speaking there will be many single-precision (i.e., small operand) div operations. We do not lay out the Lehmer method here (for details see [Knuth 1981]), but observe that Lehmer showed how to enhance an old algorithm to advantage in such tasks as factorization.

In the 1960s it was observed by R. Silver and J. Terzian [Knuth 1981], and independently in [Stein 1967], that a gcd algorithm can be e ected in a certain binary fashion. The following relations indeed suggest an elegant algorithm:

Theorem 9.4.1 (Silver, Terzian, and Stein). For integers x, y, If x, y are both even, then gcd(x, y) = 2 gcd(x/2, y/2);

If x is even and y is not, then gcd(x, y) = gcd(x/2, y); (As per Euclid) gcd(x, y) = gcd(x − y, y);

If u, v are both odd, then |u − v| is even and less than max{u, v}.

These observations give rise to the following algorithm:

Algorithm 9.4.2 (Binary gcd). The following algorithm returns the greatest common divisor of two positive integers x, y. For any positive integer m, let v2(m) be the number of low-order 0’s in the binary representation of m; that is, we have 2v2(m) m. (Note that m/2v2(m) is the largest odd divisor of m, and

464 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC

can be computed with a shift into oblivion of the low-order zeros; note also for theoretical convenience we may as well take v2(0) = .)

1. [2’s power in gcd]

β = min{v2(x), v2(y)}; // 2β gcd(x, y) x = x/2v2(x);

y= y/2v2(y);

2.[Binary gcd]

while(x =y) {

(x, y) = (min{x, y}, |y − x|/2v2(|y−x|));

}

return 2β x;

In actual practice on most machinery, the binary algorithm is often faster than the Euclid algorithm; and as we have said, Lehmer’s enhancements may also be applied to this binary scheme.

But there are other, more modern, enhancements; in fact, gcd enhancements seem to keep appearing in the literature. There is a “k-ary” method due to Sorenson, in which reductions involving k > 2 as a modulus are performed. There is also a newer extension of the Sorenson method that is claimed to be, on a typical modern machine that possesses hardware multiply, more than 5 times faster than the binary gcd we just displayed [Weber 1995]. The Weber method is rather intricate, involving several special functions for nonstandard modular reduction, yet the method should be considered seriously in any project for which the gcd happens to be a bottleneck. Most recently, [Weber et al. 2005] introduced a new modular GCD algorithm that could be an ideal choice for certain ranges of operands.

It is of interest that the Sorenson method has variants for which the complexity of the gcd is O(n2/ ln n) as opposed to the Euclidean O(n2) [Sorenson 1994]. In addition, the Sorenson method has an extended form for obtaining not just gcd but inverse as well.

One wonders whether this e cient binary technique can be extended in the way that the classical Euclid algorithm can. Indeed, there is also an extended binary gcd that provides inverses. [Knuth 1981] attributes the method to M. Penk:

Algorithm 9.4.3 (Binary gcd, extended for inverses). For positive integers x, y, this algorithm returns an integer triple (a, b, g) such that ax + by = g = gcd(x, y). We assume the binary representations of x, y, and use the exponent β as in Algorithm 9.4.2.

1. [Initialize]

x = x/2β ; y = y/2β ; (a, b, h) = (1, 0, x);

(v1, v2, v3) = (y, 1 − x, y);

if(x even) (t1, t2, t3) = (1, 0, x); else {

(t1, t2, t3) = (0, −1, −y);

9.4 Enhancements for gcd and inverse

465

goto [Check even];

}

2. [Halve t3]

if(t1, t2 both even) (t1, t2, t3) = (t1, t2, t3)/2; else (t1, t2, t3) = (t1 + y, t2 − x, t3)/2;

3. [Check even]

if(t3 even) goto [Halve t3];

4. [Reset max]

if(t3 > 0) (a, b, h) = (t1, t2, t3);

else (v1, v2, v3) = (y − t1, −x − t2, −t3);

5. [Subtract]

(t1, t2, t3) = (a, b, h) (v1, v2, v3); if(t1 < 0) (t1, t2) = (t1 + y, t2 − x) if(t3 = 0) goto [Halve t3];

return (a, b, 2β h);

Like the basic binary gcd algorithm, this one tends to be e cient in actual machine implementations. When something is known as to the character of either operand (for example, say y is prime) this and related algorithms can be enhanced (see Exercises).

9.4.2Special inversion algorithms

Variants on the inverse-finding, extended gcd algorithms have appeared over the years, in some cases depending on the character of the operands x, y. One example is the inversion scheme in [Thomas et al. 1986] for x1 mod p, for primes p. Actually, the algorithm works for unrestricted moduli (returning either a proper inverse or zero if the inverse does not exist), but the authors were concentrating on moduli p for which a key quantity p/z within the algorithm can be easily computed.

Algorithm 9.4.4 (Modular inversion). For modulus p (not necessarily prime) and x ≡0 (mod p), this algorithm returns x1 mod p.

1. [Initialize]

z = x mod p; a = 1;

2. [Loop]

 

while(z = 1) {

 

q = − p/z ;

// Algorithm is best when this is fast.

z = p + qz;

 

a = (qa) mod p;

 

}

// a = x1 mod p.

return a;

This algorithm is conveniently simple to implement, and furthermore (for some ranges of primes), is claimed to be somewhat faster than the extended

466 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC

Algorithm 2.1.4. Incidentally, the authors of this algorithm also give an interesting method for rapid calculation of p/z when p = 2q 1 is specifically a Mersenne prime.

Yet other inversion methods focus on the specific case that p is a Mersenne prime. The following is an interesting attempt to exploit the special form of the modulus:

Algorithm 9.4.5 (Inversion modulo a Mersenne prime). For p = 2q 1 prime and x ≡0 (mod p), this algorithm returns x1 mod p.

1. [Initialize]

(a, b, y, z) = (1, 0, x, p);

2. [Relational reduction] Find e such that 2e y; y = y/2e;

a = (2q−ea) mod p; if(y == 1) return a;

(a, b, y, z) = (a + b, a, y + z, y); goto [Relational reduction];

//Shift o trailing zeros.

//Circular shift, by Theorem 9.2.12.

9.4.3Recursive-gcd schemes for very large operands

It turns out that the classical bit-complexity O(ln2 N ) for evaluating the gcd of two numbers, each of size N , can be genuinely reduced via recursive reduction techniques, as first observed in [Knuth 1971]. Later it was established that such recursive approaches can be brought down to complexity

O(M (ln N ) ln ln N ),

where M (b) denotes the bit-complexity for multiplication of two b-bit integers. With the best-known bound for M (b), as discussed later in this chapter, the complexity for these recursive gcd algorithms is thus

O ln N (ln ln N )2 ln ln ln N .

Studies on the recursive approach span several decades; references include [Sch¨onhage 1971], [Aho et al. 1974, pp. 300–310], [B¨urgisser et al. 1997, p. 98], [Cesari 1998], [Stehl´ and Zimmermann 2004]. For the moment, we observe that like various other algorithms we have encountered—such as preconditioned CRT—the recursive-gcd approach cannot really use grammarschool multiplication to advantage.

We shall present in this section two recursive-gcd algorithms, the original one from the 1970s that, for convenience, we call the Knuth–Sch¨onhage gcd (or KSgcd)—–and a very new, pure-binary one by Stehl´–Zimmermann (called the SZgcd). Both variants turn out to have the same asymptotic complexity, but di er markedly in regard to implementation details.

One finds in practice that recursive-gcd schemes outperform all known alternatives (such as the binary gcd forms with or without Lehmer enhancements) when the input arguments x, y are su ciently large, say in the region

9.4 Enhancements for gcd and inverse

467

of tens of thousands of bits (although this “breakover” threshold depends strongly on machinery and on various options such as the choice of an alternative classical gcd algorithm at recursion bottom). As an example application, recall that for inversionless ECM, Algorithm 7.4.4, we require a gcd. If one is attempting to find a factor of the Fermat number F24 (nobody has yet been successful in that) there will be gcd arguments of about 16 million bits, a region where recursive gcds with the above complexity radically dominate, performance-wise, all other alternatives. Later in this section we give some specific timing estimates.

The basic idea of the KSgcd scheme is that the remainder and quotient sequences of a classical gcd algorithm di er radically in the following sense. Let x, y each be of size N . Referring to the Euclid Algorithm 2.1.2, denote by (rj , rj+1) for j ≥ 0 the pairs that arise after j passes of the loop. So a remainder sequence is defined as (r0 = x, r1 = y, r2, r3, . . .). Similarly there is an implicit quotient sequence (q1, q2, . . .) defined by

rj = qj+1rj+1 + rj+2.

In performing the classical gcd one is essentially iterating such a quotientremainder relation until some rk is zero, in which case the previous remainder rk−1 is the gcd. Now for the radical di erence between the q and r sequences: As enunciated elegantly by [Cesari 1998], the total number of bits in the remainder sequence is expected to be O(ln2 N ), and so naturally any gcd algorithm that refers to every rj is bound to admit, at best, of quadratic complexity. On the other hand, the quotient sequence (q1, . . . , qk−1) tends to have relatively small elements. The recursive notion stems from the fact that knowing the qj yields any one of the rj in nearly linear time [Cesari 1998].

Let us try an example of remainder-quotient sequences. (We choose moderately large inputs x, y here for later illustration of the recursive idea.) Take

(r0, r1) = (x, y) = (31416, 27183),

whence

r0 = q1r1 + r2 = 1 · r1 + 4233, r1 = q2r2 + r3 = 6 · r2 + 1785, r2 = q3r3 + r4 = 2 · r3 + 663, r3 = q4r4 + r5 = 2 · r4 + 459, r4 = q5r5 + r6 = 1 · r5 + 204, r5 = q6r6 + r7 = 2 · r6 + 51, r6 = q7r7 + r8 = 4 · r7 + 0.

Evidently, gcd(x, y) = r7 = 51, but notice the quotient sequence goes (1, 6, 2, 2, 1, 2, 4); in fact these are the elements of the simple continued fraction for the rational x/y. The trend is typical: Most quotient elements are expected to be small.

468 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC

To formalize how remainder terms can be gotten from known quotient terms, we can use the matrix-vector identity, valid for i < j,

rj+1

=

1

−qj

· · ·

1

−qi+1

ri+1 .

rj

 

0

1

 

0

1

ri

Now the idea is to use the typically small q values to compute a matrix G such that the vector G(x, y)T is some column vector (rj , rj+1)T where the bit-length of rj is roughly half that of x. Then one recurses on this theme, until the relevant operands can be dealt with swiftly, via a classical gcd. In the algorithm to follow, when the main function rgcd() is called, there is eventually a call (in Step [Reduce arguments]) to a procedure hgcd() that updates a matrix G so that the resulting product G(u, v)T is a column vector with significantly smaller components. To illustrate in our baby example above, if we go about half-way through the development, we have that

 

1

2

1

2

1

6 1

1 32 37

 

G =

0

 

1

 

0

1

0

 

1

0

1

=

12

15

 

and

 

 

 

 

32 37

 

27183

 

 

459

 

r5

 

 

r1

 

G

 

r0

 

=

12

15

 

 

31416

=

 

663

 

= r4 .

 

In this way we jump significantly down the remainder chain with just one call to the hgcd() procedure. For the particular example, we might then go to a classical gcd with the smaller operands r4 and r5. For very large initial operands, it would take some number of recursive passes to move su ciently down the remainder chain, with the basic bit-length of an rj being roughly halved on each pass.

For the next pseudocode display, we have drawn on an implementation in [Buhler 1991]. (Note: As with various modern software packages, we denote gcd(0, 0) = 0 for convenience.)

Algorithm 9.4.6 (Recursive gcd). For nonnegative integers x, y this algorithm returns gcd(x, y). The top-level function rgcd() calls a recursive hgcd() which in turn calls a “small-gcd” function shgcd(), with a classical (such as a Euclid or binary) function cgcd() invoked at recursion bottom. There is a global matrix G, other interior variables being local (in the usual sense for recursive procedures).

1.

[Initialize]

 

 

 

 

lim = 2256;

 

// Breakover threshold for cgcd(); adjust for e ciency.

 

prec = 32;

 

// Breakover bit length for shgcd(); adjust for e ciency.

2.

[Set up small-gcd function shgcd to return a matrix]

 

shgcd(x, y) {

1

// Short gcd, with variables u, v, q, A local.

 

A =

0

;

 

 

1

0

 

(u, v) = (x, y); while(v2 > x) {

// Recurse.

9.4 Enhancements for gcd and inverse

469

q = u/v ;

(u, v) = (v, u mod v);

A =

0

1

A;

 

1 −q

 

}

 

 

 

 

 

 

return A;

 

 

 

 

}

 

 

 

 

3. [Set up recursive procedure hgcd to modify global matrix G]

hgcd(b, x, y) {

 

 

// Variables u, v, q, m, C are local.

if(y == 0) return;

 

 

b

;

 

 

 

u = x/2b

 

 

 

v = y/2

;

 

 

 

m = B(u);

 

 

// B is as usual the bit-length function.

if(m < prec) {

 

 

 

G = shgcd(u, v);

 

return;

 

 

 

 

}

 

 

 

 

m = m/2 ;

 

 

 

hgcd(m, u, v);

 

 

// Recurse.

(u, v)T = G(u, v)T ;

// Matrix-vector multiply.

if(u < 0) (u, G11, G12) = (−u, −G11, −G12); if(v < 0) (v, G21, G22) = (−v, −G21, −G22);

if(u < v) (u, v, G11, G12, G21, G22) = (v, u, G21, G22, G11, G12); if(v = 0) {

(u, v) = (v, u); q = v/u ;

G =

0

1

G;

// Matrix-matrix multiply.

1

−q

 

 

 

v = v − qu; m = m/2 ; C = G; hgcd(m, u, v); G = GC;

}

return;

}

4. [Establish the top-level function rcgcd.]

rgcd(x, y) { // Top-level function, with variables u, v local.

(u, v) = (x, y);

5. [Reduce arguments]

(u, v) = (|u|, |v|);

if(u < v) (u, v) = (v, u); if(v < lim) goto [Branch];

1 0 G = 0 1 ;

// Absolute-value each component.

470 Chapter 9 FAST ALGORITHMS FOR LARGE-INTEGER ARITHMETIC

hgcd(0, u, v);

(u, v)T = G(u, v)T ; (u, v) = (|u|, |v|);

if(u < v) (u, v) = (v, u); if(v < lim) goto [Branch];

(u, v) = (v, u mod v); goto [Reduce arguments];

6. [Branch]

 

return cgcd(u, v);

// Recursion done, branch to alternative gcd.

}

 

To clarify the practical application of the algorithm, one chooses the “breakover” parameters lim and prec, whence the greatest common divisor of x, y is to be calculated by calling the overall function rgcd(x, y). We remark that G. Woltman has managed to implement Algorithm 9.4.6 in a highly memory-e cient way, essentially by reusing certain storage and carrying out other careful bookkeeping. He reported in year 2000 the ability to e ect a random gcd with respect to the Fermat number F24 in under an hour on a modern PC, while a classical gcd of such magnitude would consume days of machine time. This was at the time one of the very first practical successes of the recursive approach. So the algorithm, though intricate, certainly has its rewards, especially in the search for factors of very large numbers, say arguments as large as some interesting “genuine composites” like the Fermat number F20 and beyond.

An alternative recursive approach—the SZgcd—is a very new development. It is a binary-recursive gcd involving little more than binary shifts and large-integer multiplies. This spectacular discovery has the same theoretical complexity as Algorithm 9.4.6, yet [Zimmerman 2004] reports that a GNU MP implementation of the algorithm below performs a gcd of two numbers of 224 bits each in about 45 seconds, on a modern PC. The year 2000 timing for Algorithm 9.4.6 comes down, via modern (2004) machinery, to more like several minutes, so this new SZgcd is quite a performer. We remind ourselves, however, that the theoretical complexity as enunciated at the beginning of this section applies to both algorithms—the fact of simple, rapid binary operations for the newer algorithm yields a smaller e ective big-O constant. (There is also the observation [Stehl´ and Zimmermann 2004] that it is much easier to be rigorous with the complexity theory, for the SZgcd.)

The basic idea of the SZgcd is to expand a rational number in a continued fraction whose elements are not taken from the usual positive integers, rather from a set

(±1/2, ±1/4, ±3/4, ±1/8, ±3/8, ±5/8, ±7/8, ±1/16, ±3/16, . . .).

So a typical fraction development is exemplified like so for the rational 525/266 (an example publicized by D. Bernstein):

525 = (1/2)266 + 392;

9.4 Enhancements for gcd and inverse

471

266 = (3/4)392 + 560;

392 = (1/2)560 + 672;

560 = (1/2)672 + 896;

672 = (3/4)896 + 0.

Now gcd(525, 266) is seen to be the odd part of 896, namely 7. At each step we choose the fractional “quotient” so that the 2-power in the remainder increases. Thus the algorithm below is entirely 2-adic, and is especially suited for machinery with fast binary operations, such as vector-shift and so on. Note that the divbin procedure in Algorithm 9.4.7 is merely a single iteration of the above type, and that one always arranges to apply it when the first integer is not divisible by as high a power of 2 as the second integer.

Following Stehl´–Zimmermann, we employ a signed modular reduction x cmod m defined as the unique residue of x modulo m that lies in [−m/2 + 1, m/2 ]. The function v2, returning the number of trailing zero bits, is as in Algorithm 9.4.2. As with previous algorithms, B(n) denotes the number of bits in the binary representation of a nonnegative integer n.

Algorithm 9.4.7 (Stehl´–Zimmermann binary-recursive gcd). For nonnegative integers x, y this algorithm returns gcd(x, y). The top-level function SZgcd() calls a recursive, half-binary function hbingcd(), with a classical binary gcd invoked when operands have su ciently decreased.

1.

[Initialize]

 

 

thresh = 10000;

// Tunable breakover threshold for binary gcd.

2.

[Set up top-level function that returns the gcd]

 

SZgcd(x, y) {

// Variables u, v, k, q, r, G are local.

 

(u, v) = (x, y);

 

if(v2(v) < v2(u)) (u, v) = (v, u);

if(v2(v) == v2(u)) (u, v) = (u, u + v); if(v == 0) return u;

k = v2(u);

(u, v) = (u/2k, v/2k);

3. [Reduce]

if((B(u) or B(v) < thresh) return 2k gcd(u, v); // Algorithm 9.4.2.

G =

T

 

 

 

T

 

 

, u, v);

// G is a 2-by-2 matrix.

hbingcd(

 

B(u)/2

 

(u, v)

 

= G(u, v)

 

;

 

 

// Matrix-vector multiplication.

if(v == 0) return 2k−v2(u)u;

 

(q, r) = divbin(u, v);

 

 

 

(u, v) = (v/2v2(v), r/2v2(v));

 

if(v == 0) return 2ku;

 

 

 

goto [Reduce];

 

 

 

 

 

 

}

 

 

 

 

 

 

 

 

 

4. [Half-binary divide function]

 

 

 

divbin(x, y) {

 

 

 

// A 2-vector is returned. Variables q, r are local.

r = x;

 

 

 

 

 

 

 

 

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]