 •Guide to Elliptic Curve Cryptography
 •Contents
 •List of Algorithms
 •List of Tables
 •List of Figures
 •Acronyms
 •Preface
 •1 Introduction and Overview
 •1.1 Cryptography basics
 •1.2.3 Elliptic curve systems
 •1.3 Why elliptic curve cryptography?
 •1.4 Roadmap
 •2 Finite Field Arithmetic
 •2.2.1 Addition and subtraction
 •2.2.2 Integer multiplication
 •2.2.3 Integer squaring
 •2.2.4 Reduction
 •2.2.5 Inversion
 •2.3.1 Addition
 •2.3.2 Multiplication
 •2.3.3 Polynomial multiplication
 •2.3.4 Polynomial squaring
 •2.3.5 Reduction
 •2.4.1 Addition and subtraction
 •2.4.2 Multiplication and reduction
 •2.4.3 Inversion
 •3 Elliptic Curve Arithmetic
 •3.1 Introduction to elliptic curves
 •3.1.2 Group law
 •3.1.3 Group order
 •3.1.4 Group structure
 •3.2.1 Projective coordinates
 •3.3 Point multiplication
 •3.3.1 Unknown point
 •3.3.2 Fixed point
 •3.3.3 Multiple point multiplication
 •3.4 Koblitz curves
 •3.4.1 The Frobenius map and the ring Z[τ ]
 •3.4.2 Point multiplication
 •3.6 Point multiplication using halving
 •3.6.1 Point halving
 •3.6.3 Point multiplication
 •3.7 Point multiplication costs
 •4 Cryptographic Protocols
 •4.1 The elliptic curve discrete logarithm problem
 •4.2.3 Determining the number of points on an elliptic curve
 •4.4 Signature schemes
 •4.4.1 ECDSA
 •4.4.2 ECKCDSA
 •4.5.1 ECIES
 •4.5.2 PSEC
 •4.6.1 Stationtostation
 •4.6.2 ECMQV
 •5 Implementation Issues
 •5.1 Software implementation
 •5.1.1 Integer arithmetic
 •5.1.5 Timings
 •5.2 Hardware implementation
 •5.3 Secure implementation
 •5.3.1 Power analysis attacks
 •5.3.2 Electromagnetic analysis attacks
 •5.3.4 Fault analysis attacks
 •5.3.5 Timing attacks
 •A.1 Irreducible polynomials
 •A.2 Elliptic curves
 •A.2.2 Random elliptic curves over F2m
 •A.2.3 Koblitz elliptic curves over F2m
 •C.1 Generalpurpose tools
 •C.2 Libraries
 •Bibliography
 •Index
Какую работу нужно написать?
5.1. Software implementation 
219 
for inline assembler to cooperate with surrounding code is a design weakness compared with that of GNU C or the Intel compilers, which have the additional advantage that they can be used on Unixlike systems.
5.1.5Timings
Selected ﬁeld operation timings are presented for Intel Pentium family processors and the Sun UltraSPARC, commonly used in workstations. The NIST recommended binary and prime ﬁelds (§A.2) are the focus, although some data for an OEF (§2.4) is presented for comparison.
It is acknowledged that timings can be misleading, and are heavily inﬂuenced by the programmer’s talent and effort (or lack thereof), compiler selection, and the precise method of obtaining the data. The timings presented here should be viewed with the same healthy dose of skepticism prescribed for all such data. Nonetheless, timings are essential for algorithm analysis, since rough operation counts are often insufﬁcient to capture platform characteristics. For the particular timings presented here, there has generally been independent “sanity check” data available from other implementations.
Tables 5.3–5.5 give basic comparisons for the NIST recommended binary and prime ﬁelds, along with a selected OEF. Inversion and multiplication times for binary ﬁelds on two platforms appear in Table 5.6, comparing compilers, inversion algorithms, and 32bit vs 64bit code. The 64bit code on the Intel Pentium III is via specialpurpose registers. These capabilities were extended in the Pentium 4, and Table 5.7 includes timings for prime ﬁeld multiplication via these registers along with an approach using ﬂoatingpoint registers.
Field arithmetic comparisons
Timings for the smallest of the NIST recommended binary and prime ﬁelds, along with an OEF, are presented in Table 5.3. Speciﬁcally, these are the binary ﬁeld F2163
with reduction polynomial f (z) = z163 + z7 + z6 + z3 + 1, the prime ﬁeld F p192 with p192 = 2192 − 264 − 1, and the OEF F p6 with prime p = 231 − 1 and reduction poly
nomial f (z) = z6 − 7. Realistic branch misprediction penalties are obtained using a sequence of pseudo randomly generated ﬁeld elements, and the timings include framework overhead such as function calls. The Intel compiler version 6 along with the Netwide Assembler (NASM) were used on an Intel Pentium III running the Linux 2.2 operating system.
Algorithms for binary ﬁelds were coded entirely in C except for a oneline assembler fragment used in polynomial degree calculations in inversion. Assembly coding may be required in prime ﬁelds and OEFs in order to use hardware multipliers producing a 64bit product from 32bit input, and to directly access the carry bit, both of which are essential to performance in conventional methods. The ﬁrst of the F p192 columns in Table 5.3 gives timings for code written primarily in C. For most entries, a signiﬁ
220 
5. Implementation Issues 












F2163 
a 
F p192 
F(231−1)6 


F p192 








Addition 
0.04 
0.18 
0.07 
0.06 

Reduction 
0.11b 
0.25c 
0.11c 


Fast reduction 
N/A 


Barrett reduction (Algorithm 2.14) 
N/A 
1.55d 
0.49 
N/A 

Multiplication (including fast reduction) 
1.30e 
0.57d,f 
0.42f 
0.40g 

Squaring (including fast reduction) 
0.20h 
— 
0.36i 
0.32g 

Inversion 
10.5j 
58.3k 
25.2k 
2.9l 

I/M 
8.1 
102.3 
60.0 
7.3 




aCoded primarily in C. bAlgorithm 2.41. cAlgorithm 2.27. dUses a 32×32 multiplyandadd. 


eAlgorithm 2.36. fAlgorithm 2.10. gExample 2.56. hAlgorithm 2.39. 




iAlgorithm 2.13. jAlgorithm 2.48. kAlgorithm 2.22. lAlgorithm 2.59. 


Table 5.3. Timings (in µs) for ﬁeld arithmetic on an 800 MHz Intel Pentium III. The binary ﬁeld F2163 = F2[z]/(z163 + z7 + z6 + z3 + 1) and the prime ﬁeld F p192 for p192 = 2192 − 264 − 1 are from the NIST recommendations (§A.2). The rightmost column is the optimal extension ﬁeld F p6 = F p[z]/(z6 − 7) for prime p = 231 − 1.
cant penalty is seen relative to the timings with assembly. However, the multiplication routine uses an inline assembly fragment for a 32×32 multiply with a threeword accumulation. If reduction is excluded, the time is very close to that obtained with the assembly language version, an indication that the Intel compiler handles insertion of short inline assembly fragments well.
Reduction Barrett reduction does not exploit the special form of the NIST prime, and the entries can be interpreted as rough cost estimates of reduction with a random 192bit prime. In contrast to special primes, this estimate shows that reduction is now a very signiﬁcant part of ﬁeld multiplication timings, encouraging the use of Montgomery (§2.2.4) and other multiplication methods. Signiﬁcant performance degradation in the C version of the fast reduction algorithm is largely explained by the many conditionals in the clumsy handling of carry.
OEF The OEF F(231−1)6 in the rightmost column of Table 5.3 is roughly the same size as F p192 . The multiplication is accomplished with an accumulation method (Example 2.56) resembling the method used in F p192 , and the resulting times are comparable. As expected, inversion is signiﬁcantly faster for the OEF.
NIST ﬁelds Tables 5.4 and 5.5 provide timings for the NIST recommended binary and prime ﬁelds. Note that optimizations in the larger ﬁelds were limited to techniques employed for F2163 and F p192 . In particular, KaratsubaOfman methods were not competitive in our tests on this platform for the smaller ﬁelds, but were not examined carefully in the larger ﬁelds.


5.1. 
Software implementation 
221 










F2163 
F2233 
F2283 
F2409 
F2571 










Addition 
0.04 
0.04 
0.04 
0.06 
0.07 


Reduction (Algorithms 2.41–2.45) 
0.11 
0.13 
0.19 
0.14 
0.33 


Multiplication (Algorithm 2.36) 
1.30 
2.27 
2.92 
5.53 
10.23 


Squaring (Algorithm 2.39) 
0.20 
0.23 
0.32 
0.31 
0.56 










Inversion (Algorithm 2.48) 
10.5 
18.6 
28.2 
53.9 
96.4 


I/M 
8.1 
8.2 
9.7 
9.8 
9.4 










Table 5.4. Timings (in µs) for binary ﬁeld arithmetic on an 800 MHz Intel Pentium III, including
reduction to canonical form. The ﬁelds are from the NIST recommendations (§A.2) with reduction polynomials z163 + z7 + z6 + z3 + 1, z233 + z74 + 1, z283 + z12 + z7 + z5 + 1, z409 + z87 + 1, and z571 + z10 + z5 + z2 + 1, respectively.

F p192 
F p224 
F p256 
F p384 
F p521 








Addition 
0.07 
0.07 
0.08 
0.10 
0.10 

Reduction (Algorithms 2.27–2.31) 
0.11 
0.12 
0.30 
0.38 
0.20 








Multiplication (Algorithm 2.10) 
0.42 
0.52 
0.81 
1.47 
2.32 

Squaring (Algorithm 2.13) 
0.36 
0.44 
0.71 
1.23 
1.87 








Inversion (Algorithm 2.22) 
25.2 
34.3 
44.3 
96.3 
163.8 

I/M 
60.0 
70.0 
54.7 
65.5 
70.6 








Table 5.5. Timings (in µs) for prime ﬁeld arithmetic on an 800 MHz Intel Pentium III, in
cluding reduction to canonical form. The ﬁelds are from the NIST recommendations (§A.2)
with p192 = 2192 − 264 − 1, p224 = 2224 − 296 + 1, p256 = 2256 − 2224 + 2192 + 296 − 1, p384 = 2384 − 2128 − 296 + 232 − 1, and p521 = 2521 − 1.
Multiplication and inversion in binary ﬁelds
In point multiplication on elliptic curves (§3.3), the cost of ﬁeld inversion relative to ﬁeld multiplication is of particular interest. This section presents estimates of the ratio for the NIST binary ﬁelds (where the ratio is expected to be relatively small) for two platforms. The three inversion methods discussed in §2.3.6 are compared, along with timings for 32bit and 64bit code. The results also show signiﬁcant differences among the compilers used.
Table 5.6 gives comparative timings on two popular platforms, the Intel Pentium III and Sun UltraSPARC IIe. Both processors are capable of 32and 64bit operations, although only the UltraSPARC is 64bit. The 64bit operations on the Pentium III are via the singleinstruction multipledata (SIMD) registers, introduced on the Pentium MMX (see Table 5.1). The inversion methods are the extended Euclidean algorithm (EEA) in Algorithm 2.48, the binary Euclidean algorithm (BEA) in Algorithm 2.49, and the almost inverse algorithm (AIA) in Algorithm 2.50. The example ﬁelds are
taken from the NIST recommendations, with reduction polynomials f (z) = z163 + z7 + z6 + z3 + 1 and f (z) = z233 + z74 + 1. Both allow fast reduction, but only the latter
is favourable to the almost inverse algorithm. Field multiplication based on the comb
222 
5. Implementation Issues 













Pentium III (800 MHz) 
SPARC (500 MHz) 



32bit 
64bit 
32bit 
64bit 


Algorithm 
gcc 
icc 
mmx 
gcc 
cc 
cc 

















Arithmetic in F2163 







multiplication 
1.8 
1.3 
.7 
1.9 
1.8 
.9 

Euclidean algorithm 
10.9 
10.5 
7.1 
21.4 
14.8 
— 

binary Euclidean algorithm 
20.7 
16.0 
— 
16.8 
14.9 
10.6 

almost inverse 
16.4 
15.2 
— 
22.6 
15.2 
— 

I/M 
6.1 
8.1 
9.8 
8.8 
8.2 
12.1 









Arithmetic in F2233 







multiplication 
3.0 
2.3 
— 
4.0 
2.9 
1.7 

Euclidean algorithm 
18.3 
18.8 
— 
45.5 
25.7 
— 

binary Euclidean algorithm 
36.2 
28.9 
— 
42.0 
34.0 
16.9 

almost inverse 
22.7 
20.1 
— 
36.8 
24.7 
— 

I/M 
6.1 
8.2 
— 
9.2 
8.5 
9.9 








Table 5.6. Multiplication and inversion times for the Intel Pentium III and Sun UltraSPARC IIe. The compilers are GNU C 2.95 (gcc), Intel 6 (icc), and Sun Workshop 6U2 (cc). The 64bit “multimedia” registers were employed for the entries under “mmx.” Inversion to multiplication (I/M) uses the best inversion time.
method (Algorithm 2.36) appears to be fastest on these platforms. A width4 comb was used, and the times include reduction. Other than the MMX code and a oneline assembler fragment for EEA, algorithms were coded entirely in C.
Some table entries are as expected, for example, the relatively good times for almost inverse in F2233 . Other entries illustrate the signiﬁcant differences between platforms or compilers on a single platform. Apparent inconsistencies remain in Table 5.6, but we believe that the fastest times provide meaningful estimates of inversion and multiplication costs on these platforms.
Division The timings do not make a very strong case for division using a modiﬁcation of the BEA (§2.3.6). For the 32bit code, unless EEA or AIA can be converted to efﬁciently perform division, then only the entry for F2163 on the SPARC supports use of BEAlike division. Furthermore, the ratio I/M is at least 8 in most cases, and hence the savings from use of a division algorithm would be less than 10%. With such a ratio, elliptic curve methods will be chosen to reduce the number of inversions, so the savings on a point multiplication k P would be signiﬁcantly less than 10%.
On the other hand, if afﬁneonly arithmetic is in use in a point multiplication method based on doubleandadd, then a fast division would be especially welcomed even if I/M is signiﬁcantly larger than 5. If BEA is the algorithm of choice, then division has essentially the same cost as inversion.
Implementation notes General programming considerations for the implementations used here are covered in §5.1.4. In particular, to obtain acceptable multiplication times
5.1. Software implementation 
223 
with gcc on the Sun SPARC, code was tuned to be more “gccfriendly.” Limited tuning for gcc was also performed on the inversion code. Optimizing the inversion code is tedious, in part because rough operation counts at this level often fail to capture processor or compiler characteristics adequately.
Multimedia registers The Intel Pentium family (all but the original and the Pentium Pro) and AMD processors possess eight 64bit “multimedia” registers that were employed for the timings in the column marked “mmx.” Use of these capabilities for ﬁeld arithmetic is discussed in §5.1.3.
EEA Algorithm 2.48 requires polynomial degree calculations. On the SPARC, degree was found by binary search and table lookup, once the nonzero word of interest is located. On the Pentium, a bit scan instruction (bsr) that ﬁnds the position of the most signiﬁcant bit in a word was employed via inline assembly, resulting in an improvement of approximately 15% in inversion times.
The code tracks the lengths of u and v using t fragments of similar code, each fragment corresponding to the current “top” of u and v. Here, t was chosen to be the number of words required to represent ﬁeld elements.
BEA Algorithm 2.49 was implemented with a tfragment split to track the lengths of u and v efﬁciently. Rather than the degree calculation indicated in step 3.3, a simpler comparison on the appropriate words was used.
AIA Algorithm 2.50 allows efﬁcient tracking of the lengths of g1 and g2 (in addition to the lengths of u and v). A total of t2 similar fragments of code were used, a significant amount of code expansion unless t is small. As with BEA, a simple comparison replaces the degree calculations. Note that only the reduction polynomial for F2233 is favourable to the almost inverse algorithm.
Prime ﬁeld multiplication methods
For prime ﬁelds, traditional approaches for ﬁeld multiplication are often throttled by limitations of hardware integer multipliers and carry propagation. Both the UltraSPARC and the Pentium family processors suffer from such limitations. The Intel Pentium 4 is in fact much slower (in terms of processor cycles) in some operations than the preceding generation of Pentium processors. As an example, ﬁeld multiplication in F p224 using Algorithm 2.10 with code targeted at the Pentium II/III appears in Table 5.5 (from a Pentium III) and Table 5.7 (from a Pentium 4). Despite a factor 2 clock speed advantage for the Pentium 4, the timing is in fact slower than obtained on the Pentium III.
KaratsubaOfman Methods based on KaratsubaOfman do not appear to be competitive with classical methods on the Pentium II/III for ﬁelds of this size. Table 5.7 includes times on the Pentium 4 using a depth2 approach outlined in Example 2.12.