~ubuntu-branches/ubuntu/utopic/dropbear/utopic-proposed

For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long''

1695

data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma = 32$.

1675

data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma \ge 32$.

1696

1676

1697

1677

\newpage\begin{figure}[!here]

1698

1678

\begin{center}

1759

1739

1760

1740

EXAM,bn_s_mp_sub.c

1761

1741

1762

Line @24,min@ and @25,max@ perform the initial hardcoded sorting of the inputs. In reality the $min$ and $max$ variables are only aliases and are only

1763

used to make the source code easier to read. Again the pointer alias optimization is used within this algorithm. Lines @42,tmpa@, @43,tmpb@ and @44,tmpc@ initialize the aliases for

1764

$a$, $b$ and $c$ respectively.

1765

1766

The first subtraction loop occurs on lines @47,u = 0@ through @61,}@. The theory behind the subtraction loop is exactly the same as that for

1767

the addition loop. As remarked earlier there is an implementation reason for using the ``awkward'' method of extracting the carry

1768

(\textit{see line @57, >>@}). The traditional method for extracting the carry would be to shift by $lg(\beta)$ positions and logically AND

1769

the least significant bit. The AND operation is required because all of the bits above the $\lg(\beta)$'th bit will be set to one after a carry

1770

occurs from subtraction. This carry extraction requires two relatively cheap operations to extract the carry. The other method is to simply

1771

shift the most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This optimization only works on

1772

twos compliment machines which is a safe assumption to make.

1773

1774

If $a$ has a larger magnitude than $b$ an additional loop (\textit{see lines @64,for@ through @73,}@}) is required to propagate the carry through

1775

$a$ and copy the result to $c$.

1742

Like low level addition we ``sort'' the inputs. Except in this case the sorting is hardcoded

1743

(lines @24,min@ and @25,max@). In reality the $min$ and $max$ variables are only aliases and are only

1744

used to make the source code easier to read. Again the pointer alias optimization is used

1745

within this algorithm. The aliases $tmpa$, $tmpb$ and $tmpc$ are initialized

1746

(lines @42,tmpa@, @43,tmpb@ and @44,tmpc@) for $a$, $b$ and $c$ respectively.

1747

1748

The first subtraction loop (lines @47,u = 0@ through @61,}@) subtract digits from both inputs until the smaller of

1749

the two inputs has been exhausted. As remarked earlier there is an implementation reason for using the ``awkward''

1750

method of extracting the carry (line @57, >>@). The traditional method for extracting the carry would be to shift

1751

by $lg(\beta)$ positions and logically AND the least significant bit. The AND operation is required because all of

1752

the bits above the $\lg(\beta)$'th bit will be set to one after a carry occurs from subtraction. This carry

1753

extraction requires two relatively cheap operations to extract the carry. The other method is to simply shift the

1754

most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This

1755

optimization only works on twos compliment machines which is a safe assumption to make.

1756

1757

If $a$ has a larger magnitude than $b$ an additional loop (lines @64,for@ through @73,}@) is required to propagate

1758

the carry through $a$ and copy the result to $c$.

1776

1759

1777

1760

\subsection{High Level Addition}

1778

1761

Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be

2098

2081

2099

2082

EXAM,bn_mp_lshd.c

2100

2083

2101

The if statement on line @24,if@ ensures that the $b$ variable is greater than zero. The \textbf{used} count is incremented by $b$ before

2102

the copy loop begins. This elminates the need for an additional variable in the for loop. The variable $top$ on line @42,top@ is an alias

2103

for the leading digit while $bottom$ on line @45,bottom@ is an alias for the trailing edge. The aliases form a window of exactly $b$ digits

2104

over the input.

2084

The if statement (line @24,if@) ensures that the $b$ variable is greater than zero since we do not interpret negative

2085

shift counts properly. The \textbf{used} count is incremented by $b$ before the copy loop begins. This elminates

2086

the need for an additional variable in the for loop. The variable $top$ (line @42,top@) is an alias

2087

for the leading digit while $bottom$ (line @45,bottom@) is an alias for the trailing edge. The aliases form a

2088

window of exactly $b$ digits over the input.

2105

2089

2106

2090

\subsection{Division by $x$}

2107

2091

2151

2135

2152

2136

EXAM,bn_mp_rshd.c

2153

2137

2154

The only noteworthy element of this routine is the lack of a return type.

2155

2138

The only noteworthy element of this routine is the lack of a return type since it cannot fail. Like mp\_lshd() we

2139

form a sliding window except we copy in the other direction. After the window (line @59,for (;@) we then zero

2140

the upper digits of the input to make sure the result is correct.

2156

2141

2157

2142

\section{Powers of Two}

2158

2143

2214

2198

2215

2199

EXAM,bn_mp_mul_2d.c

2216

2200

2217

Notes to be revised when code is updated. -- Tom

2201

The shifting is performed in--place which means the first step (line @24,a != c@) is to copy the input to the

2202

destination. We avoid calling mp\_copy() by making sure the mp\_ints are different. The destination then

2203

has to be grown (line @31,grow@) to accomodate the result.

2204

2205

If the shift count $b$ is larger than $lg(\beta)$ then a call to mp\_lshd() is used to handle all of the multiples

2206

of $lg(\beta)$. Leaving only a remaining shift of $lg(\beta) - 1$ or fewer bits left. Inside the actual shift

2207

loop (lines @45,if@ to @76,}@) we make use of pre--computed values $shift$ and $mask$. These are used to

2208

extract the carry bit(s) to pass into the next iteration of the loop. The $r$ and $rr$ variables form a

2209

chain between consecutive iterations to propagate the carry.

2218

2210

2219

2211

\subsection{Division by Power of Two}

2220

2212

2263

2255

result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before

2264

2256

the quotient is obtained.

2265

2257

2266

The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. (-- Fix this paragraph up later, Tom).

2258

The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. The only significant difference is

2259

the direction of the shifts.

2267

2260

2268

2261

\subsection{Remainder of Division by Power of Two}

2269

2262

2306

2299

2307

2300

EXAM,bn_mp_mod_2d.c

2308

2301

2302

We first avoid cases of $b \le 0$ by simply mp\_zero()'ing the destination in such cases. Next if $2^b$ is larger

2303

than the input we just mp\_copy() the input and return right away. After this point we know we must actually

2304

perform some work to produce the remainder.

2305

2306

Recalling that reducing modulo $2^k$ and a binary ``and'' with $2^k - 1$ are numerically equivalent we can quickly reduce

2307

the number. First we zero any digits above the last digit in $2^b$ (line @41,for@). Next we reduce the

2308

leading digit of both (line @45,&=@) and then mp\_clamp().

2309

2310

\section*{Exercises}

2311

\begin{tabular}{cl}

2464

2463

2465

2464

EXAM,bn_s_mp_mul_digs.c

2466

2465

2467

Lines @31,if@ to @35,}@ determine if the Comba method can be used first. The conditions for using the Comba routine are that min$(a.used, b.used) < \delta$ and

2468

the number of digits of output is less than \textbf{MP\_WARRAY}. This new constant is used to control

2469

the stack usage in the Comba routines. By default it is set to $\delta$ but can be reduced when memory is at a premium.

2470

2471

Of particular importance is the calculation of the $ix+iy$'th column on lines @64,mp_word@, @65,mp_word@ and @66,mp_word@. Note how all of the

2472

variables are cast to the type \textbf{mp\_word}, which is also the type of variable $\hat r$. That is to ensure that double precision operations

2473

are used instead of single precision. The multiplication on line @65,) * (@ makes use of a specific GCC optimizer behaviour. On the outset it looks like

2474

the compiler will have to use a double precision multiplication to produce the result required. Such an operation would be horribly slow on most

2475

processors and drag this to a crawl. However, GCC is smart enough to realize that double wide output single precision multipliers can be used. For

2476

example, the instruction ``MUL'' on the x86 processor can multiply two 32-bit values and produce a 64-bit result.

2466

First we determine (line @30,if@) if the Comba method can be used first since it's faster. The conditions for

2467

sing the Comba routine are that min$(a.used, b.used) < \delta$ and the number of digits of output is less than

2468

\textbf{MP\_WARRAY}. This new constant is used to control the stack usage in the Comba routines. By default it is

2469

set to $\delta$ but can be reduced when memory is at a premium.

2470

2471

If we cannot use the Comba method we proceed to setup the baseline routine. We allocate the the destination mp\_int

2472

$t$ (line @36,init@) to the exact size of the output to avoid further re--allocations. At this point we now

2473

begin the $O(n^2)$ loop.

2474

2475

This implementation of multiplication has the caveat that it can be trimmed to only produce a variable number of

2476

digits as output. In each iteration of the outer loop the $pb$ variable is set (line @48,MIN@) to the maximum

2477

number of inner loop iterations.

2478

2479

Inside the inner loop we calculate $\hat r$ as the mp\_word product of the two mp\_digits and the addition of the

2480

carry from the previous iteration. A particularly important observation is that most modern optimizing

2481

C compilers (GCC for instance) can recognize that a $N \times N \rightarrow 2N$ multiplication is all that

2482

is required for the product. In x86 terms for example, this means using the MUL instruction.

2483

2484

Each digit of the product is stored in turn (line @68,tmpt@) and the carry propagated (line @71,>>@) to the

2485

next iteration.

2477

2486

2478

2487

\subsection{Faster Multiplication by the ``Comba'' Method}

2479

2488

MARK,COMBA

2480

2489

2481

One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be computed and propagated upwards. This

2482

makes the nested loop very sequential and hard to unroll and implement in parallel. The ``Comba'' \cite{COMBA} method is named after little known

2483

(\textit{in cryptographic venues}) Paul G. Comba who described a method of implementing fast multipliers that do not require nested

2484

carry fixup operations. As an interesting aside it seems that Paul Barrett describes a similar technique in

2485

his 1986 paper \cite{BARRETT} written five years before.

2486

2487

At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight twist is placed on how

2488

the columns of the result are produced. In the standard long-hand algorithm rows of products are produced then added together to form the

2489

final result. In the baseline algorithm the columns are added together after each iteration to get the result instantaneously.

2490

2491

In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at the $O(n^2)$ level a

2492

simple multiplication and addition step is performed. The carries of the columns are propagated after the nested loop to reduce the amount

2493

of work requiored. Succintly the first step of the algorithm is to compute the product vector $\vec x$ as follows.

2490

One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be

2491

computed and propagated upwards. This makes the nested loop very sequential and hard to unroll and implement

2492

in parallel. The ``Comba'' \cite{COMBA} method is named after little known (\textit{in cryptographic venues}) Paul G.

2493

Comba who described a method of implementing fast multipliers that do not require nested carry fixup operations. As an

2494

interesting aside it seems that Paul Barrett describes a similar technique in his 1986 paper \cite{BARRETT} written

2495

five years before.

2496

2497

At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight

2498

twist is placed on how the columns of the result are produced. In the standard long-hand algorithm rows of products

2499

are produced then added together to form the final result. In the baseline algorithm the columns are added together

2500

after each iteration to get the result instantaneously.

2501

2502

In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at

2503

the $O(n^2)$ level a simple multiplication and addition step is performed. The carries of the columns are propagated

2504

after the nested loop to reduce the amount of work requiored. Succintly the first step of the algorithm is to compute

2505

the product vector $\vec x$ as follows.

2494

2506

2495

2507

\begin{equation}

2496

2508

\vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace

2584

2596

\textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\

2585

2597

\textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\

2586

2598

\hline \\

2587

Place an array of \textbf{MP\_WARRAY} double precision digits named $\hat W$ on the stack. \\

2599

Place an array of \textbf{MP\_WARRAY} single precision digits named $W$ on the stack. \\

2588

2600

1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\

2589

2601

2. If step 1 failed return(\textit{MP\_MEM}).\\

2590

2602

2591

Zero the temporary array $\hat W$. \\

2592

3. for $n$ from $0$ to $digs - 1$ do \\

2593

\hspace{3mm}3.1 $\hat W_n \leftarrow 0$ \\

2594

2595

Compute the columns. \\

2596

4. for $ix$ from $0$ to $a.used - 1$ do \\

2597

\hspace{3mm}4.1 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\

2598

\hspace{3mm}4.2 If $pb < 1$ then goto step 5. \\

2599

\hspace{3mm}4.3 for $iy$ from $0$ to $pb - 1$ do \\

2600

\hspace{6mm}4.3.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}b_{iy}$ \\

2601

2602

Propagate the carries upwards. \\

2603

5. $oldused \leftarrow c.used$ \\

2604

6. $c.used \leftarrow digs$ \\

2605

7. If $digs > 1$ then do \\

2606

\hspace{3mm}7.1. for $ix$ from $1$ to $digs - 1$ do \\

2607

\hspace{6mm}7.1.1 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix-1} / \beta \rfloor$ \\

2608

\hspace{6mm}7.1.2 $c_{ix - 1} \leftarrow \hat W_{ix - 1} \mbox{ (mod }\beta\mbox{)}$ \\

2609

8. else do \\

2610

\hspace{3mm}8.1 $ix \leftarrow 0$ \\

2611

9. $c_{ix} \leftarrow \hat W_{ix} \mbox{ (mod }\beta\mbox{)}$ \\

2612

2613

Zero excess digits. \\

2614

10. If $digs < oldused$ then do \\

2615

\hspace{3mm}10.1 for $n$ from $digs$ to $oldused - 1$ do \\

2616

\hspace{6mm}10.1.1 $c_n \leftarrow 0$ \\

2617

11. Clamp excessive digits of $c$. (\textit{mp\_clamp}) \\

2618

12. Return(\textit{MP\_OKAY}). \\

2603

3. $pa \leftarrow \mbox{MIN}(digs, a.used + b.used)$ \\

2604

2605

4. $\_ \hat W \leftarrow 0$ \\

2606

5. for $ix$ from 0 to $pa - 1$ do \\

2607

\hspace{3mm}5.1 $ty \leftarrow \mbox{MIN}(b.used - 1, ix)$ \\

2608

\hspace{3mm}5.2 $tx \leftarrow ix - ty$ \\

2609

\hspace{3mm}5.3 $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\

2610

\hspace{3mm}5.4 for $iz$ from 0 to $iy - 1$ do \\

2611

\hspace{6mm}5.4.1 $\_ \hat W \leftarrow \_ \hat W + a_{tx+iy}b_{ty-iy}$ \\

2612

\hspace{3mm}5.5 $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$\\

2613

\hspace{3mm}5.6 $\_ \hat W \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\

2614

6. $W_{pa} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\

2615

2616

7. $oldused \leftarrow c.used$ \\

2617

8. $c.used \leftarrow digs$ \\

2618

9. for $ix$ from $0$ to $pa$ do \\

2619

\hspace{3mm}9.1 $c_{ix} \leftarrow W_{ix}$ \\

2620

10. for $ix$ from $pa + 1$ to $oldused - 1$ do \\

2621

\hspace{3mm}10.1 $c_{ix} \leftarrow 0$ \\

2622

2623

11. Clamp $c$. \\

2624

12. Return MP\_OKAY. \\

2619

2625

\hline

2620

2626

\end{tabular}

2621

2627

\end{center}

2625

2631

\end{figure}

2626

2632

2627

2633

\textbf{Algorithm fast\_s\_mp\_mul\_digs.}

2628

This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision. The algorithm

2629

essentially peforms the same calculation as algorithm s\_mp\_mul\_digs, just much faster.

2630

2631

The array $\hat W$ is meant to be on the stack when the algorithm is used. The size of the array does not change which is ideal. Note also that

2632

unlike algorithm s\_mp\_mul\_digs no temporary mp\_int is required since the result is calculated directly in $\hat W$.

2633

2634

The $O(n^2)$ loop on step four is where the Comba method's advantages begin to show through in comparison to the baseline algorithm. The lack of

2635

a carry variable or propagation in this loop allows the loop to be performed with only single precision multiplication and additions. Now that each

2636

iteration of the inner loop can be performed independent of the others the inner loop can be performed with a high level of parallelism.

2634

This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision.

2635

2636

The outer loop of this algorithm is more complicated than that of the baseline multiplier. This is because on the inside of the

2637

loop we want to produce one column per pass. This allows the accumulator $\_ \hat W$ to be placed in CPU registers and

2638

reduce the memory bandwidth to two \textbf{mp\_digit} reads per iteration.

2639

2640

The $ty$ variable is set to the minimum count of $ix$ or the number of digits in $b$. That way if $a$ has more digits than

2641

$b$ this will be limited to $b.used - 1$. The $tx$ variable is set to the to the distance past $b.used$ the variable

2642

$ix$ is. This is used for the immediately subsequent statement where we find $iy$.

2643

2644

The variable $iy$ is the minimum digits we can read from either $a$ or $b$ before running out. Computing one column at a time

2645

means we have to scan one integer upwards and the other downwards. $a$ starts at $tx$ and $b$ starts at $ty$. In each

2646

pass we are producing the $ix$'th output column and we note that $tx + ty = ix$. As we move $tx$ upwards we have to

2647

move $ty$ downards so the equality remains valid. The $iy$ variable is the number of iterations until

2648

$tx \ge a.used$ or $ty < 0$ occurs.

2649

2650

After every inner pass we store the lower half of the accumulator into $W_{ix}$ and then propagate the carry of the accumulator

2651

into the next round by dividing $\_ \hat W$ by $\beta$.

2637

2652

2638

2653

To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the

2639

2654

cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require

2643

2658

2644

2659

EXAM,bn_fast_s_mp_mul_digs.c

2645

2660

2646

The memset on line @47,memset@ clears the initial $\hat W$ array to zero in a single step. Like the slower baseline multiplication

2647

implementation a series of aliases (\textit{lines @67, tmpx@, @70, tmpy@ and @75,_W@}) are used to simplify the inner $O(n^2)$ loop.

2648

In this case a new alias $\_\hat W$ has been added which refers to the double precision columns offset by $ix$ in each pass.

2649

2650

The inner loop on lines @83,for@, @84,mp_word@ and @85,}@ is where the algorithm will spend the majority of the time, which is why it has been

2651

stripped to the bones of any extra baggage\footnote{Hence the pointer aliases.}. On x86 processors the multiplication and additions amount to at the

2652

very least five instructions (\textit{two loads, two additions, one multiply}) while on the ARMv4 processors they amount to only three

2653

(\textit{one load, one store, one multiply-add}). For both of the x86 and ARMv4 processors the GCC compiler performs a good job at unrolling the loop

2654

and scheduling the instructions so there are very few dependency stalls.

2655

2656

In theory the difference between the baseline and comba algorithms is a mere $O(qn)$ time difference. However, in the $O(n^2)$ nested loop of the

2657

baseline method there are dependency stalls as the algorithm must wait for the multiplier to finish before propagating the carry to the next

2658

digit. As a result fewer of the often multiple execution units\footnote{The AMD Athlon has three execution units and the Intel P4 has four.} can

2659

be simultaneously used.

2661

As per the pseudo--code we first calculate $pa$ (line @47,MIN@) as the number of digits to output. Next we begin the outer loop

2662

to produce the individual columns of the product. We use the two aliases $tmpx$ and $tmpy$ (lines @61,tmpx@, @62,tmpy@) to point

2663

inside the two multiplicands quickly.

2664

2665

The inner loop (lines @70,for@ to @72,}@) of this implementation is where the tradeoff come into play. Originally this comba

2666

implementation was ``row--major'' which means it adds to each of the columns in each pass. After the outer loop it would then fix

2667

the carries. This was very fast except it had an annoying drawback. You had to read a mp\_word and two mp\_digits and write

2668

one mp\_word per iteration. On processors such as the Athlon XP and P4 this did not matter much since the cache bandwidth

2669

is very high and it can keep the ALU fed with data. It did, however, matter on older and embedded cpus where cache is often

2670

slower and also often doesn't exist. This new algorithm only performs two reads per iteration under the assumption that the

2671

compiler has aliased $\_ \hat W$ to a CPU register.

2672

2673

After the inner loop we store the current accumulator in $W$ and shift $\_ \hat W$ (lines @75,W[ix]@, @78,>>@) to forward it as

2674

a carry for the next pass. After the outer loop we use the final carry (line @82,W[ix]@) as the last digit of the product.

2660

2675

2661

2676

\subsection{Polynomial Basis Multiplication}

2662

2677

To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms

2976

2991

2977

2992

EXAM,bn_mp_toom_mul.c

2978

2993

2994

The first obvious thing to note is that this algorithm is complicated. The complexity is worth it if you are multiplying very

2995

large numbers. For example, a 10,000 digit multiplication takes approximaly 99,282,205 fewer single precision multiplications with

2996

Toom--Cook than a Comba or baseline approach (this is a savings of more than 99$\%$). For most ``crypto'' sized numbers this

2997

algorithm is not practical as Karatsuba has a much lower cutoff point.

2998

2999

First we split $a$ and $b$ into three roughly equal portions. This has been accomplished (lines @40,mod@ to @69,rshd@) with

3000

combinations of mp\_rshd() and mp\_mod\_2d() function calls. At this point $a = a2 \cdot \beta^2 + a1 \cdot \beta + a0$ and similiarly

3001

for $b$.

3002

3003

Next we compute the five points $w0, w1, w2, w3$ and $w4$. Recall that $w0$ and $w4$ can be computed directly from the portions so

3004

we get those out of the way first (lines @72,mul@ and @77,mul@). Next we compute $w1, w2$ and $w3$ using Horners method.

3005

3006

After this point we solve for the actual values of $w1, w2$ and $w3$ by reducing the $5 \times 5$ system which is relatively

3007

straight forward.

2979

3008

2980

3009

\subsection{Signed Multiplication}

2981

3010

Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all

2982

3011

of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established.

2983

3012

2984

\newpage\begin{figure}[!here]

3013

\begin{figure}[!here]

2985

3014

\begin{small}

2986

3015

\begin{center}

2987

3016

\begin{tabular}{l}

3065

3093

The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines

3066

3094

will not handle.

3067

3095

3068

\newpage\begin{figure}[!here]

3096

\begin{figure}[!here]

3069

3097

\begin{small}

3070

3098

\begin{center}

3071

3099

\begin{tabular}{l}

3121

3149

3122

3150

EXAM,bn_s_mp_sqr.c

3123

3151

3124

Inside the outer loop (\textit{see line @32,for@}) the square term is calculated on line @35,r =@. Line @42,>>@ extracts the carry from the square

3125

term. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized on lines @45,tmpx@ and @48,tmpt@ respectively. The doubling is performed using two

3126

additions (\textit{see line @57,r + r@}) since it is usually faster than shifting,if not at least as fast.

3152

Inside the outer loop (line @32,for@) the square term is calculated on line @35,r =@. The carry (line @42,>>@) has been

3153

extracted from the mp\_word accumulator using a right shift. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized

3154

(lines @45,tmpx@ and @48,tmpt@) to simplify the inner loop. The doubling is performed using two

3155

additions (line @57,r + r@) since it is usually faster than shifting, if not at least as fast.

3156

3157

The important observation is that the inner loop does not begin at $iy = 0$ like for multiplication. As such the inner loops

3158

get progressively shorter as the algorithm proceeds. This is what leads to the savings compared to using a multiplication to

3159

square a number.

3127

3160

3128

3161

\subsection{Faster Squaring by the ``Comba'' Method}

3129

3162

A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional

3135

3168

that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example,

3136

3169

$ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$.

3137

3170

3138

However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two mp\_word

3139

arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and carry propagation can be

3140

moved to a $O(n)$ work level outside the $O(n^2)$ level.

3171

However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two

3172

mp\_word arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and

3173

carry propagation can be moved to a $O(n)$ work level outside the $O(n^2)$ level. In this case, we have an even simpler solution in mind.

3141

3174

3142

3175

\newpage\begin{figure}[!here]

3143

3176

\begin{small}

3147

3180

\textbf{Input}. mp\_int $a$ \\

3148

3181

\textbf{Output}. $b \leftarrow a^2$ \\

3149

3182

\hline \\

3150

Place two arrays of \textbf{MP\_WARRAY} mp\_words named $\hat W$ and $\hat {X}$ on the stack. \\

3183

Place an array of \textbf{MP\_WARRAY} mp\_digits named $W$ on the stack. \\

3151

3184

1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\

3152

3185

2. If step 1 failed return(\textit{MP\_MEM}). \\

3153

3. for $ix$ from $0$ to $2a.used + 1$ do \\

3154

\hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\

3155

\hspace{3mm}3.2 $\hat {X}_{ix} \leftarrow 0$ \\

3156

4. for $ix$ from $0$ to $a.used - 1$ do \\

3157

\hspace{3mm}Compute the square.\\

3158

\hspace{3mm}4.1 $\hat {X}_{ix+ix} \leftarrow \left ( a_{ix} \right )^2$ \\

3159

3160

\hspace{3mm}Compute the double products.\\

3161

\hspace{3mm}4.2 for $iy$ from $ix + 1$ to $a.used - 1$ do \\

3162

\hspace{6mm}4.2.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}a_{iy}$ \\

3163

5. $oldused \leftarrow b.used$ \\

3164

6. $b.used \leftarrow 2a.used + 1$ \\

3165

3166

Double the products and propagate the carries simultaneously. \\

3167

7. $\hat W_0 \leftarrow 2 \hat W_0 + \hat {X}_0$ \\

3168

8. for $ix$ from $1$ to $2a.used$ do \\

3169

\hspace{3mm}8.1 $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ \\

3170

\hspace{3mm}8.2 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix - 1} / \beta \rfloor$ \\

3171

\hspace{3mm}8.3 $b_{ix-1} \leftarrow W_{ix-1} \mbox{ (mod }\beta\mbox{)}$ \\

3172

9. $b_{2a.used} \leftarrow \hat W_{2a.used} \mbox{ (mod }\beta\mbox{)}$ \\

3173

10. if $2a.used + 1 < oldused$ then do \\

3174

\hspace{3mm}10.1 for $ix$ from $2a.used + 1$ to $oldused$ do \\

3175

\hspace{6mm}10.1.1 $b_{ix} \leftarrow 0$ \\

3176

11. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\

3177

12. Return(\textit{MP\_OKAY}). \\

3186

3187

3. $pa \leftarrow 2 \cdot a.used$ \\

3188

4. $\hat W1 \leftarrow 0$ \\

3189

5. for $ix$ from $0$ to $pa - 1$ do \\

3190

\hspace{3mm}5.1 $\_ \hat W \leftarrow 0$ \\

3191

\hspace{3mm}5.2 $ty \leftarrow \mbox{MIN}(a.used - 1, ix)$ \\

3192

\hspace{3mm}5.3 $tx \leftarrow ix - ty$ \\

3193

\hspace{3mm}5.4 $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\

3194

\hspace{3mm}5.5 $iy \leftarrow \mbox{MIN}(iy, \lfloor \left (ty - tx + 1 \right )/2 \rfloor)$ \\

3195

\hspace{3mm}5.6 for $iz$ from $0$ to $iz - 1$ do \\

3196

\hspace{6mm}5.6.1 $\_ \hat W \leftarrow \_ \hat W + a_{tx + iz}a_{ty - iz}$ \\

3197

\hspace{3mm}5.7 $\_ \hat W \leftarrow 2 \cdot \_ \hat W + \hat W1$ \\

3198

\hspace{3mm}5.8 if $ix$ is even then \\

3199

\hspace{6mm}5.8.1 $\_ \hat W \leftarrow \_ \hat W + \left ( a_{\lfloor ix/2 \rfloor}\right )^2$ \\

3200

\hspace{3mm}5.9 $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\

3201

\hspace{3mm}5.10 $\hat W1 \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\

3202

3203

6. $oldused \leftarrow b.used$ \\

3204

7. $b.used \leftarrow 2 \cdot a.used$ \\

3205

8. for $ix$ from $0$ to $pa - 1$ do \\

3206

\hspace{3mm}8.1 $b_{ix} \leftarrow W_{ix}$ \\

3207

9. for $ix$ from $pa$ to $oldused - 1$ do \\

3208

\hspace{3mm}9.1 $b_{ix} \leftarrow 0$ \\

3209

10. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\

3210

11. Return(\textit{MP\_OKAY}). \\

3178

3211

\hline

3179

3212

\end{tabular}

3180

3213

\end{center}

3183

3216

\end{figure}

3184

3217

3185

3218

\textbf{Algorithm fast\_s\_mp\_sqr.}

3186

This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm s\_mp\_sqr when

3187

the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$.

3188

3189

This routine requires two arrays of mp\_words to be placed on the stack. The first array $\hat W$ will hold the double products and the second

3190

array $\hat X$ will hold the squares. Though only at most $MP\_WARRAY \over 2$ words of $\hat X$ are used, it has proven faster on most

3191

processors to simply make it a full size array.

3192

3193

The loop on step 3 will zero the two arrays to prepare them for the squaring step. Step 4.1 computes the squares of the product. Note how

3194

it simply assigns the value into the $\hat X$ array. The nested loop on step 4.2 computes the doubles of the products. This loop

3195

computes the sum of the products for each column. They are not doubled until later.

3196

3197

After the squaring loop, the products stored in $\hat W$ musted be doubled and the carries propagated forwards. It makes sense to do both

3198

operations at the same time. The expression $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ computes the sum of the double product and the

3199

squares in place.

3219

This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm

3220

s\_mp\_sqr when the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$.

3221

This algorithm is very similar to the Comba multiplier except with a few key differences we shall make note of.

3222

3223

First, we have an accumulator and carry variables $\_ \hat W$ and $\hat W1$ respectively. This is because the inner loop

3224

products are to be doubled. If we had added the previous carry in we would be doubling too much. Next we perform an

3225

addition MIN condition on $iy$ (step 5.5) to prevent overlapping digits. For example, $a_3 \cdot a_5$ is equal

3226

$a_5 \cdot a_3$. Whereas in the multiplication case we would have $5 < a.used$ and $3 \ge 0$ is maintained since we double the sum

3227

of the products just outside the inner loop we have to avoid doing this. This is also a good thing since we perform

3228

fewer multiplications and the routine ends up being faster.

3229

3230

Finally the last difference is the addition of the ``square'' term outside the inner loop (step 5.8). We add in the square

3231

only to even outputs and it is the square of the term at the $\lfloor ix / 2 \rfloor$ position.

3200

3232

3201

3233

EXAM,bn_fast_s_mp_sqr.c

3202

3234

3235

This implementation is essentially a copy of Comba multiplication with the appropriate changes added to make it faster for

3236

the special case of squaring.

3203

3237

3204

3238

\subsection{Polynomial Basis Squaring}

3205

3239

The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception

3312

3345

is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4

3313

3346

it is actually below the Comba limit (\textit{at 110 digits}).

3314

3347

3315

This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are redirected to

3316

the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and mp\_clears are executed normally.

3317

3318

\textit{Last paragraph sucks. re-write! -- Tom}

3348

This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are

3349

redirected to the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and

3350

mp\_clears are executed normally.

3319

3351

3320

3352

\subsection{Toom-Cook Squaring}

3321

3353

The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used

3322

instead of multiplication to find the five relations.. The reader is encouraged to read the description of the latter algorithm and try to

3354

instead of multiplication to find the five relations. The reader is encouraged to read the description of the latter algorithm and try to

3323

3355

derive their own Toom-Cook squaring algorithm.

3324

3356

3325

3357

\subsection{High Level Squaring}

3362

3394

$\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\

3363

3395

& that have different number of digits in Karatsuba multiplication. \\

3364

3396

& \\

3365

$\left [ 3 \right ] $ & In ~SQUARE~ the fact that every column of a squaring is made up \\

3397

$\left [ 2 \right ] $ & In ~SQUARE~ the fact that every column of a squaring is made up \\

3366

3398

& of double products and at most one square is stated. Prove this statement. \\

3367

3399

& \\

3368

$\left [ 2 \right ] $ & In the Comba squaring algorithm half of the $\hat X$ variables are not used. \\

3369

& Revise algorithm fast\_s\_mp\_sqr to shrink the $\hat X$ array. \\

3370

& \\

3371

3400

$\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\

3372

3401

& \\

3373

3402

$\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\

3375

3404

$\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\

3376

3405

& required for equation $6.7$ to be true. \\

3377

3406

& \\

3407

$\left [ 3 \right ] $ & Implement a threaded version of Comba multiplication (and squaring) where you \\

3408

& compute subsets of the columns in each thread. Determine a cutoff point where \\

3409

& it is effective and add the logic to mp\_mul() and mp\_sqr(). \\

3410

&\\

3411

$\left [ 4 \right ] $ & Same as the previous but also modify the Karatsuba and Toom-Cook. You must \\

3412

& increase the throughput of mp\_exptmod() for random odd moduli in the range \\

3413

& $512 \ldots 4096$ bits significantly ($> 2x$) to complete this challenge. \\

3414

& \\

3378

3415

\end{tabular}

3379

3416

3380

3417

\chapter{Modular Reduction}

3394

3431

Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions

3395

3432

is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the

3396

3433

RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in

3397

Elliptic Curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular

3434

elliptic curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular

3398

3435

exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the

3399

3436

range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check

3400

3437

algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems.

3610

3647

In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for

3611

3648

future use so that the Barrett algorithm can be used without delay.

3612

3649

3613

\begin{figure}[!here]

3650

\newpage\begin{figure}[!here]

3614

3651

\begin{small}

3615

3652

\begin{center}

3616

3653

\begin{tabular}{l}

5818

5855

defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is

5819

5856

equivalent to equation \ref{eqn:legendre}.

5820

5857

5858

\textit{-- Tom, don't be an ass, cite your source here...!}

5859

5821

5860

\begin{equation}

5822

5861

a^{(p-1)/2} \equiv \begin{array}{rl}

5823

5862

-1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\

Older »