1
Copyright 1997, 1999, 2000, 2001 Free Software Foundation, Inc.
3
This file is part of the GNU MP Library.
5
The GNU MP Library is free software; you can redistribute it and/or modify
6
it under the terms of the GNU Lesser General Public License as published by
7
the Free Software Foundation; either version 2.1 of the License, or (at your
8
option) any later version.
10
The GNU MP Library is distributed in the hope that it will be useful, but
11
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
13
License for more details.
15
You should have received a copy of the GNU Lesser General Public License
16
along with the GNU MP Library; see the file COPYING.LIB. If not, write to
17
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
24
This directory contains mpn functions for 64-bit V9 SPARC
26
RELEVANT OPTIMIZATION ISSUES
28
The Ultra I/II pipeline executes up to two simple integer arithmetic operations
29
per cycle. The 64-bit integer multiply instruction mulx takes from 5 cycles to
30
35 cycles, depending on the position of the most significant bit of the 1st
31
source operand. When used for 32x32->64 multiplication, it needs 20 cycles.
32
Furthermore, it stalls the processor while executing. We stay away from that
33
instruction, and instead use floating-point operations.
35
Integer conditional move instructions cannot dual-issue with other integer
36
instructions. No conditional move can issue 1-5 cycles after a load. (Or
37
something such bizarre.) We don't use these.
39
Integer branches can issue with two integer arithmetic instructions. Likewise
40
for integer loads. Four instructions may issue (iop, iop, ld/st/fop,
41
branch/fop) but only if a branch or fop is last.
45
Timings on UltraSPARC-1/2:
47
* lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb.
49
* add_n, sub_n: The current code runs at 4 cycles/limb.
51
* mul_1/addmul_1/submul_1: The current code runs at about 33 cycles/limb. By
52
splitting the invariant operand into 16-bit chunks and other operand into
53
32-bit chunks, we could reach 14 cycles/limb.