| Copyright 2001 Free Software Foundation, Inc. |
| |
| This file is part of the GNU MP Library. |
| |
| The GNU MP Library is free software; you can redistribute it and/or modify |
| it under the terms of the GNU Lesser General Public License as published by |
| the Free Software Foundation; either version 2.1 of the License, or (at your |
| option) any later version. |
| |
| The GNU MP Library is distributed in the hope that it will be useful, but |
| WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
| or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public |
| License for more details. |
| |
| You should have received a copy of the GNU Lesser General Public License |
| along with the GNU MP Library; see the file COPYING.LIB. If not, write to |
| the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA |
| 02110-1301, USA. |
| |
| |
| |
| |
| INTEL PENTIUM-4 MPN SUBROUTINES |
| |
| |
| This directory contains mpn functions optimized for Intel Pentium-4. |
| |
| The mmx subdirectory has routines using MMX instructions, the sse2 |
| subdirectory has routines using SSE2 instructions. All P4s have these, the |
| separate directories are just so configure can omit that code if the |
| assembler doesn't support it. |
| |
| |
| STATUS |
| |
| cycles/limb |
| |
| mpn_add_n/sub_n 4 normal, 6 in-place |
| |
| mpn_mul_1 4 normal, 6 in-place |
| mpn_addmul_1 6 |
| mpn_submul_1 7 |
| |
| mpn_mul_basecase 6 cycles/crossproduct (approx) |
| |
| mpn_sqr_basecase 3.5 cycles/crossproduct (approx) |
| or 7.0 cycles/triangleproduct (approx) |
| |
| mpn_l/rshift 1.75 |
| |
| |
| |
| The shifts ought to be able to go at 1.5 c/l, but not much effort has been |
| applied to them yet. |
| |
| In-place operations, and all addmul, submul, mul_basecase and sqr_basecase |
| calls, suffer from pipeline anomalies associated with write combining and |
| movd reads and writes to the same or nearby locations. The movq |
| instructions do not trigger the same hardware problems. Unfortunately, |
| using movq and splitting/combining seems to require too many extra |
| instructions to help. Perhaps future chip steppings will be better. |
| |
| |
| |
| NOTES |
| |
| The Pentium-4 pipeline "Netburst", provides for quite a number of surprises. |
| Many traditional x86 instructions run very slowly, requiring use of |
| alterative instructions for acceptable performance. |
| |
| adcl and sbbl are quite slow at 8 cycles for reg->reg. paddq of 32-bits |
| within a 64-bit mmx register seems better, though the combination |
| paddq/psrlq when propagating a carry is still a 4 cycle latency. |
| |
| incl and decl should be avoided, instead use add $1 and sub $1. Apparently |
| the carry flag is not separately renamed, so incl and decl depend on all |
| previous flags-setting instructions. |
| |
| shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest |
| integer instructions (addl, subl, orl, andl, and some more). shldl and |
| shrdl seem to have 13 and 15 cycles latency, respectively. Bizarre. |
| |
| movq mmx -> mmx does have 6 cycle latency, as noted in the documentation. |
| pxor/por or similar combination at 2 cycles latency can be used instead. |
| The movq however executes in the float unit, thereby saving MMX execution |
| resources. With the right juggling, data moves shouldn't be on a dependent |
| chain. |
| |
| L1 is write-through, but the write-combining sounds like it does enough to |
| not require explicit destination prefetching. |
| |
| xmm registers so far haven't found a use, but not much effort has been |
| expended. A configure test for whether the operating system knows |
| fxsave/fxrestor will be needed if they're used. |
| |
| |
| |
| REFERENCES |
| |
| Intel Pentium-4 processor manuals, |
| |
| http://developer.intel.com/design/pentium4/manuals |
| |
| "Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001, |
| order number 248966. Available on-line: |
| |
| http://developer.intel.com/design/pentium4/manuals/248966.htm |
| |
| |
| |
| ---------------- |
| Local variables: |
| mode: text |
| fill-column: 76 |
| End: |