| Copyright 2000, 2001 Free Software Foundation, Inc. |
| |
| This file is part of the GNU MP Library. |
| |
| The GNU MP Library is free software; you can redistribute it and/or modify |
| it under the terms of either: |
| |
| * the GNU Lesser General Public License as published by the Free |
| Software Foundation; either version 3 of the License, or (at your |
| option) any later version. |
| |
| or |
| |
| * the GNU General Public License as published by the Free Software |
| Foundation; either version 2 of the License, or (at your option) any |
| later version. |
| |
| or both in parallel, as here. |
| |
| The GNU MP Library is distributed in the hope that it will be useful, but |
| WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
| or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License |
| for more details. |
| |
| You should have received copies of the GNU General Public License and the |
| GNU Lesser General Public License along with the GNU MP Library. If not, |
| see https://www.gnu.org/licenses/. |
| |
| |
| |
| |
| AMD K6 MPN SUBROUTINES |
| |
| |
| |
| This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and |
| K6-3. |
| |
| The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory |
| has MMX code suiting K6-2 and K6-3. All chips in the K6 family have MMX, |
| the separate directories are just so that ./configure can omit them if the |
| assembler doesn't support MMX. |
| |
| |
| |
| |
| STATUS |
| |
| Times for the loops, with all code and data in L1 cache, are as follows. |
| |
| cycles/limb |
| |
| mpn_add_n/sub_n 3.25 normal, 2.75 in-place |
| |
| mpn_mul_1 6.25 |
| mpn_add/submul_1 7.65-8.4 (varying with data values) |
| |
| mpn_mul_basecase 9.25 cycles/crossproduct (approx) |
| mpn_sqr_basecase 4.7 cycles/crossproduct (approx) |
| or 9.2 cycles/triangleproduct (approx) |
| |
| mpn_l/rshift 3.0 |
| |
| mpn_divrem_1 20.0 |
| mpn_mod_1 20.0 |
| mpn_divexact_by3 11.0 |
| |
| mpn_copyi 1.0 |
| mpn_copyd 1.0 |
| |
| |
| K6-2 and K6-3 have dual-issue MMX and get the following improvements. |
| |
| mpn_l/rshift 1.75 |
| |
| |
| Prefetching of sources hasn't yet given any joy. With the 3DNow "prefetch" |
| instruction, code seems to run slower, and with just "mov" loads it doesn't |
| seem faster. Results so far are inconsistent. The K6 does a hardware |
| prefetch of the second cache line in a sector, so the penalty for not |
| prefetching in software is reduced. |
| |
| |
| |
| |
| NOTES |
| |
| All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow. |
| |
| Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can |
| execute them in both X and Y (and in both together). |
| |
| Branch misprediction penalty is 1 to 4 cycles (Optimization Manual |
| chapter 6 table 12). |
| |
| Write-allocate L1 data cache means prefetching of destinations is unnecessary. |
| Store queue is 7 entries of 64 bits each. |
| |
| Floating point multiplications can be done in parallel with integer |
| multiplications, but there doesn't seem to be any way to make use of this. |
| |
| |
| |
| OPTIMIZATIONS |
| |
| Unrolled loops are used to reduce looping overhead. The unrolling is |
| configurable up to 32 limbs/loop for most routines, up to 64 for some. |
| |
| Sometimes computed jumps into the unrolling are used to handle sizes not a |
| multiple of the unrolling. An attractive feature of this is that times |
| smoothly increase with operand size, but an indirect jump is about 6 cycles |
| and the setups about another 6, so it depends on how much the unrolled code |
| is faster than a simple loop as to whether a computed jump ought to be used. |
| |
| Position independent code is implemented using a call to get eip for |
| computed jumps and a ret is always done, rather than an addl $4,%esp or a |
| popl, so the CPU return address branch prediction stack stays synchronised |
| with the actual stack in memory. Such a call however still costs 4 to 7 |
| cycles. |
| |
| Branch prediction, in absence of any history, will guess forward jumps are |
| not taken and backward jumps are taken. Where possible it's arranged that |
| the less likely or less important case is under a taken forward jump. |
| |
| |
| |
| MMX |
| |
| Putting emms or femms as late as possible in a routine seems to be fastest. |
| Perhaps an emms or femms stalls until all outstanding MMX instructions have |
| completed, so putting it later gives them a chance to complete on their own, |
| in parallel with other operations (like register popping). |
| |
| The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3 |
| at the start of a routine, in case it's been preceded by x87 floating point |
| operations. This isn't done because in gmp programs it's expected that x87 |
| floating point won't be much used and that chances are an mpn routine won't |
| have been preceded by any x87 code. |
| |
| |
| |
| CODING |
| |
| Instructions in general code are shown paired if they can decode and execute |
| together, meaning two short decode instructions with the second not |
| depending on the first, only the first using the shifter, no more than one |
| load, and no more than one store. |
| |
| K6 does some out of order execution so the pairings aren't essential, they |
| just show what slots might be available. When decoding is the limiting |
| factor things can be scheduled that might not execute until later. |
| |
| |
| |
| NOTES |
| |
| Code alignment |
| |
| - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary, |
| short decode is inhibited. The cross.pl script detects this. |
| |
| - loops and branch targets should be aligned to 16 bytes, or ensure at least |
| 2 instructions before a 32 byte boundary. This makes use of the 16 byte |
| cache in the BTB. |
| |
| Addressing modes |
| |
| - (%esi) degrades decoding from short to vector. 0(%esi) doesn't have this |
| problem, and can be used as an equivalent, or easier is just to use a |
| different register, like %ebx. |
| |
| - K6 and pre-CXT core K6-2 have the following problem. (K6-2 CXT and K6-3 |
| have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F). |
| |
| If more than 3 bytes are needed to determine instruction length then |
| decoding degrades from direct to long, or from long to vector. This |
| happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since |
| with mod=00 the sib determines whether there's a displacement. |
| |
| This affects all MMX and 3DNow instructions, and others with an 0F prefix, |
| like movzbl. The modes affected are anything with an index and no |
| displacement, or an index but no base, and this includes (%esp) which is |
| really (,%esp,1). |
| |
| The cross.pl script detects problem cases. The workaround is to always |
| use a displacement, and to do this with Zdisp if it's zero so the |
| assembler doesn't discard it. |
| |
| See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages |
| 13-14 and 36-37. |
| |
| Calls |
| |
| - indirect jumps and calls are not branch predicted, they measure about 6 |
| cycles. |
| |
| Various |
| |
| - adcl 2 cycles of decode, maybe 2 cycles executing in the X pipe |
| - bsf 12-27 cycles |
| - emms 5 cycles |
| - femms 3 cycles |
| - jecxz 2 cycles taken, 13 not taken (optimization manual says 7 not taken) |
| - divl 20 cycles back-to-back |
| - imull 2 decode, 3 execute |
| - mull 2 decode, 3 execute (optimization manual decoding sample) |
| - prefetch 2 cycles |
| - rcll/rcrl implicit by one bit: 2 cycles |
| immediate or %cl count: 11 + 2 per bit for dword |
| 13 + 4 per bit for byte |
| - setCC 2 cycles |
| - xchgl %eax,reg 1.5 cycles, back-to-back (strange) |
| reg,reg 2 cycles, back-to-back |
| |
| |
| |
| |
| REFERENCES |
| |
| "AMD-K6 Processor Code Optimization Application Note", AMD publication |
| number 21924, revision D amendment 0, January 2000. This describes K6-2 and |
| K6-3. Available on-line, |
| |
| http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf |
| |
| "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD |
| publication number 21828, revision A amendment 0, August 1997. This is an |
| older edition of the above document, describing plain K6. Available |
| on-line, |
| |
| http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf |
| |
| "3DNow Technology Manual", AMD publication number 21928G/0-March 2000. |
| This describes the femms and prefetch instructions, but nothing else from |
| 3DNow has been used. Available on-line, |
| |
| http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf |
| |
| "3DNow Instruction Porting Guide", AMD publication number 22621, revision B, |
| August 1999. This has some notes on general K6 optimizations as well as |
| 3DNow. Available on-line, |
| |
| http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf |
| |
| |
| |
| ---------------- |
| Local variables: |
| mode: text |
| fill-column: 76 |
| End: |