src/gmp-6.1.0/mpn/x86/k6/README - stadia-controller/gcc-arm-none-eabi - Git at Google

 Copyright 2000, 2001 Free Software Foundation, Inc.

 This file is part of the GNU MP Library.

 The GNU MP Library is free software; you can redistribute it and/or modify
 it under the terms of either:

   * the GNU Lesser General Public License as published by the Free
     Software Foundation; either version 3 of the License, or (at your
     option) any later version.

 or

   * the GNU General Public License as published by the Free Software
     Foundation; either version 2 of the License, or (at your option) any
     later version.

 or both in parallel, as here.

 The GNU MP Library is distributed in the hope that it will be useful, but
 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
 for more details.

 You should have received copies of the GNU General Public License and the
 GNU Lesser General Public License along with the GNU MP Library.  If not,
 see https://www.gnu.org/licenses/.


 			AMD K6 MPN SUBROUTINES


 This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
 K6-3.

 The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
 has MMX code suiting K6-2 and K6-3.  All chips in the K6 family have MMX,
 the separate directories are just so that ./configure can omit them if the
 assembler doesn't support MMX.


 STATUS

 Times for the loops, with all code and data in L1 cache, are as follows.

                                  cycles/limb

 	mpn_add_n/sub_n            3.25 normal, 2.75 in-place

 	mpn_mul_1                  6.25
 	mpn_add/submul_1           7.65-8.4  (varying with data values)

 	mpn_mul_basecase           9.25 cycles/crossproduct (approx)
 	mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
                                    or 9.2 cycles/triangleproduct (approx)

 	mpn_l/rshift               3.0

 	mpn_divrem_1              20.0
 	mpn_mod_1                 20.0
 	mpn_divexact_by3          11.0

 	mpn_copyi                  1.0
 	mpn_copyd                  1.0


 K6-2 and K6-3 have dual-issue MMX and get the following improvements.

 	mpn_l/rshift               1.75


 Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
 instruction, code seems to run slower, and with just "mov" loads it doesn't
 seem faster.  Results so far are inconsistent.  The K6 does a hardware
 prefetch of the second cache line in a sector, so the penalty for not
 prefetching in software is reduced.


 NOTES

 All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.

 Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
 execute them in both X and Y (and in both together).

 Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
 chapter 6 table 12).

 Write-allocate L1 data cache means prefetching of destinations is unnecessary.
 Store queue is 7 entries of 64 bits each.

 Floating point multiplications can be done in parallel with integer
 multiplications, but there doesn't seem to be any way to make use of this.


 OPTIMIZATIONS

 Unrolled loops are used to reduce looping overhead.  The unrolling is
 configurable up to 32 limbs/loop for most routines, up to 64 for some.

 Sometimes computed jumps into the unrolling are used to handle sizes not a
 multiple of the unrolling.  An attractive feature of this is that times
 smoothly increase with operand size, but an indirect jump is about 6 cycles
 and the setups about another 6, so it depends on how much the unrolled code
 is faster than a simple loop as to whether a computed jump ought to be used.

 Position independent code is implemented using a call to get eip for
 computed jumps and a ret is always done, rather than an addl $4,%esp or a
 popl, so the CPU return address branch prediction stack stays synchronised
 with the actual stack in memory.  Such a call however still costs 4 to 7
 cycles.

 Branch prediction, in absence of any history, will guess forward jumps are
 not taken and backward jumps are taken.  Where possible it's arranged that
 the less likely or less important case is under a taken forward jump.


 MMX

 Putting emms or femms as late as possible in a routine seems to be fastest.
 Perhaps an emms or femms stalls until all outstanding MMX instructions have
 completed, so putting it later gives them a chance to complete on their own,
 in parallel with other operations (like register popping).

 The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
 at the start of a routine, in case it's been preceded by x87 floating point
 operations.  This isn't done because in gmp programs it's expected that x87
 floating point won't be much used and that chances are an mpn routine won't
 have been preceded by any x87 code.


 CODING

 Instructions in general code are shown paired if they can decode and execute
 together, meaning two short decode instructions with the second not
 depending on the first, only the first using the shifter, no more than one
 load, and no more than one store.

 K6 does some out of order execution so the pairings aren't essential, they
 just show what slots might be available.  When decoding is the limiting
 factor things can be scheduled that might not execute until later.


 NOTES

 Code alignment

 - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
   short decode is inhibited.  The cross.pl script detects this.

 - loops and branch targets should be aligned to 16 bytes, or ensure at least
   2 instructions before a 32 byte boundary.  This makes use of the 16 byte
   cache in the BTB.

 Addressing modes

 - (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
   problem, and can be used as an equivalent, or easier is just to use a
   different register, like %ebx.

 - K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
   have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).

   If more than 3 bytes are needed to determine instruction length then
   decoding degrades from direct to long, or from long to vector.  This
   happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
   with mod=00 the sib determines whether there's a displacement.

   This affects all MMX and 3DNow instructions, and others with an 0F prefix,
   like movzbl.  The modes affected are anything with an index and no
   displacement, or an index but no base, and this includes (%esp) which is
   really (,%esp,1).

   The cross.pl script detects problem cases.  The workaround is to always
   use a displacement, and to do this with Zdisp if it's zero so the
   assembler doesn't discard it.

   See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
   13-14 and 36-37.

 Calls

 - indirect jumps and calls are not branch predicted, they measure about 6
   cycles.

 Various

 - adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
 - bsf       12-27 cycles
 - emms      5 cycles
 - femms     3 cycles
 - jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
 - divl      20 cycles back-to-back
 - imull     2 decode, 3 execute
 - mull      2 decode, 3 execute (optimization manual decoding sample)
 - prefetch  2 cycles
 - rcll/rcrl implicit by one bit: 2 cycles
             immediate or %cl count: 11 + 2 per bit for dword
                                     13 + 4 per bit for byte
 - setCC	    2 cycles
 - xchgl	%eax,reg  1.5 cycles, back-to-back (strange)
         reg,reg   2 cycles, back-to-back


 REFERENCES

 "AMD-K6 Processor Code Optimization Application Note", AMD publication
 number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
 K6-3.  Available on-line,

 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf

 "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
 publication number 21828, revision A amendment 0, August 1997.  This is an
 older edition of the above document, describing plain K6.  Available
 on-line,

 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf

 "3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
 This describes the femms and prefetch instructions, but nothing else from
 3DNow has been used.  Available on-line,

 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf

 "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
 August 1999.  This has some notes on general K6 optimizations as well as
 3DNow.  Available on-line,

 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf


 ----------------
 Local variables:
 mode: text
 fill-column: 76
 End:
	Copyright 2000, 2001 Free Software Foundation, Inc.

	This file is part of the GNU MP Library.

	The GNU MP Library is free software; you can redistribute it and/or modify
	it under the terms of either:

	* the GNU Lesser General Public License as published by the Free
	Software Foundation; either version 3 of the License, or (at your
	option) any later version.

	or

	* the GNU General Public License as published by the Free Software
	Foundation; either version 2 of the License, or (at your option) any
	later version.

	or both in parallel, as here.

	The GNU MP Library is distributed in the hope that it will be useful, but
	WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
	or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
	for more details.

	You should have received copies of the GNU General Public License and the
	GNU Lesser General Public License along with the GNU MP Library. If not,
	see https://www.gnu.org/licenses/.




	AMD K6 MPN SUBROUTINES



	This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
	K6-3.

	The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
	has MMX code suiting K6-2 and K6-3. All chips in the K6 family have MMX,
	the separate directories are just so that ./configure can omit them if the
	assembler doesn't support MMX.




	STATUS

	Times for the loops, with all code and data in L1 cache, are as follows.

	cycles/limb

	mpn_add_n/sub_n 3.25 normal, 2.75 in-place

	mpn_mul_1 6.25
	mpn_add/submul_1 7.65-8.4 (varying with data values)

	mpn_mul_basecase 9.25 cycles/crossproduct (approx)
	mpn_sqr_basecase 4.7 cycles/crossproduct (approx)
	or 9.2 cycles/triangleproduct (approx)

	mpn_l/rshift 3.0

	mpn_divrem_1 20.0
	mpn_mod_1 20.0
	mpn_divexact_by3 11.0

	mpn_copyi 1.0
	mpn_copyd 1.0


	K6-2 and K6-3 have dual-issue MMX and get the following improvements.

	mpn_l/rshift 1.75


	Prefetching of sources hasn't yet given any joy. With the 3DNow "prefetch"
	instruction, code seems to run slower, and with just "mov" loads it doesn't
	seem faster. Results so far are inconsistent. The K6 does a hardware
	prefetch of the second cache line in a sector, so the penalty for not
	prefetching in software is reduced.




	NOTES

	All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.

	Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
	execute them in both X and Y (and in both together).

	Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
	chapter 6 table 12).

	Write-allocate L1 data cache means prefetching of destinations is unnecessary.
	Store queue is 7 entries of 64 bits each.

	Floating point multiplications can be done in parallel with integer
	multiplications, but there doesn't seem to be any way to make use of this.



	OPTIMIZATIONS

	Unrolled loops are used to reduce looping overhead. The unrolling is
	configurable up to 32 limbs/loop for most routines, up to 64 for some.

	Sometimes computed jumps into the unrolling are used to handle sizes not a
	multiple of the unrolling. An attractive feature of this is that times
	smoothly increase with operand size, but an indirect jump is about 6 cycles
	and the setups about another 6, so it depends on how much the unrolled code
	is faster than a simple loop as to whether a computed jump ought to be used.

	Position independent code is implemented using a call to get eip for
	computed jumps and a ret is always done, rather than an addl $4,%esp or a
	popl, so the CPU return address branch prediction stack stays synchronised
	with the actual stack in memory. Such a call however still costs 4 to 7
	cycles.

	Branch prediction, in absence of any history, will guess forward jumps are
	not taken and backward jumps are taken. Where possible it's arranged that
	the less likely or less important case is under a taken forward jump.



	MMX

	Putting emms or femms as late as possible in a routine seems to be fastest.
	Perhaps an emms or femms stalls until all outstanding MMX instructions have
	completed, so putting it later gives them a chance to complete on their own,
	in parallel with other operations (like register popping).

	The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
	at the start of a routine, in case it's been preceded by x87 floating point
	operations. This isn't done because in gmp programs it's expected that x87
	floating point won't be much used and that chances are an mpn routine won't
	have been preceded by any x87 code.



	CODING

	Instructions in general code are shown paired if they can decode and execute
	together, meaning two short decode instructions with the second not
	depending on the first, only the first using the shifter, no more than one
	load, and no more than one store.

	K6 does some out of order execution so the pairings aren't essential, they
	just show what slots might be available. When decoding is the limiting
	factor things can be scheduled that might not execute until later.



	NOTES

	Code alignment

	- if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
	short decode is inhibited. The cross.pl script detects this.

	- loops and branch targets should be aligned to 16 bytes, or ensure at least
	2 instructions before a 32 byte boundary. This makes use of the 16 byte
	cache in the BTB.

	Addressing modes

	- (%esi) degrades decoding from short to vector. 0(%esi) doesn't have this
	problem, and can be used as an equivalent, or easier is just to use a
	different register, like %ebx.

	- K6 and pre-CXT core K6-2 have the following problem. (K6-2 CXT and K6-3
	have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).

	If more than 3 bytes are needed to determine instruction length then
	decoding degrades from direct to long, or from long to vector. This
	happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
	with mod=00 the sib determines whether there's a displacement.

	This affects all MMX and 3DNow instructions, and others with an 0F prefix,
	like movzbl. The modes affected are anything with an index and no
	displacement, or an index but no base, and this includes (%esp) which is
	really (,%esp,1).

	The cross.pl script detects problem cases. The workaround is to always
	use a displacement, and to do this with Zdisp if it's zero so the
	assembler doesn't discard it.

	See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
	13-14 and 36-37.

	Calls

	- indirect jumps and calls are not branch predicted, they measure about 6
	cycles.

	Various

	- adcl 2 cycles of decode, maybe 2 cycles executing in the X pipe
	- bsf 12-27 cycles
	- emms 5 cycles
	- femms 3 cycles
	- jecxz 2 cycles taken, 13 not taken (optimization manual says 7 not taken)
	- divl 20 cycles back-to-back
	- imull 2 decode, 3 execute
	- mull 2 decode, 3 execute (optimization manual decoding sample)
	- prefetch 2 cycles
	- rcll/rcrl implicit by one bit: 2 cycles
	immediate or %cl count: 11 + 2 per bit for dword
	13 + 4 per bit for byte
	- setCC 2 cycles
	- xchgl %eax,reg 1.5 cycles, back-to-back (strange)
	reg,reg 2 cycles, back-to-back




	REFERENCES

	"AMD-K6 Processor Code Optimization Application Note", AMD publication
	number 21924, revision D amendment 0, January 2000. This describes K6-2 and
	K6-3. Available on-line,

	http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf

	"AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
	publication number 21828, revision A amendment 0, August 1997. This is an
	older edition of the above document, describing plain K6. Available
	on-line,

	http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf

	"3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
	This describes the femms and prefetch instructions, but nothing else from
	3DNow has been used. Available on-line,

	http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf

	"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
	August 1999. This has some notes on general K6 optimizations as well as
	3DNow. Available on-line,

	http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf



	----------------
	Local variables:
	mode: text
	fill-column: 76
	End: