src/gmp-6.1.0/mpn/cray/README - stadia-controller/gcc-arm-none-eabi - Git at Google

 Copyright 2000-2002 Free Software Foundation, Inc.

 This file is part of the GNU MP Library.

 The GNU MP Library is free software; you can redistribute it and/or modify
 it under the terms of either:

   * the GNU Lesser General Public License as published by the Free
     Software Foundation; either version 3 of the License, or (at your
     option) any later version.

 or

   * the GNU General Public License as published by the Free Software
     Foundation; either version 2 of the License, or (at your option) any
     later version.

 or both in parallel, as here.

 The GNU MP Library is distributed in the hope that it will be useful, but
 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
 or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
 for more details.

 You should have received copies of the GNU General Public License and the
 GNU Lesser General Public License along with the GNU MP Library.  If not,
 see https://www.gnu.org/licenses/.


 The code in this directory works for Cray vector systems such as C90,
 J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
 the T3E and T3D systems, see the `alpha' subdirectory at the same
 level as the directory containing this file.)

 The cfp subdirectory is for systems utilizing the traditional Cray
 floating-point format, and the ieee subdirectory is for the newer
 systems that use the IEEE floating-point format.

 There are several issues that reduces speed on Cray systems.  For
 systems with cfp floating point, the main obstacle is the forming of
 128-bit products.  For IEEE systems, adding, and in particular
 computing carry is the main issue.  There are no vectorizing
 unsigned-less-than instructions, and the sequence that implement that
 operation is very long.

 Shifting is the only operation that is simple to make fast.  All Cray
 systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
 should be really useful.

 For best speed for cfp systems, we need a mul_basecase, since that
 reduces the need for carry propagation to a minimum.  Depending on the
 size (vn) of the smaller of the two operands (V), we should split U and V
 in different chunk sizes:

 U split in 2 32-bit parts
 V split according to the table:
 parts			4	5	6	7	8
 bits/part		16	13	11	10	8
 max allowed vn		1	8	32	64	256
 number of multiplies	8	10	12	14	16
 peak cycles/limb	4	5	6	7	8

 U split in 3 22-bit parts
 V split according to the table:
 parts			3	4	5
 bits/part		22	16	13
 max allowed vn		16	1024	8192
 number of multiplies	9	12	15
 peak cycles/limb	4.5	6	7.5

 U split in 4 16-bit parts
 V split according to the table:
 parts			4
 bits/part		16
 max allowed vn		65536
 number of multiplies	16
 peak cycles/limb	8

 (A T90 CPU can accumulate two products per cycle.)

 IDEA:
 * Rewrite mpn_add_n:
     short cy[n + 1];
     #pragma _CRI ivdep
       for (i = 0; i < n; i++)
 	{ s = up[i] + vp[i];
 	  rp[i] = s;
 	  cy[i + 1] = s < up[i]; }
       more_carries = 0;
     #pragma _CRI ivdep
       for (i = 1; i < n; i++)
 	{ s = rp[i] + cy[i];
 	  rp[i] = s;
 	  more_carries += s < cy[i]; }
       cys = 0;
       if (more_carries)
 	{
 	  cys = rp[1] < cy[1];
 	  for (i = 2; i < n; i++)
 	    { rp[i] += cys;
 	      cys = rp[i] < cys; }
 	}
       return cys + cy[n];

 * Write mpn_add3_n for adding three operands.  First add operands 1
   and 2, and generate cy[].  Then add operand 3 to the partial result,
   and accumulate carry into cy[].  Finally propagate carry just like
   in the new mpn_add_n.

 IDEA:

 Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
 down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
 storing even fewer bits per limb, perhaps 56, it would be possible to
 write a mul_mul_basecase that would run at effectively 1 cycle/limb.
 (Use VM here to better handle the romb-shaped multiply area, perhaps
 rounding operand sizes up to the next power of 2.)
	Copyright 2000-2002 Free Software Foundation, Inc.

	This file is part of the GNU MP Library.

	The GNU MP Library is free software; you can redistribute it and/or modify
	it under the terms of either:

	* the GNU Lesser General Public License as published by the Free
	Software Foundation; either version 3 of the License, or (at your
	option) any later version.

	or

	* the GNU General Public License as published by the Free Software
	Foundation; either version 2 of the License, or (at your option) any
	later version.

	or both in parallel, as here.

	The GNU MP Library is distributed in the hope that it will be useful, but
	WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
	or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
	for more details.

	You should have received copies of the GNU General Public License and the
	GNU Lesser General Public License along with the GNU MP Library. If not,
	see https://www.gnu.org/licenses/.






	The code in this directory works for Cray vector systems such as C90,
	J90, T90 (both the CFP variant and the IEEE variant) and SV1. (For
	the T3E and T3D systems, see the `alpha' subdirectory at the same
	level as the directory containing this file.)

	The cfp subdirectory is for systems utilizing the traditional Cray
	floating-point format, and the ieee subdirectory is for the newer
	systems that use the IEEE floating-point format.

	There are several issues that reduces speed on Cray systems. For
	systems with cfp floating point, the main obstacle is the forming of
	128-bit products. For IEEE systems, adding, and in particular
	computing carry is the main issue. There are no vectorizing
	unsigned-less-than instructions, and the sequence that implement that
	operation is very long.

	Shifting is the only operation that is simple to make fast. All Cray
	systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
	should be really useful.

	For best speed for cfp systems, we need a mul_basecase, since that
	reduces the need for carry propagation to a minimum. Depending on the
	size (vn) of the smaller of the two operands (V), we should split U and V
	in different chunk sizes:

	U split in 2 32-bit parts
	V split according to the table:
	parts 4 5 6 7 8
	bits/part 16 13 11 10 8
	max allowed vn 1 8 32 64 256
	number of multiplies 8 10 12 14 16
	peak cycles/limb 4 5 6 7 8

	U split in 3 22-bit parts
	V split according to the table:
	parts 3 4 5
	bits/part 22 16 13
	max allowed vn 16 1024 8192
	number of multiplies 9 12 15
	peak cycles/limb 4.5 6 7.5

	U split in 4 16-bit parts
	V split according to the table:
	parts 4
	bits/part 16
	max allowed vn 65536
	number of multiplies 16
	peak cycles/limb 8

	(A T90 CPU can accumulate two products per cycle.)

	IDEA:
	* Rewrite mpn_add_n:
	short cy[n + 1];
	#pragma _CRI ivdep
	for (i = 0; i < n; i++)
	{ s = up[i] + vp[i];
	rp[i] = s;
	cy[i + 1] = s < up[i]; }
	more_carries = 0;
	#pragma _CRI ivdep
	for (i = 1; i < n; i++)
	{ s = rp[i] + cy[i];
	rp[i] = s;
	more_carries += s < cy[i]; }
	cys = 0;
	if (more_carries)
	{
	cys = rp[1] < cy[1];
	for (i = 2; i < n; i++)
	{ rp[i] += cys;
	cys = rp[i] < cys; }
	}
	return cys + cy[n];

	* Write mpn_add3_n for adding three operands. First add operands 1
	and 2, and generate cy[]. Then add operand 3 to the partial result,
	and accumulate carry into cy[]. Finally propagate carry just like
	in the new mpn_add_n.

	IDEA:

	Store fewer bits, perhaps 62, per limb. That brings mpn_add_n time
	down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb. By
	storing even fewer bits per limb, perhaps 56, it would be possible to
	write a mul_mul_basecase that would run at effectively 1 cycle/limb.
	(Use VM here to better handle the romb-shaped multiply area, perhaps
	rounding operand sizes up to the next power of 2.)