libgcrypt-1.4.6/mpi/i586/README - nest-learning-thermostat/5.1.5/libgcrypt - Git at Google

 This directory contains mpn functions optimized for Intel Pentium
 processors.

 RELEVANT OPTIMIZATION ISSUES

 1. Pentium doesn't allocate cache lines on writes, unlike most other modern
 processors.  Since the functions in the mpn class do array writes, we have to
 handle allocating the destination cache lines by reading a word from it in the
 loops, to achieve the best performance.

 2. Pairing of memory operations requires that the two issued operations refer
 to different cache banks.  The simplest way to insure this is to read/write
 two words from the same object.  If we make operations on different objects,
 they might or might not be to the same cache bank.

 STATUS

 1. mpn_lshift and mpn_rshift run at about 6 cycles/limb, but the Pentium
 documentation indicates that they should take only 43/8 = 5.375 cycles/limb,
 or 5 cycles/limb asymptotically.

 2. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb.  Due to loop
 overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb.

 3. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they
 should...
	This directory contains mpn functions optimized for Intel Pentium
	processors.

	RELEVANT OPTIMIZATION ISSUES

	1. Pentium doesn't allocate cache lines on writes, unlike most other modern
	processors. Since the functions in the mpn class do array writes, we have to
	handle allocating the destination cache lines by reading a word from it in the
	loops, to achieve the best performance.

	2. Pairing of memory operations requires that the two issued operations refer
	to different cache banks. The simplest way to insure this is to read/write
	two words from the same object. If we make operations on different objects,
	they might or might not be to the same cache bank.

	STATUS

	1. mpn_lshift and mpn_rshift run at about 6 cycles/limb, but the Pentium
	documentation indicates that they should take only 43/8 = 5.375 cycles/limb,
	or 5 cycles/limb asymptotically.

	2. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop
	overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb.

	3. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they
	should...