| This directory contains mpn functions optimized for DEC Alpha processors. |
| |
| RELEVANT OPTIMIZATION ISSUES |
| |
| EV4 |
| |
| 1. This chip has very limited store bandwidth. The on-chip L1 cache is |
| write-through, and a cache line is transfered from the store buffer to the |
| off-chip L2 in as much 15 cycles on most systems. This delay hurts |
| mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. |
| |
| 2. Pairing is possible between memory instructions and integer arithmetic |
| instructions. |
| |
| 3. mulq and umulh is documented to have a latency of 23 cycles, but 2 of |
| these cycles are pipelined. Thus, multiply instructions can be issued at a |
| rate of one each 21nd cycle. |
| |
| EV5 |
| |
| 1. The memory bandwidth of this chip seems excellent, both for loads and |
| stores. Even when the working set is larger than the on-chip L1 and L2 |
| caches, the perfromance remain almost unaffected. |
| |
| 2. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th |
| cycle. umulh has a measured latency of 15 cycles and an issue rate of 1 |
| each 10th cycle. But the exact timing is somewhat confusing. |
| |
| 3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 |
| are memory operations. This will take at least |
| ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles |
| We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data |
| cache cycles, which should be completely hidden in the 20 issue cycles. |
| The computation is inherently serial, with these dependencies: |
| addq |
| / \ |
| addq cmpult |
| | | |
| cmpult | |
| \ / |
| or |
| I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute |
| minimum. We could replace the `or' with a cmoveq/cmovne, which would save |
| a cycle on EV5, but that might waste a cycle on EV4. Also, cmov takes 2 |
| cycles. |
| addq |
| / \ |
| addq cmpult |
| | \ |
| cmpult -> cmovne |
| |
| STATUS |
| |