| Copyright 2000-2002, 2004 Free Software Foundation, Inc. |
| |
| This file is part of the GNU MP Library. |
| |
| The GNU MP Library is free software; you can redistribute it and/or modify |
| it under the terms of either: |
| |
| * the GNU Lesser General Public License as published by the Free |
| Software Foundation; either version 3 of the License, or (at your |
| option) any later version. |
| |
| or |
| |
| * the GNU General Public License as published by the Free Software |
| Foundation; either version 2 of the License, or (at your option) any |
| later version. |
| |
| or both in parallel, as here. |
| |
| The GNU MP Library is distributed in the hope that it will be useful, but |
| WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
| or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License |
| for more details. |
| |
| You should have received copies of the GNU General Public License and the |
| GNU Lesser General Public License along with the GNU MP Library. If not, |
| see https://www.gnu.org/licenses/. |
| |
| |
| |
| |
| |
| GMP SPEED MEASURING AND PARAMETER TUNING |
| |
| |
| The programs in this directory are for knowledgeable users who want to |
| measure GMP routines on their machine, and perhaps tweak some settings or |
| identify things that can be improved. |
| |
| The programs here are tools, not ready to run solutions. Nothing is built |
| in a normal "make all", but various Makefile targets described below exist. |
| |
| Relatively few systems and CPUs have been tested, so be sure to verify that |
| results are sensible before relying on them. |
| |
| |
| |
| |
| MISCELLANEOUS NOTES |
| |
| --enable-assert |
| |
| Don't configure with --enable-assert, since the extra code added by |
| assertion checking may influence measurements. |
| |
| Direct mapped caches |
| |
| Some effort has been made to accommodate CPUs with direct mapped caches, |
| by putting data blocks more or less contiguously on the stack. But this |
| will depend on TMP_ALLOC using alloca, and even then it may or may not |
| be enough. |
| |
| FreeBSD 4.2 i486 getrusage |
| |
| This getrusage seems to be a bit doubtful, it looks like it's |
| microsecond accurate, but sometimes ru_utime remains unchanged after a |
| time of many microseconds has elapsed. It'd be good to detect this in |
| the time.c initializations, but for now the suggestion is to pretend it |
| doesn't exist. |
| |
| ./configure ac_cv_func_getrusage=no |
| |
| NetBSD 1.4.1 m68k macintosh time base |
| |
| On this system it's been found getrusage often goes backwards, making it |
| unusable (time.c getrusage_backwards_p detects this). gettimeofday |
| sometimes doesn't update atomically when it crosses a 1 second boundary. |
| Not sure what to do about this. Expect possible intermittent failures. |
| |
| SCO OpenUNIX 8 /etc/hw |
| |
| /etc/hw takes about a second to return the cpu frequency, which suggests |
| perhaps it's measuring each time it runs. If this is annoying when |
| running the speed program repeatedly then set a GMP_CPU_FREQUENCY |
| environment variable (see TIME BASE section below). |
| |
| Timing on GNU/Linux |
| |
| On Linux, timing currently uses the cycle counter. This is unreliable, |
| since the counter is not saved and restored at context switches (unlike |
| FreeBSD and Solaris where the cycle counter is "virtualized"). |
| |
| Using the clock_gettime method with CLOCK_PROCESS_CPUTIME_ID (posix) or |
| CLOCK_VIRTUAL (BSD) should be more reliable. To get clock_gettime |
| with glibc, one has to link with -lrt (which also drags in the pthreads |
| threading library). configure.in must be hacked to detect this and |
| arrange proper linking. Something like |
| |
| old_LIBS="$LIBS" |
| AC_SEARCH_LIBS(clock_gettime, rt, [AC_DEFINE(HAVE_CLOCK_GETTIME)]) |
| TUNE_LIBS="$LIBS" |
| LIBS="$old_LIBS" |
| |
| AC_SUBST(TUNE_LIBS) |
| |
| might work. |
| |
| Low resolution timebase |
| |
| Parameter tuning can be very time consuming if the only timebase |
| available is a 10 millisecond clock tick, to the point of being |
| unusable. This is currently the case on VAX and ARM systems. |
| |
| |
| |
| |
| PARAMETER TUNING |
| |
| The "tuneup" program runs some tests designed to find the best settings for |
| various thresholds, like MUL_TOOM22_THRESHOLD. Its output can be put |
| into gmp-mparam.h. The program is built and run with |
| |
| make tune |
| |
| If the thresholds indicated are grossly different from the values in the |
| selected gmp-mparam.h then there may be a performance boost in applicable |
| size ranges by changing gmp-mparam.h accordingly. |
| |
| Be sure to do a full reconfigure and rebuild to get any newly set thresholds |
| to take effect. A partial rebuild is enough sometimes, but a fresh |
| configure and make is certain to be correct. |
| |
| If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of |
| the mpn subdirectories then the values from "make tune" should be similar. |
| But check that the configured CPU is right and there are no machine specific |
| effects causing a difference. |
| |
| It's hoped the compiler and options used won't have too much effect on |
| thresholds, since for most CPUs they ultimately come down to comparisons |
| between assembler subroutines. Missing out on the longlong.h macros by not |
| using gcc will probably have an effect. |
| |
| Some thresholds produced by the tune program are merely single values chosen |
| from what's a range of sizes where two algorithms are pretty much the same |
| speed. When this happens the program is likely to give somewhat different |
| values on successive runs. This is noticeable on the toom3 thresholds for |
| instance. |
| |
| |
| |
| |
| SPEED PROGRAM |
| |
| The "speed" program can be used for measuring and comparing various |
| routines, and producing tables of data or gnuplot graphs. Compile it with |
| |
| make speed |
| |
| (Or on DOS systems "make speed.exe".) |
| |
| Here are some examples of how to use it. Check the code for all the |
| options. |
| |
| Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05 |
| (whichever is greater). |
| |
| ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n |
| gnuplot foo.gnuplot |
| |
| Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and |
| showing under mpn_lshift the difference between it and mpn_add_n. |
| |
| ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1 |
| |
| Using option -c for times in cycles is interesting but normally only |
| necessary when looking carefully at assembler subroutines. You might think |
| it would always give an integer value, but this doesn't happen in practice, |
| probably due to overheads in the time measurements. |
| |
| In the free-form output the "#" symbol against a measurement means the |
| corresponding routine is fastest at that size. This is a convenient visual |
| cue when comparing different routines. The graph data files <name>.data |
| don't get this since it would upset gnuplot or other data viewers. |
| |
| |
| |
| |
| TIME BASE |
| |
| The time measuring method is determined in time.c, based on what the |
| configured host has available. A cycle counter is preferred, possibly |
| supplemented by another method if the counter has a limited range. A |
| microsecond accurate getrusage() or gettimeofday() will work quite well too. |
| |
| The cycle counters (except possibly on alpha) and gettimeofday() will depend |
| on the machine being otherwise idle, or rather on other jobs not stealing |
| CPU time from the measuring program. Short routines (those that complete |
| within a timeslice) should work even on a busy machine. |
| |
| Some trouble is taken by speed_measure() in common.c to avoid ill effects |
| from sporadic interrupts, or other intermittent things (like cron waking up |
| every minute). But generally an idle machine will be necessary to be |
| certain of consistent results. |
| |
| The CPU frequency is needed to convert between cycles and seconds, or for |
| when a cycle counter is supplemented by getrusage() etc. The speed program |
| will convert as necessary according to the output format requested. The |
| tune program will work with either cycles or seconds. |
| |
| freq.c knows how to get the frequency on some systems, or can measure a |
| cycle counter against gettimeofday() or getrusage(), but when that fails, or |
| needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be |
| used (in Hertz). For example in "bash" on a 650 MHz machine, |
| |
| export GMP_CPU_FREQUENCY=650e6 |
| |
| A high precision time base makes it possible to get accurate measurements in |
| a shorter time. |
| |
| |
| |
| |
| EXAMPLE COMPARISONS - VARIOUS |
| |
| Here are some ideas for things that can be done with the speed program. |
| |
| There's always going to be a certain amount of overhead in the time |
| measurements, due to reading the time base, and in the loop that runs a |
| routine enough times to get a reading of the desired precision. Noop |
| functions taking various arguments are available to measure this. The |
| "overhead" printed by the speed program each time in its intro is the "noop" |
| routine, but note that this is just for information, it isn't deducted from |
| the times printed or anything. |
| |
| ./speed -s 1 noop noop_wxs noop_wxys |
| |
| To see how many cycles per limb a routine is taking, look at the time |
| increase when the size increments, using option -D. This avoids fixed |
| overheads in the measuring. Also, remember many of the assembler routines |
| have unrolled loops, so it might be necessary to compare times at, say, 16, |
| 32, 48, 64 etc to see what the unrolled part is taking, as opposed to any |
| finishing off. |
| |
| ./speed -s 16-64 -t 16 -C -D mpn_add_n |
| |
| The -C option on its own gives cycles per limb, but is really only useful at |
| big sizes where fixed overheads are small compared to the code doing the |
| real work. Remember of course memory caching and/or page swapping will |
| affect results at large sizes. |
| |
| ./speed -s 500000 -C mpn_add_n |
| |
| Once a calculation stops fitting in the CPU data cache, it's going to start |
| taking longer. Exactly where this happens depends on the cache priming in |
| the measuring routines, and on what sort of "least recently used" the |
| hardware does. Here's an example for a CPU with a 16kbyte L1 data cache and |
| 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000 |
| limbs. |
| |
| ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n |
| gnuplot foo.gnuplot |
| |
| When a routine has an unrolled loop for, say, multiples of 8 limbs and then |
| an ordinary loop for the remainder, it can happen that it's actually faster |
| to do an operation on, say, 8 limbs than it is on 7 limbs. The following |
| draws a graph of mpn_sub_n, to see whether times smoothly increase with |
| size. |
| |
| ./speed -s 1-100 -c -P foo mpn_sub_n |
| gnuplot foo.gnuplot |
| |
| If mpn_lshift and mpn_rshift have special case code for shifts by 1, it |
| ought to be faster (or at least not slower) than shifting by, say, 2 bits. |
| |
| ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2 |
| |
| An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and |
| if the lshift isn't faster there's an obvious improvement that's possible. |
| |
| ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self |
| |
| On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the |
| destination is one of the sources is faster than a separate destination. |
| Here's an example to see this. ".1" selects dst==src1 for mpn_add_n (and |
| mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL. |
| |
| ./speed -s 1-200 -c mpn_add_n mpn_add_n.1 |
| |
| The gmp manual points out that divisions by powers of two should be done |
| using a right shift because it'll be significantly faster than an actual |
| division. The following shows by what factor mpn_rshift is faster than |
| mpn_divrem_1, using division by 32 as an example. |
| |
| ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32 |
| |
| |
| |
| |
| EXAMPLE COMPARISONS - MULTIPLICATION |
| |
| mul_basecase takes a ".<r>" parameter. If positive, it gives the second |
| (smaller) operand size. For example to show speeds for 3x3 up to 20x3 in |
| cycles, |
| |
| ./speed -s 3-20 -c mpn_mul_basecase.3 |
| |
| A negative ".<-r>" parameter fixes the size of the product to the absolute |
| value r. For example to show speeds for 10x10 up to 19x1 in cycles, |
| |
| ./speed -s 10-19 -c mpn_mul_basecase.-20 |
| |
| mul_basecase with no parameter does an NxN multiply, so for example to show |
| speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles, |
| |
| ./speed -s 1-20 -c mpn_mul_basecase |
| |
| sqr_basecase is implemented by a "triangular" method on most CPUs, making it |
| up to twice as fast as mul_basecase. In practice loop overheads and the |
| products on the diagonal mean it falls short of this. Here's an example |
| running the two and showing by what factor an NxN mul_basecase is slower |
| than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes |
| below SQR_TOOM2_THRESHOLD, so if it crashes at that point don't worry.) |
| |
| ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase |
| |
| The technique described above with -CD for showing the time difference in |
| cycles per limb between two size operations can be done on an NxN |
| mul_basecase using -E to change the basis for the size increment to N*N. |
| For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16 |
| doing 256 limbs. The following therefore shows the per crossproduct speed |
| of mul_basecase and sqr_basecase at around 20x20 limbs. |
| |
| ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase |
| |
| Of course sqr_basecase isn't really doing NxN crossproducts, but it can be |
| interesting to compare it to mul_basecase as if it was. For sqr_basecase |
| the -F option can be used to base the deltas on N*(N+1)/2 operations, which |
| is the triangular products sqr_basecase does. For example, |
| |
| ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase |
| |
| Both -E and -F are preliminary and might change. A consistent approach to |
| using them when claiming certain per crossproduct or per triangularproduct |
| speeds hasn't really been established, but the increment between speeds in |
| the range karatsuba will call seems sensible, that being k to k/2. For |
| instance, if the karatsuba threshold was 20 for the multiply and 30 for the |
| square, |
| |
| ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase |
| ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase |
| |
| |
| |
| EXAMPLE COMPARISONS - MALLOC |
| |
| The gmp manual recommends application programs avoid excessive initializing |
| and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new |
| variable will at a minimum go through an init, a realloc for its first |
| store, and finally a clear. Quite how long that takes depends on the C |
| library. The following compares an mpz_init/realloc/clear to a 10 limb |
| mpz_add. Don't be surprised if the mallocing is quite slow. |
| |
| ./speed -s 10 -c mpz_init_realloc_clear mpz_add |
| |
| On some systems malloc and free are much slower when dynamic linked. The |
| speed-dynamic program can be used to see this. For example the following |
| measures malloc/free, first static then dynamic. |
| |
| ./speed -s 10 -c malloc_free |
| ./speed-dynamic -s 10 -c malloc_free |
| |
| Of course a real world program has big problems if it's doing so many |
| mallocs and frees that it gets slowed down by a dynamic linked malloc. |
| |
| |
| |
| |
| |
| EXAMPLE COMPARISONS - STRING CONVERSIONS |
| |
| mpn_get_str does a binary to string conversion. The base is specified with |
| a ".<r>" parameter, or decimal by default. Power of 2 bases are much faster |
| than general bases. The following compares decimal and hex for instance. |
| |
| ./speed -s 1-20 -c mpn_get_str mpn_get_str.16 |
| |
| Smaller bases need more divisions to split a given size number, and so are |
| slower. The following compares base 3 and base 9. On small operands 9 will |
| be nearly twice as fast, though at bigger sizes this reduces since in the |
| current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit |
| limbs) and those divisions come to dominate. |
| |
| ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9 |
| |
| mpn_set_str does a string to binary conversion. The base is specified with |
| a ".<r>" parameter, or decimal by default. Power of 2 bases are faster than |
| general bases on large conversions. |
| |
| ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10 |
| |
| mpn_set_str also has some special case code for decimal which is a bit |
| faster than the general case, basically by giving the compiler a chance to |
| optimize some multiplications by 10. |
| |
| ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11 |
| |
| |
| |
| |
| EXAMPLE COMPARISONS - GCDs |
| |
| mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both |
| x and y are single limbs. This isn't tuned currently, but a value can be |
| established by a measurement like |
| |
| ./speed -s 10-32 mpn_gcd_1.10 |
| |
| This runs src[0] from 10 to 32 bits, and y fixed at 10 bits. If the div |
| threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd |
| is done by nibbling away at the 32-bit operands bit-by-bit. When the |
| threshold is small, say 1 bit, then an initial x%y is done to reduce it to a |
| 10x10 bit operation. |
| |
| The threshold in mpn/generic/gcd_1.c or the various assembler |
| implementations can be tweaked up or down until there's no more speedups on |
| interesting combinations of sizes. Note that this affects only a 1x1 limb |
| operation and so isn't very important. (An Nx1 limb operation always does |
| an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.) |
| |
| |
| |
| |
| SPEED PROGRAM EXTENSIONS |
| |
| Potentially lots of things could be made available in the program, but it's |
| been left at only the things that have actually been wanted and are likely |
| to be reasonably useful in the future. |
| |
| Extensions should be fairly easy to make though. speed-ext.c is an example, |
| in a style that should suit one-off tests, or new code fragments under |
| development. |
| |
| many.pl is a script for generating a new speed program supplemented with |
| alternate versions of the standard routines. It can be used for measuring |
| experimental code, or for comparing different implementations that exist |
| within a CPU family. |
| |
| |
| |
| |
| THRESHOLD EXAMINING |
| |
| The speed program can be used to examine the speeds of different algorithms |
| to check the tune program has done the right thing. For example to examine |
| the karatsuba multiply threshold, |
| |
| ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n |
| |
| When examining the toom3 threshold, remember it depends on the karatsuba |
| threshold, so the right karatsuba threshold needs to be compiled into the |
| library first. The tune program uses specially recompiled versions of |
| mpn/mul_n.c etc for this reason, but the speed program simply uses the |
| normal libgmp.la. |
| |
| Note further that the various routines may recurse into themselves on sizes |
| far enough above applicable thresholds. For example, mpn_kara_mul_n will |
| recurse into itself on sizes greater than twice the compiled-in |
| MUL_TOOM22_THRESHOLD. |
| |
| When doing the above comparison between mul_basecase and kara_mul_n what's |
| probably of interest is mul_basecase versus a kara_mul_n that does one level |
| of Karatsuba then calls to mul_basecase, but this only happens on sizes less |
| than twice the compiled MUL_TOOM22_THRESHOLD. A larger value for that |
| setting can be compiled-in to avoid the problem if necessary. The same |
| applies to toom3 and DC, though in a trickier fashion. |
| |
| There are some upper limits on some of the thresholds, arising from arrays |
| dimensioned according to a threshold (mpn_mul_n), or asm code with certain |
| sized displacements (some x86 versions of sqr_basecase). So putting huge |
| values for the thresholds, even just for testing, may fail. |
| |
| |
| |
| |
| FUTURE |
| |
| Make a program to check the time base is working properly, for small and |
| large measurements. Make it able to test each available method, including |
| perhaps the apparent resolution of each. |
| |
| Make a general mechanism for specifying operand overlap, and a syntax like |
| maybe "mpn_add_n.dst=src2" to select it. Some measuring routines do this |
| sort of thing with the "r" parameter currently. |
| |
| |
| |
| ---------------- |
| Local variables: |
| mode: text |
| fill-column: 76 |
| End: |