by Steve Ramey
The use of the -r8 or REAL*8 type does not affect machine performance, it only makes use of the available number representation. However, the use of REAL*16 types does decrease performance, just like any other machine's use of the DOUBLE PRECISION type. Confusion can occur in regards to the -32, -n32, and -64 flags, which only refer to the memory address size and defaults to -n32, which should be sufficient for most applications.
An important concern in numerical computation relates to a parameter
called the machine epsilon. This number by definition is the largest
number such that the computer calculates 1+epsilon=1. This results
from the way computers add numbers of different orders of magnitude
and can result in the loss of information in the calculation. The
increased precision of the Origin 2000 means that its machine epsilon
is smaller than that of the Cray Y-MP. As a reference, the following
are the machine epsilons for different FORTRAN types on the Origin
|REAL*4:||epsilon = 5.9604645E-08|
|REAL*8:||epsilon = 1.1102230246251565E-16|
|REAL*16:||epsilon = 6.1629758220391547297791294162718E-33|
To take full advantage of the Origin 2000's parallel processing capability, a program must instruct the compiler where to run the code concurrently. These instructions come in the form of compiler directives which can be added by hand, or by a utility called the Power FORTRAN Accelerator (pfa). This utility can run as a stand alone source code preprocessor, or more conveniently, it can be called as an option to the compilation command line (-pfa). The stand alone version resides in the /usr/lib/ directory and is useful in determining the nature of parallel directives added. It produces a program.l and program.m file which describes the optimizations implemented and gives the altered source code, respectively. The command line option only writes these files if explicitly told to do so. There are many fine-tuning options for pfa, but the most important concerns the level one cache size. It defaults to 64kb, but our machine has a 32kb cache, so the -chs=32 option should be given. To give an option to the Power FORTRAN Accelerator from the compilation command line, the -WK option must be included as follows:
The Power FORTRAN Accelerator is not perfect, and if performance is a concern the altered source code should act only as a guideline for manually added optimizations. (An important note on pfa: Though this version of FORTRAN 77 supports pointers, they produce unpredictable results and floating point exceptions when concurrentized so their use is discouraged. Studies are underway here and at SGI to remedy this problem.)
The compiler has many performance enhancing options as well, many of which concern memory management techniques. For small jobs, the memory management defaults are satisfactory, but for larger jobs some custom tuning will probably be necessary. All jobs benefit from including the optimization options such as -Ofast. This is an option group that includes SGI's choices of the best optimization routines and is a good place to start. With a little experience, some of the other options can be explored that may enhance performance even more. A typical compilation line now reads:
Another concern, especially for users coming from the Cray Y-MP environment, concerns the handling of floating point exceptions. The Origin 2000 is IEEE compliant in this area, meaning that by default it ignores floating point exceptions so your end results could end up being meaningless. To make the machine trap floating point exceptions and dump the core, a floating point exception library must be linked and the environment must be set to handle the traps as follows:
(There are more specific exception handling features as well, consult the man pages.) The command line, including the floating point exception library, now reads:
As this command line tends to get longer and longer, I suggest putting it in your .cshrc.local file, along with setting the environment to your choice of exception handling routines.
The results you see for MFLOP's is an average for each processor, to get a total performance estimate, look at the total number of graduated floating point operations and divide by the run time. The run time can be estimated by using the timex utility, as in "timex a.out," which gives real (wall clock) run time and user time. The user time is a sum of the cpu time from each processor, so to get an average cpu time divide this number by the number of processors used (default is 8). Or better yet, use a function call in the program to DTIME. Large discrepancies in wall clock time versus the DTIME result usually means multiple users on the machine. To check this, you can use the `top' utility to examine individual processor usage by different users while your program is running.
For more detailed information, see the man pages for topics mentioned or refer to the online help library "insight" (/usr/sbin/insight), available on SGI machines. As for more advanced performance tuning, the speedshop debugger can help in locating bottlenecks and other performance problems. More in depth information on these and other topics are available on the NSCEE web pages and the SGI web pages.
[an error occurred while processing this directive]