by Steve Ramey


Unlike the Cray Y-MP, the Origin 2000 adheres to the IEEE standard for its internal number representation. This means that the machine can manipulate a smaller range of numbers, but at increased precision. As a result, a REAL number can range between about 3.402E38 to 1.170E-38. The use of the DOUBLE PRECISION type results in the same range but with an increased precision in the fraction. This is the IEEE standard for 32-bit machines, which is the F77 compiler default. To make the machine take advantage of its 64 bit architecture, the compiler flag -r8 must be added to the F77 command line. This flag makes REALs default to r*8 which allows a REAL to range from 1.797E308 down to 2.225E-308. The DOUBLE PRECISION number now behaves identically to a REAL, including both range and decimal spaces. To increase the precision of a DOUBLE PRECISION declaration, the -d16 flag must be added. Now the DOUBLE PRECISION has the same range as the REAL*8 but greater precision. (NOTE: both the -r8 and -d16 flags must be used to get the range and precision from the DOUBLE PRECISION type. To simplify the compilation command line, I don't use the DOUBLE PRECISION type at all, rather, I just use REAL*16 for the DOUBLE PRECISION type and REAL*8 for real number types, then compile with the -r8 flag.)

The use of the -r8 or REAL*8 type does not affect machine performance, it only makes use of the available number representation. However, the use of REAL*16 types does decrease performance, just like any other machine's use of the DOUBLE PRECISION type. Confusion can occur in regards to the -32, -n32, and -64 flags, which only refer to the memory address size and defaults to -n32, which should be sufficient for most applications.

An important concern in numerical computation relates to a parameter called the machine epsilon. This number by definition is the largest number such that the computer calculates 1+epsilon=1. This results from the way computers add numbers of different orders of magnitude and can result in the loss of information in the calculation. The increased precision of the Origin 2000 means that its machine epsilon is smaller than that of the Cray Y-MP. As a reference, the following are the machine epsilons for different FORTRAN types on the Origin 2000:

REAL*4: epsilon = 5.9604645E-08
REAL*8: epsilon = 1.1102230246251565E-16
REAL*16: epsilon = 6.1629758220391547297791294162718E-33


To take full advantage of the Origin 2000's parallel processing capability, a program must instruct the compiler where to run the code concurrently. These instructions come in the form of compiler directives which can be added by hand, or by a utility called the Power FORTRAN Accelerator (pfa). This utility can run as a stand alone source code preprocessor, or more conveniently, it can be called as an option to the compilation command line (-pfa). The stand alone version resides in the /usr/lib/ directory and is useful in determining the nature of parallel directives added. It produces a program.l and program.m file which describes the optimizations implemented and gives the altered source code, respectively. The command line option only writes these files if explicitly told to do so. There are many fine-tuning options for pfa, but the most important concerns the level one cache size. It defaults to 64kb, but our machine has a 32kb cache, so the -chs=32 option should be given. To give an option to the Power FORTRAN Accelerator from the compilation command line, the -WK option must be included as follows:

The Power FORTRAN Accelerator is not perfect, and if performance is a concern the altered source code should act only as a guideline for manually added optimizations. (An important note on pfa: Though this version of FORTRAN 77 supports pointers, they produce unpredictable results and floating point exceptions when concurrentized so their use is discouraged. Studies are underway here and at SGI to remedy this problem.)

The compiler has many performance enhancing options as well, many of which concern memory management techniques. For small jobs, the memory management defaults are satisfactory, but for larger jobs some custom tuning will probably be necessary. All jobs benefit from including the optimization options such as -Ofast. This is an option group that includes SGI's choices of the best optimization routines and is a good place to start. With a little experience, some of the other options can be explored that may enhance performance even more. A typical compilation line now reads:

Another concern, especially for users coming from the Cray Y-MP environment, concerns the handling of floating point exceptions. The Origin 2000 is IEEE compliant in this area, meaning that by default it ignores floating point exceptions so your end results could end up being meaningless. To make the machine trap floating point exceptions and dump the core, a floating point exception library must be linked and the environment must be set to handle the traps as follows:

(There are more specific exception handling features as well, consult the man pages.) The command line, including the floating point exception library, now reads:

As this command line tends to get longer and longer, I suggest putting it in your .cshrc.local file, along with setting the environment to your choice of exception handling routines.


Finally, to examine the performance of your code's execution, a utility called `perfex' gives the hardware counter results on a variety of cpu statistics. To get started, the -a flag can be added to examine all of the counter functions and the -y to give performance statistics such as MFLOP's. A typical use of perfex would read as follows:

The results you see for MFLOP's is an average for each processor, to get a total performance estimate, look at the total number of graduated floating point operations and divide by the run time. The run time can be estimated by using the timex utility, as in "timex a.out," which gives real (wall clock) run time and user time. The user time is a sum of the cpu time from each processor, so to get an average cpu time divide this number by the number of processors used (default is 8). Or better yet, use a function call in the program to DTIME. Large discrepancies in wall clock time versus the DTIME result usually means multiple users on the machine. To check this, you can use the `top' utility to examine individual processor usage by different users while your program is running.

For more detailed information, see the man pages for topics mentioned or refer to the online help library "insight" (/usr/sbin/insight), available on SGI machines. As for more advanced performance tuning, the speedshop debugger can help in locating bottlenecks and other performance problems. More in depth information on these and other topics are available on the NSCEE web pages and the SGI web pages.

[an error occurred while processing this directive]