Programming on the Origin2000

For the most part, programming on the Origin2000 is no different than programming on traditional shared memory multiprocessors such as the Power Challenge 10000.This is, of course, largely due to the hardware which makes the Origin's physically distributed memory function as shared memory with fairly uniform access times. However, it is also due to the operating system, which contains some significant new capabilities designed to keep the system running as efficiently as possible.While most of these added capabilities function transparently to the user, there are some new tools available for fine-tuning performance, and some new terminology to go along with them.

In this section, our goal is to describe to you some of the capabilities of IRIX 6.4 (also known as Cellular IRIX), give you an understanding of what the operating system is trying to accomplish with them, and familiarize you with the terminology.

Achieving Good Performance

Even though Origin is a highly parallel system, the vast majority of programs run on it will use only a single processor. Nothing new or different needs to be done to achieve good performance on such applications: tune these just as you would a single processor program.

We can break tuning into 5 steps:

Step 1: Get the right answers: While this may seem obvious, it is quite easy to forget to check the answers along the way as one is making performance improvements. A lot of frustration can be avoided by making only one change at a time and verifying that the program is still correct after each change.

Step 2: Use existing tuned code: The quickest and easiest way to improve the performance of a program is to link it with libraries already tuned for the target hardware. In addition to the standard libraries, such as libc, whose hardware-specific versions are automatically linked in with programs compiled and run on the Origin2000 system, there are other libraries which may provide substantial performance benefits.They include: complib, complib sgimath and sgimath.

Step 3: Find out where to tune: When confronted with a program composed of hundreds of modules and thousands of lines of code, it would require a very heroic, and very inefficient effort to tune the entire program.Tuning needs to be concentrated on those few sections of the code where the work will pay off with the biggest gains in performance.These sections of code are identified with the help of a profiler. The hardware counters in the R10000 CPU make it possible to profile the behavior of a program in many ways without modifying the code. Profiling tools include:

Step 4: Let the compiler do the work: Once the amount of code to be tuned has been narrowed down, it's time to actually do the work of tuning. The most important tool available for accomplishing this is the compiler. Ideally, the compiler should be able to automatically make your program run its fastest. But this simply isn't possible because the compiler does not have enough information to make all optimal decisions.

Given that the various optimizations the compiler will perform can interact with each other, it is impossible to provide a simple formula specifying which optimizations should be attempted in which order to achieve the best results in all cases. Nevertheless, we can make some recommendations which work well in general. A good set of compiler flags to use are the following:

-n32 -mips4 -Ofast=ip27 -OPT:IEEE_arithmetic=3

In addition, when linking in math routines, be sure to use the fast math library:

-lfastm -lm

Step 5: Modify the code for better cache utilization: For a cache-based system, such as the Origin2000, the optimizations which have the greatest potential for significant performance gains are those which improve the program's utilization of the cache hierarchy. In the MIPSpro 7.x compilers, this class of optimizations is known as loop nest optimizations, and they are performed by the loop nest optimizer, or LNO.

It should be pointed out that if you are using the flags recommended above, you are already using the LNO since it is enabled whenever the highest level of optimization, -O3 or -Ofast, is used. For the majority of programs and users, this is a great benefit since the LNO is capable of automatically solving many cache use problems.

Ranges and Precision

Unlike the Cray Y-MP, the Origin2000 adheres to the IEEE standard for its internal number representation. This means that the machine can manipulate a smaller range of numbers, but at increased precision.

The IEEE single-precision float representation uses a base of 2. There is a sign bit, a mantissa with 23 bits plus one hidden bit (so the total precision is 24 base-2 digits), and an 8-bit exponent that can represent values in the range -125 to 128, inclusive.

So, for an implementation that uses this representation for the float data type, appropriate values for the corresponding parameters are:

FLT_MIN1.17549435E-38F
FLT_MAX3.40282347E+38F
FLT_EPSILON1.19209290E-07F

Here are the values for the double data type:

DBL_MAX1.7976931348623157E+308
DBL_MIN2.2250738585072014E-308
DBL_EPSILON2.2204460492503131E-016

For more detailed information, see the man pages for topics mentioned or refer to the online help library "insight," available on SGI machines. As for more advanced performance tuning, the speedshop debugger can help in locating bottlenecks and other performance problems. More in depth information on these and other topics are available on the NSCEE web pages and the SGI web pages.

[an error occurred while processing this directive]