You probably have noticed a new line of output when logging in to clark recently:
Active label set to : level0,noneThis line is an indication that MLS, the UNICOS Multilevel Security feature has been enabled.
"Trusted" UNICOS is a specifically-defined configuration of the UNICOS 8.0 MLS system, which was evaluated by the National Computer Security Center (NCSC) to meet the B1 MDIA criteria of the Trusted Network Interpretation (TNI) of the Trusted Computer System Evaluation Criteria (TCSEC). This evaluation was completed in March of this year.
MDIA is an acronym defined in the TNI. It is formed by combining the first letter of the following four security policies that network components can support in order to obtain an evaluated rating:
A security "label" is the combination of level and compartment (MAC) in force at the time a security action is taken. Upon login, a user's current label is displayed, which explains the new line of output that was mentioned at the beginning of this article. For now, all our users have been assigned level 0 security and no compartments. Because of this, our current use of MLS should be largely invisible to our users.
For a more complete description of the features of Multilevel Security, please see the UNICOS Multilevel Security (MLS) Feature User's Guide, Cray Document SG-2111 8.0, available at the NSCEE.
Recently a set of codes written by scientists at REECO has been reworked with parallel constructs. Work done by David Cawlfield, Thomas Lindstrom, and George Miel for the ground water characterization program at the Nevada Test Site found tremendous speedup when run using PVM.
PVM (Parallel Virtual Machine) from ORNL (Oak Ridge National Laboratory) is a set of routines that allow a network of heterogeneous computers running UNIX to function as a large parallel computer. PVM is mainly a message passing scheme using sockets for data communication between processes; it does not directly address problem decomposition, in other words, how to transform code that runs in a sequential environment into code that runs in a parallel environment.
REECO's problem was remarkably well tailored to a parallel implementation. They needed to solve a rather complex integral equation. Through a clever observation they found that they can solve the integral equation by solving the matrix equation
A X = Bwhere B is a trivial vector. Matrix A is nice in that it is a symmetric positive definite matrix. The problem is that computing the coefficients for A is very costly. In fact, the code that computes the A matrix running on a Sun SparcStation 2 takes about three days. Running on a Convex C220 takes five hours and 40 minutes.
Their method for computing the coefficients of A takes as input the row and column of the matrix element and a constant called gamma. Given this information, it then computes the coefficient for that matrix element completely independently from any other element. An added bonus is since A is symmetric, they only compute the upper triangular matrix.
Through experimentation, it was found that it takes more time to solve elements further away from the main diagonal than it takes to solve elements closer to the main diagonal. Thus decomposition of the REECO problem was quite straight forward--assign a matrix element to a processor and collect the answer when the processor is done.It is simple to solve A X = B (which solves the integral equation), once matrix A is known. In fact, the integral equation was solved in this manner on a personal computer (486 DX/66).
It took about three weeks to convert the sequential code to a parallel implementation using PVM (this includes testing two different solvers). The parallel version of the code now solves the same problems yielding the same results in approximately 50 minutes! Going from 72 hours on a SparcStation, to 5 hours on a "mini" supercomputer, down to less than an hour is an improvement worth the investment.
The initial runs of the PVM solver generated 10-50x50 matrices (for 10 different gammas) in 8 hours and 45 minutes. It would have taken a SparcStation 2 station one month of continuous operation to compute the same results. Taking in consideration the additional three weeks of work needed before the runs could take place, the PVM code certainly improved the turn around time on this project. However, it would have taken the Convex just two days to complete the same amount of work. Even this time could be costly when REECO decides to increase the number of matrices or increase the size of the matrices (100x100).
REECO is now analyzing the data from the first run of the program. Whether they need more or larger matrices is unknown. Now with this new scheme solving a 50x50 upper diagonal matrix code they can pursue more matrices faster than ever.
Table 1 (above) shows the execution time in hours to compute the coefficients for a 50x50 upper diagonal matrix with the same sequential code running on a SUN SparcStation 2, a Convex C220, and the same sequential code converted to PVM and then run on a virtual machine consisting of a variety of machines at the NSCEE.
Table 2 (above) shows the amount of speed up achieved. Speed up is the execution time of the fastest sequential code divided by the execution time of the parallel implementation. Table 2 shows a variety of speedup ratios. For example the C220 / PVM shows that the PVM version of the code is 5.71 times faster than the sequential code running on the C220.* The machines in the PVM tests ran two PVM processes each. They are: 1 - Cray Y/MP 2/216 1 - Convex C220 1 - SGI Crimson 1 - SGI Indigo 1 - SUN 6/690 4 - SUN Sparc 1+ stations 1 - SUN IPC.
Ringo Ling joined NSCEE on September 1, 1994 as a Research Associate in the Research Division. His responsibilities include participation in current research activities at the NSCEE and development of new research programs in computational science and high performance computation. He is currently working on intelligent systems for monitoring and control of supercomputer clusters and heuristic classification of remote sensing data in collaboration with other NSCEE staff and scientists.Ringo has been involved in artificial intelligence research in high performance computation for over six years. He has published papers and chaired a workshop session in AI in modeling and engineering design. In his Ph.D. dissertation, he developed a computational approach to automate the process of modeling heat transfer problems involving partial differential equations.
His research interests include building computing tools and integrated environments for scientific computing and engineering design, automated reasoning about physical systems and artificial intelligence in science and engineering.
Ringo received a B.E.Sc. degree in mechanical engineering and a M.Sc. degree in computer science. He completed his Ph.D. in Computer Science at Rutgers University. He had extensive experience in engineering and computer science. He worked as an air-conditioning and control engineer for over six years in design, installation, and service of central air conditioning systems. He also worked at Shell Canada and Alberta Research Council, Canada where he conducted AI research and developed expert systems. Ringo can be reached at email@example.com
The Center's modem bank has undergone some changes recently. As some of our users have already noticed, the 895-4155 rotary has been reserved for SLIP/PPP use. SLIP and PPP are protocols for passing internet protocol over serial lines. This service will soon be available to special NSCEE users and will be charged for on a cost recovery basis. The 895-4154 rotary is still available to NSCEE users. The current modem configuration is:
|895-4154||-||9600 baud modems|
|895-4155||-||14400 baud modems|
Benchmarking a machine can be a very tricky operation. Benchmark programs come in different shapes, different names, different languages, and are very often known to be only meaningful for certain classes of machines! To complicate things, as new hardware platforms appear, new benchmarks that are better suited emerge, leaving old benchmarks in the dust.
In an attempt to capture the main benchmarking characteristics, we would like to discuss the standard concepts--Kernels, Mips, Mflops, Peak Theoretical Performance, and Sustained Performance--and conclude with a list of past, and present benchmarking programs you might encounter.
This is the first article in a series on benchmarking. Future articles will focus on the specifics of popular benchmarks LINPACK, SPEC, NAS, Parkbench and others.
MFlops stands for millions of floating point operations per second. The conversion between Mips and MFlops depends on the machine type. If, for example, we consider that performing a floating point operation on a scalar machine may require an average of three instructions, then one MFlops may imply three Mips.
For example, the CRAY YMP has a cycle time of 6 nano sec. If, during a cycle the results of both an addition and a multiplication can be completed, then the MFlops/s rate is:
(2 operations / 1 cycle) * (1 cycle / 6 nsec) = 333 MFlops/sThis is generally a peak rate, and operations are done in full precision.
Peak Theoretical Performance is typically used by the manufacturer to denote the number of floating point operations that can be completed during a period of time (usually the cycle time for the machine). In other words, the manufacturer guarantees that programs will not exceed these rates, which amount to a kind of speed of light for a given computer.
Peak theoretical performance for a single cpu 6 nanosecond CRAY Y-MP was computed in the previous paragraph: 333 MFLOPS. The evolution of peak performance rates in time can be extracted from Jack Dongarra's latest LINPACK report: From 1 to 10 MFlops/s (CDC 6600), from 10 to 100 MFlops/s with later systems (IBM 370/195 and Illiac IV), from 100-1000 MFlops/s with more recent systems (CRAY 1, CRAY X-MP, CRAY Y-MP), and from 1000-100000 with current parallel machines (16 processor CRAY C90 is 15238 MFlops/s, 32 processor CRAY J90 is roughly 6000 MFlops/s or 6 GFlops/s, pronounced gigaflops).
Peak Theoretical Performance for newer architectures is now usually given not on a per processor basis, but across all processors as with the 32 processor CRAY J90, which has a peak rate of over 6 GFlops but only about 200 MFlops/CPU. More important than peak performance for the newer massively parallel machines is scalability.
Sustained performance is average performance. We will speak of vector performance here, even though the discussion could be broadened. Typical vector rates attained from high-level compiled source code have been found to be significantly lower than the peak theoretical rates. For example, 100 Mflops is often cited as a good vector rate for real world Fortran code on the CRAY YMP, and is substantially less than the 333 Mflops peak rate. The reason for this performance decline is that in realistic environments the average speed is determined by the processing times of a large number of mixed jobs that include both CPU and I/O operations.
Sustained performance is difficult to determine, since it varies with such factors as the level of vectorization achieved in the program, vector length, memory contention, and I/O latency. It is fairly typical to find sustained rates to approach 50 percent of the peak rate. For example, the CRAY J90 computer with 32 processors reports a sustainable application performance of 2 to 3 GFlops.
Stay tuned for more detailed information on specific benchmark suites!
For questions or suggestions on benchmarking please contact Richard Marciano at (702) 895-4000 or firstname.lastname@example.org
[an error occurred while processing this directive]