In Proceedings of the Joint International Conference on Mathematical Methods and Supercomputing for Nuclear Applications, vol. 2, 1454-1466 (Saratoga Springs, New York October 5-9, 1997).

DEVELOPMENT OF PORTABLE PROGRAM SYSTEM

FOR MODELING NEUTRON-PHYSICAL PROCESSES IN

NUCLEAR REACTOR

ON PARALLEL COMPUTERS

V. I. Arzhanov, A. V. Voronkov, A. S. Golubev,
N. A. Konovalov, V. A. Krukov, and E. P. Sychugova

Keldysh Institute of Applied Mathematics, Russian Academy of Sciences,

Miusskaya Sq. 4, Moscow, 125047, Russia
(095) 972-3646

Abstract

Main approaches for elaborating portable parallel codes are considered. The REACTOR-P parallel program system is described. Traditional ways for creating portable parallel codes are examined. Main shortcomings of the message passing approach are discussed. A new data parallel language named C-DVM for implementing portable codes on multiprocessor computers has been developed. Some parallel numerical algorithms for 3D neutron diffusion problems are suggested. Results of numerical experiments have confirmed high efficiency of these algorithms and this new parallel language.

Table of contents

1. Introduction

2. Portability of the SEQUENTIAL program system

3. Using traditional approaches TO elaborate parallel codes

4. Development of Methods and Tools for Creating Portable and Efficient Parallel Programs

1. Introduction

Since 1985 we have been developing all neutron-physical codes within the framework of the unified package of application programs REACTOR¹, which has been used in the Keldysh Institute of Applied Mathematics (KIAM) and other organizations for some years. The REACTOR package allows to carry out numerical calculation of stationary, quasistationary (calculation of burnup) and non-stationary (kinetic problems) processes in nuclear reactors in one-, two-, and three dimensional geometries. To solve the transport equation there can be used Diffusion, P1-approximation or the Method of Discrete Ordinates.

Up to now the REACTOR package has been used only on serial computers that narrowed areas of application. When new parallel systems became available the problem arose to create appropriate software effectively using the potentiality of parallel systems. Primarily it is necessary to elaborate software for modeling dynamic processes in nuclear reactors and especially to simulate emergency situations. These problems require a considerable increase in solution precision. High accuracy could be a compensation for practically a complete absence of full scale emergency experiments. Deep parallelization allows to use more precise and efficient solution methods. At present the REACTOR system is enhanced with new parallel computational modules. This new version of the package (Reactor-P) is intended to solve the most sophisticated problems requiring high performance and enlarged memory size. REACTOR-P runs on the distributed-memory multicomputer MBC-100² based on i860-XP processors.

2. Portability of the SEQUENTIAL program system

The package REACTOR-P is split into big separate codes (or sections according to this package terminology). Each section performs a considerable logically closed body of job, for example imputing reactor geometry or zone composition, computing equation coefficients and so on. Their interaction is performed through the archive (specialized data base) that resides into external storage. As a rule each section runs in the following manner: interacts with the program environment, inputs initial information from the data file, reads information from the archive, processes inputted information and writes results back to the archive. On this approach one can isolate those parts of the package (in the form of executable module) that being small as source code consume the major portion of the total CPU time. It is precisely these parts of the package that can be effectively parallelized.

To create portable software one must first of all use standard programming languages. A major portion of our reactor codes is written in Fortran-77. Additional potentialities that are absent in Fortran can be provided by means of the C language. At the present moment it is decided to use only ANSI C and Fortran. The joint use of Fortran and C poses a problem of language mixing. Theoretically there are two ways:

executable modules are written in only one language, data exchange is based on standard OS files (weak interface);
separate executable modules use language mixture (strong interface).

It is decided to use only the weak interface where possible.

To exclude code dependence on specific parameters and computer platform limitations a concept of parameter independence has been developed. Basic requirements are as follows:

to locate all specific declarations and dimensions in one place;
to use systematic declarative notations convenient for the user;
to free all subprograms of specific declarations;
to pass maximum dimensions into subprograms to control memory usage.

3. Using traditional approaches TO elaborate parallel codes

To solve sophisticated scientific problems on distributed memory multicomputers we use the well-known approach based on describing the algorithm as a set of sequential processes interacting through the message passing. Under this approach two basic methods can be used to implement parallel algorithms. One of them is to use new parallel languages that are standard languages extended with new language constructions. The other method uses special libraries extending standard languages (for example Fortran or C) with the message passing means.

The Message Passing Interface library (MPI³) is an essential step in standardization of development tools. Being implemented on different hardware platforms this library principally gives a potentiality to write portable parallel codes. Comparatively low level of these library tools close to hardware structure of distributed memory systems allows to write effective parallel codes for such systems.

At present an MPI subset is implemented in KIAM using the program system Router⁴. Its functions are available by explicitly calling library routines. In addition, the message-passing language Fortran-GNS⁵ is implemented on MBC-100 in KIAM. This language is an extension of Fortran-77 in accordance to C. A. Thole’s suggestions⁶. These suggestions are based on the concept of the Abstract GENESIS Machine. In line with this concept the user’s task is represented as a set of sub-tasks which can be dynamically started on given processors of the distributed system. They interact with each other through message passing in one of three modes: synchronous, asynchronous, and no-wait. At present the concept of GENESIS computer is implemented as the Lib-GNS library elaborated to model basic features and statements of the Fortran-GNS. Codes written in Fortran-GNS may be ported onto any platform, that has a Fortran-77 compiler and Lib-GNS.

Some results about efficiency of these approaches shown by the package REACTOR-P are presented for a benchmark problem based on experiments performed at the United Kingdom ZEBRA facility⁷. This benchmark problem is a system of diffusion equations consisting of 26 energy groups. The finite-difference grid contains 16 planes each having 1519 points to give totally 24304 unknowns in every energy group. As it was mentioned above a parallel code is a set of interacting processes each running on a separate processor with local memory. On the other hand every section of the package (including a CPU-intensive section) must have access to the archive in the external storage. A problem arises how to organize simultaneous access of different processes to one archive under the necessary condition to keep information consistency. A number of alternatives have been analyzed to choose the most simple and logically clear scheme: one process is assigned to be the master, the others are slaves. Only the master process may interact with the “outer world”: (input/output from/to the archive, displaying messages on the monitor, receiving information from the keyboard and so on). Besides the master process organizes coherent running of slave processes: sends out initial information, gives permission to start/continue, gathers information on the achieved local accuracy in separate processes, decides on finishing the job, and at last performs necessary actions to complete running.

Calculations have been carried out in the multigroup neutron diffusion approximation in three-dimensional X-Y-Z geometry. This test problem is solved by using Fortran-77 together with MPI, implemented in the system Router. Solution parallelization of the diffusion equations is implemented by means of decomposition of numerical domain into layers along the vertical axis. Each layer consists of a number of grid planes and is processed by one processor. The information interaction involves exchange of values of neutron fluxes at the boundaries of layers.

According to our results the most effective splitting is to break down the whole grid into computational sub-areas with equal numbers of grid planes so that each sub-area corresponds to one processor. At the inner iterations all the processors start working simultaneously. The planes in each sub-area are ordered from the periphery to the center of reactor.

When solving the equation in each energy group the least parallelizable part is a fragment of code controlling inner iterative convergence. This algorithm runs on the processor that receives information on local convergence, analyses it, and decides whether to continue inner iterations or not. There are two possibilities to organize these processes on a set of processors. In the first case one of the local processors acts as the host processor. Performance results in this case are presented in Table 1

Table 1

Master and slave processes placed on single processor

N	1	2	4	8	16
E	100%	97.41%	96.75%	85.87%	49.89%

In this Table E is the parallelization efficiency:

where stands for the time shown by the single processor code and is the time reported by the N processor code.

Analysis has shown that the main cause decreasing the parallelization efficiency is a fragment of code searching the overall convergence of inner iterations. The fact is that the time spent on receiving the global criterion from the host processor increases considerably faster than linear function when the total number of processors grows. Because of this another parallelization method has been considered when a separate processor acts as the host. The basic idea is to superpose inner iterations and convergence analysis of the previous iteration. This admits one extra inner iteration in every group. This approach does decrease parallelization efficiency in case of few processors. But this approach is promising in case of massively parallel systems.

The following are the results obtained in case of 3, 5, 9, and 17 processors.

Table 2

Master and slave processes placed on different processors

N	3	5	9	17
E	60.32%	78.10%	80.28%	59.73%

Comparative analysis shows that in case of 17 and more processor code parallelization efficiency is greater than in the preceding case. Nevertheless the parallel efficiency drops as earlier when the number of processors grows.

4. Development of Methods and Tools for Creating Portable and Efficient Parallel Programs

The experience of development of the REACTOR-P system in Fortran-77 language with MPI library confirms the well-known opinion: the message passing approach has essential shortcomings.

First, the level of language facilities supporting program parallelism is too low. The lack of the global address and name space requires serious efforts for writing parallel programs. The debugging of message passing programs is still an open problem and a serious disadvantage for industrial development of parallel codes. The parallel program development needs using or creating the appropriate computational methods. To achieve the required performance it is necessary to modify or even to rewrite parallel programs many times. And this modification is much more complicated for the message passing programs than it is for sequential ones.

Second, the efficient program execution on massively parallel computers requires load balancing which is extremely difficult to provide in the message passing model. Workstation networks have become the most common parallel architectures available, but porting message-passing programs onto heterogeneous computer networks is a very difficult problem. It is necessary to develop new methods and tools for creating portable and efficient parallel programs for distributed memory computers.

The majority of experts believe that using data parallel languages (Fortran D⁸, Pandore C⁹, Vienna Fortran¹⁰, HPF¹¹) is the most promising approach. These languages are versions of standard Fortran or C languages enhanced with data decomposition. The languages relieve the programmer from the burden of program partitioning and communication generation whenever a processor needs data from other processors. Based on the user-specified partitioning of data, the compilers must generate the corresponding parallel program for distributed-memory computers. It is unknown today whether they will be able to achieve performance comparable to message passing languages.

Research on data parallel languages, which was started in KIAM in 1993, allows us to arrive at the conclusion: the main weakness of data parallel languages is not only complexity of compilers, but also absence of explicit and clear model of parallel execution. This makes impossible to provide the user with convenient tools for analyzing execution and improving performance of his program.

A new model of parallel program execution was proposed in 1994¹². This model (DVM-model) combines the major advantages of PCF Fortran¹³ and HPF models and allows one to express functional and data parallelism in scientific and engineering applications on massively parallel computers.

The DVM name originates from two main notions - Distributed Virtual Memory and Distributed Virtual Machine. The former reflects the global address space, and the latter reflects the use of an intermediate abstract machine for the two-step mapping of the parallel program onto a real parallel computer. This distributed virtual machine consists of one or several multidimensional arrays of virtual processors, and defines the parallelism of the program. The programmer creates a suitable virtual machine for his program, then specifies the mapping of the program and data onto his virtual machine, and describes the rules of mapping this virtual machine onto the real parallel computer. This mapping is performed during run-time and does not require to recompile the source code.

The C-DVM language was developed on the basis of the DVM-model. The main principle of language development is: the programmer must have a possibility to specify exactly parallel execution of his program, but the intelligent compiler does not require the user to do it.

In fact, this principle determines directions of the language progress: from the complicated languages to simpler ones, from simple compilers to more complex ones. The direction of HPF development is different: from automatic parallelization to HPF1, then - to HPF2. The main effect of the difference in approaches is that at the cost of certain additional efforts the user of C-DVM is always able to make his program as efficient as it could be when conventional means for parallelization and message passing is used. Yet this does not involve any decrease in portability.

The C-DVM language has been developed concurrently and jointly with the REACTOR-P system. As a result, the DVM-model was improved considerably. Two experimental versions of the compiler were implemented and the language specification was revised twice. As a result, the C-DVM programming system was created. A short description of this programming system is presented below.

The system is based on the new language model and new methodology, it is aimed at simplifying development, debugging, testing, performance evaluation and maintenance of portable and efficient parallel programs.

The C-DVM language was developed via enhancing the C language in correspondence with the DVM-model. The parallel program is a code in the standard C language, in which the DVM directives specifying parallel execution of the program are inserted. The directives are transparent for standard C compilers, therefore DVM programs can be executed on workstations as usual sequential programs.

The language includes the following main features for parallelism description:

distribution of loop iterations over processors;
distribution of array elements over processors;
organization of efficient access to remote data (located on other processors) with high level array-based specification;
organization of efficient execution of global operations with data, located on different processors (for example, accumulation of the values of distributed array elements).

The compiler translates the parallel C-DVM code into a sequential C code including Lib-DVM library calls. The Lib-DVM is written in the C language and it uses the features of MPI or PVM¹⁴ for providing interprocessor communications. Therefore C-DVM can be used on any platform that has a C compiler and MPI or PVM installation.

For debugging a DVM program the following approach has been suggested. In the first stage, the program is debugged on a workstation as a sequential program using appropriate debugging methods and tools. In the second stage, a special execution mode for checking the DVM directives is used on the same workstation too. In the third stage, the program may be executed on a parallel computer (or on a workstation, simulating the MPI or PVM parallel machines) in a special mode, when intermediate results of its execution are compared with reference results (for example, the results of sequential execution). This mode is used for program testing as well.

There are tools for analyzing statistic data about program execution on a parallel computer (obtained in time of real execution or simulation). The DVM-program performance browser shows information about execution time of the DVM program and its fragments on every processor. Besides, the browser shows integral characteristics of the DVM-program and its fragments performance, for example, the parallelization efficiency.

Research and experimental programming of the REACTOR-P system in the C-DVM language have shown the following peculiarities of the main REACTOR-P code:

low ratio computation/communication;
every iteration requires exchange of boundaries (exported array elements, which are used by neighboring processors) of tens of arrays;
test of several conditions for iteration completion require to execute several reduction operations.

As a result, the following features for overlapping calculation and communication were included in the language:

asynchronous execution of group of reduction operations (SUM,MAX,MAXLOC, etc.);
asynchronous exchange of group of arrays boundaries;
potentiality to change loop iteration order to provide, for example, early calculating and sending array boundaries.

The above mentioned computational code for solving multigroup diffusion neutron problems in 3D X-Y-Z geometry have been rewritten in the C-DVM language. The following modifications in codes have been made to increase parallelization efficiency:

the whole computational area are embedded into a 3D parallelepiped and all main arrays are declared as regular 3D arrays, it allows to perform parallelization in three directions X, Y, and Z;
the “red-black” ordering of grid nodes are introduced instead of the natural one giving computational independence on the number of processors, besides loop iterations become independent, this allows to change the order of loop iterations - first processing all exported elements (boundary values used by neighboring processes), then the sending procedure is started concurrently with processing inner elements, note that the potentiality to change order of loop iteration and to overlap computing and message passing are advantages of the C-DVM language;
the loop structure is modified to decrease the number of inter-processor messages, the number of data passing operations is reduced by the factor of G (G is a number of energy groups) through integrating separate passings into one operation, this modification considerably decreased the passing time overhead;
to minimize the cost of determining the convergence criterion of inner iterations the asynchronous reduction operation MAX is used, this operation for the K-th iteration is performed as a background for the (K+1)-th iteration, one can define other delay number, note that the potentiality to overlap reduction and computing is a C-DVM feature.

Table 3 presents performance results shown by a code written in C-DVM for the same test problem. In this test problem there were 16*31*61 grid points in Z, Y, and X directions correspondingly. Grid points were distributed over processors either in Z-direction (denoted in the table, for example, by 8*1) or in Z- and Y-directions (denoted by 8*4 meaning 32 processor grid of 8 rows and 4 columns in Z and Y directions correspondingly, arrays were divided into 8 equal portions over Z-axis and approximately into 4 equal portions over Y-axis).

Table 3

Parallel efficiency of C-DVM code

N	1	2*1	4*1	8*1	2*8	2*16
	693.8	353.5	183.0	99.0	54.3	29.9
t_opt	693.8	346.9	173.5	86.7	43.4	21.7
t_wst	0.0	6.6	9.5	12.3	10.9	8.2
E	100%	98%	95%	88%	80%	73%

where t_opt = / N, t_wst = - t_opt

As one can see from Table 3 the C-DVM program has high performance for distributed systems containing tens of nodes.

Table 4

Parallel efficiency of C-DVM code on 32 processor configuration

(as function of processor topology)

N	16*2	1*32	8*4	4*8	2*16
	36.4	33.6	30.8	30.8	29.9
t_opt	21.7	21.7	21.7	21.7	21.7
t_wst	14.7	11.9	9.1	9.1	8.2
E	60%	65%	70%	70%	73%

The indicated above results were obtained on the basis of statistical information produced by the special C-DVM performance browser.

The following Table 5 displays an example of statistics on C-DVM program performance, where:

N - number of processors
T_par - parallel execution time (the maximum value of execution times of separate processors)
T_est - sequential execution time (estimation)
E_est - parallel efficiency estimation = T_est / (T_par * N)
T_wst - waste time = (T_par * N) - T_est

Components of waste time:

T_red - time of waiting for reduction operations, in parentheses - synchronization overhead (a part of time of waiting for collective communication operations caused by non-simultaneous starting of these operations)
T_shd - time of waiting for shadow elements (importing copies of elements from neighboring processors) , in parentheses - synchronization overhead
T_oth - other portion of waste time.

Table 5

	N=1	N=2(2*1)	N=4(4*1)	N=8(8*1)	N=16(2*8)	N=32(2*16)
T_par	693.8	353.5	183.0	99.0	54.3	29.9
T_est	693.7	689.0	670.4	658.8	633.3	635.6
E_est	1.0	0.97	0.92	0.83	0.73	0.66
T_wst	0.0	18.0	61.0	133.2	236.0	320.9
T_red	0.0	0.12(0.0)	9.8(9.0)	20.4(16.7)	120.5(111.0)	157.9(130.0)
T_shd	0.0	17.3(3.0)	49.9(27.4)	109.0(28.6)	105.8(69.7)	142.2(71.6)
T_oth	0.0	0.2	1.35	3.9	9.7	20.7

5. Conclusions

The experience on elaborating the portable program system REACTOR-P on parallel computer confirmed the correctness of the proposed approaches for implementing parallel codes. It has been shown that the elaborated parallel methods and approaches for their implementation give good results for distributed memory systems containing tens of processor nodes.

The experience on developing REACTOR-P programs in the C-DVM language allows us to state that the C-DVM programming system makes it much easier to develop programs on distributed memory computers.

It is very important that the C-DVM program can be dynamically (without recompilation) customized to application parameters (the number and size of arrays) and parallel computer configuration. This allows us to reduce the number of different versions of the program and to simplify program maintenance.

Further development of the REACTOR-P program system will be oriented to parallelization of neutron dynamic processes with thermohydraulic feedback and to modeling the problem of deep penetration radiation by using the transport equation in 3D geometry. To support it the C-DVM language will be extended with features for expressing functional parallelism.

Performance analysis of the REACTOR-P program written in the C-DVM language has shown that it is necessary to include into the language features for irregular mapping of arrays and computation (in a similar manner as in Vienna Fortran or HPF2 languages).

Appendix

Example of (parallel) Jacobi solver in C-DVM for Laplace equation

#define DVM(directive) /* This macros is blank for standard C compiler
                          and used by C-DVM compiler */

#define DO(v,l,u,s) for(v=l;v<=u;v+=s)

/*Definition of loops that can be made parallel in C-DVM */
#define N 1000
#define ITMAX 20

void main(void)

DVM(DISTRIBUTE [BLOCK] [BLOCK]) /* Pointers to arrays A and B 
                                 with block distribution */
float (*A)[N], (*B)[N];
int i,j,k; float eps,MAXEPS=0.5E-7;

A = malloc(N*N*sizeof(float)); // memory allocation for arrays A
B = malloc(N*N*sizeof(float)); // and B they are used below

/* main iterative loop */

for(it=0;it<ITMAX;it++)
{ eps=0.0;
  DVM(PARALLEL ON A[i][j]; REDUCTION MAX(eps))
  /* Nest of two loops and description of reduction operation */
  DO(i,1,N-2,1)
  DO(j,1,N-2,1)
  {
     eps=max(abs(B[i][j] - A[i][j]),eps);
     A[i][j] = B[i][j];
  }
  DVM(PARALLEL ON B[i][j]; SHADOW_RENEW A)
    /*Copying shadow elements of array A from neighbouring
      processors before loop execution */
  DO(i,1,N-2,1)
     DO(j,1,N-2,1)
        B[i][j] = (A[i-1][j]+A[i+1][j]+A[i][j-1]+A[i][j+1])/4;
  printf("it = %4d eps = %12.3e\n",it,eps);
  if(eps < MAXEPS) break;
  }
}

Acknolegements

This research was performed under the financial support of ISTC Projects # 067-94 and # 068-94.

Referencies

A.V. Voronkov, V.I. Arzhanov, “REACTOR – Program System for Neutron-Physical Calculations”, Proceedings of the International Topical Meeting “Advances in Mathematics, Calculation and Reactor Physics”, 5, 30.5 1-1 - 30.5 1-4, ANS, Pittsburgh, USA (1991).
*A.V. Zabrodin, V.K. Levin, V.V. Korneev, “The Massively Parallel Computer System MBC-100”, Proceedings of the Third International Conference on Parallel Computing Technologies, 341-355, Springer, St. Petersburg, Russia (1995).
David W. Walker, “The design of a standard message-passing interface for distributed memory concurrent computers”, Parallel Computing, 20, 4, 657-673 (1994).
A.O. Lacis, A.E. Lutskiy, “Numerical solution of supersonic overpassing flows on a multi-processor system.”, Proceeding of Russian Society of Computer Sciences and Computer Technics, 3, 19-23, Moscow, Russia (1994).
V.A. Krukov, L.A. Pozdnjakov and I.B. Zadykhailo, “RAMPA - CASE for portable parallel programs development”, Proceedings of the Third International Conference on Parallel Computing Technologies, 3, 643-650, IPPE, Obninck, Russia (1993).
C.A. Thole, “A Proposal for the Abstract GENESIS Machine”, Report on the GENESIS definition phase, October 1988 - June (1989).
A.N. Chebeskov, I.V. Krivitskij, G.V. Matveev, Y.N. Miranovich, A.V. Voronkov, E.P Sychugova, A.D. Knipe “Low reactivity sodium-void benchmark study in an annular heterogeneous assembly”, Proceedings of the International Topical Meeting Sodium cooled fast reactor Safety, 1, 60-70, IPPE, Obninsk, Russia (1994).
G. Fox, S. Hiranandani, K. Kennedy, K. Koelbel, U. Kremer, C. Tseng, and M. Wu “Fortran D Language specification”, Technical Report, TR 90-141, Dept. of Computer Science, Rice University (1990).
F. Andre, J. Pazat, H. Thomas, “A system to manage data distribution”, Proceedings of the 1990 ACM International Conference on Supercomputing, Amsterdam, Netherlands (1990).
B. Chapman, P. Mehrotra, and H. Zima, “Programming in Vienna Fortran”, Science Programming, 1, 31-50 (1992).
High Performance Fortran Forum, “High Performance Fortran Language Specification”, Technical Report, Version 1.0, Rice University (1993).
N.A. Konovalov, V.A. Krukov, S.N. Mihailov and A.A. Pogrebtsov, “Fortran DVM - a Language for Portable Parallel Program Development”, Proceedings of Software For Multiprocessors & Supercomputers: Theory, Practice, Experience, 124-133, Institute for System Programming, RAS, Moscow (1994).
B.T. Smith, “Parallel computing forum (PCF) Fortran”, Proc. IFIP WG 2. 5 Working Conference - Aspects of Computation on Asynchronous Parallel Processors, North-Holland (1989).
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, “PVM 3 User’s Guide and Reference Manual”, Technical report, Oak Ridge National Laboratory ORNL/TM-12187 (1993).