Prefictor of DVM-program performance. Detailed design

Predictor of DVM-program performance
(Predictor)
Detailed design
* September, 2000 *

- last edited 27.12.00 -

Contents

1 Introduction
2 Predictor implementation

2.1 Representation of the program as an hierarchy of intervals
2.2 Program execution characteristics on each processor
2.3 Main characteristics and their components
2.4 Source data for Predictor

3 Simulation

3.1 General principles of simulation
3.2 Ordinary functions
3.3 Interval delimiter functions
3.4 Input/output functions
3.5 Create/delete object functions
3.6 Data and resources distribution functions
3.7 Functions of collective operation initialization
3.8 Functions of collective operation execution
3.9 Parallel loop functions
3.10 Unknown functions

4 Estimation of remote data access overhead

4.1 Main terms and definitions
4.2 Estimation of array redistribution exchanges
4.3 Estimation of distributed array edge exchanges
4.4 Estimation of reduction operation exchanges
4.5 Estimation of exchanges during remote array buffer loading

Appendix 1. Link from field name in output HTML-file to field name in structure
_IntervalResult
Appendix 2. Definition of auxiliary functions and classes
Appendix 3. Main functions of time extrapolation
Appendix 4. Trace fragments and parameters of Lib-DVM functions simulated
by Predictor
References

1 Introduction

The Predictor is intended for performance analysis and performance debugging of DVM-programs without usage of a real parallel computer (access to which is usually limited or complicated). With the Predictor user can get the predicted time characteristics of execution of his program on MPP or workstation cluster in more or less details.

The Predictor is the system for processing the trace information gathered by Run-Time library (Lib-DVM) during the program execution on a workstation and consists of two major components: trace interpreter (PRESAGE) and time extrapolator (RATER). The trace interpreter, using trace information and user-defined parameters, calculates and displays extrapolated time characteristics for the program execution on MPP or workstation cluster, calling functions of time extrapolator, which simulates parallel DVM-program execution. In fact, time extrapolator is a model of Lib-DVM library, low level message passing system (MPI), used by the library and hardware.

The performance of parallel programs on multiprocessor computers with distributed memory is determined by the following major factors:

Program parallelism degree - a part of parallel calculations in the total volume of calculations.
Balance of processor load during parallel calculations.
Time needed for execution of interprocessor communications.
Degree of overlapping of interprocessor communications with calculations.

The Predictor allows user to obtain the above and others characteristics and estimate quality of the parallel program.

There are three major stages of Predictor work:

read trace; process information about interval structures, support system function call sequence and nesting, input and output function parameters used for simulation and time of each call execution;
simulate program execution on the base of the program execution structure obtained at the previous stage and calculate program execution characteristics described in 2.2.1 for each interval;
write characteristics into HTML file.

2 Predictor implementation

2.1 Representation of the program as an hierarchy of intervals

For detailed program efficiency analysis user can split program into intervals and obtain performance characteristics for each interval.

The execution of the program can be represented as a sequence of alternating sequential or parallel intervals (groups of operators) execution. By default, the program is considered as one interval. Also user can define intervals by means of C-DVM and Fortran DVM languages. There is also opportunity to set a mode of compilation when the following loops can be declared as intervals

all parallel loops;
all sequential loops, containing parallel loops;
all sequential loops.

(Constraint: when Predictor is used, the integer expression with an interval must not appear inside parallel loop).

The user can also split any interval into smaller intervals or unite neighbor intervals (in order of execution) in new one, i.e. to present the program as hierarchy of intervals of several levels (the whole program is an interval of highest level 0).

The mechanism of splitting the program into intervals serves for more detailed analysis of behavior of the program during its execution. Looking through results with the help of the Predictor, user can set a depth of details i.e. ignore intervals of prescribed levels.

For simplification of the further description we shall enter the following notions. An interval will be called simple, if it does not contain other intervals (nested intervals). We refer to intervals including nested intervals as composite. Thus, the whole program is a interval tree: an interval of highest level (whole program) is a root of the tree; an interval of the lowest level is a leaf.

While processing trace information one of the intervals is an active interval (it is an interval of the lowest level containing program operator executed at the moment). For the active interval the following information is accumulated:

type of the interval (IntervalType);
number of the interval entering (EXE_count);
source file name (source_file);
line number corresponding to the beginning of the interval (source_line).

At this stage the number of communication operations within the interval is calculated:

the number of input/output operations (num_op_io);
the number of reduction operations (num_op_reduct);
the number of edge exchange operations (num_op_shadow);
the number of remote access operations (num_op_remote);
the number of array redistribution operations (num_op_redist);

In Predictor each interval corresponds with an object of “Interval” type which contains all necessary interval characteristics; during trace processing the interval tree builds. Each interval contains a vector of sub-objects of “Processor” type, each element of the vector saves characteristics of a processor involved in the interval execution. Thus after trace processing stage we have an interval tree with nodes of “Interval” type containing saved characteristics (IntervalType, EXE_count, source_file, source_line, num_op_io, num_op_reduct, num_op_shadow, num_op_remote, num_op_redist). At the same time characteristics of each involved processor are saved in the corresponding “Processor” object.

2.2 Program execution characteristics on each processor

“Processor” object contains the following characteristics of the processor involved in the interval execution:

Execution_time – the interval execution time.
IO_time – input/output time.
SYS_time – productive system time (Lib_DVM call execution time without message exchange time).
CPU_time - productive user program time (user program calculations taking into account loop ‘’slicing’’).
Lost_time - sum of lost times due to insufficient parallelism (Insuff_parallelism), losses because of communications (Communication) and processor idle time (Idle).
Communication – total communication time. Time of every communication type is extrapolated by time extrapolator (RATER).
Insuff_parallelism = Insuff_parallelism_USR + Insuff_parallelism_SYS.
Insuff_parallelism_USR – user program losses because of insufficient parallelism.
Insuff_parallelism_SYS – system losses because of insufficient parallelism.
Synchronization – time of losses because of dissynchronization.
Time_variation - time variation of collective operation completion.
Idle – processor idle time - difference between maximal interval execution time (looking through all processors) and interval execution time at the current processor.
Load_imbalance – time of imbalance of processors loading, difference between maximal processor time (CPU+SYS) and corresponding time at the current processor.
Overlap – total time of overlapping of asynchronous communication operations; sum of time of overlapping of asynchronous input/output operations (IO_overlap), reductions (Reduction_overlap), edge exchanges (Shadow_overlap), remote access (Remote_overlap) array redistributions (Redistribution_overlap).
IO_comm – total time of communications for input/output operations.
IO_synch - losses because of dissynchronization for input/output operations.
IO_overlap - time of overlapping of asynchronous input/output operations.
Wait_reduction - total time of communications for reduction operations.
Reduction_synch - losses because of dissynchronization for reduction operations.
Reduction_overlap - time of overlapping of asynchronous reduction operations.
Wait_shadow - total time of communications for edge exchange operations.
Shadow_synch - losses because of dissynchronization for edge exchange operations.
Shadow_overlap - time of overlapping of asynchronous edge exchange operations.
Remote_access - total time of communications for remote access operations.
Remote_synch - losses because of dissynchronization for remote access operations.
Remote_overlap - time of overlapping of asynchronous remote access operations.
Redistribution - total time of communications for array redistribution operations (redistribute, realign).
Redistribution_synch - losses because of dissynchronization for array redistribution operations.
Redistribution_overlap - time of overlapping of asynchronous array redistribution operations.

As mentioned above accumulation of these characteristics is performed at the first stage of simulation – at the stage of trace processing; the characteristics are saved in “Processor” object for each interval and for each processor used for the interval execution.

While trace processing the tree of intervals (tree of “Interval” objects) is built, each interval contains a vector of “Processor” objects. The vector size is equal to the number of processors in the root processor topology, but processors involved in the interval execution are significant only.

2.3 Main characteristics and their components.

The possibility of processing the trace, accumulated by Lib-DVM at the stage of simulation execution of the program allows Predictor to give user the following main characteristics of the parallel program execution for the whole program as well as for each interval:

execution time (Execution_time) - maximal parallel program execution time (looking through all processors).

efficiency coefficient (Efficiency) – main characteristic of parallel program execution. It is necessary to calculate two amounts of time:

Productive time (Productive_time), required for the parallel program execution on serial computer (consists of three parts):

productive user program time (Productive_CPU_time) – time which the program spends for user calculations.

productive system time (Productive_SYS_time) - time which the program spends for system calls (Lib-DVM calls).

input/output time (IO_time) - time which the program spend for input/output.

Total processor time (Total_time) calculated as a product of execution time by the number of processors.

Efficiency coefficient - is a ratio of the productive time to the total processor time.

lost time (Lost_time) - is the total processor time subtracted by the productive time. There are following components of the lost time:

synchronization (Synchronization) – used for estimation of the total potential losses, which can arise because of non-simultaneous start of collective operations on different processors. A main reason of these losses is imbalance of processors loading during execution of parallel loops. For estimating the potential losses because of imbalance, the generalized characteristic should be given to the user – imbalance.

imbalance (Load_imbalance). Dissynchronization can occur not only due to imbalance, but also due to differences in completion times of a collective operation on different processors connected with its realization on the specific parallel machine.

time variation (Time_variation) of collective operation completion is used by programmer to evaluate the potential dissynchronization.

time of overlapping (Overlap) is an important characteristic, showing the degree of overlapping of interprocessor communications with calculations. It is sum of basic characteristics – time of overlapping of reduction operations and time of overlapping of shadow edge exchanges.

For more detail information about above characteristics see [1].

After the tree of “Interval” objects is built the method “Integrate” is called for each object. This method calculate interval characteristics using corresponding characteristics of processor vector and characteristics of nested intervals. Finally HTML-file is created according to the interval tree: characteristics of every interval are written into the special “template” (see IntervalTemplate.cpp file). Correspondence between “Interval” object fields and HTML-file fields can be found in Appendix 1.

2.4 Source data for Predictor

The following data serve for simulation: trace of the program execution on one processor, configuration information saved in some file (for example, “Predictor.par”) and command line options. To obtain necessary trace information the following parameters should be edited in the configuration file usr.par:

*Is_DVM_TRACE=1;*	- trace on;
*FileTrace=1;*	- accumulate trace in files;
*MaxIntervalLevel=3;*	- maximum level of nested intervals;
*PreUnderLine=0;*	- do not underline “call” in trace file;
*PostUnderLine=0;*	- do not underline “ret” in trace file;
*MaxTraceLevel=0;*	- maximum trace depth for nested functions.

Parameters PreUnderLine, PostUnderLine è MaxTraceLevel say to Lib-DVM that it is not necessary to accumulate lines of underscores in trace and it is not necessary to trace nested Lib-DVM calls, which gives much smaller size of the trace file.

Note. To run the parallel program on one processor with explicitly defined processor configuration or with dynamic set up for allocated processors it is necessary to define corresponding “virtual” processor system by IsUserPS and UserPS parameters.
For example, to define 2*2 “virtual” processor system use the following parameter values:

*IsUserPS=1;*	- use “virtual” processor system definition;
*UserPS=2,2;*	- “virtual” processor system topology.

Let us consider the following trace fragment:

1.       call_getlen_ TIME=0.000005 LINE=31 FILE=GAUSS_C.CDV
2.       ArrayHandlePtr=951cd0;
          …
3.       ret_getlen_ TIME=0.000002 LINE=31 FILE=GAUSS_C.CDV
4.       Res=4;

Line 1 identifies function of Lib-DVM (getlen_). It contains:

time of execution of the user part of the program (time interval between previous Lib-DVM return and the given function call - TIME=0.000005; this time (call_time) is summed up in CPU_Time field of “Interval” structure during trace processing);
source file line number (LINE=31) which contains Lib-DVM function call. This information is transferred to source_line field of “Interval” structure;
source file name (FILE=GAUSS_C.CDV), it is transferred to source_file field of “Interval” structure;

Line 2 (and probably lines after it) contains names and values of actual function parameters. They are transformed into input numerical values, packed in structures and used for simulation of every Lib-DVM function.
Line 3 traces information about Lib-DVM function return. The only value used by Predictor is the time (ret_time) of the function execution (TIME=0.000002); it is usually summed up in SYS_Time field of “Interval” structure.
Line 4 contains function return value and used as it was described for the line 2.

Every Lib-DVM function is processed by Predictor in accordance with its own algorithm, though many functions (“unknown” functions and “ordinary” functions) are processed by the same algorithm.

Predictor configuration information is defined in the special file (“Predictor.par”). The file contains the characteristics of the simulated multiprocessor system and has the following structure:

// System type = network | transputer
type = network;
// Communication characteristics (mks)
start time = 75;
send byte time = 0.2;
// Comparative processors performance
power = 1.00;
// Topology - optional
topology = {2, 2};

Lines beginning from “//” are comments. The parameter topology defines the virtual processor system topology, i.e. its rank and size on each dimension. The value network of type parameter means that the processor system is the network of workstations, and the value transputer means that the dedicated processor system uses transputer system as communication network. The linear approximation is used to calculate the time of transmission of n bytes:

T = (start time) + n *( send byte time),

where:
start time – start time of the data transmission;
send byte time - time of 1 byte transmission.

Parameter Power defines the ratio of the productivity of the processor where Predictor is working to the productivity of the processor where the parallel program will be executed.
The order of parameters is arbitrary.
Structure of the output HTML-file is the same as the interval structure in the program. Every HTML-file fragment corresponds to some interval and contains data characterizing the interval, the integral characteristics of the program execution on the interval and also links to the fragments with the information about nested intervals. HTML-file is a tree of intervals with special buttons to traverse the tree in any direction.

3 Simulation

3.1 General principles of simulation

At the first stage of simulation call-graph of Lib-DVM calls with actual parameters is built. Such a graph is a linear sequence of calls, as nested calls are not accumulated in the trace for Predictor. The call-graph is built in three steps:

Trace file ->
vector of TraceLine objects ->
vector of FuncCall objects.

At the second stage actual simulation of the program execution on the given multiprocessor system is performed.

To calculate the program execution characteristics Lib-DVM calls are simulated in the same order as they are in trace. During the simulation the following auxiliary structures and structure arrays are used:

structure with information about the processor system;
array of structures with information about existing abstract machine representations;
array of structures with information about existing distributed arrays;
structure with information about the parallel loop last created;
array of structures with information about existing reduction variables;
array of structures with information about existing reduction groups;
array of structures with information about existing shadow edge renewal groups;
array of structures with information about reduction operations already started but not finished;
array of structures with information about shadow edge renewal operations already started but not finished;
stack of virtual machines (and the current virtual machine in it) corresponding to dynamic created virtual machines and ”their entries”.

During the call simulation array elements are created and deleted in the same order as they were during the program execution.

All support system functions can be divided in the following classes (from the point of view of simulation):

ordinary functions;
interval delimiter functions;
input/output functions;
data transfer functions;
create/delete object functions;
data and resources distribution functions;
functions of collective operation initialization;
functions of collective operation execution;
parallel loop functions;
unknown functions.

The principles of the simulation of the above functions will be considered in the next chapters.

3.2 Ordinary functions

In this group there are functions executed on every processor of simulated processor system (or its subsystem the current task is mapped on), and the time of their execution does not depend on configuration parameters and is equal to the time of execution on serial computer (taking into account performance of processors). If it is not specified explicitly, Lib-DVM function is simulated according to the following algorithm.

Simulation: times call_time and ret_time are added to Execution_time, time call_time is added to CPU_time, and ret_time is added to SYS_time of each processor. Besides, Insuff_parallelism_USR of each processor is added to the time calculated by formula:

T = Tcall_time* (N_proc- 1) / N_proc

and Insuff_parallelism_SYS is added to the time:

T = Tret_time* (N_proc- 1)/ N_proc

This is a base algorithm for time simulation. Further on we will suppose that execution time is simulated according to this algorithm, unless another algorithm is indicated explicitly.

3.3 Interval delimiter functions

In this group there are functions used as delimiters of intervals of the simulated program.

Functions:

binter_	Create user interval.
bploop_	Create parallel interval.
bsloop_	Create sequential interval.

Simulation algorithm: according to the information from trace (source file name source_file, line number source_line, expression value (optionally) and interval type) an interval with same characteristics is found in the array of intervals nested in the current interval. Interval type depends on interval beginning function according to the following table:

binter_	User (USER)
bsloop_	Sequential (SEQ)
bploop_	Parallel (PAR)

If interval is found its value of EXE_count increases by 1, otherwise a new element of the array of intervals nested in the current interval is created with EXE_count=1. The found interval or new interval becomes current.

Functions:

einter_	- end of user interval;
eloop_	- end of parallel and sequential interval.

Simulation algorithm: an interval which is external to the current interval becomes current. Times are corrected according to the base algorithm.

3.4 Input/output functions

In this group there are functions used for data input/output. These functions are executed on input/output processor.

Simulation algorithm: time simulation algorithm differs from the base one. Function execution time is added to Execution_time and IO_time of input/output processor. Some of these functions broadcast execution result to all processors. In this case the broadcasting time is added to IO_comm of all processors.

3.5 Create/delete object functions

In this group there are functions used for creating and deleting different objects of Lib-DVM support system . When one of the functions is found in the trace it is necessary either to create and initialize or delete corresponding object in simulated multiprocessor system. As a rule creating and deleting objects are performed by corresponding constructor or destructor of the time extrapolator (RATER).

Function:

crtps_

- create subsystem of the given multiprocessor system.

As subtask is run by:

long runam_ (AMRef *AMRefPtr)

where:
*AMRefPtr – reference to an abstract machine of the subtask

and the current subtask is terminated by:

stopam_ ()

there is a stack of pairs (AM, PS) in from the Predictor (where runam_ push pair (AM, PS) in stack, and stopam_ pop stack). The top of the stack defines the current system statdefines root e: current ÀÌ and current PS. The bottom of the stack AM and PS (rootAM, rootPS).

Pair (AM, PS) is created by simulation of call:

mapam_ (AMRef *AMRefPtr, PSRef *PSRefPtr ).

Topology PS the corresponding AM is mapped on is read from Predictor configuration file (Predictor.par) during runam_ call. Predictor needs ÀÌ only once to establish the current PS. Characteristics and topology of source PS are found in the trace (from calls crtps_ and psview_ - AMRef and PSRef are needed).

Call getamr_ has a parameter:

IndexArray - array with i-th element containing index value of the corresponding element (i.e. abstract machine) along (i+1)-th dimension. The only thing which matters for Predictor is that a new AM is created and then mapped on new PS.

Simulation algorithm: on the base of information from parameter file an object of VM class is created (constructor VM::VM(long ARank, long* ASizeArray, int AMType, double ATStart, double ATByte)). Reference to this object and returned parameter PSRef are saved in the structure describing MPS.

Function:

crtamv_

- create representation of an abstract machine.

Simulation algorithm: on the base of parameters from the trace (Rank and SizeArray[]) an object of AMView class is created (constructor AMView::AMView(long ARank, long* ASizeArray)). Reference to this object, actual parameter AMRef and return parameter AMViewRef are saved in a new element of array of abstract machine representations.

Function:

crtda_

- create distributed array.

Simulation algorithm: on the base of parameters from the trace (Rank, SizeArray[] and TypeSize) an object of DArray class is created (constructor DArray::DArray(long ARank, long* ASizeArray, long ATypeSize)). Reference to this object and returned parameter ArrayHandlePtr are saved in a new element of array of distributed arrays.

Function:

crtrg_

- create reduction group.

Simulation algorithm: an object of RedGroup class is created (constructor RedGroup::RedGroup(VM* AvmPtr)). Parameter AvmPtr is a reference to the object of VM class created while crtps_ function simulating. Reference to the created object and returned parameter RedGroupRef are saved in a new element of array of reduction groups.

Function:

crtred_

- create reduction variable.

Simulation algorithm: an object of RedVar class is created (constructor RedVar::RedVar(long ARedElmSize, long ARedArrLength, long ALocElmSize)). Parameters ARedArrLength and ALocElmSize are actual function parameters, parameter ARedElmSize is calculated using RedArrayType parameter according to the following table:

RedArrayType	C type	ARedElmSize
1 (rt_INT)	int	sizeof(int)
2 (rt_LONG)	long	sizeof(long)
3 (rt_FLOAT)	float	sizeof(float)
4 (rt_DOUBLE)	double	sizeof(doble)

Reference to the created object and returned parameter RedRef are saved in a new element of array of reduction variables.

Function:

crtshg_

- create edge group.

Simulation algorithm: an object of BoundGroup class is created (constructor BoundGroup::BoundGroup()). Reference to the created object and returned parameter ShadowGroupRef are saved in a new element of array of edge group.

Function:

delamv_

- delete an abstract machine representation.

Simulation algorithm: an element with key AMViewHandlePtr is looked for in the array of abstract machine representations (AMViewRef). If it is found the object of AMView class is deleted (this object was created during crtamv_ function simulation) and then the element is deleted.

Function:

delda_

- delete distributed array.

Simulation algorithm: an element with key ArrayHandlePtr is looked for in the array of distributed arrays. If it is found the object of DArray class is deleted (this object was created during crtda_ function simulation) and then the element is deleted.

Functions:

delred_	- delete reduction variable;
delrg_	- delete reduction group.

Simulation algorithm: an element with key RedRef (RedGroupRef) is looked for in the array of reduction variables (reduction groups). If it is found the object of RedVar (RedGroup) class is deleted (this object was created during simulation) and then the element is deleted.

Function:

delshg_

- delete edge group.

Simulation algorithm: an element with key ShadowGroupRef is looked for in the array of edge group. If it is found the object of BoundGroup class is deleted (this object was created during simulation) and then the element is deleted.

3.6 Data and resources distribution functions

In this group there are functions used for initial distribution and redistribution of data and resources.

Function:

distr_

define mapping of an abstract machine representation on multiprocessor system (resource distribution).

Simulation algorithm: an element with key AMViewRef is looked for in the array of abstract machine representations. If it is found the method AMView::DisAM(VM *AVM_Dis, long AParamCount, long* AAxisArray, long* ADistrParamArray) is called. Parameter AVM_Dis is a refference to the object of VM class created while crtps_ function simulating, other parameters are actual function parameters.

Function:

align_

- define location of distributed array (alignment).

Simulation algorithm: an element with key ArrayHandlePtr is looked for in the array of distributed arrays. An element with key PatternRef is looked for either in the array of abstract machine representations or in the array of distributed arrays depending on the type of alignment template (AMView or DisArray). The method DArray::AlnDA(AMView*{DArray*} APattern, long* AAxisArray, long* ACoeffArray, long* AConstArray) is called for the object of Darray class which is from the first found record with key ArrayHandlePtr. The first parameter is reference to object of AMView or Darray type, depending on the type of alignment template corresponding to PatternRef key. Other parameters are actual function parameters. Besides, the type of alignment template is saved. Data transfer for distributed array alignment is ignored during the simulation.

Function:

redis_

redefine mapping of an abstract machine representation on multiprocessor system (resource distribution).

Simulation algorithm: an element with key AMViewRef is looked for in the array of abstract machine representations. The method AMView::RDisAM(long AParamCount, long* AAxisArray, long* ADistrParamArray, long ANewSign) is called for the object of AMView class which is from the found record of the array of abstract machine representations. Other parameters are actual function parameters.

Time simulation algorithm differs from the base one. First, time counters of all processors are set to the same value which is equal to maximal counter values at the moment of the beginning of redis_ execution. The time which has been added into the counter sums with Execution_Time and with Synchronization of the given processor. The time returned by AMView::RDisAM(…) method is added to Execution_time and Redistribution of each processor.

Function:

realn_

- redefine location of distributed array.

Simulation algorithm: an element with key ArrayHandlePtr is looked for in the array of distributed arrays. An element with key PatternRef is looked for either in the array of abstract machine representations or in the array of distributed arrays depending on the type of new alignment template (AMView or DisArray). The method DArray::RAlnDA(AMView*{DArray*} APattern, long* AAxisArray, long* ACoeffArray, long* AConstArray, long ANewSign) is called for the object of DArray class which is from the found record with key ArrayHandlePtr. The first parameter is reference to object of AMView or DArray type, depending on the type of template corresponding to PatternRef key. Other parameters are actual function parameters. The type of new alignment template is saved. If parameter NewSign is equal to 1, nested call of crtda_ function is looked for, and array key is substituted by the new value equal to return parameter ArrayHandlePtr of crtda_ function.

Time simulation algorithm differs from the base one. At first, time counters of all processors are set to the same value which is equal to maximal counter values at the moment of the beginning of realn_ execution. The time which has been added to the counter sums with Execution_Time and with Synchronization of the given processor. The time returned by DArray::RAlnDA(…) method is added to Execution_time and Redistribution of each processor.

3.7 Functions of collective operation initialization

In this group there are functions used for including reduction variable in reduction and including distributed array edge in edge group.

Function:

insred_

- include reduction variable in reduction group.

Simulation algorithm: an element with key RedRef and corresponding object of RedVar class are looked for in the array of reduction variables; an element with key RedGroupRef and corresponding object of RedGroup class are looked for in the array of reduction groups. The method RedGroup::AddRV(RedVar* ARedVar) ) is called for the object of RedGroup class, the found object of RedVar class is a parameter.

Function:

inssh_

- include distributed array edge in edge group.

Simulation algorithm: an element with key ShadowGroupRef and corresponding object of BoundGroup class are looked for in the array of edge groups; an element with key ArrayHandlePtr and corresponding object of DArray class are looked for in the array of distributed arrays. The method BoundGroup::AddBound(DArray* ADArray, long* ALeftBSizeArray, long* ARightBSizeArray, long ACornerSign) is called for the found object of BoundGroup class. The first parameter is refference to the found object of DArray class, parameter ALeftBSizeArray is an actual parameter LowShdWidthArray of inssh_ function, parameter ARightBSizeArray – actual parameter HiShdWidthArray, parameter ACornerSign – actual parameter FullShdSign.

3.8 Functions of collective operation execution

In this group there are functions used for collective operation execution.

Function:

arrcpy_

- copy distributed arrays.

Simulation algorithm: elements with keys FromArrayHandlePtr and ToArrayHandlePtr are looked for in the array of distributed arrays. The function ArrayCopy(DArray* AFromArray, long* AFromInitIndexArray, long* AFromLastIndexArray, long* AFromStepArray, DArray* AToArray, long* AToInitIndexArray, long* AToLastIndexArray, long* AToStepArray, long ACopyRegim) is executed. Parameters are references to objects of DArray class from found records. Other parameters are actual parameters FromInitIndexArray[], FromLastIndexArray[], FromStepArray[], ToInitIndexArray[], ToLastIndexArray[], ToStepArray[], CopyRegim of arrcpy_ function.

Time simulation algorithm differs from the base one. At first, time counters of all processors are set to the same value which is equal to maximal counter values at the moment of the beginning of arrcpy_ execution. The time which has been added to the counter sums with Execution_Time and with Synchronization of the given processor. The time returned by ArrayCpy(…) function is added to Execution_time and Remote_access of each processor.

Function:

strtrd_

- start of reduction group.

Simulation algorithm: an element with key RedGroupRef is looked for in the array of reduction group. The method RedGroup::StartR(DArray* APattern, long ALoopRank, long* AAxisArray) is called for the object of RedGroup class from the found element. Parameters APattern, ALoopRank and AAxisArray are from the structure which has been filled in during the last mappl_ call and corresponds to the parallel loop last mapped. An element with key RedGroupRef is created in the array of started reductions. The times of reduction beginning and reduction completion are saved in this element. The time of reduction beginning is equal to maximal value of processor counters at the moment of strtrd_ function call. The time of reduction completion is equal to sum of the time of reduction beginning and the time returned by RedGroup::StartR(…) method. If the pattern for mapping parallel loop is not distributed array but abstract machine representation an error is fixed and simulation is completed.

Time simulation algorithm differs from the base one. At first, time counters of all processors are set to the same value which is equal to maximal counter values at the moment of the beginning of strtrd_ execution. The time which has been added to the counter sums with Execution_Time and with Synchronization of the given processor.

Function:

strtsh_

- start of edge exchange of the given group.

Simulation algorithm: : an element with key ShadowGroupRef is looked for in the array of edge group. The method BoundGroup::StartB() is called for the object of BoundGroup class from the found element. An element with key ShadowGroupRef is created in the array of started edge exchanges. The times of exchange beginning and exchange completion are saved in this element. The time of exchange beginning is equal to maximal value of processor counters at the moment of strtsh_ function call. The time of exchange completion is equal to the sum of the time of exchange beginning and the time returned by BoundGroup::StartB() method.

Time simulation algorithm differs from the base one. At first, time counters of all processors are set to the same value which is equal to maximal counter values at the moment of the beginning of strtsh_ execution. The time which has been added to the counter sums with Execution_Time and with Synchronization of the given processor.

Function:

waitrd_

- wait reduction completion.

Simulation algorithm: an element with key RedGroupRef is looked for in the array of reductions already started. At the end of simulation this element is deleted.

Time simulation algorithm differs from the base one. For each processor the current value of time counter is compared with the time of reduction completion fixed during the simulation of strtrd_ function. The difference between above times is added to Reduction_overlap. If the value of processor time counter is less than the time of reduction completion, then the counter value is set to time of reduction completion. The time added to the counter sums to Execution_Time and Reduction_wait of the given processor.

Function:

waitsh_

- wait completion of edge exchange of the given group.

Simulation algorithm: an element with key ShadowGroupRef is looked for in the array of edge exchanges already started. At the end of simulation this element is deleted.

Time simulation algorithm differs from the base one. For each processor the current value of time counter is compared with the time of exchange completion fixed during the simulation of strtsh_ function. The difference between above times is added to Shadow_overlap. If the value of processor time counter is less than the time of exchange completion, then the counter value is set as time of exchange completion. The time added to the counter sums to Execution_Time and Wait_shadow of the given processor.

3.9 Parallel loop functions

In this group there are functions used for parallel loop initialization, distribution of its iterations between processors and the parallel loop execution.

Function:

crtpl_

- create parallel loop.

Simulation algorithm: parameter Rank – parallel loop dimension is written in the structure describing parallel loop.

Function:

mappl_

- mapping parallel loop.

Simulation algorithm: an object of ParLoop class is created (constructor ParLoop:: ParLoop (long ARank)). The method ParLoop::MapPL(AMView*{DArray*} APattern, long* AAxisArray, long* ACoeffArray, long* AconstArray, long* AInInitIndex, long* AInLastIndex, long* AInLoopStep) is simulated distribution of parallel loop iterations between processors. Then, for each processor, an object corresponding to the loop iteration block mapped on the given processor is created (constructor LoopBlock::LoopBlock(ParLoop *pl, long ProcLI)). Besides the key of the pattern map (PatternRef) and its type are saved in the structure describing the parallel loop.

Function:

dopl_

- define if it is necessary to continue parallel loop execution.

Simulation algorithm: Time simulation algorithm differs from the base one. The total number of loop iterations (N_iter) and the number of loop iterations on each processor (N_i) are calculated. The value T_func*(N_i/N_iter) is calculated for each processor and added into Execution_time and CPU_time of the given processor. Besides, the domain of processor calculations is compared with calculation domains of other processors, the number of processors executing the same loop iterations is calculated (N_proc), for each processor executing the given part of the loop the following value is calculated

(T_func/N_i)*((N_proc-1)/N_proc)

and added to Insufficient_parallelism_Par.

3.10 Unknown functions

Simulation algorithm: unknown functions are simulated according to the base algorithm. The warning about unknown function in the trace is output.

4 Estimation of remote data access overhead

4.1 Main terms and definitions

Before describing algorithms of estimation of remote data access overhead, let us consider the method of mapping calculations on distributed system. Diagram of array distributions on processor topology in DVM model is shown in Fig. 1.

Fig. 1. DVM- model of data distribution on processors.

Let us introduce a notion of Abstract machine representation (for short – Template or AMView) – a special array intended for providing two-stage arrays mapping and calculations mapping on processor system: first they (Darrays) are mapped on AMView, which then are mapped on processor system (VM). The first mapping is entirely determined by interrelation between data and calculations characteristic for algorithm and do not depend on architecture or configuration of distributed computational system. The second mapping serves for the parallel program tuning according to available system resources.

While aligning (realigning) a distributed array, an ordinary distributed array that has already been mapped on some AMView can act as AMView itself.

Information about aligning arrays on template can be received using DArray::AlnDA function call. Distribution of the template on the processor system is performed using AMView::DisAm.

4.2 Estimation of array redistribution exchanges

Let’s consider the algorithm of computing overhead charges of realigning the array over template (realign). This algorithm is implemented in Darray:RalnDA function, which returns amount of time spent in exchanges between processors during the given operation. Besides, the function changes the information of the given array, about the template it is mapped on and by which rule, according to function parameters. Correction of lists of aligned arrays is performed, for corresponding templates.

At the first stage of algorithm we check the value of the ANewSign input parameter. If the value is non-zero, contents of the realigned array will be updated, therefore it is not necessary to transfer useless elements, so function returns zero. Otherwise, algorithm continues working.

Saving the information about the array position before realigning, we get the information about it’s new position using DArray:AlnDA function.

Then, using CommCost::Update(DArray *, DArray *), array CommCost::transfer (filled with zeros at first) is changed, according to the information about array position obtained before. This array (CommCost::transfer) contains the information about number of bytes being transferred between every two processors.

And finally, the amount of time is computed based on data in transfer array, using function CommCost::GetCost.

Algorithm implemented in Update function:

Information about array distribution before and after performing the operation (which we call array redistribution) which changes the array position. At first we check if the array was fully replicated (using the template) before the redistribution (this property is implemented in Darray::AlnDA function). Let it be so, otherwise go to stage 2. If the set of processors (M1) on which template elements were situated before the redistribution, contains the set of processors (M2), on which template elements are located after the redistribution, there’s no exchange, hence transfer array is unchanged and algorithm comes to an end. Otherwise, we look for the nearest processor (P1) to every processor (P2) that belongs to M1/M2. It (P1) is situated on the edge of rectangular section made up by processors that belong to M1. Then we perform the described actions for every such pair of processors. We find the array section that belongs to P2, using constructor Block::Block(DArray *, long). Array section that belonged to P1 before the operation, equals this array, as it was replicated according to the template. Taking intersection of these two sections and using operator ^ (Block &, Block &)^, we get a section we intend to transfer from P1 to P2. After finding out amount of bytes in it with Block::GetBlockSize, we add it to the corresponding element of transfer array. Algorithm comes to an end. It is needed to find out all processor system dimensions on which the array has been distributed or partly distributed before redistribution. If there are no such dimensions go to stage 3. For the processor system dimensions on which the array has been partly distributed, we find out sets of index values for which there were array elements on corresponding processors before redistribution. We go round all the processors (current is P1) which indices, in the dimensions the array is distributed on, are equal to zero and which indexes, in the dimensions the array is partly distributed on, are equal the minimum of the received corresponding sets.

For these processors we find out the array section which have been distributed on every processor before redistribution. For all the processors (current is P2) we find out the array sections which are on every processor after redistribution. We take the intersection of the above sections for the current processors P1 and P2. For the processor P2, we find out the nearest processor with array section coinciding with the section on processor P1. Indexes of this processor coincide with indexes of P1 in all dimensions of processor system, except dimensions by which the array was muliplicated or is replicated. The indexes taken are the nearest indexes to P2 indexes in these dimensions (defined using the obtained sets). If id of some processor isn’t equal to the id of P2, number of bytes in intersection is added to the corresponding element of the transfer array. After processing all P1, P2 pairs, algorithm stops.

For every processor, array sections that belonged to it before and after the redistribution are determined. For every pair of non-coinciding processors, we find their intersection. Number of bytes in the intersection is added to the element of the transfer array, corresponding to the given processor pair. After processing all processor pairs, the algorithm stops.

For finding the time spared by processors, in GetCost function, one of the two algorithms is used. The choice of algorithm depends on distributed processor system type. In case of the net with bus organization, the sought time is computed by this formula:

where N is number of the processors, Ts – start time of the exchange operation, TB – time to send one byte. This formula is a consequence of the fact that several messages can’t be simultaneously sent over the net and, consequently, total time equals to the sum of all exchange times between any two non-coinciding processors.

Total time for the transputer grid generally depends on the shortest way between the two processors most distant from each other, if exchanges between them exist. Also, while evaluating this time it is necessary to consider the possibility of message pipelining. As follows from this remarks, we can make up an algorithm and a formula to determine overhead charges:

While passing the transfer array, for every non-zero element, we determine the distance between corresponding processors using VM:GetDistance(long, long). We also determine the current maximum distance and the current maximum byte number sent over this distance. Going on this way, we pass all the transfer array. As a result we have: l – the distance between the two processors most distant from each other and LB – length of the biggest message sent over distance l. If l = 0, the result is zero and the algorithm stops. Otherwise, move on to the next stage.
If l = 1, the sought time is calculated using formula:

If l > 1, pipelining of the message is possible. So, at first we examine the following function for extremes, which describes the dependence of LB from S – size of the message sent during one pipeline phase:

We get that minimum is reached when:

Considering that the message size and the number of phases are integers, we come to this expression:

where S’ = [S] (S’ = 1, if [S] = 0); c Î {0, 1} – an evidence that S’ doesn’t divide LB. In order to find the exact value we find the time by the same formula, but in the integer points near to S’. If they are bigger, value in the point S’ is the sought value. Otherwise we perform the decreasing search, until value on the stage k is bigger than value on k-1. Then, the sought time is the value on stage k-1. Algorithm stops.

As a conclusion, we’ll consider the algorithm of finding overhead charges while redistributing the template. These charges arise because all the distributed arrays aligned by the template using DArray::AlnDA (both directly and indirectly), will be mapped on the template again, after it changes it’s position in the processor system. Consequently, they will change their position, and it will lead to the exchanges. The given algorithm is implemented in AMView::RdisAM, which returns the time spent in the interprocessor exchanges while performing this operation. Also, the function replaces the information for the given template, about the rule by which it was mapped on the processor system (received by the AMView::DisAM function), with information about it’s new position according to the new parameters, given in the function call.

In the first stage of the algorithm we check the value of the ANewSign input parameter. If the value is non-zero, contents of all the distributed arrays aligned with the given template will be updated, therefore it is not necessary to transfer useless elements, so function returns zero. Otherwise, algorithm continues working.

Saving the information about the template position before redistribution, we get the information about it’s new position after the given operation using the AMView::DisAM function. It is necessary in order to know how the arrays aligned by this template are positioned before and after the redistribution.

Then, for every array aligned by the given template, array CommCost::transfer (filled with zeros at first) is changed, using CommCost::Update(DArray *, DArray *), according to the information about the array position before and after the redistribution received before.

And finally, the sought amount of time is computed based on data in the transfer array, using the function CommCost::GetCost.

4.3 Estimation of distributed array edge exchanges

Let’s consider the algorithm of computing the time needed for the exchange of the given group bounds. It consists of two parts.

Inclusion of all needed bounds of distributed arrays in the given group is performed using BoundGroup::AddBound function for every such array. At the same time the transfer array is changed accordingly.
The sought time is determined by the received transfer array, using the CommCost::GetCost function (it occurs after the BoundGroup::StartB function call).

Algorithm implemented in CommCost::GetCost was described in the previous paragraph, so we’ll consider algorithm of the BoundGroup::AddBound function:

If the array is fully replicated, function exits without altering the transfer array (filled with zeros when the group is created). Otherwise the algorithm continues. Then, the possibility to include the bound of the given array into the group with given function parameters is checked. If it is possible, determine using function DArray::GetMapDim : by what dimensions will the real bounds exchange be conducted (the dimensions by which array was not replicated), corresponding array dimensions and directions of their break-down. This information is put into dimInfo array (its elements are the objects of DimBound class). Then we determine the criteria of inclusion in “corner” elements bounds: it is present, if the parameter of the given ACornerSign function equals 1 and number of processor system dimensions by which the exchange will be conducted is bigger than 1. The CommCost::BoundUpdate function is called with information already determined before as parameters.

In the BoundUpdate function these actions are performed for every processor. Array section on the given processor is determined. If there are no array elements on this processor, move to the next one. Otherwise, we determine the needed transfers for right and left array section bounds, positioned on processors adjacent by the dimensions that are contained in the dimInfo array. In order to do it, we pass the dimInfo array and determine, using Block::IsLeft (IsRight), if they are elements to the right (left) of the given elements according to the current position of the distributed array, for the given section. If there are such elements, and the right (left) bound in the given dimension is determined by the addBound function call, we determine the size of transferred boundary and, using the VM::GetSpecLI function, find the number of the adjacent processor, on which the given bound is transferred (using the information from the dimInfo array corresponding to the given array dimension). Size of the transferred bound is defined like size of the current section , in which number of the elements in the current array dimension is equal to the width of the right (left) boundary in this dimension, in bytes. This value is added to the corresponding element of the transfer matrix. If finished with the left and right boundaries, we move on to the “corner” elements, if their criterion of inclusion in the array boundaries is present (let’s consider only two cases, when the “corner” elements are in the intersection of real borders by two dimensions, so there are two elements in dimInfo array). Similarly to what has been described, but by all the dimensions of the array dimensions in dimInfo, we determine if there exist (for the given array section) elements that are adjacent by the dimensions given for the corner elements of the section by directions of all the diagonals in this set of dimensions (using IsLeft and IsRight functions). If there are such elements and the corresponding bounds by the given dimensions are given, we determine the size of transferred “corner” elements section and, using the VM::GetSpecLI function (which is used number of times equal to the number of the dimensions in the set), find the number of the adjacent processor, on which the given boundary is transferred (using information corresponding to the given array dimensions from the dimInfo array). Size of the section that is transferred is defined like size of the current section, in which number of the elements in every array dimension from this set is equal to the width of the corresponding boundary in this dimension, which takes part in the current “corner” element section, in bytes. This value is added to the corresponding element of the transfer matrix. Algorithm stops.

4.4 Estimation of reduction operation exchanges

The algorithm of determining time spent in exchanges during the reduction operations is implemented in the RedGroup::StartR function. Before the description of algorithm, we’ll show what preparatory work is performed.

When another reduction variable is added to the reduction group, counter of the number of sent bytes (TotalSize) is increased by the size of the reduction variable and auxiliary information (given for some of the reduction operations) in bytes using the RedGroup:AddRV function. Information about the reduction variable type isn’t taken into consideration when evaluating the exchange time.

Description of StartR function:

In input we have the information about which array the parallel loop, in which reduction operations from the given group are performed, is mapped on; number of dimensions of the parallel loop; array that contains the information about how loop dimensions are mapped on the array dimensions. If the array on which the parallel loop is mapped is fully replicated, function exits returning zero time. Otherwise, using the DArray::GetMapDim function, we determine the processor system dimension for every loop dimension. The loop is mapped on this dimension. If the loop dimension is replicated over all the processor system dimensions, we put either the corresponding dimension number or zero in the loopAlign array. Then, using the obtained information, we determine the sought time that depends on the distributed system type.

If we have the net with bus organization, the main idea is to send the information by the dimensions the loop is replicated by while collecting the information. Hence we can collect it only by one section made up of processors with fixed indexes by the given dimensions. Then we broadcast the result to all the processors. This leads us to the following formula:
time = (TS + TB TotalSize) × (Ni1 × ... × Nik + N – 2) ,

where exchange operation, TB – time to send one byte, Nik – N is the number of processors, Ts – start time of the number of the processors by the processor system dimension with number loopAlign[ik], on which the loop iterations are present. Function returns the value calculated.
If it is the transputer grid, we can collect the information by all of the sections described in point 2 in parallel. Then we broadcast the result of the reduction operation to the same sections. Time to collect the values and to broadcast the result depends on the distance from the geometrical center of such section to the most distant corner. Therefore, at first we determine this value (called Distance) using the loopAlign array and information about the size of the corresponding processor system dimensions. Then the sought time is calculated using the following formula:

time = (TS + TB × TotalSize) × (2 × Distance + ConerDistance),

where ConerDistance is distance from the given sections to the most distant corner processor. Return the resulting value.

4.5 Estimation of exchanges during remote array buffer loading

If aligning the arrays doesn’t get us rid of the remote data and no shadow edges can be used to access them, their bufferization is performed through the separate buffer array on every processor. Estimated time spent during the buffer loading is calculated in the ArrayCopy function. The interface of this function allows further expansion of the output language, therefore some parameters aren’t used at this stage. Exactly, exchanges during the replication of the distributed array (array that is read) over all the processors in the processor system are evaluated in this function. Below there is a description of algorithm implemented in ArrayCopy function:

We check that the input parameters that determine the size of the array section don’t exceed the array bounds using the DArray::CheckIndex function. If the result is positive, the algorithm continues, otherwise an error message is generated and the algorithm stops.
transfer array (filled with zeros when the function is called) is altered using function CommCost::CopyUpdate, according to the parameters that determine the section to replicate and to the information about the distribution of the given array.
The sought time is determined using the CommCost::GetCost function (3.2) and data from the transfer array.

In the CopyUpdate function, the following actions are performed:

Check if the criterion of full replication by template (implemented in DArray::AlnDA function) is present for the given array. Otherwise move to stage 2. If the template elements are present on all the processors of the system (determined using the information about how template is mapped on the processor system), there are no transfers and transfer array is unchanged, therefore function exits. Otherwise, for every processor that doesn’t contain template elements, we find the processor nearest to it that does. Size (in bytes) of the given section of array that is read (previously determined by the Block::GetBlockSize function) is put in the corresponding element of the transfer array, for every such processor pair.
All processor system dimensions by which the array is replicated (or partly replicated) are determined. If there are no such dimensions, move on to stage 3. In order to measure the dimensions by which the array is partly replicated, we determine sets of index values. This values are determined so that if indexes have these values, corresponding processors have the array elements on them. We pass all the processors (let the current one be P1) with zero indexes in the dimensions that array is replicated by; and indexes that are equal to minimum of the corresponding sets, in the dimensions that array is partly replicated by. For every such processor we find the intersection (let the current be C1) of the array section on it with the replicated section given in the function call. We determine the number of bytes in the given intersection C1. For every processor (let the current one be P2) we determine the array sections (let the current one be C2) on it. If the intersection of C1 and C2 is not empty, move on to the next processor pair. Otherwise determine the processor nearest to P2 on which the array section coincides with the P1 section. If it’s number isn’t equal to the number of P2, add the number of bytes in intersection determined before to the corresponding element of the transfer array. After handling all the (P1, P2) processor pairs the algorithm stops.
Determine the section that is the intersection of the array section on the processor with the section to be replicated, for every processor. If its empty, move on to the next processor. Otherwise determine the size (in bytes) of the given section. Also, determine the array section on the processor, for every processor. Find the intersection of such sections for every pair of non-coinciding processors. If it isn’t empty, move on to the next processor pair. Otherwise, the number of bytes received before is added to the element of the transfer array corresponding to the given processor pair. After handling all such processor pairs, the algorithm stops.

Appendix 1. Link from field name in output HTML-file to field name in structure _IntervalResult ==>