Lib-DVM interface description (contents) | Part 1 (1-5) |
Part 2 (6-7) |
Part 3 (8-11) |
Part 4 (12-13) |
Part 5 (14-15) |
Part 6 (16-18) |
Part 7 (19) |
created: february, 2001 | - last edited 03.05.01 - |
long begbl_(void);
The function begbl_ defines the beginning of the localization block for the following system objects:
All these objects, being created in the localization block, are local ones. This means that these objects will be automatically deleted, when the control exits from this block. The exception is the static objects, which have to be deleted explicitly. The localization blocks can be nested.
Note, when abstract machine representation is deleted, all distributed arrays, mapped on the representation, are also deleted. By analogy with it, all the abstract machines, their representations and the distributed arrays, mapped onto the deleted processor system, are also deleted.
The function returns zero.
long endbl_(void);*
The function endbl_ marks the end of the localization block. The function returns zero.
LoopRef crtpl_ (long *RankPtr);
*RankPtr - rank of the parallel loop.
The function crtpl_ creates the parallel loop (and so, defines its beginning) and returns reference to the created object. Parallel loop is terminated by endpl_ function (see section 9.5).
Other parallel loop can be created and terminated inside any parallel loop, i.e. parallel loops can be nested.
long mappl_ ( | LoopRef PatternRef long long long AddrType long long long long long long long |
*LoopRefPtr, *PatternRefPtr, AxisArray[], CoeffArray[], ConstArray[], LoopVarAddrArray[], LoopVarTypeArray[], InInitIndexArray[], InLastIndexArray[], InStepArray[], OutInitIndexArray[], OutLastIndexArray[], OutStepArray[] ); |
||
*LoopRefPtr | - | reference to the parallel loop. | ||
*PatternRefPtr | - | reference to the pattern of the parallel loop mapping. | ||
AxisArray | - | array, which j-th element is a dimension number of the parallel loop (that is the number of the index variable) used in linear alignment rule for the pattern (j+1)-th dimension. | ||
CoeffArray | - | array, which j-th element is a coefficient for the parallel loop index variable used in linear alignment rule for the pattern (j+1)-th dimension. | ||
ConstArray | - | array, which j-th element is a constant used in the linear alignment rule for the pattern (j+1)-th dimension. | ||
LoopVarAddrArray | - | array, which i-th element is an address (cast to AddrType type) of the index variable of the parallel loop (i+1)-th dimension. | ||
LoopVarTypeArray | - | array, which i-th element is the type of index variable of the parallel loop (i+1)-th dimension. | ||
InInitIndexArray | - | input array, which i-th element is an initial value for the index variable of the parallel loop (i+1)-th dimension. | ||
InLastIndexArray | - | input array, which i-th element is a last value for the index variable of the parallel loop (i+1)-th dimension. | ||
InStepArray | - | input array, which i-th element is a step value for the index variable of the parallel loop (i+1)-th dimension. |
OutInitIndexArray | - | output array, which i-th element is calculated initial value for the index variable of the parallel loop (i+1)-th dimension. | ||
OutLastIndexArray | - | output array, which i-th element is calculated last value for the index variable of the parallel loop (i+1)-th dimension. | ||
OutStepArray | - | output array, which i-th element is calculated step value for the index variable of the parallel loop (i+1)-th dimension. |
The function mappl_ creates regular mapping of the parallel loop onto the abstract machine representation, according to the defined mapping rules and the descriptions of the loop dimensions. The function distributes iterations of the loop among child abstract machines from this representation. The pattern defines the abstract machine representation. This pattern can be either the representation by itself, or the distributed array ("indirect mapping"). In the last case, the reference to the pattern is the first word of the distributed array header.
Using distributed array as a pattern is allowed only if this array has been mapped on a abstract machine representation; and using an abstract machine representation as a pattern is allowed only if this representation has been mapped on a processor system. So the function mappl_ only distributes iterations of the parallel loop among processors from this processor system.
Parallel loop can't be remapped.
Each description of the parallel loop dimension consists of the initial, last and step values of index. The function mappl_ reevaluates these values for the current virtual processor, that is, the function selects the corresponding subset of iterations. The function writes these new values to the arrays OutInitIndex, OutLastIndexArray and OutStepArray. The number of the descriptions of the parallel loop dimensions is equal to the rank of the loop.
Initial value of index can be as less as more than it's last value. In the first case the step is positive, and in the second case the step is negative.
The array LoopVarAddrArray (in the Fortran language) can be formed using the function dvmadr_ (see section 17.7).
The types of parallel loop index variables, specified in LoopVarTypeArray, may be the following:
0 – long variable;
1 – int variable;
2 – short variable;
3 – char variable.
If LoopVarTypeArray = NULL, index variable types for all dimensions of the loop will be equal to 1 (int).
Consider the description of the parallel loop mapping in detail. Let F be a multifunction, which domain of definition is a space of indexes of the mapped loop and which image is the space of indexes of the pattern:
F( (I_{1}, ... , I_{i}, ... , I_{n}) ) = | F_{1}(I_{1},
... , I_{i}, ... , I_{n})
´ . . . . . . . . . . . .. . . F_{j}(I_{1}, ... , I_{i}, ... , I_{n}) ´ . . . . . . . . . . . .. . . F_{m}(I_{1}, ... , I_{i}, ... , I_{n}) , |
where:
´ | - | symbol of Cartesian product; |
n | - | rank of the parallel loop; |
m | - | rank of the pattern; |
I_{i} | - | index variable of the i-th dimension of the parallel; |
F_{j} | - | multifunction, which image is a range of index variable of the pattern j-th dimension. |
An alignment of the parallel loop by the pattern means for Run-Time System (by the function F specification); that an iteration (i_{1}, ... , i_{n}) of the parallel loop has to be executed only on the processors on which an least one pattern element from F((i_{1}, ... , i_{n})) set is mapped (i.e. set F((i_{1}, ... , i_{n})) is a image-set, and elements of this set are vectors of the space of indexes of the pattern).
The functions F_{1}, ... , F_{m} are named coordinate mapping rules. Run-Time System provides the following set of alignment rules:
1. | F_{j}(I_{1}, ... , I_{n}) = { A_{j}*I_{k }+ B_{j}_{ }} , where: | ||
k=f(j) | - | dimension number of the parallel loop (1 £ k £ n, f(j_{1}) ¹ f(j_{2}) when j_{1 }¹ j_{2}); | |
A_{j}, B_{j} | - | integers. |
This mapping rule means that for each element (i_{1}, … , i_{n}) of index space of the parallel loop a corresponding set contains one element – A_{j}*I_{k }+ B_{j}, belonging to value range of the index variable of the pattern j-th dimension.
Note, that A_{j} and B_{j} have to meet the following requirements:
0 £ A_{j}*MAX_{k} + B_{j}_{ }_{£}_{ }MAX_{j} and 0 £ A_{j}*MIN_{k} + B_{j} £ MAX_{j} , where:
MAX_{j}_{ } | - | maximum of the index variable of the pattern j-th dimension; |
MAX_{k} | - | maximum of the index variable of the parallel loop k-th dimension; |
MIN_{k} | - | minimum of the index variable of the parallel loop k-th dimension. |
2. F_{j}(I_{1}, ... , I_{n}) = { q Î M_{j}: 0 £ q £ MAX_{j}_{ }} , where:
M_{j} | - | range of values of the index value of the pattern j-th dimension; |
MAX_{j} | - | maximum of the index variable of the pattern j-th dimension. |
This mapping rule means that for each element (i_{1}, … , i_{n}) of index space of the mapping loop a corresponding set consists of whole range of values of the index variable of the pattern j-th dimension. In such a case, the symbol "*" ("any of the admissible") is usually used.
As a result of given parallel loop mapping by mappl_ function on specified (directly on indirectly) abstract machine representation each loop iteration will be matched with abstract machine set, which is the iteration image for F mapping rule, considered above. On entry in each iteration all corresponding abstract machines become current ones (each on its own processor). So generally (because of F function multivalue) the execution of the parallel branch, represented by the loop iteration, is performed by several parallel subtasks (see section 10). As processor systems of these subtasks can be different, different ways of further branching (in the parallel loop iteration) are possible. To avoid such replication of parallel loop iterations on different subtasks, it is recommended to avoid without necessity loop iteration replication on dimensions of parental abstract machine representation, i.e. to not apply second rule from coordinate mapping rules, considered above. For this purpose for example, subsidiary representation of parental abstract machine can be created. This representation is of lesser rank and the parallel loop is mapped on the representation without iteration replication.
Note, that although, conceptually, on entry in parallel branch the current abstract machine must be replaced by corresponding descendant abstract machine, actually, (for overheads decreasing) the current abstract machine replacing is performed not when loop iteration is entered, but only if it is necessary (i.e. when those Run-Time System functions are called, which require existence of the current abstract machine as program object, for example, when the reference to the current abstract machine is requested, input/output in parallel loop iteration is performed and so on).
Examples.
Defining loop mapping onto pattern space (that is defining F_{1}, ... , F_{m} functions) has to meet requirement that all iterations of the loop have to be a part of the pattern space. Observance of both: the correct mapping of the loop and the correct mapping of the pattern guarantees the correct final distribution of the loop iterations over the processors. Note, that if the pattern is not an abstract machine representation, then mapping the loop onto the abstract machine representation is a superposition of mapping the loop onto the pattern space and the alignment of the pattern with the abstract machine (see section 7.2).
When the function mappl_ is called the parameters of the mapping rule F_{j}(I_{1}, ... , I_{n})= {A_{j}*I_{k }+ B_{j}} for the j-th pattern dimension have to be defined as follows:
AxisArray[j-1] contains value k;
CoeffArray[j-1] contains value A_{j};
ConstArray[j-1] contains value B_{j}.
To specify the rule F_{j}(I_{1}, ... , I_{n}) with the image in a set of all values of index variable of pattern j-th dimension for any I_{1}, ... , I_{n}, the value AxisArray[j-1] (k value) has to be equal to -1. The values CoeffArray[j-1] and ConstArray[j-1] (A_{j} and B_{j}) are irrelevant in this case.
The number of the mapping rules has to be equal to the rank of the pattern, when the function mappl_ is called.
The function returns non-zero value, if mapped parallel loop has a local part on the current processor, and zero otherwise.
9.3 Reordering parallel loop execution
long exfrst_ ( | LoopRef ShadowGroupRef |
*LoopRefPtr, *ShadowGroupRefPtr ); |
*LoopRefPtr | - | reference to the parallel loop. | ||
*ShadowGroupRefPtr | - | reference to the shadow edge group, which will be renewed after computing the exported elements ærom the local parts of the distributed arrays. |
The function exfrst_ sets the following order of the parallel loop iterations. First, exported elements (elements-originals, that are the images of elements of shadow edges of) of the distributed array local parts) have been computed. Then the shadow edge group renewing (shadow elements update) has been started. At last the internal elements of the local parts of the distributed arrays have been computed (internal elements are local part of distributed array without exported elements). After computing of the internal points, the automatic waiting for the completion of the shadow edge renewing is not performed.
The iteration execution order described above is implemented by partitioning iterations by 2*n+1 parts (portions) (n is the loop rank) on each processor. Information about each iteration portion is requested by dopl_ function (see section 9.4) and during its execution is put in OutInitIndexArray, OutLastIndexArray and OutStepArray arrays that are mappl_ function parameters. Invoking exfrst_ function causes following order of iteration portion requesting by dopl_ function: first 2*n requests put in OutInitIndexArray, OutLastIndexArray and OutStepArray arrays the information about the loop iterations, corresponding to exported elements of distributed arrays; last positive request corresponds to internal elements of distributed arrays.
The function exfrst_ call must precede to parallel loop mapping by mappl_ function.
The shadow edge group, specified by *ShadowGroupRefPtr reference, must be created in the current subtask.
When partitioning parallel loop execution by parts using exfrst_ function, result shadow edge widths of distributed arrays, necessary for iteration portion creation, are calculated in the following way (the calculation is performed by parallel loop mapping function mappl_). Let ShadowArrayGroup be a set of distributed arrays with shadow edges, included in the group, specified by *ShadowGroupRefPtr reference. Let also AMView be abstract machine representation, the parallel loop is (directly or indirectly) mapped on, and PLAxis be the loop dimension, mapped on AMVAxis dimension of the representation. Then the result low (high) shadow edge width for PLAxis loop dimension is equal to maximal value of low (high) shadow edge widths of those dimensions of the arrays from ShadowArrayGroup set, which are aligned with AMVAxis dimension of the representations, equivalent to AMView representation.
If Run-Time System didn't find any distributed array from ShadowArrayGroup set which dimension is aligned with AMVAxis dimension of the representation, equivalent to AMView representation, the result low and high shadow edge widths of PLAxis dimension will be set to zero. Low and high edge widths of parallel loop dimensions, not mapped on any dimension of AMView representation, will be also set to zero.
For more detail description of the shadow edges of the distributed arrays see section 12.
The function returns zero.
Note. The representations of the same abstract machine are equivalent, if they:
Processor subsystems of the same processor system are equivalent, if they:
long imlast_ ( | LoopRef ShadowGroupRef |
*LoopRefPtr, *ShadowGroupRefPtr ); |
*LoopRefPtr | - | reference to the parallel loop. | ||
*ShadowGroupRefPtr | - | reference to the shadow edge group, which renewing completion the Run-Time System awaits after the computation of the internal points of the local parts of the distributed arrays. |
The function imlast_ sets the following order of the parallel loop iterations. First the internal points of the local parts of the distributed arrays have been computed Then Run-Time System awaits the completion of the shadow edge renewing of the specified group. At last the exported elements (elements-originals) of the local parts of the distributed arrays have been computed. After computing of the internal points, the automatic starting the shadow edge renewing is not performed.
As for exfrst_ function required order of iteration execution is implemented by partitioning iterations by 2*n+1 parts (portions) (n is the loop rank) on each procåssor. Information about each iteration portion is requested by dopl_ function (see section 9.4) and during its execution is put in OutInitIndexArray, OutLastIndexArray and OutStepArray arrays that are mappl_ function parameters. Invoking exfrst_ function causes following order of iteration portion requesting by dopl_ function: first request puts in OutInitIndexArray, OutLastIndexArray and OutStepArray arrays the information about loop iterations, corresponding to internal elements of distributed arrays; next 2*n requests correspond to exported elements of distributed arrays.
The function imlast_ must precede parallel loop mapping by mappl_ function.
The shadow edge group, specified by *ShadowGroupRefPtr reference, must be created in the current subtask.
Calculation of result shadow edge widths of distributed arrays, required for iteration portion creation is equivalent to their calculation, described above in the exfrst_ function description.
The function returns zero.
9.4 Inquiry of continuation of parallel loop execution
long dopl_ (LoopRef *LoopRefPtr);
*LoopRefPtr - reference to the parallel loop.
The function dopl_ allows to determine the completion of the execution of all the parallel loop parts, on which the loop has been divided by the functions exfrst_ or imlast_. At each next function dopl_ execution, the Run-Time System corrects the loop parameters in the arrays OutInitIndexArray, OutLastIndexArray and OutStepArray, defined as output arrays in the function mappl_.
The function returns the following values:
0 | - | the execution of all parts of the parallel loop is completed; |
1 | - | the execution of the parallel loop has to be continued (the information about next iteration portion is put in OutInitIndexArray, OutLastIndexArray and OutStepArray arrays). |
On changed order of iteration execution dopl_ function sequentially returns 2*n+1 positive values (see section 9.3). If the loop execution is not partitioned by exfrst_ and imlast_ functions, dopl_ function returns non-zero value only one time. At that the information, put to OutInitIndexArray, OutLastIndexArray and OutStepArray arrays by mappl_ function and corresponding to the local part of the loop, is not changed.
Note, that splitting loop onto parts can be also associated with unregular mapping of pattern space onto the processor system. In this case the correction of the current loop parameters when control exits the next regular interval and requesting completion of all intervals are performed by the function dopl_ also.
Invoking dopl_ function is allowed for mapped parallel loop only.
long endpl_ (LoopRef *LoopRefPtr);
*LoopRefPtr - reference to the parallel loop.
The function endpl_ completes the parallel loop execuôion and forces merging the parallel branches to the parental one. The Run-Time System automatically deletes all objects created inside the parallel loop. The exception is static objects. (that is, the parallel loop is a program block beginning with the function crtpl_ call, see section 8).
Run-Time System allows to terminate a program by lexit_ function (see section 2) during any parallel loop execution without it previous termination by endpl_ function.
When the control exits from the parallel loop, the reference to the loop is undefined and so it cannot be used at any Run-Time System call.
The function returns zero.
9.6 Specifying information about data dependence between parallel loop iterations
Partitioning parallel loop execution into parts (required for asynchronous renewing of distributed array shadow edges, dynamic redistribution of computations over processors and so on) causes changing of the loop iteration execution order that can result in wrong computations if data dependencies exist. For correct partitioning of the loop iterations into portions Run-Time System is needed in information about all points of the form (I_{1}+D_{1}, … , I_{n}+D_{n}), in which the values of computed variable are required to calculate its value in (I_{1}, ... , I_{n}) point (n is the loop rank, all D_{i} are integers). It is assumed that at least two values of D_{i} are non-zero (Run-Time System doesn’t use the information about points, in which a number of non-zero D_{i} is less or equal to one).
The information about every point of the form, described above, is passed to Run-Time System by separate call of the function
long pldpnd_( | LoopRef long |
*LoopRefPtr, DependCodeArray ); |
*LoopRefPtr | - | reference to parallel loop | ||
DependCodeArray | - | array, which i-th element is the code of data dependence between the loop iterations for its (i+1)-th dimension. |
Data dependence codes, specified in i-th element of DependCodeArray array, can be:
0 | - | there is no data dependence (D_{i+1} is zero); |
1 | - | data anti-dependence (D_{i+1} is positive); |
2 | - | data dependence (D_{i+1} is negative). |
To inform Run-Time System about several points pldpnd_ function is called several times. All the function calls must precede parallel loop mapping by mappl_ function (see section 9.2).
Absence of pldpnd_ function calls before the loop mapping means, that there are no data dependences, restricting the loop execution by parts, between the loop iterations.
The function returns zero.
Note. If the loop execution for some dimension is partitioned into parts, this dimension is named split. When selecting split dimensions Run-Time System treats dimensions with lager numbers as more priority ones (their index variables are changed more quickly) and in according to criterion: right side of assignment statement can't contain a point, whose set of split dimensions contains two or more dimensions with multi-directional data dependences (for dimensions with the same signs of index variable steps) or with uni-directional data dependences (for dimensions with different signs of index variable step).
9.7 Measuring execution time of parallel loop iterations
The execution time of different iterations of parallel loop can differ. So for balance loading of processors non-equal distribution of parallel loop iterations over the processors is required. Non-equality of distributions is achieved by specifying of different computational weights of processor coordinates by setpsw_, genbli_, genbld_, genblk_ and setelw_ functions (see sections 4.5 and 4.6).
To define proportions of non-equal disrtibutions in program Run-Time system provides possibility to measure execution time of parallel loop iterations. The results of measuring can be used as parameters of setelw_ function (see section 4.6) when redistributing abstract machine representations and distributed arrays (with the following redistribution of parallel loop iterations).
Measuring execution time of parallel loop iterations is necessary also to predict parallel program performance if it is supposed to run programs on the bigger number of processors.
Process of measuring is implemented in the following way. User program gives to Run-Time system a request to measure time of parallel loop iterations by means of one or several setgrn_ function calls (see section 9.7.1). In every setgrn_ function call the following parameters are specified:
Executing setgrn_ function, Run-Time system creates an array of double type; its size is equal to specified number of the groups (array of loading coordinate weights). All elements of the array are cleared by zero.
Later on the iterations of all parallel loops, mapped on abstract machine representation, specified in setgrn_ function, are split into portions: every dimension of the loop is split into the same number of parts (groups) as the abstract machine representation dimension the loop dimension was mapped by setgrn_ function is done. So, several (in common case) arrays of loading coordinate weights correspond to every parallel loop, and several elements (loading weights) of the arrays correspond to each iteration portion. Run-Time system measures execution time of each portion and adds it to every element correspondent to the portion. If several elements of the array of loading coordinate weights correspond to one portion, its execution time is divided between them.
The results of measuring (arrays of loading coordinate weights) can be requested by gettar_ and getwar_ functions (see section 9.7.2).
Note, that splitting parallel loop iterations by portions changes their execution order. So, if there are date dependencies between iterations, specifying of two or more dimensions of the same representation of the abstract machine by setgrn_ function can cause wrong computations (see section 9.6).
9.7.1 Issuing time measuring request
long setgrn_( | AMViewRef long long |
*AMViewRefPtr, *AxisPtr, *GroupNumberPtr ); |
*AMViewRefPtr | - | reference to abstract machine representation, determining parallel loops, for which execution time of iterations should be measured. | ||
*AxisPtr | - | dimension number of given abstract machine representation; the number indirectly specifies dimension numbers of parallel loops. | ||
*GroupNumberPtr | - | the number of parts (groups), *AxisPtr-th dimension of given abstract machine representation (and all corresponding dimensions of parallel loops) should be split into. |
The setgrn_ function issues a request to Run-Time system to measure execution time of iteration portions of all parallel loops, that have a dimension, mapped onto *AxisPtr-th dimension of given abstract machine representation. Measured times will be accumulated in the array of loading coordinate weights (internal array of Run-Time system).
Repeated call of the function with the same parameters *AMViewRefPtr and *AxisPtr as in previous call is allowed.
If value of *GroupNumberPtr is more than size of *AxisPtr-th dimension, the number of groups will be equal to this size.
The request will be ignoried, if *AxisPtr-th dimension is not block-distributed.
The function returns zero.
9.7.2 Requesting results of measuring (requesting array of loading coordinate weights)
long getwar_( | double AMViewRef long |
WeightArray[], *AMViewRefPtr, *AxisPtr ); |
WeightArray | - | result array to write array of loading coordinate weights. | ||
*AMViewRefPtr | - | reference to abstract machine representation. | ||
*AxisPtr | - | dimension number of abstract machine representation. |
The function getwar_ writes to the array WeightArray, specified by user program, the array of loading coordinate weights, obtained as result of measuring, and returns its size. If for *AxisPtr-th dimension of given abstract machine representation the request to measure is not issued, the function returns zero (the array WeightArray is not filled).
The function cancels the request to measure for *AxisPtr-th dimension of given abstract machine representation and deletes the array of loading coordinate weights. In the case of repeated requesting of measuring results the function returns zero and the array WeightArray is not filled.
Note. The function getwar_ just calls sequentially the functions
long gettar_( | double AMViewRef long |
WeightArray[], *AMViewRefPtr, *AxisPtr ); |
and
long rsttar_( | AMViewRef long |
*AMViewRefPtr, *AxisPtr ); |
The gettar_ function is similar to considered above getwar_ function, but doesn't cancel request to measure for *AxisPtr-th dimension of given abstract machine representation and doesn't delete the array of loading coordinate weights. In the case of repeated requesting of measuring results the same size of the array of loading coordinate weights is returned and the array WeightArray is filled by the same information, as in previous request.
Note, that result array of loading coordinate weights is created at all the processors during the first request; and it causes interprocessor exchanges: central processor receives local parts of the array from all other processors, joints them, and then sends result array to the processors. Therefore repeated requesting of mesuaring results is performed more fastly, then the first request.
The function cancels the request to measure for *AxisPtr-th dimension of given abstract machine representation, specified by *AMViewRefPtr reference and deletes the array of loading coordinate weights. If dimension number *AxisPtr is zero, for all dimensions of given abstract machine representation the requests to measure are canceled and the arrays of loading coordinate weights are deleted.
The function returns zero.
9.7.3 Using results of measuring for dynamic balance of processor loading
Requested by getwar_ (gettar_) function array of loading processor weights contains the weights characterizing distribution of computational load along the dimension of abstract machine representation. In assumption, that the dimension of abstract machine representation is equally distributed along the dimension of processor system, these weights can be considered as characteristics of computation distribution (parallel loop iterations) along dimension of processor system. So requested by getwar_ (gettar_) function array can be used as array with loading coordinate weights of processors in setelw_ function call (see section 4.6).
After time measuring and requesting of the arrays with loading coordinate weigths the abstract machine representation, distributed arrays and parallel loops can be remapped to obtain most balanced loading of the processors. When setelw_ function is requested array of loading coordinate weights must correspond to that processor system dimension, on which the dimension of abstract machine representation is mapped.
9.7.4 Global tracing of execution times of parallel loop iterations
To predict DVM-program performance Run-Time system provides a possibility to trace execution time of iterations of all parallel loops of user program. Output of information (if trace is enabled) is performed after completion of every parallel loop. Tracing of execution times of iterations of all parallel loops is controlled by Run-Time system startup parameters PLGroupNumber and PLTimeTrace.
Parameter PLGroupNumber defines portions the iterations of parallel loops are split into when their execution times are measured:
PLGroupNumber = N_{1}, N_{2}, N_{3}, N_{4};
N_{i} - a number of parts, all dimensions of all representations of abstract machines, mapped on i-th dimension of processor system (the processor system the abstract machine representation is mapped on) are split into. All dimensions of all parallel loops, mapped (through dimensions of representations of abstract machines) on i-th dimension of processor system are divided into the same number of parts. Zero value of N_{i} means, that when measuring execution times of parallel loop iterations all dimensions (mapped on i-th dimension of processor system) of all parallel loops will not be divided into the parts.
The parameter PLTimeTrace is the flag to enable tracing of execution times of parallel loop iterations, and also defines a form of output information. It can have values:
0 | - | trace is enabled; |
1 | - | to output initial and last values of index variable of the loop for every its dimension and time of portion execution for every executed portion of iterations; |
2 | - | to output array of loading coordinate weights for every block-distributed and divided into the parts dimension of abstract machine representation, the loop is mapped on. |
Note. If Run-Time system needs to divide dimension of parallel loop into the parts as for creation of an array of loading coordinate weights (see section 9.7.1), as for global tracing of execution times of iterations, the number of parts, the dimension will be divided into, will be equal to maximal of the numbers, defined by these two requests.
10 Representation of the program as a set of subtasks executed in parallel
A parallel subtask is a pair (<abstract machine>, <processor subsystem>). A subtask group is an abstract machine representation, each element corresponding to a processor subsystem. A parallel subtask is in execution state (active) at the processor belonging to the subtask processor system, if the subtask abstract machine is current. Each processor always executes the only subtask with the (the current) abstract machine mapped on the processor.
A subtask can be created by the functions, mapping an abstract machine representation onto a processor subsystem (distr_, redis_, mdistr_, mredis_), considered in section 5. The created in such a manner subtasks are activated when entering a parallel loop iteration.
Let us consider now creation of the parallel subtasks by explicit specifying of the correspondence <abstract machine> Þ <processor subsystem> and a way of the subtask initialization.
10.1 Mapping abstract machine (subtask creation)
long mapam_ ( | AMRef PSRef |
*AMRefPtr, *PSRefPtr ); |
*AMRefPtr | - | reference to abstract machine to be mapped. | ||
*PSRefPtr | - | reference to the processor system, determining a structure of the processors, assigned to the abstract machine (execution area of the created subtask). |
To create a subtask successfully, its abstract machine and processor system must satisfy to the following requirements:
It is allowed repeated mapping (remapping) of abstract machine, which has no descendants (the abstract machine, corresponding to terminal of abstract machine tree). It is not allowed to remap abstract machine, if it belongs to parental abstract machine representation, mapped by distr_ (mdistr_) function.
The function returns zero.
10.2 Starting subtask (activation)
long runam_ (AMRef *AMRefPtr);
*AMRefPtr - reference to the abstract machine of started subtask.
To start a subtask successfully its abstract machine must belongs to one of the current abstract machine representations.
After the subtask startup, its abstract machine and processor system become the current ones. Together with the current processor system replacing internal numbers of central processor and input/output processor are also replaced (each processor system has its own central and input/output processors).
The function returns the values:
0 | - | the subtask is not started (the current processor doesn't belong to the subtask processor subsystem); |
1 | - | the subtask is started. |
10.3 Completing (stopping) current subtask
long stopam_ (void);
After the function stopam_ execution the abstract machine, parental for the abstract machine, assigned to the stopped subtask, becomes the new current one (the abstract machine of the subtask, activating stopped subtask). Similarly, the processor system, which subsystem is processor system of stopped subtask, will become the current one.
A subtask can't be stopped during parallel loop execution. Initial subtask can't be stopped also (there is no a subtask, that created it).
Run-Time System allows to terminate a program by lexit_ function (see section 2) during any subtask execution without its previous stop by stopam_ function.
When the subtask is stopped, all objects (except of static ones) created from the moment of its initialization, are automatically deleted, i.e. the points of subtask starting and stopping defines a program block (see section 8).
The function returns zero.
The reduction is the computation of the specified function (named "reduction function") with the parameters received from the variable (named "reduction variable") from the different processors executed the different parallel program branches. After reduction execution completes, all copies of the reduction variable in the program branches became equal to the value returned of the reduction function.
For optimization purposes, the Run-Time System executes the reduction over reduction group that is aggregate of the reduction variables and reduction functions. Each reduction variable in reduction group associates with own reduction function.
11.1 Creating reduction variable
Following variables can be used as reduction variables:
Generally it is assumed that reduction variable is a one-dimensional array, and reduction function is executed on each array element.
Creating reduction is declaration of reduction variable and corresponding reduction function:
RedRef crtred_ ( | long void long long void long long |
*RedFuncNumbPtr, *RedArrayPtr, *RedArrayTypePtr, *RedArrayLengthPtr, *LocArrayPtr, *LocElmLengthPtr, *StaticSignPtr ); |
||
*RedFuncNumbPtr | - | the number of the reduction function. | ||
RedArrayPtr | - | pointer to the reduction array-variable. | ||
*RedArrayTypePtr | - | the type of the elements of the array-variable. | ||
*RedArrayLengthPtr | - | the number of the elements in the array-variable. | ||
LocArrayPtr | - | pointer to the array containing additional information about reducôion function (the number of the elements in this array has to be equal to the number of the elements in the reduction array-variable). | ||
*LocElmLengthPtr | - | the size (in bytes) of an element of the array with additional information. | ||
*StaticSignPtr | - | the flag of the static reduction declaration. |
The function crtred_ creates a descriptor of the reduction. The function returns reference to this descriptor (or the reference to the reduction).
Following reduction function numbers are supported (C language named constants are specified in brackets):
The operations with 5, 6, 7 and 8 numbers can be applied to reduction variables of int and long type only (see below).
The result of the operation with number 9 (rf_NE) is non-zero ("TRUE"), if at the moment of the operation start the values of reduction variable are differ at least on two processors and zero ("FALSE") otherwise.
The result of the operation with number 10 (rf_EQ) is non-zero ("TRUE"), if at the moment of the operation start the values of reduction variable are same on all the processors and zero ("FALSE") otherwise.
Additional information (also called localization information), associated with the reduction function, depends on the user task algorithm and is formed by the user program. The additional information can be used only in the functions rf_MAX and rf_MIN and turns them into positional reduction functions MAXLOC and MINLOC. Admittedly the additional information in this case is the specified positions of the maximum or minimum. Run-Time System only sends this information from the processor, where the maximum or minimum was evaluated, to all another processors. If LocArrayPtr = NULL, or if *LocElmLengthPtr = 0 than localizaôion information is omitted.
Following types of reduction array-variable are supported (the name constants of the C language are in brackets):
From C language point of view reduction variable (types 5 and 6) is float or double type one-dimensional array, consisting of two elements (first element is real part of complex number and second one is imaginary part). Only addition and multiplication operations can be applied to complex reduction variables.
If the flag *StaticSignPtr of the static reduction declaration is not equal to zero, then the created (or declared) reduction is not deleted automatically when the control exits from the current program block (see section 8). Such reduction has to be deleted explicitly using the function delred_.
Note. To avoid warnings from Fortran-compiler when the function crtred_ is called with different types of the reduction variables, Run-Time System ðrovides the function
RedRef crtrdf_ ( | long AddrType long long void long long |
*RedFuncNumbPtr, *RedArrayAddrPtr, *RedArrayTypePtr, *RedArrayLengthPtr, *LocArrayPtr, *LocElmLengthPtr, *StaticSignPtr ); |
differing from the function crtred_ in the second parameter:
*RedArrayAddrPtr | - | pointer to the reduction variable-array, cast to AddrType type by one of the functions, considered in section 17.7. |
Other parameters of the function crtrdf_ are similar to corresponding ones of the function crtred_.
long lindtp_ ( | RedRef long |
*RedRefPtr, *LocIndTypePtr ); |
*RedRefPtr | - | reference to the reduction variable. | ||
*LocIndTypePtr | - | code of index variable type. |
The function informs Run-Time system about type of index variables, whose values define coordinates of local maximums or minimums of reduction variables, that form array, passed to crtred_ (crtrdf_) function as RedArrayPtr (RedArrayAddrPtr) parameter (see section 11.1). The values of these index variables are contained in every element of localization information array specified by LocArrayPtr pointer in crtred_ and crtrdf_ function call.
The parameter *LocIndTypePtr can be:
0 – index variables of long type,
1 – int,
2 – short,
3 – char.
Reduction variable, specified by *RedRefPtr reference, must be created in the current subtask and can't be included in any reduction group.
The function returns zero.
Just after creation of reduction variable by crtred_ (crtrdf_) function the type of index variables, which values define coordinates of local maximum or minimum, is integer (default value).
RedGroupRef crtrg_ ( | long long |
*StaticSignPtr, *DelRedSignPtr ); |
*StaticSignPtr | - | the flag of the static reduction group creation. | ||
*DelRedSignPtr | - | the flag of deleting of the all reduction descriptors while deleting the reduction group. |
The function crtrg_ creates empty reduction group (that is group that does not contain any reduction). The function returns reference to the created group.
If the flag *StaticSignPtr of the static reduction group creation is not equal to zero, then the created group does not deleted automatically when the control exits from the current program block (see section 8). Such reduction group has to be deleted explicitly using the function delrg_.
If the flag *DelRedSignPtr is not equal to zero then all reduction variables of the reduction group (at the moment of deleting) will be also deleted while deleting the group. When the group is deleted explicitly, all its variables will be also deleted explicitly, and when the group is deleted implicitly all its variables will be also deleted implicitly.
11.4 Including reduction in reduction group
long insred_ ( | RedGroupRef RedRef PSSpaceRef long |
*RedGroupRefPtr, *RedRefPtr, *PSSpacePefPtr, *RenewSignPtr ); |
*RedGroupRefPtr | - | reference to the reduction group. | ||
*RedRefPtr | - | reference to the reduction variable. | ||
*PSSpaceRefPtr | - | reference to specifier of processor system and reduction variable processor space. | ||
*RenewSignPtr | - | flag to update saved value of reduction variable when it is reincluded in the group. |
Including the reduction variable to the group means its registration as a member of the group reduction operation and its value keeping for the reduction execution.
As specified in the function call reduction group, as the reduction variable included in the group must be created in the current subtask. Moreover, the reduction group must not be started by strtrd_ function (see section 11.6), and the reduction variable must not be already included in other reduction group.
The parameter, specified by *PSSpaceRefPtr reference, defines processor space of reduction variable, that is a processor set, where reduction operation, binded with the variable, is executed. The parameter also binds with the reduction variable a processor system, which will become processor system of reduction group when first reduction variable is included in the group. Processor space of the reduction variable is not strict subset of the set of its processor system elements. Processor systems of all included in reduction group variables must be equivalent, i.e. the must have the same rank and all dimension sizes and consist of the same processors.
The following objects can be used as processor space specifiers:
The processor system, specified by tbecomes the processor PSSpaceRefPtrsystem and he processor space of the reduction variable. All its elements must belong to parameter, the current processor system.
A distributed array, defined by *PSSpaceRefPtr reference, must be mapped on processor system, considered by Run-Time System as the processor system and the processor space of the reduction variable. Each element of the processor system must belong to the current processor system.
A parallel loop, specified by must be the current one and PSSpaceRefPtr reference, mapped (i.e. parallel loop can be used for specification of the reduction variable specification only 9.2). Processor after its mapping by mappl_ function, see section system, the parallel loop is mapped on, is considered as the processor system of the reduction variable. The processor space of the reduction variable is a subset of its processor system elements, which doesn't contain processors, duplicating the execution of replicated loop iterations.
If abstract machine representation is mapped by distr_ (redis_, mdistr_, mredis_) function, the processor system, the mapping is done on, is considered as the processor system and the processor space of the reduction variable. In this case if the representation is not mapped by distr_ (redis_, mdistr_, mredis_) function, but at least one its abstract machine is mapped by mappl_ function, the current processor system is considered as the processor system of reduction variable, and the processor space is a set of central processors of subtasks, created while mapping representation elements (the reduction on parallel subtask group, which competed its execution; see section 10).
Abstract machine representation or its elements can be mapped on the current processor system or its direct or indirect subsystems.
If *PSSpaceRefPtr is zero or PSSpaceRefPtr is NULL, the current processor system is considered as processor system and processor space of the reduction variable.
The reduction variable can be reincluded in the reduction group. Specification of the variable processor space may differ, but its processor system must be equivalent to the processor system of the reduction group. When the reduction variable is reincluded in the group, its saved value is updated, if *RenewSignPtr parameter has non-zero value, and is not changed, if *RenewSignPtr is zero or RenewSignPtr is NULL.
The function returns zero.
11.5 Storing values of reduction variables
The functions, considered below, allow to update values of reduction variables, saved while including the variables in the reduction group without indicating (and, hence, without changing) specifications of processor spaces.
long saverv_ (RedRef *RedRefPtr);
*RedRefPtr - reference to reduction variable.
The function saverv_ updates saved value of given reduction variable for following its usage for group reduction operation execution.
The reduction variable, specified by *RedRefPtr reference, must be created in the current subtask and its reduction group must not be started by strtrd_ function (see sections 11.6).
The function returns zero.
long saverg_ (RedGroupRef *RedGroupRefPtr);
*RedGroupRefPtr - reference to the reduction group.
The function saverg_ stores values of all variables of the reduction group. The reduction group, specified by *RedGroupRefPtr reference, must be created in the current subtask and can't be empty or started by strtrd_ function (see sections 11.6).
The function returns zero.
long strtrd_ (RedGroupRef *RedGroupRefPtr);
*RedGroupRefPtr - reference to the reduction group.
The function strtrd_ starts all reduction operations over all reduction variables of the group. It is not allowed to use these reduction variables until the completion of all reduction operations.
The started reduction group must be created in the current subtask and can't be already started by strtrd_ function. Restarting reduction group is allowed only after completion of the previous start by waitrd_ function (see sections 11.7). Empty reduction group can't be started.
Used for reduction operation execution values of the variables, included in the started group, are defined by Run-Time System in the following way. At central processor reduction variable value is always equal to its current value. At other processors the current value of reduction variable is corrected in the following way:
The reduction variable value, saved at moment of including the variable in the group or renewed by subtracted from its current value.saverv_ or saverg_ function is
Current value of reduction variable is divided by its value saved at moment of including the variable in the group or renewed by saverv_ or saverg_ function (zero divisor is replaced by one).
Current value of reduction variable is bitwise summed up by module 2 (or bitwise summed up by module 2 followed by bitwise inversion) with its value saved when the variable was included in the group or renewed by saverv_ or saverg_ function.
The function returns zero.
11.7 Waiting for completion of reduction group
long waitrd_ (RedGroupRef *RedGroupRefPtr);
*RedGroupRefPtr - reference to the reduction group.
The function waitrd_ awaits the completion of the all reduction operations of the group. It is allowed to use all reduction variables ærom this group after completion of this function.
The reduction group, specified by *RedGroupRefPtr reference, must be created in the current subtask. Waiting for completion of reduction group operations, not started by strtrd_, is not allowed.
After the completion of reduction operations the group will become opened for including new reduction variables and updating variables, already included in the group. The group, which reduction operations were completed, can be restarted by strtrd_ function.
The function returns zero.
long delrg_ (RedGroupRef *RedGroupRefPtr);
*RedGroupRefPtr - reference to the reduction group.
The function delrg_ deletes the reduction group created by the function crtrg_. After deleting the group the reference to the group can be used by user program for any goals.
The reduction group can be deleted by delrg_ function only if it was created in the current subtask and in the current program block (or its sub-block) (see sections 8 and 10). The group, which reduction operations were not completed by waitrd_ function, can't be deleted.
If the reduction group was created by crtrg_ function with zero value of *DelRedSignPtr parameter, then when deleting the group all reduction variables, included in the group, will be deleted also.
To delete reduction group the function delobj_ can also be used (see section 17.5).
The function returns zero.
long delred_ (RedRef *RedRefPtr);
*RedRefPtr - reference to the reduction variable.
The function delred_ deletes reduction variable (the reduction descriptor) created by the function crtred_. After deleting the reference to the descriptor can be used by user program for any goals.
The reduction variable can be deleted by delred_ function only if it was created in the current subtask and in the current program block (or its sub-block (see sections 8 and 10). The variable, belonging to reduction group, which operations are not completed by waitrd_ function, can't be deleted.
If deleted reduction variable belongs to some reduction group, it is excluded from the group.
To delete reduction variable the function delobj_ can also be used (see section 17.5).
The function returns zero.
11.10 Supporting asynchronous reduction during parallel loop execution
Central processor of group processor system performs execution of all reduction operations of reduction group when it calls waitrd_ function: central processor waits arrival of source values of reduction variables from all other processors of the group's processor system, performs required operations and sends result values of reduction variables to the processors.
Central processor can receive source values of reduction variables long before it calls waitrd_ function. So to overlap the message exchange with computations in the best way Run-Time system attempts to perform reduction operations when central processor invokes dopl_ function (before it calls waitrd_ function).
Execution of reduction operations surpassing call of waitrd_ function by central processor is implemented in the following way.
Internal iterations of parallel loop are split into the portions (internal iterations are the iterations, corresponding to internal elements of local part of distributed array, the loop is mapped on, see section 9.3). The number of portions is defined by value of Run-Time system startup parameter InPLQNumber (integer number, more than 1). Splitting internal iterations into the portions is implemented by division into min(InPLQNumber, PLAxisSize) parts of such loop dimension with the least number, for which a number of iteration coordinates is more than one (PLAxisSize is the number of coordinates in such dimension).
When central processor performs dopl_ function that performs transition from one portion to other, Run-Time system checks completion of message receiving with source values of reduction variables from all processors of processor system of reduction group. If all messages are received, all operations of reduction group are performed, and result values of reduction variables are written in central processor memory and sent to all non-central ones.
Surpassing execution of reduction operations is controlled by Run-Time system startup parameter dopl_WaitRD, that can be
0 | - | surpassing execution of reduction operations is not allowed; |
1 | - | to perform the attempts of surpassing execution of reduction operations when transiting from one portion of internal iterations of the loop to other; |
2 | - | to perform additional attempts of surpassing execution of reduction operations when transiting from execution of exported elements to first portion of internal iterations or from last portion of internal iterations to computation of exported elements (see section 9.3). |
Surpassing execution of reduction operations is performed for all reduction groups, started by strtrd_ function.
Note. Splitting of internal iterations of parallel loop into the portions is also performed to request completion of all receivings and sendings of messages when dealing with MPI message passing system: if Run-Time system startup parameter dopl_MPI_Test is non-zero, every call of dopl_ function is accompanied with MPI_Test function call for every non-completed exchange of messages. MPI_Test function is called to "push" asynchronous message passing, i.e. for more overlapping of message exchange with computations (MPI specific).
If both dopl_WaitRD and dopl_MPI_Test parameters are zero, internal iterations of parallel loops are not split into portions.