| Parallel Program Performance Analyzer
        (PPPA) | 
Contents
1 Functions of Performance Analyzer
2
The Content of PPPA
3 Approach and Principle of Performance Analyzer
3.1 Splitting program into intervals
3.2 Accumulation of the information
3.3 Processing the information
3.3.1 Calculation of the characteristics of execution on each processors
3.3.2 Calculation of dissynchronization characteristic and time variation of collective operations
3.3.3 Calculation of overlap time
3.3.4 Calculation of the comparative characteristics of execution on different processors
3.3.5 Calculation of the main characteristics
1 Functions of Performance Analyzer
The performance analyzer is intended for the analysis and debugging of DVM-programs execution efficiency. With the performance analyzer the user can get the execution time characteristics of his program in more or less detail.
The efficiency of execution of the parallel programs on multiprocessor computers with the distributed memory is determined by the following major factors:
The DVM-system has an information whether sequential or parallel part of the program is executed on any processor at any moment. This approach is an essential advantage in comparison with the approaches based on explicit use of communication libraries (MPI, PVM). Besides, all synchronization operations of the program are known. Therefore there is an opportunity to quantify the influence of four above factors on the program execution efficiency. It is impossible to distinguish sequential calculations from parallel and to determine a degree of parallelism of the program, if the parallel program is submitted as a system of cooperating processes using communication libraries explicitly.
The opportunity to distinguish sequential and parallel parts of the program during its execution on the multiprocessor computer allows the performance analyzer to give the user the following basic parameters of the parallel program execution:
Execution time is the maximum from the times of the program execution on each processor.
To calculate the main characteristic of parallel execution (parallelism efficiency coefficient) it is necessary to calculate two amount of time. First, an effective time required for the program execution on serial computer. Second, a total time of parallel execution, calculated as a product of execution time by number of processors. It is important to calculate total time of the parallel execution in this way (instead of the sum of times of execution on all processors) because all processors are allocated to the program at the moment of program start and are released after the end of program execution. Parallelism efficiency coefficient is a ratio of the effective time to the total time of parallel execution.
The lost time is the total time of parallel execution subtracted by the effective time. If the programmer is not satisfied with parallelism efficiency coefficient he should analyze which components the lost time consists of.
Thus, there are following independent components of the lost time:
These components are calculated using total time of parallel execution. Therefore the lost time includes also the times, when the processors spent idling, i.e. variations between the execution times on different processors results in increasing the lost time. These variations can occur due to different time of execution of common actions (reduction operations, execution of sequential or parallel parts) on different processors. Besides such imbalance of processors loading at different stages of the parallel program execution can result in growing processor synchronization time. Therefore imbalance and dissynchronization characteristics should be calculated and given the user. Besides for collective operations requiring processors synchronization (reduction operations and loading of buffers), potential dissynchronization time is determined, as the time, which may be lost because of non-simultaneous start of collective operation execution on different processors. Dissynchronization can occur not only due to imbalance, but also due to distinctions of completion times of collective operations on different processors. To evaluate the dissynchronization a programmer is provided by a special characteristic - time variation of collective operation completion.
For more detailed analysis of the program efficiency user should be able to get characteristics of participation of each processor in execution of the parallel program. Besides the user can apply special language features to split program into intervals and to get performance characteristics for each interval.
The performance analyzer consists of two subsystems - accumulation subsystem and subsystem of information processing.
The first subsystem provides accumulation of execution characteristics of parallel program on each processor. This subsystem is called from Lib-DVM during the parallel program execution. Besides C-DVM and Fortran DVM languages have the features for description of intervals of the program execution, for which the user would like to get the performance characteristics. The compiler inserts accumulation subsystem calls at the beginning and the end of each interval. The information from every processor outputs into a file upon the termination of the program.
The second subsystem running on a workstation, processes the information gathered on parallel computer and outputs performance characteristics requested by user.
3 Approach and Principle of Performance Analyzer
3.1 Splitting program into intervals
The execution of the program can be represented as a sequence of intervals. By default, the program is one interval. Also user can define intervals by means of C-DVM and Fortran DVM languages. There is also opportunity to set a mode of compilation, when all sequential loops, containing parallel loops, or all sequential loops or marked sequences of program statements may be declared as intervals.
The user can also split any interval into smaller intervals or to unite neighbor intervals (in order of execution) in a new one, i.e. to present the program as hierarchy of intervals of several levels (the whole program is an interval of highest level).
The mechanism of splitting the program into intervals serves for more detailed analysis of behavior of the program during its execution. Looking through results with the help of the performance analyzer, user can set the depth of details to leave out of consideration intervals of the requested.
3.2 Accumulation of the information
The following information is stored in memory and written into a file on each processor:
- Time of execution Execution_Time
- Processor time of program calculations User_CPU_time
- Time of system calculations Sys_CPU_time
- Time of losses because of insufficient parallelism Insufficient_par
- Times of the message exchange during the communication function execution Communication
- Time of real synchronization Real_synchronization  a part of message passing time, lost, because a receive operation has started on one processor before the corresponding send operation on other processor
- Time of exchanges with files I/O time.
3.3 Processing the information
The information saved in file during parallel execution of the program is processed then on a workstation as follows.
3.3.1 Calculation of the characteristics of execution on each processors
At this stage idle, load imbalance and lost times are calculated at every processor.
3.3.2 Calculation of dissynchronization characteristic and time variation of collective operations
For collective operations requiring synchronization of processors (reduction, updating of shadow edges), the appropriate dissynchronization times are determined for each processor - times which may be lost during message passing due to non-simultaneous start of collective operation execution on different processors. The variation of completion times is calculated for the collective operations. This characteristic shows possible dissynchronization losses arising when collective operations complete not simultaneously on different processors.
When collective operation dissynchronization is calculated, at first the time Tmax of the operation start on the last of processors is calculated. Then for each processor, started the collective operation at time Ti, dissynchronization is calculated. It is equal to a difference Tmax- Ti. Time variation of collective operations is calculated similarly, but the times of the operation completion are used. These characteristics are calculated for all collective operations excepting times of asynchronous operation starts.
3.3.3 Calculation of overlap time
Time of reduction overlap (Reduction_overlap) is the sum of time intervals which begin from the last completion of a start reduction operation (Start_reduction) on any processor and finish when other processors start corresponding wait_reduction operation (Wait_reduction). Time of shadow renewing overlap (Shadow_renewing_overlap) and times of overlap of other collective operations which have of start and wait parts are calculated in the same way.
3.3.4 Calculation of the comparative characteristics of execution on different processors
For each of the program execution characteristics, accumulated on different processors, two processors are determined, which parameters are maximal and minimal. For each characteristic the numbers of these two processors and three values of the characteristic - minimal, maximal and average are given.
3.3.5 Calculation of the main characteristics
These characteristics concern all parallel program and its intervals and are calculated using the characteristics received on each processor. According to the degree of details, given by the user, (level of intervals) the following information on performance execution of the program can be given.
Time of imbalance (Load_Imbalance), real synchronization (Real_synchronization), potential synchronization (Synchronization), time variation (Time_variation), time of overlapping (Overlap) and all their components are calculated similarly.