NPB BT-MZ Benchmark – The READEX Project

The NPB3.3-MPI-MZ BT_MZ prepared to run with READEX tool suite is available at the git repository here: acratus.ichec.ie:readex-apps.git. All the scripts required to compile and run the benchmarks are available in the directory mentioned below:

readex-apps/readex-repository/benchmark_apps/NPB3.3-MZ-MPI/

Instrumentation

Runtime filterung

The NPB benchmarks are compiled as follows:
make <benchmark name> CLASS=<class> NPROCS=<number of processes>
A script called “compile_for_saf.sh” is provided in the above mentioned directory to compile the bt-mz benchmarks for class C with two processes for applying scorep-autofilter. The benchmark binary will be placed in the bin directory with a suffix “_saf”.
Apply scorep-autofilter using the script named as “run_saf.sh”. This script requires do_scorep_autofilter_single.sh which is present in the same directory.
Scorep-autofilter will generate a filter file named as scorep.filt which contains the region names to be filtered enclosed between SCOREP REGION NAMES BEGIN and SCOREP REGION NAMES END, as shown below:

SCOREP_REGION_NAMES_BEGIN EXCLUDE!$omp parallel @add.f:22 !$omp parallel @exch_qbc.f:204 !$omp parallel @rhs.f:28 add binvrhs compute_rhs copy_x_face … rhs_norm set_constants timer_read timer_start timer_stop zone_setup SCOREP_REGION_NAMES_END

Manual Instrumentation

For the NPB bt-mz benchmark, the significant regions exch_qbc(), x_solve(), y_solve() and z_solve are manually annotated as significant regions in the following files respectively:

readex-apps/readex-repository/benchmark_apps/NPB3.3-MZ-MPI/BT-MZ/exch_qbc.f readex-apps/readex-repository/benchmark_apps/NPB3.3-MZ-MPI/BT-MZ/x_solve.f readex-apps/readex-repository/benchmark_apps/NPB3.3-MZ-MPI/BT-MZ/y_solve.f readex-apps/readex-repository/benchmark_apps/NPB3.3-MZ-MPI/BT-MZ/z_solve.f

Design Time Analysis

Apply readex-dyn-detect

The NPB benchmarks with manually annotated phase region are built using the script “compile_for_rdd.sh”. The benchmark binary will be placed in the bin directory with a suffix “_rdd”. This compiles the benchmark with compiler instrumentation and the phase region. In order to compile the benchmark with manual instrumentation, the script “compile_for_rdd_manual.sh” is used. The name of the binary with manual instrumentation will be appended with the suffix “_rdd_manual”.
To apply the readex-dyn-detect tool, the script “run_rdd.sh” is available. The following lines are printed as part of the output by the readex-dyn-detect for NPB bt-mz benchmark:

Significant regions are:

xch_qbc

x_solve

y_solve

z_solve
Significant region information

===============================


Region name
Min(t)
Max(t)
Time
Time Dev.(%Reg)
Ops/L3miss
Weight(%Phase)

exch_gbc
0.018
0.021
3.634
1.2
0
1

x_solve
0.001
0.019
115.943
74.1
24738
28

y_solve
0.001
0.018
120.023
72.2
107241
29

z_solve
0.001
0.018
119.114
73.9
68568
28

Phase information

=================


Min
Max
Mean
Time
Dev(%Phase)
Dyn.(% Phase)

2.10048
2.1136
2.10225
420.45
0
0.624276

threshold time variation (percent of mean region time): 10.000000

threshold compute intensity deviation (#ops/L3 miss): 10.000000

threshold region importance (percent of phase exec. time): 10.000000
SUMMARY:

========

No inter-phase dynamism

Intra-phase dynamism due to time variation(%) of the following important significant regions

x_solve

y_solve

z_solve

Intra-phase dynamism due to variation in the compute intensity of the following important significant regions

x_solve

y_solve

z_solve

The printed output above for the bt-mz benchmark can be divided into three parts:

First, line no from 1 to 7 lists the name of the significant regions computed from the detection algorithm. To know the algorithm detail please see deliverable D 2.1.

Secondly, line no. from 10 to 29 shows the profile statistic output for the detected significant regions and phase region. Significant region information presents the minimum, maximum of the execution time for each significant region as well as the aggregated execution for the region.

It also prints the time deviation in % with respect to its mean value. Ops/L3miss column prints the absolute compute intensity value. The last column, Weight(%Phase) is the execution time with respect to phase time.

The tool after that prints the statistics information for the phase region as well. It shows the minimum, maximum, mean values of the execution time spent on the phase regions well as the aggregated execution for the phase. The Dev.(% Phase) column prints the time deviation w.r.t the phase mean execution time. Last column, Dyn.(% Phase) prints the variation between minimum and maximum execution time w.r.t the mean execution time of the phase.

Finally, the tool prints the summary results of the dynamism analysis. First, if the standard deviation % of the phase is larger than the given variation threshold, the tool indicates having inter-phase dynamism due to variation of the execution time of phases. Otherwise, the application doesn’t have inter-phase dynamism.

The tool compares Weight(%Phase) with the given threshold given by the user. If a significant region has enough weight and it’s time deviation w.r.t region is more than the threshold time deviation, the tool detects intra-phase dynamism for these significant region(s) due to time variation.

The tool computes the variation of the compute intensity for the set of detected significant regions having a minimum weight of 10%.

The tool also generates a readex_config.xml file which contains all the configurations about the tuning parameters and objective functions which need to be explored by PTF in next step.

Apply PTF

The NPB benchmarks with annotated phase region are built using the script “compile_for_ptf.sh”. The benchmark binary will be placed in the bin directory with a suffix “_ptf”. This compiles the benchmark with compiler instrumentation and the phase region. In order to compile the benchmark with manual instrumentation, the script “compile_for_ptf_manual.sh” is used and the name of binary generated will have a suffix “_ptf_manual”.
Toapply PTF design time analysis, the scripts “run_ptf.sh” or “run_ptf_manual.sh” are available for running with compiler instrumentation and manual instrumentation respectively. This step uses the “readex_config.xml” generated by readex-dyn-detect in the previous step and generates a tuning model named as “tuning_model.json” in the parent working directory which contains details about all the generated runtime situations (rts) and the optimum configurations of tuning parameters for each rts.
The tuning_model.json is used by READEX Runtime Library (RRL) to tune the application at runtime.

Runtime Application Tuning

Runtime Application Tuning can be performed by the Readex Runtime Library (RRL) using the following steps.

The script “compile_ for_plain.sh” is used to generate a binary without Score-P instrumentation.
Both the benchmark plain binary generated in the first step and the binary compiled for PTF using the script “compile_for_ptf.sh” or in case of manual tuning “compile_for_ptf_manual” is used at runtime for RRL run.
The benchmark can be run using the script “run_rrl.sh”. The run script can be updated for custom configurations. To do custom configurations, edit “rrl_tests.txt” to define the new test configuration. Next, execute the command “generate_plain_rrl_hdeem.sh rrl_tests.txt <number_of_repeat_runs_per_test>”, where “number_of_repeat_runs_per_test” is an integer specifying how many times a to repeat the test. This will generate a new “run_rrl.sh” script with updated test configurations.
The script “run_rrl.sh” performs tests for plain run and the RRL run and uses HDEEM for energy measurements. It takes as input the “tuning_model.json” generated by applying PTF. The outputs of the run will be in “bt-mz_plain_hdeem.out” and “bt-mz_rrl_hdeem.out” containing runtime and energy consumption of plain runs and RRL runs respectively.
To use “sacct” instead of HDEEM for energy measurements, the script “run_sacct_rrl_plain.sh” is used. It also performs the experiments for both plain and RRL run. It outputs the “Average Plain Time”, “Average Plain Energy”, “Average RRL Time” and “Average RRL Energy” to the console.