NPB BT-MZ Benchmark

The NPB3.3-MPI-MZ BT_MZ prepared to run with READEX tool suite is available at the git repository here: acratus.ichec.ie:readex-apps.git. All the scripts required to compile and run the benchmarks are available in the directory mentioned below:

readex-apps/readex-repository/benchmark_apps/NPB3.3-MZ-MPI/

Instrumentation

Runtime filterung

  1. The NPB benchmarks are compiled as follows:
    make <benchmark name> CLASS=<class> NPROCS=<number of processes>
    A script called “compile_for_saf.sh” is provided in the above mentioned directory to compile the bt-mz benchmarks for class C with two processes for applying scorep-autofilter. The benchmark binary will be placed in the bin directory with a suffix “_saf”.
  2. Apply scorep-autofilter using the script named as “run_saf.sh”. This script requires do_scorep_autofilter_single.sh which is present in the same directory.
  3. Scorep-autofilter will generate a filter file named as scorep.filt which contains the region names to be filtered enclosed between SCOREP REGION NAMES BEGIN and SCOREP REGION NAMES END, as shown below:

    SCOREP_REGION_NAMES_BEGIN
    EXCLUDE!$omp parallel @add.f:22
    !$omp parallel @exch_qbc.f:204
    !$omp parallel @rhs.f:28
    add
    binvrhs
    compute_rhs
    copy_x_face

    rhs_norm
    set_constants
    timer_read
    timer_start
    timer_stop
    zone_setup
    SCOREP_REGION_NAMES_END

Manual Instrumentation

For the NPB bt-mz benchmark, the significant regions exch_qbc(), x_solve(), y_solve() and z_solve are manually annotated as significant regions in the following files respectively:

readex-apps/readex-repository/benchmark_apps/NPB3.3-MZ-MPI/BT-MZ/exch_qbc.f
readex-apps/readex-repository/benchmark_apps/NPB3.3-MZ-MPI/BT-MZ/x_solve.f
readex-apps/readex-repository/benchmark_apps/NPB3.3-MZ-MPI/BT-MZ/y_solve.f
readex-apps/readex-repository/benchmark_apps/NPB3.3-MZ-MPI/BT-MZ/z_solve.f

Design Time Analysis

Apply readex-dyn-detect

  1. The NPB benchmarks with manually annotated phase region are built using the script “compile_for_rdd.sh”. The benchmark binary will be placed in the bin directory with a suffix “_rdd”. This compiles the benchmark with compiler instrumentation and the phase region. In order to compile the benchmark with manual instrumentation, the script “compile_for_rdd_manual.sh” is used. The name of the binary with manual instrumentation will be appended with the suffix “_rdd_manual”.
  2. To apply the readex-dyn-detect tool, the script “run_rdd.sh” is available. The following lines are printed as part of the output by the readex-dyn-detect for NPB bt-mz benchmark:

Significant regions are:
xch_qbc
x_solve
y_solve
z_solve

Significant region information
===============================

Region name Min(t) Max(t) Time Time Dev.(%Reg) Ops/L3miss Weight(%Phase)
exch_gbc 0.018 0.021 3.634 1.2 0 1
x_solve 0.001 0.019 115.943 74.1 24738 28
y_solve 0.001 0.018 120.023 72.2 107241 29
z_solve 0.001 0.018 119.114 73.9 68568 28

Phase information
=================

Min Max Mean Time Dev(%Phase) Dyn.(% Phase)
2.10048 2.1136 2.10225 420.45 0 0.624276

threshold time variation (percent of mean region time): 10.000000
threshold compute intensity deviation (#ops/L3 miss): 10.000000
threshold region importance (percent of phase exec. time): 10.000000

SUMMARY:
========
No inter-phase dynamism
Intra-phase dynamism due to time variation(%) of the following important significant regions
x_solve
y_solve
z_solve
Intra-phase dynamism due to variation in the compute intensity of the following important significant regions
x_solve
y_solve
z_solve

The printed output above for the bt-mz benchmark can be divided into three parts:

First, line no from 1 to 7 lists the name of the significant regions computed from the detection algorithm. To know the algorithm detail please see deliverable D 2.1.

Secondly, line no. from 10 to 29 shows the profile statistic output for the detected significant regions and phase region. Significant region information presents the minimum, maximum of the execution time for each significant region as well as the aggregated execution for the region.

It also prints the time deviation in % with respect to its mean value. Ops/L3miss column prints the absolute compute intensity value. The last column, Weight(%Phase) is the execution time with respect to phase time.

The tool after that prints the statistics information for the phase region as well. It shows the minimum, maximum, mean values of the execution time spent on the phase regions well as the aggregated execution for the phase. The Dev.(% Phase) column prints the time deviation w.r.t the phase mean execution time. Last column, Dyn.(% Phase) prints the variation between minimum and maximum execution time w.r.t the mean execution time of the phase.

Finally, the tool prints the summary results of the dynamism analysis. First, if the standard deviation % of the phase is larger than the given variation threshold, the tool indicates having inter-phase dynamism due to variation of the execution time of phases. Otherwise, the application doesn’t have inter-phase dynamism.

The tool compares Weight(%Phase) with the given threshold given by the user. If a significant region has enough weight and it’s time deviation w.r.t region is more than the threshold time deviation, the tool detects intra-phase dynamism for these significant region(s) due to time variation.

The tool computes the variation of the compute intensity for the set of detected significant regions having a minimum weight of 10%.

The tool also generates a readex_config.xml file which contains all the configurations about the tuning parameters and objective functions which need to be explored by PTF in next step.

Apply PTF

  1. The NPB benchmarks with annotated phase region are built using the script “compile_for_ptf.sh”. The benchmark binary will be placed in the bin directory with a suffix “_ptf”. This compiles the benchmark with compiler instrumentation and the phase region. In order to compile the benchmark with manual instrumentation, the script “compile_for_ptf_manual.sh” is used and the name of binary generated will have a suffix “_ptf_manual”.
  2. Toapply PTF design time analysis, the scripts “run_ptf.sh” or “run_ptf_manual.sh” are available for running with compiler instrumentation and manual instrumentation respectively. This step uses the “readex_config.xml” generated by readex-dyn-detect in the previous step and generates a tuning model named as “tuning_model.json” in the parent working directory which contains details about all the generated runtime situations (rts) and the optimum configurations of tuning parameters for each rts.
  3. The tuning_model.json is used by READEX Runtime Library (RRL) to tune the application at runtime.

Runtime Application Tuning

Runtime Application Tuning can be performed by the Readex Runtime Library (RRL) using the following steps.

  1. The script “compile_ for_plain.sh” is used to generate a binary without Score-P instrumentation.
  2. Both the benchmark plain binary generated in the first step and the binary compiled for PTF using the script “compile_for_ptf.sh” or in case of manual tuning “compile_for_ptf_manual” is used at runtime for RRL run.
  3. The benchmark can be run using the script “run_rrl.sh”. The run script can be updated for custom configurations. To do custom configurations, edit “rrl_tests.txt” to define the new test configuration. Next, execute the command “generate_plain_rrl_hdeem.sh rrl_tests.txt <number_of_repeat_runs_per_test>”, where “number_of_repeat_runs_per_test” is an integer specifying how many times a to repeat the test. This will generate a new “run_rrl.sh” script with updated test configurations.
  4. The script “run_rrl.sh” performs tests for plain run and the RRL run and uses HDEEM for energy measurements. It takes as input the “tuning_model.json” generated by applying PTF. The outputs of the run will be in “bt-mz_plain_hdeem.out” and “bt-mz_rrl_hdeem.out” containing runtime and energy consumption of plain runs and RRL runs respectively.
  5. To use “sacct” instead of HDEEM for energy measurements, the script “run_sacct_rrl_plain.sh” is used. It also performs the experiments for both plain and RRL run. It outputs the “Average Plain Time”, “Average Plain Energy”, “Average RRL Time” and “Average RRL Energy” to the console.