Components and Usage

 

  1. Preparation
  2. Instrumentation
    1. Instrument the code
    2. Runtime filtering
    3. Compile time filtering
    4. Energy Measurements
    5. Filter OpenMP and MPI regions
    6. Manual instrumentation
    7. ATP instrumentation
  3. Design Time Analysis
    1. Apply readex-dyn-detect
    2. Specify the tuning parameters, objectives, the energy metrics and the search strategy
    3. Apply PTF
  4. Runtime Application Tuning
    1. Tune with RRL
    2. Visualize Configuration Switching

 

1 Preparation

Load the modules for the tool suite.

2 Instrumentation

The whole READEX tool suite is based on the instrumentation of the application with Score-P. Instrumentation inserts measurement probes into the source code of the application. This can be done by the compiler, other software tools, or manually. Detailed documentation on Score-P and the instrumentation features can be found at www.vi-hps.org/projects/score-p.

The probes add overhead to your application that can make any measurements and tuning efforts wasted time. Therefore, in the very first step, make sure the instrumentation overhead is below a few percent.

This section focuses on giving you advice on the support in Score-P for reducing the measurement overheads.

To measure the overhead, first measure the execution without instrumentation and then measure it with instrumentation.

Follow the steps outline below in this section to reduce the overhead to an acceptable level. First, try to reduce the overhead with runtime and compile time filtering.

Then switch on the energy measurements with HDEEM since it has a much higher overhead than just time measurements. Verify the overhead again. If it is still too high, consider manual instrumentation of those regions that are relevant for the READEX tool suite.

Do not proceed to energy tuning if the overhead is too high.

2.1 Instrument the code

  1. First modify the application’s makefile for instrumentation with Score-P. Prepend the compilation with the scorep command.
    MPICXX = mpi++ -fopenmp
    Replace by
    MPICXX = scorep -mpp=mpi mpic++ fopenmp

    The scorep command switches on compiler instrumentation of program functions as well as instrumentation of MPI routines and OpenMP regions.
    Use -mpp=mpi for MPI applications and -mpp=none for non-MPI applications.
  2. Build the application. Note that Score-P and the application are to be built with the same compiler.
  3. Run the application as with the uninstrumented version.

Outcome: Compiler instrumentation of the application is performed and Score-P creates a profile (profile.cubex) file in the scorep-<xyz> directory at the execution location.

2.2 Runtime Filtering

The first way to reduce the instrumentation overhead is to suppress the measurements done by Score-P for instrumented regions. This is called runtime filtering of regions. READEX provides the scorep-autofilter that inspects a generated profile and creates a filter file for guiding runtime filtering. This files includes the names of too fine-granular regions that are dominated by the measurement overhead.

  1. Apply the scorep-autofilter tool on the profile.cubex file as follows:
    scorep-autofilter -t <region_granularity_threshold_in_sec>
    -f <filter_file_name_without_extension>
    <path_to_cubex_file>/profile.cubex

    Use as threshold for the READEX tool suite 100 ms, i.e., -t 0.1.
    This will create a filter file with .filt extension.
  2. It is advisable but not required to rerun the application and scorep-autofilter to detect additional fine granular regions that were missed in the previous step because their execution
    time was increased by the measurement overhead of nested regions. This requires that the environment variable SCOREP FILTERING FILE is to be set to the filter file name (including
    the .filt extension) before rerunning the application.
    Apply scorep-autofilter to the new profile. Be careful not to overwrite the current filter file. Copy the newly found region names into the original filter file.
    Repeat this step until no more regions were found.

Outcome: A filter file with .filt extension containing the application regions that Score-P will not measure.

2.3 Compile time filtering

Runtime filtering only suppresses the measurements while the overhead for the probes is still there. You can apply the filter file also during instrumentation of the application to suppress the insertion of probes for the given regions. Please check the Score-P user manual.

2.4 Energy Measurements

Due to the overhead of energy measurements on Taurus with hdeem for application profiling with Score-P of about 5 ms, it is necessary to check the overhead when the energy measurements are switched on.

For energy measurements, load the hdeem module and the scorep-hdeem sync plug-in that is compatible with the Score-P built for the READEX toolsuite, and set the following environment variables:

module load scorep-hdeem/sync-2017-01-31-git-hdeem2.2.20ms-xmpi-gcc5.3
export SCOREP_METRIC_PLUGINS=hdeem_sync_plugin
export SCOREP_METRIC_HDEEM_SYNC_PLUGIN_CONNECTION="INBAND"
export SCOREP_METRIC_HDEEM_SYNC_PLUGIN_VERBOSE="WARN"
export SCOREP_METRIC_HDEEM_SYNC_PLUGIN_STATS_TIMEOUT_MS=1000

If the overhead for hdeem measurements for the application regions is more than a few percent, you need to switch to manual instrumentation of important coarse-granular regions as shown in Section 2.6.

2.5 Filter OpenMP and MPI regions

Before switching to manual instrumentation, you can remove instrumentation of MPI routines
and OpenMP regions.

To skip the instrumentation of OpenMP regions, the option –thread=none should be used. As a side-effect, no instrumented regions should occur inside of parallel regions. Otherwise, a runtime error will occur.

Instead of switching of instrumentation of all OpenMP regions, you can also disable region
selectively via

--opari="--disable=omp:single,master,atomic,critical,barrier"

This will instrument parallel regions and nested instrumented regions would be handled as expected by Score-P.
To disable measurements for MPI routines, you can add add the following line to your batch script:

export SCOREP_MPI_ENABLE_GROUPS=ENV

It suppresses instrumentation for all MPI routines except MPI Init, MPI Finalize and other environment routines. These are required during DTA with the Periscope Tuning Framework.

2.6 Manual Instrumentation

Finally, if none of the methods above is successful to reduce the overhead to an acceptable level, switch to manual instrumentation.

Manually annotate regions where most of the computation time is spent. You can find these regions with a standard profiler. It is also recommended to instrument the parents of all the significant regions up until the main caller in the hierarchy. This is an optional step which will allow the annotated regions to be used as identifiers for runtime situations.

  1. Build the application with additional options to disable compiler instrumentation (–nocompiler) and to enable user region instrumentation (–user).
  2. Manually annotate coarse granular application regions or any other regions that are of interest for tuning using SCOREP USER REGION DEFINE inside the function definition as shown below:SCOREP_USER_REGION_DEFINE( REGION_HANDLE )
    SCOREP_USER_REGION_BEGIN( REGION_HANDLE, "REGION_NAME", SCOREP_USER_REGION_TYPE_COMMON )
    // application region
    SCOREP_USER_REGION_END( REGION_HANDLE )

Note: You also have to instrument the main routine. Known issue: The Intel FORTRAN compiler ifort might report #error: incomplete macro call after inserting. In such a case the compiler option -extend-source can help.

Example

main() {
...
integrate.run(...);
...
}
void Integrate::run(...) {
SCOREP_USER_REGION_DEFINE( REGION_HANDLE )
SCOREP_USER_REGION_BEGIN( REGION_HANDLE, "REGION_NAME", SCOREP_USER_REGION_TYPE_COMMON )
// application region
SCOREP_USER_REGION_END( REGION_HANDLE )
}

2.7 ATP instrumentation

In order to be able to exploit application level parameters in READEX, some code annotation and instrumentation need to be done.

ATP annotation needs to be done manually, where you need to pinpoint the parts of the code that can be exploited as application parameters and annotate with the APi functions.

  1. Include the ”atplib.h” header file in the source code.
  2. Declare the parameter in the source code with the following functions:
    ATP_PARAM_DECLARE("PARAM_NAME", ATP_PARAM_TYPE_RANGE, DEFAULT_VALUE, NULL);
    ATP_ADD_VALUES("PARAM_NAME", {1,5,1}, 3, NULL);
  3. add the call for parameter value assignment:
    ATP_PARAM_GET("PARAM_NAME", &control_variable, NULL);
  4. Link the application with the ATP library ( -latp ) .

Example

void foo(){
int atp_cv;
...
ATP_PARAM_DECLARE("solver", ATP_PARAM_TYPE_RANGE, 1, NULL);
int solver_values[3] = {1,5,1};
//{1,5,1} means a range with a minimum value of 1, a maximum one of 5 and an increment of 1
ATP_ADD_VALUES("solver", solver_values, 3, NULL);
ATP_PARAM_GET("solver", &atp_cv, NULL);
switch (atp_cv){
case 1:
// choose algorithm 1
break;
case 2:
// choose algorithm 2
break;
...
}

3 Design Time Analysis

The first step in the DTA is to detect and analyze the dynamism of the application using readex-dyn-detect. The tool automatically identifies the significant regions that are subject to the READEX tuning methodology and generates a report on the potentially exploitable dynamism in these regions.

The following steps describe how to use this tool.

3.1 Apply readex-dyn-detect

The readex-dyn-detect tool requires a single phase region. A phase region is a repetitive, single-entry and exit region, typically the body of the main progress loop of the application.

Specify the phase region: Manually annotate the phase region of the application as shown
below:
SCOREP_USER_REGION_DEFINE( REGION_HANDLE )
// loop starts
SCOREP_USER_OA_PHASE_BEGIN( REGION_HANDLE, "PHASE_REGION_NAME", SCOREP_USER_REGION_TYPE_COMMON )
// loop body (phase region)
SCOREP_USER_OA_PHASE_END( REGION_HANDLE )
// loop ends

Example The for-loop body in Integrate::run() is annotated as a phase region in

/projects/p_readex/ichec/test_apps/miniMD_10_ref_alpha_prototype/integrate.cpp

Perform the following steps to use readex-dyn-detect:

  1. Build the application with scorep –online-access –user –thread=none for the manually annotated phase region and add –nocompiler if the application is manually instrumented.
  2. Run the application with the following environment variables set:This will create a tupled profile.cubex file in the scorep-<xyz> directory at the execution location.
    export SCOREP_PROFILING_FORMAT=cube_tuple
    export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_L3_TCM
    export SCOREP_FILTERING_FILE=<filter_file_name_with_extension>
  3. Apply the readex-dyn-detect tool on the profile.cubex file as follows:
    readex-dyn-detect -t <region_granularity_threshold_in_sec>
    -p <phase_region_name >
    -c <compute_intensity_variation_threshold>
    -v <execution_time_variation_threshold_in_percent>
    -w <region_execution_time_weight_wrt_phase_execution_time_in_percent>
    -r <Configuration file name without extension>
    -f <RADAR_report_file_name>
    <path_to_cubex_file>/profile.cubex
    The command line options have the following meaning:-t This threshold specifies the minimal mean execution time of regions that are to be considered as significant regions. Use a value larger than 0.1 ( 100 ms).
    -p Name of the phase region as given in the instrumentation.
    -c This is the required minimal standard deviation of the compute intensities of significant regions with a weight above the given threshold, such that intra-phase dynamism due
    to compute intensity variation is reported.
    -v This is the required minimal standard deviation of the execution time of instances of significant regions in percent of the mean region’s execution time, such that intra-phase
    dynamism is reported. It is also used to decide whether inter-phase dynamism exists. Only if the standard variation of the phase time in percent of the mean phase time is
    greater, inter-phase dynamism is reported.
    -w This threshold specifies the minimal weight of a region such that any dynamism due to time variation or compute intensity variation is reported.
    -r This is the option for configuring readex config file .
    -f If a file name is given, the report is generated in Latex form to include it into the RADAR report.
  4. The results of readex-dyn-detect are summarized in readex config.xml in the execution directory, which is used as an input to PTF. An example of readex config.xml is available in <PTF installation path>/templates/readex config.xml.default.
    Alternatively, the readex config.xml file may be manually created from this template and used as input for PTF without applying scorep-autofilter and readex-dyn-detect if the significant regions are already known.
    Note: readex-dyn-detect currently ignores MPI and shared memory regions in the significant
    regions analysis.

Outcome: The readex config.xml file containing the tuning potential summary, the list of significant regions, and the intra-phase and inter-phase dynamism due to variation in the execution time and compute intensity.

3.2 Specify the tuning parameters, objectives, the energy metrics and the search strategy

The next step of the DTA you may modify the readex config.xml file by performing the following steps:

  1. Specify the tuning parameters: The READEX tuning plugin supports three tuning parameterscore frequency, uncore frequency and the number of OpenMP threads. A minimum of one tuning parameter should be specified. Specify the ranges (minimum, maximum and the step size) for core frequency in kHz and for the uncore frequency in 100 million Hz. For OpenMP threads, specify the lower bound and the step size to increment to the next value.Example
    <tuningParameter>
    <frequency>
    <min_freq>1200000</min_freq>
    <max_freq>2400000</max_freq>
    <freq_step>500000</freq_step>
    </frequency>
    <uncore>
    <min_freq>10</min_freq>
    <max_freq>30</max_freq>
    <freq_step>2</freq_step>
    </uncore>
    <openMPThreads>
    <lower_value>1</lower_value>
    <step>2</step>
    </openMPThreads>
    In case the Application tuning parameters (ATPs) are considered, put the ATP library in DTA mode by setting the following environment variable:export ATP_EXECUTION_MODE=DTA
  2. Specify the objectives: Specify at least one objective from Energy, Execution Time, CPU Energy, Energy Delay Product and Energy Delay Product Squared. The plugin measures the
    objective values for all the specified objectives, but tunes the application for the objective that is specified first.Example<objectives>
    <objective>Energy</objective>
    <objective>Time</objective>
    <objective>EDP</objective>
    <objective>ED2P</objective>
    <objective>CPUEnergy</objective>
    </objectives>
  3. Specify the energy metrics: Specify the energy plugin name and associated metric names. For hdeem sync plugin, it’s possible to measure the energy for the whole node or/and two CPUs respectively. The energy metrics should be specified under <periscope> </periscope>
    Example
    <periscope>
    <metricPlugin>
    <name>hdeem_sync_plugin</name>
    </metricPlugin>
    <metrics>
    <node_energy>hdeem/BLADE/E</node_energy>
    <cpu0_energy>hdeem/CPU0/E</cpu0_energy>
    <cpu1_energy>hdeem/CPU1/E</cpu1_energy>
    </metrics>
    </periscope>
    To specify the RAPL counter energy plugin x86 energy sync plugin, use the configuration as follows:Example
    <periscope>
    <metricPlugin>
    <name>x86_energy_sync_plugin</name>
    </metricPlugin>
    <metrics>
    <node_energy>x86_energy/BLADE/E</node_energy>
    <cpu0_energy>x86_energy/CORE0/E</cpu0_energy>
    <cpu1_energy>x86_energy/CORE1/E</cpu1_energy>
    </metrics>
    </periscope>
  4. Specify a search algorithm: Specify a search algorithm from exhaustive, random, individual or genetic search. For the random search strategy, specify the number of samples (scenarios) that the plugin should limit to. For the individual search, specify the number of tuning parameter values to keep in the search space. For the genetic search, specify the population size, maximum generations and the timer to set an upper limit on the tuning execution time. The energy search algorithm should also be specified under <periscope> </periscope>Example<periscope>
    <searchAlgorithm>
    <name>exhaustive</name>
    <name>random</name>
    <samples>2</samples>
    <name>individual</name>
    <keep>2</keep>
    <name>gde3</name>
    <populationSize>10</populationSize>
    <maxGenerations>10</maxGenerations>
    <timer>20</timer>
    </searchAlgorithm>
    <periscope>

3.3 Apply PTF

The following steps describe how to use PTF and are contained in the slurm job script for miniMD below.

  1. Build the application with instrumentation as discussed above (scorep –online-access–user …) for the manually annotated phase.
  2. Set the number of nodes to 2 (line 4), and allocate enough memory per CPU to fit the application as shown in line 12. PTF will use one node for the tool’s agents and the remaining nodes for the application processes.
  3. Load the scorep-hdeem sync plug-in for energy measurements compatible with the Score-P built for the READEX toolsuite, and set the environment variables as shown in lines 40–44.
  4. Use the parameter control plug-ins compatible with Score-P and PTF, and set the following
    environment variables as shown in line 37.
  5. Use and apply te PTF on the application with the psc frontend command as shown in lines 49–56. Specify the manually instrumented phase region for the option –phase, the readex tuning plugin for –tune and the readex configuration file for –config-file.
    The options –info and –selective-info are only used for debug messages, and are optional. For more debug output, set the –info=<max info level between 2 and 6, and –selective-info=<comma separated list of information levels. For more information about other options, see psc frontend –help.
    This will produce a tuning model in the execution directory under the name tuning model.json.
    1 #!/bin/sh
    2
    3 #SBATCH --time=5:00:00 # walltime
    4 #SBATCH --nodes=2 # number of processor cores (i.e. tasks)
    5 #SBATCH --ntasks=8
    6 #SBATCH --tasks-per-node=8
    7 #SBATCH --cpus-per-task=1
    8 #SBATCH --exclusive
    9 #SBATCH --partition=haswell
    10 #SBATCH --comment="cpufreqchown"
    11 #SBATCH --cpu-freq=Low
    12 #SBATCH --mem-per-cpu=2500M # memory per CPU core
    13 #SBATCH -J "miniMD_PTF" # job name
    14 #SBATCH -A p_readex
    15
    16 handle_error() {
    17 if [ $1 != 0 ]; then
    18 echo "Error on $2."
    19 exit 1
    20 fi
    21 }
    22
    23 echo "run PTF begin."
    24
    25 module purge
    26 module use /projects/p_readex/modules
    27 module load readex/ci_readex_bullxmpi1.2.8.4_gcc5.3.0
    28 handle_error $? "loading READEX CI module."
    29
    30 INPUT_FILE=in.lj.miniMD
    31 PHASE=INTEGRATE_RUN_LOOP
    32 NP=8 # check against --ntasks and tasks-per-node
    33
    34 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
    35
    36 export SCOREP_SUBSTRATE_PLUGINS=rrl
    37 export SCOREP_RRL_PLUGINS=cpu_freq_plugin,uncore_freq_plugin
    38 export SCOREP_RRL_VERBOSE="WARN"
    39
    40 module load scorep-hdeem/sync-2017-01-31-git-hdeem2.2.20ms-xmpi-gcc5.3
    41 export SCOREP_METRIC_PLUGINS=hdeem_sync_plugin
    42 export SCOREP_METRIC_PLUGINS_SEP=";"
    43 export SCOREP_METRIC_HDEEM_SYNC_PLUGIN_CONNECTION="INBAND"
    44 export SCOREP_METRIC_HDEEM_SYNC_PLUGIN_VERBOSE="WARN"
    45 export SCOREP_METRIC_HDEEM_SYNC_PLUGIN_STATS_TIMEOUT_MS=1000
    46
    47 export SCOREP_MPI_ENABLE_GROUPS=ENV
    48 export SCOREP_FILTERING_FILE=scorep.filt
    49
    50 export OMP_NUM_THREADS=1
    51
    52 psc_frontend --apprun="./miniMD_openmpi -i $INPUT_FILE"
    53 --mpinumprocs=$NP
    54 --ompnumthreads=1
    55 --phase=$PHASE
    56 --tune=readex_tuning
    57 --config-file=readex_config.xml
    58 --info=2
    59 --force-localhost
    60 --selective-info=AutotuneAll,AutotunePlugins
    61
    62 handle_error $? "run PTF"
    63
    64 echo "run PTF done."
    To use the RAPL counter energy plugin change from line no.40-45 with the following:module load scorep_plugin_x86_energy
    export SCOREP_METRIC_PLUGINS=x86_energy_sync_plugin
    export SCOREP_METRIC_X86_ENERGY_SYNC_PLUGIN=*/E
    export SCOREP_METRIC_PLUGINS_SEP=";"
    export SCOREP_METRIC_X86_ENERGY_SYNC_PLUGIN_CONNECTION="INBAND"
    export SCOREP_METRIC_X86_ENERGY_SYNC_PLUGIN_VERBOSE="WARN"
    SCOREP_METRIC_X86_ENERGY_SYNC_PLUGIN_STATS_TIMEOUT_MS=1000export


    Outcome
    :

    A printed summary of the created scenarios, the properties found in each scenario, the optimum and the worst scenarios for the phase, the measured objective values for the phase in each scenario, the best configuration for each rts, the static and dynamic energy savings for the rts’s, and the static energy savings for the whole phase.
    A tuning model.json file containing the list of rts’s that were tuned by the plugin, the scenarios they are classified into, and the best configuration for each scenario.

 

4 Runtime Application Tuning

4.1 Tune with RRL

The following steps describe how to use RRL to tune the application and compare the execution time and energy consumption with an untuned run of the application.

  1. If Application tuning parameters are exploited in the application then the ATP related instrumentation functions should remain in the code.
  2. Use an uninstrumented verion of the application to compare its energy consumption and execution time against the version tuned with RRL.
  3. For the application run tuned with RRL, use the application built for analysis with PTF as described in Section 3.
  4. Set the number of nodes to run the application on (line 4), and allocate enough memory per CPU to fit the application (line 10).
  5. Load the scorep-hdeem plug-in for energy measurements using the HDEEM command-line tool on Taurus (line 18).
  6. For the untuned run of the application (lines 20–54) perform the following steps:
    1. Disable Score-P profiling and tracing (lines 21 and 22), and set the Score-P substrate plugins, RRL tuning plugins and the tuning model to empty (lines 23–25).
    2. Before running the uninstrumented version of the application miniMD openmpi plain (line 33), start the HDEEM energy measurements on all nodes (line 29–30) and get the start timestamp (line 31).
    3. After the application run is complete, stop the HDEEM measurements and print the statistics from all nodes into a file hdeem.out (lines 36–38), and get the end timestamp (line 35).
    4. Aggregate the energy consumption for the untuned run of the application from hdeem.out (lines 40–53).
  7. For the RRL-tuned run of the application (lines 56–92) perform the following steps:
    1. Disable Score-P profiling and tracing (lines 59 and 60), set the Score-P substrate plugins to rrl, RRL plugins to the tuning plugins to use (cpu freq plugin and uncore freq plugin in this example) and the tuning model to the file generated by PTF (lines 61–63).
    2. Before running the RRL-tuned version of the application miniMD openmpi (line 71), start the HDEEM energy measurements on all nodes (line 67–68) and get the start timestamp (line 69).
    3. After the application run is complete, stop the HDEEM measurements and print the statistics from all nodes into a file hdeem.out (lines 74–76), and get the end timestamp (line 73).
    4. Aggregate the energy consumption for the RRL-tuned run of the application from hdeem.out (lines 78–91).

1 #!/bin/sh
2
3 #SBATCH --time=2:00:00
4 #SBATCH --nodes=1
5 #SBATCH --ntasks=24
6 #SBATCH --tasks-per-node=24
7 #SBATCH --cpus-per-task=1
8 #SBATCH --exclusive
9 #SBATCH --partition=haswell
10 #SBATCH --mem-per-cpu=2500M
11 #SBATCH -J "miniMD_rrl"
12 #SBATCH -A p_readex
13
14 INPUT_FILE=in.lj.miniMD
15 energy_label="Energy"
16 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
17
18 module load scorep-hdeem/2016-12-20-hdeem-2.2.20ms
19
20 # start plain run
21 export SCOREP_ENABLE_PROFILING="false"
22 export SCOREP_ENABLE_TRACING="false"
23 export SCOREP_SUBSTRATE_PLUGINS=""
24 export SCOREP_RRL_PLUGINS=""
25 export SCOREP_RRL_TMM_PATH=""
26 export SCOREP_MPI_ENABLE_GROUPS=ENV
27
28 # start measurements
29 srun -N 1 -n 1 --ntasks-per-node=1 -c 1 clearHdeem
30 srun -N 1 -n 1 --ntasks-per-node=1 -c 1 startHdeem
31 start_time=$(($(date +%s%N)/1000000))
32 # run untuned application
33 srun ./miniMD_openmpi_plain -i $INPUT_FILE
34 # stop measurements
35 stop_time=$(($(date +%s%N)/1000000))
36 srun -N 1 -n 1 --ntasks-per-node=1 -c 1 stopHdeem
37 srun -N 1 -n 1 --ntasks-per-node=1 -c 1 sleep 5
38 srun -N 1 -n 1 --ntasks-per-node=1 -c 1 checkHdeem >> hdeem.out
39
40 # aggregate energy measurements from HDEEM
41 energy_total=0
42 if [ -e hdeem.out ]; then
43 exec < hdeem.out
44 while read max max_unit min min_unit average average_unit energy energy_unit; do
45 if [ "$energy" == "$energy_label" ]; then
46 read blade max_val min_val average_val energy_val
47 energy_total=$(echo "${energy_total} + ${energy_val}" | bc)
48 fi
49 done
50 time_total=$(echo "${stop_time} - ${start_time}" | bc)
51 echo "Untuned run: Total time = $time_total ms, Total energy = $energy_total J"
52 rm -rf hdeem.out
53 fi
54 # end plain run
55
56 # start RRL-tuned run
57 module use /projects/p_readex/modules
58 module load readex/ci_readex_bullxmpi1.2.8.4_gcc5.3.0
59 export SCOREP_ENABLE_PROFILING="false"
60 export SCOREP_ENABLE_TRACING="false"
61 export SCOREP_SUBSTRATE_PLUGINS="rrl"
62 export SCOREP_RRL_PLUGINS="cpu_freq_plugin,uncore_freq_plugin"
63 export SCOREP_RRL_TMM_PATH="tuning_model.json"
64 export SCOREP_MPI_ENABLE_GROUPS=ENV
65
66 # start measurements
67 srun -N 1 -n 1 --ntasks-per-node=1 -c 1 clearHdeem
68 srun -N 1 -n 1 --ntasks-per-node=1 -c 1 startHdeem
69 start_time=$(($(date +%s%N)/1000000))
70 # run RRL-tuned application
71 srun ./miniMD_openmpi -i $INPUT_FILE
72 # stop measurmenents
73 stop_time=$(($(date +%s%N)/1000000))
74 srun -N 1 -n 1 --ntasks-per-node=1 -c 1 stopHdeem
75 srun -N 1 -n 1 --ntasks-per-node=1 -c 1 sleep 5
76 srun -N 1 -n 1 --ntasks-per-node=1 -c 1 checkHdeem >> hdeem.out
77
78 # aggregate energy measurements from HDEEM
79 energy_total=0
80 if [ -e hdeem.out ]; then
81 exec < hdeem.out
82 while read max max_unit min min_unit average average_unit energy energy_unit; do
83 if [ "$energy" == "$energy_label" ]; then
84 read blade max_val min_val average_val energy_val
85 energy_total=$(echo "${energy_total} + ${energy_val}" | bc)
86 fi
87 done
88 time_total=$(echo "${stop_time} - ${start_time}" | bc)
89 echo "RRL-tuned run: Total time = $time_total ms, Total energy = $energy_total J"
90 rm -rf hdeem.out
91 fi
92 # end RRL-tuned run

Outcome:

  • The total execution time and energy consumption of the untuned run of the application and the run tuned by RRL are printed for comparison.

Example The batch script presented above is available in

/projects/p_readex/ichec/test_apps//miniMD_10_ref_alpha_prototype/run_rrl.sh

For different applications, run rrl.sh can be reused by updating the command to run the application in lines 33 and 71. This script is to be run from the location with the application’s executable.

4.2 Visualize Configuration Switching

In addition to the four steps specified in Section 4.1, do following steps to visualize the configuration switching for each tuning parameter.
As visualization is implemented as a synchronous plugin, Score-P supports this only in profiling mode, so to get the metrics in trace, both profiling and tracing have to be set.

export SCOREP_ENABLE_TRACING=true
export SCOREP_ENABLE_PROFILING=true

  1. Set the environment variables to specify the metric plugin from RRL for visualization of tuning parameters as metrics in Vampir trace.
    export SCOREP_METRIC_PLUGINS="scorep_substrate_rrl"
  2. Set the environment variable to specify the tuning parameters which need to be added to trace. For the hardware and software tuning parameters, names of the PCPs are used. All of the Hw/Sw parameters can be loaded by simply setting the environment variable to ”*”. Application Tuning Parameters (ATP) need to be explicitly specified. To load ATPs, value should be set equal to ’ATP/<atp name>’ where, atp name is the name of the ATP. The prefix ’ATP/’ is required to recognize the ATPs.
    export SCOREP_METRIC_SCOREP_SUBSTRATE_RRL=’ATP/<atp_name>, <pcp_name>

 Example

  1. The environment variables to specify the RRL as metric plug-in and view the CPU frequency switching in trace in Vampir can be set as follows:
    export SCOREP_METRIC_PLUGINS="scorep_substrate_rrl"
    export SCOREP_METRIC_SCOREP_SUBSTRATE_RRL=’cpu_freq_plugin’