Example – NPB-OMP BT Benchmark

The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks are derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications. Problem sizes in NPB are predefined and indicated as different classes. In this example we will use the pseudo-application “BT-Block Tri-diagonal solver ” with the problem size class “D”.

Prerequisite

This tutorial assumes that you installed the READEX components and have their binaries available at $PATH. Furthermore, we use the Intel Compiler Suite and Intel MPI during this test case. You might use other compilers and MPI versions.

Preparation

First we download the NPB3.3.1 tarball for the benchmark suite, extract the archive contents and change to NPB3.3.1/NPB3.3-OMP directory:

mian@tauruslogin5:~/web_examples> git clone https://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz
mian@tauruslogin5:~/web_examples> tar -xzvf NPB3.3.1.tar.gz
mian@tauruslogin5:~/web_examples> cd NPB3.3.1/NPB3.3-OMP

Copy the file “config/make.def.template” to “config/make.def”

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> cp config/make.def.template config/make.def

Edit the “config/make.def” and add the appropriate F77 compiler. In this case, F77 is set to Intel compiler “ifort -qopenmp -mcmodel=medium”.

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> vim config/make.def
...
#---------------------------------------------------------------------------
# This is the fortran compiler used for Fortran programs
#---------------------------------------------------------------------------
F77 =ifort -qopenmp -mcmodel=medium
# This links fortran programs; usually the same as ${F77}
FLINK = $(F77)
...

Now we compile the “bt” benchmark for CLASS=D.

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> make bt CLASS=D
============================================
= NAS PARALLEL BENCHMARKS 3.3 =
= OpenMP Versions =
= F77/C =
============================================
cd BT; make CLASS=D VERSION=
make[1]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
make[2]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/sys'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/sys'
../sys/setparams bt D
make[2]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
ifort -qopenmp -mcmodel=medium -O -o ../bin/bt.D.x bt.o initialize.o exact_solution.o exact_rhs.o set_constants.o adi.o rhs.o x_solve.o y_solve.o solve_subs.o z_solve.o add.o error.o verify.o ../common/print_results.o ../common/timers.o ../common/wtime.o
make[2]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
make[1]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'

After the benchmark is compiled, a binary named “bt.D.x” is created in the “bin” directory. We make a copy of the original (non-instrumented) program for later comparison.

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> mv bin/bt.D.x bin/bt.D.x_clean

First Compilation with Score-P and application of Autofilter

Now we can compile the benchmark with Score-P. We edit the “config/make.def” and change the compiler to “F77 = scorep ifort -qopenmp -mcmodel=medium” and then compile the benchmark.

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> vim config/make.def
...
#---------------------------------------------------------------------------
# This is the fortran compiler used for Fortran programs
#---------------------------------------------------------------------------
F77 =scorep ifort -qopenmp -mcmodel=medium
# This links fortran programs; usually the same as ${F77}
FLINK = $(F77)
...

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> make bt CLASS=D
============================================
= NAS PARALLEL BENCHMARKS 3.3 =
= OpenMP Versions =
= F77/C =
============================================
cd BT; make CLASS=D VERSION=
make[1]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
make[2]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/sys'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/sys'
../sys/setparams bt D
make[2]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
scorep ifort -qopenmp -mcmodel=medium -c -O bt.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/bt.f(48): (col. 16) remark: MAIN__ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O initialize.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/initialize.f(210): (col. 18) remark: lhsinit_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/initialize.f(4): (col. 19) remark: initialize_ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O exact_solution.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/exact_solution.f(4): (col. 18) remark: exact_solution_ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O exact_rhs.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/exact_rhs.f(5): (col. 18) remark: exact_rhs_ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O set_constants.f
scorep ifort -qopenmp -mcmodel=medium -c -O adi.f
scorep ifort -qopenmp -mcmodel=medium -c -O rhs.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/rhs.f(4): (col. 18) remark: compute_rhs_ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O x_solve.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/x_solve.f(5): (col. 18) remark: x_solve_ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O y_solve.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/y_solve.f(4): (col. 18) remark: y_solve_ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O solve_subs.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(5): (col. 18) remark: matvec_sub_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(58): (col. 18) remark: matmul_sub_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(206): (col. 18) remark: binvcrhs_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(494): (col. 18) remark: binvrhs_ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O z_solve.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/z_solve.f(4): (col. 18) remark: z_solve_ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O add.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/add.f(4): (col. 19) remark: add_ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O error.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/error.f(67): (col. 18) remark: rhs_norm_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/error.f(4): (col. 18) remark: error_norm_ has been targeted for automatic cpu dispatch
scorep ifort -qopenmp -mcmodel=medium -c -O verify.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/verify.f(5): (col. 20) remark: verify_ has been targeted for automatic cpu dispatch
cd ../common; scorep ifort -qopenmp -mcmodel=medium -c -O print_results.f
cd ../common; scorep ifort -qopenmp -mcmodel=medium -c -O timers.f
cd ../common; cc -c -O -o wtime.o ../common/wtime.c
scorep ifort -qopenmp -mcmodel=medium -O -o ../bin/bt.D.x bt.o initialize.o exact_solution.o exact_rhs.o set_constants.o adi.o rhs.o x_solve.o y_solve.o solve_subs.o z_solve.o add.o error.o verify.o ../common/print_results.o ../common/timers.o ../common/wtime.o
make[2]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
make[1]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'

Now we run our benchmark for the first time. We use srun (instead of mpirun) since taurus uses the SLURM batch system, we use 1 compute node with 24 processor cores.

To create the filter file, we remove instrumentation of all functions that are shorter then 100 ms (can be changed with -t switch) and create an Intel filter file (since we use an Intel compiler). Please make sure to use the correct scorep folder. Yours will have a different timestamp.

Now we can recompile “bt” with the given filter file for next steps.

Phase Instrumentation and Application of Periscope

Phase Instrumentation

Now, we instrument the phase manually. We now add the definition of the phase region (SCOREP_USER_REGION_DEFINE) and mark beginning and end of the loop (SCOREP_USER_OA_PHASE_BEGIN, SCOREP_USER_OA_PHASE_END). Furthermore, we include the Fortran header file scorep/SCOREP_User.inc.

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> vim BT/bt.f
c---------------------------------------------------------------------
c---------------------------------------------------------------------
#include "scorep/SCOREP_User.inc"
program BT
c---------------------------------------------------------------------
include 'header.h'
SCOREP_USER_REGION_DEFINE( my_region_handle )
...
...
...
c---------------------------------------------------------------------
c do one time step to touch all code, and reinitialize
c---------------------------------------------------------------------
call adi
call initialize
do i = 1, t_last
call timer_clear(i)
end do
call timer_start(1)
do step = 1, niter
SCOREP_USER_OA_PHASE_BEGIN(my_region_handle,"phase",SCOREP_USER_REGION_TYPE_COMMON )
if (mod(step, 20) .eq. 0 .or.
> step .eq. 1) then
write(*, 200) step
200 format(' Time step ', i4)
endif
call adi
SCOREP_USER_OA_PHASE_END( my_region_handle )
end do
...
...
...

Preparing Design Time Analysis (Compilation)

We recompile the application, and now we enable the user instrumentation and the Online Access interface, that is needed by Periscope. We make a copy of the old binary for later usage with the READEX runtime library.

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> cp bin/bt.D.x bin/bt.D.x_rrl
mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> vim config/make.def

...
#---------------------------------------------------------------------------
# This is the fortran compiler used for Fortran programs
#---------------------------------------------------------------------------
F77 =scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source
# This links fortran programs; usually the same as ${F77}
FLINK = $(F77)
...

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> make bt CLASS=D
============================================
= NAS PARALLEL BENCHMARKS 3.3 =
= OpenMP Versions =
= F77/C =
============================================
cd BT; make CLASS=D VERSION=
make[1]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
make[2]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/sys'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/sys'
../sys/setparams bt D
make[2]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O bt.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/bt.f(49): (col. 16) remark: MAIN__ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O initialize.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/initialize.f(210): (col. 18) remark: lhsinit_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/initialize.f(4): (col. 19) remark: initialize_ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O exact_solution.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/exact_solution.f(4): (col. 18) remark: exact_solution_ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O exact_rhs.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/exact_rhs.f(5): (col. 18) remark: exact_rhs_ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O set_constants.f
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O adi.f
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O rhs.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/rhs.f(4): (col. 18) remark: compute_rhs_ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O x_solve.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/x_solve.f(5): (col. 18) remark: x_solve_ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O y_solve.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/y_solve.f(4): (col. 18) remark: y_solve_ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O solve_subs.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(5): (col. 18) remark: matvec_sub_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(58): (col. 18) remark: matmul_sub_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(206): (col. 18) remark: binvcrhs_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(494): (col. 18) remark: binvrhs_ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O z_solve.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/z_solve.f(4): (col. 18) remark: z_solve_ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O add.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/add.f(4): (col. 19) remark: add_ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O error.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/error.f(67): (col. 18) remark: rhs_norm_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/error.f(4): (col. 18) remark: error_norm_ has been targeted for automatic cpu dispatch
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O verify.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/verify.f(5): (col. 20) remark: verify_ has been targeted for automatic cpu dispatch
cd ../common; scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O print_results.f
cd ../common; scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O timers.f
cd ../common; cc -c -O -o wtime.o ../common/wtime.c
scorep --online-access --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -O -o ../bin/bt.D.x bt.o initialize.o exact_solution.o exact_rhs.o set_constants.o adi.o rhs.o x_solve.o y_solve.o solve_subs.o z_solve.o add.o error.o verify.o ../common/print_results.o ../common/timers.o ../common/wtime.o
make[2]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
make[1]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'

Applying readex-dyn-detect and Creating readex_config.xml

Now, everything is ready for Design Time Analysis. First, we run the code to be used for readex-dyn-detect, afterwards, we run readex-dyn-detect to create a configuration file for Periscope.

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> export SCOREP_PROFILING_FORMAT=cube_tuple
mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_L3_TCM
mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> export SCOREP_EXPERIMENT_DIRECTORY=readex_dyn_detect
mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> srun -n 24 --mem-per-cpu=6G -p haswell ../bin/bt.D.x
mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> readex-dyn-detect -p phase readex_dyn_detect/profile.cubex
Reading readex_dyn_detect/profile.cubex...
Done.
Granularity threshold: 0.1
32candidate_nodes
Phase node phase found in candidate regions. It has 1 children.
Granularity of MAIN__: 1058.71, Number of children: 6
Granularity of initialize_: 0.372545, Number of children: 1
Granularity of !$omp parallel @initialize.f:21: 0.370085, Number of children: 0
Granularity of exact_rhs_: 0.760675, Number of children: 1
Granularity of !$omp parallel @exact_rhs.f:19: 0.760091, Number of children: 0
Granularity of adi_: 4.20779, Number of children: 5
Granularity of compute_rhs_: 0.973933, Number of children: 1
Granularity of !$omp parallel @rhs.f:17: 0.973826, Number of children: 0
Granularity of x_solve_: 0.817286, Number of children: 1
Granularity of !$omp parallel @x_solve.f:44: 0.817243, Number of children: 0
Granularity of y_solve_: 1.08139, Number of children: 1
Granularity of !$omp parallel @y_solve.f:42: 1.08135, Number of children: 0
Granularity of z_solve_: 1.25959, Number of children: 1
Granularity of !$omp parallel @z_solve.f:42: 1.25955, Number of children: 0
Granularity of !$omp parallel @add.f:18: 0.0755035, Number of children: 0
Granularity of phase: 4.20595, Number of children: 1
Granularity of adi_: 4.20779, Number of children: 5
Granularity of compute_rhs_: 0.973933, Number of children: 1
Granularity of !$omp parallel @rhs.f:17: 0.973826, Number of children: 0
Granularity of x_solve_: 0.817286, Number of children: 1
Granularity of !$omp parallel @x_solve.f:44: 0.817243, Number of children: 0
Granularity of y_solve_: 1.08139, Number of children: 1
Granularity of !$omp parallel @y_solve.f:42: 1.08135, Number of children: 0
Granularity of z_solve_: 1.25959, Number of children: 1
Granularity of !$omp parallel @z_solve.f:42: 1.25955, Number of children: 0
Granularity of !$omp parallel @add.f:18: 0.0755035, Number of children: 0
Granularity of verify_: 1.03483, Number of children: 3
Granularity of !$omp parallel @error.f:24: 0.0377083, Number of children: 0
Granularity of compute_rhs_: 0.973933, Number of children: 1
Granularity of !$omp parallel @rhs.f:17: 0.973826, Number of children: 0
Granularity of !$omp parallel @error.f:82: 0.0248786, Number of children: 0
Granularity of !$omp parallel @print_results.f:25: 1.82854e-05, Number of children: 0
27 rest_nodes
27 coarse_nodes
Phase node phase
There is a phase region

Granularity of phase: 4.20595
Granularity of adi_: 4.20779
Granularity of compute_rhs_: 0.973933
Granularity of x_solve_: 0.817286
Granularity of y_solve_: 1.08139
Granularity of z_solve_: 1.25959

Candidate regions are:
phase
adi_
compute_rhs_
x_solve_
y_solve_
z_solve_

Call node: compute_rhs_ Inclusive Time 5827.68
Parent node: adi_ Exclusive Time 0.0176662
Call node: x_solve_ Inclusive Time 4903.4
Parent node: adi_ Exclusive Time 0.0176662
Call node: y_solve_ Inclusive Time 6488.15
Parent node: adi_ Exclusive Time 0.0176662
Call node: z_solve_ Inclusive Time 7557.32
Parent node: adi_ Exclusive Time 0.0176662

Significant regions are:
compute_rhs_
x_solve_
y_solve_
z_solve_

Significant region information
==============================
Region name Min(t) Max(t) Time Time Dev.(%Reg) Ops/L3miss Weight(%Phase)
compute_rhs_ 0.963 0.982 5827.675 0.0 16 23
x_solve_ 0.816 0.892 4903.400 0.0 5046 19
y_solve_ 1.080 1.187 6488.150 0.0 118 26
z_solve_ 1.259 1.267 7557.325 0.0 123 30

Phase information
=================
Min Max Mean Time Dev.(% Phase) Dyn.(% Phase)
0.0754267 4.38705 100.918 25229.5 0 4.2724

threshold time variation (percent of mean region time): 0.000000
threshold compute intensity deviation (#ops/L3 miss): 0.000000
threshold region importance (percent of phase exec. time): 0.000000

SUMMARY:
========
Inter-phase dynamism due to variation of the execution time of phases
No intra-phase dynamism due to time variation
Intra-phase dynamism due to variation in the compute intensity of the following important significant regions
compute_rhs_
x_solve_
y_solve_
z_solve_
Writing into the configuration file...
Config file did not exist. Copied template with cp /lustre/ssd/p_readex/ptf/ci_readex_intelmpi2017.2.174_intel2017.2.174_slurm_starter_with_scorep/templates/readex_config.xml.default readex_config.xml

We now change the configuration file according to the available frequencies. The file /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies on a compute node shows that the following frequencies are available. It prints: 2501000 2500000 2400000 2300000 2200000 2100000 2000000 1900000 1800000 1700000 1600000 1500000 1400000 1300000 1200000
We want to check for frequencies between 1.3 and 2.5 GHz with an increment of 200 MHz. To check concurrency throttling, we check for number of threads between 4 and 24 with an increment of 400MHz. To use uncore frequency scaling, we check the uncore frequency between 1.4 and 3.0 with an increment of 400MHz. Therefore, I change the file readex_config.xml accordingly:

...
<tuningParameter>
<frequency>
<min_freq>1300</min_freq>
<max_freq>2500</max_freq>
<freq_step>200</freq_step>
<default>2500</default>
</frequency>
<uncore>
<min_freq>1400</min_freq>
<max_freq>3000</max_freq>
<freq_step>400</freq_step>
<default>3000</default>
</uncore>
<openMPThreads>
<lower_value>4</lower_value>
<step>4</step>
</openMPThreads>
</tuningParameter>
...

Applying Periscope Tuning Framework and Creating the tuning model

Now, we have a configuration file that we can use with Periscope to analyze the program. We do so by applying the Periscope frontend to the application using the SLURM periscope starter.  To do so, we use a SLURM batch file that looks like the following. First, we define some SLURM parameters. Look at the SLURM documentation to learn more. Then, we setup the READEX Runtime Library and the energy monitoring plugin. There’s one for HDEEM and one for RAPL and APM. Here, HDEEM is used. Last, we start Periscope.

#!/bin/bash
#SBATCH --time=08:00:00 # walltime
#SBATCH --nodes=2 # number of processor cores (i.e. tasks)
##SBATCH --ntasks=
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --exclusive
#SBATCH --partition=haswell
#SBATCH --reservation=READEX
#SBATCH --mem-per-cpu=2500M # memory per CPU core
#SBATCH -A p_readex
#SBATCH -J "bt_PTF" # job name
###############################################################################
# loading READEX environment
module purge
module use /projects/p_readex/modules
module load readex/ci_readex_intelmpi2017.2.174_intel2017.2.174
# set-up Score-P
export SCOREP_SUBSTRATE_PLUGINS=rrl
export SCOREP_RRL_PLUGINS=cpu_freq_plugin,uncore_freq_plugin,,OpenMPTP
export SCOREP_RRL_VERBOSE="WARN"
# set-up energy measuremenent
module load scorep-hdeem/sync-intelmpi-intel2017
export SCOREP_METRIC_PLUGINS=hdeem_sync_plugin
export SCOREP_METRIC_PLUGINS_SEP=";"
export SCOREP_METRIC_HDEEM_SYNC_PLUGIN_CONNECTION="INBAND"
export SCOREP_METRIC_HDEEM_SYNC_PLUGIN_VERBOSE="WARN"
export SCOREP_METRIC_HDEEM_SYNC_PLUGIN_STATS_TIMEOUT_MS=1000
#lower instrumentation overhead
export SCOREP_MPI_ENABLE_GROUPS=ENV
# Define Thread affinity
export GOMP_CPU_AFFINITY=0-23
export OMP_NUM_THREADS=24
PHASE=phase
psc_frontend --apprun="./bin/bt.D.x_ptf" --mpinumprocs=1 --ompnumthreads=24 --phase=$PHASE --tune=readex_intraphase --config-file=readex_config.xml --force-localhost --info=2 --selective-info=AutotuneAll,AutotunePlugins

Applying the Tuning Model

From Periscope, we got a tuning model (tuning_model.json) that we can apply. But first, we get rid of the Online Access Instrumentaion. We therefore re-compile “bt” without the online-access interface attached.

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> vim config/make.def
...
#---------------------------------------------------------------------------
# This is the fortran compiler used for Fortran programs
#---------------------------------------------------------------------------
F77 =scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source
# This links fortran programs; usually the same as ${F77}
FLINK = $(F77)
...

mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP> make bt CLASS=D
============================================
= NAS PARALLEL BENCHMARKS 3.3 =
= OpenMP Versions =
= F77/C =
============================================
cd BT; make CLASS=D VERSION=
make[1]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
make[2]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/sys'
cc -o setparams setparams.c
make[2]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/sys'
../sys/setparams bt D
make[2]: Entering directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
make.def modified. Rebuilding npbparams.h just in case
rm -f npbparams.h
../sys/setparams bt D
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O bt.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/bt.f(49): (col. 16) remark: MAIN__ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O initialize.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/initialize.f(210): (col. 18) remark: lhsinit_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/initialize.f(4): (col. 19) remark: initialize_ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O exact_solution.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/exact_solution.f(4): (col. 18) remark: exact_solution_ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O exact_rhs.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/exact_rhs.f(5): (col. 18) remark: exact_rhs_ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O set_constants.f
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O adi.f
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O rhs.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/rhs.f(4): (col. 18) remark: compute_rhs_ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O x_solve.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/x_solve.f(5): (col. 18) remark: x_solve_ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O y_solve.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/y_solve.f(4): (col. 18) remark: y_solve_ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O solve_subs.f
~/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(5): (col. 18) remark: matvec_sub_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(58): (col. 18) remark: matmul_sub_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(206): (col. 18) remark: binvcrhs_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f(494): (col. 18) remark: binvrhs_ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O z_solve.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/z_solve.f(4): (col. 18) remark: z_solve_ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O add.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/add.f(4): (col. 19) remark: add_ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O error.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/error.f(67): (col. 18) remark: rhs_norm_ has been targeted for automatic cpu dispatch
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/error.f(4): (col. 18) remark: error_norm_ has been targeted for automatic cpu dispatch
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O verify.f
/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT/verify.f(5): (col. 20) remark: verify_ has been targeted for automatic cpu dispatch
cd ../common; scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O print_results.f
cd ../common; scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -c -O timers.f
cd ../common; cc -c -O -o wtime.o ../common/wtime.c
scorep --user --thread=omp --noopenmp --nomemory ifort -fopenmp -tcollect-filter=../scorep_icc.filt -mcmodel=medium -extend_source -O -o ../bin/bt.D.x bt.o initialize.o exact_solution.o exact_rhs.o set_constants.o adi.o rhs.o x_solve.o y_solve.o solve_subs.o z_solve.o add.o error.o verify.o ../common/print_results.o ../common/timers.o ../common/wtime.o
make[2]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'
make[1]: Leaving directory `/home/h1/mian/web_examples/NPB3.3.1/NPB3.3-OMP/BT'

We then compare the default and the optimized version with the following batch script:

#SBATCH --time=02:00:00 # walltime
#SBATCH --nodes=2 # number of nodes requested for application run
#SBATCH --tasks-per-node=24 # number of processes per node for application run
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --partition=haswell
#SBATCH --mem-per-cpu=2500M # memory per CPU core
#SBATCH -J "bt_RRL" # job name
#SBATCH -A p_readex
#SBATCH --reservation=READEX
# loading READEX environment
module purge
module use /projects/p_readex/modules
module load readex/ci_readex_intelmpi2017.2.174_intel2017.2.174
# set-up Score-P
export SCOREP_SUBSTRATE_PLUGINS=rrl
export SCOREP_RRL_PLUGINS=cpu_freq_plugin,uncore_freq_plugin,OpenMPTP
export SCOREP_RRL_VERBOSE="DEBUG"
export SCOREP_TUNING_CPU_FREQ_PLUGIN_VERBOSE="DEBUG"
export SCOREP_TUNING_UNCORE_FREQ_PLUGIN_VERBOSE="DEBUG"
# Set-up RRL
export SCOREP_RRL_TMM_PATH="./tuning_model.json"
# Disable Profiling
export SCOREP_ENABLE_PROFILING=false
# Lower MPI overhead
export SCOREP_MPI_ENABLE_GROUPS=ENV
# run optimized
../bin/bt_D_x_clean
# run default
../bin/bt_D_x_rrl

Results

Now we can use the SLURM accounting tool sacct to gather the energy values gathered with HDEEM:


mian@tauruslogin5:~/web_examples/NPB3.3.1/NPB3.3-OMP/scripts_taurus_hsw> sacct --format JobID,JobName,CPUTime,ConsumedEnergy -j 17283663
JobID JobName CPUTime ConsumedEnergy
------------ ---------- ---------- --------------
17283761 bt_rrl 17:44:24
17283761.ba+ batch 17:44:24 617.96K
17283761.0 bt.D.x_cl+ 07:04:24 312.85K
17283761.1 bt.D.x_rrl 10:34:48 304.02K