Using Intel Compilers on HPC Clusters

This document explains the syntax and usage of Intel compilers included in Intel oneAPI Toolkits, with examples for each usage scenario.

Introduction to Intel oneAPI Toolkits

Intel compilers
Intel MPI
OpenMP
Hybrid OpenMP/MPI
Intel MKL

Using Intel compilers: Sequential programs

Example: C
Example: Fortran

Using Intel compilers: Parallel programs

MPI Example: C
MPI Example: Fortran

OpenMP

OpenMP Example: C
OpenMP Example: Fortran

Hybrid OpenMP/MPI

Hybrid Example: C
Hybrid Example: Fortran

Prerequisites

In general, high-performance computing (HPC) users should have a basic knowledge of Linux environment, HPC systems, and job scheduling and workload management systems (specifically, PBS Pro used on the Thunder/Thunder Prime cluster at CCAST), and some Linux shell scripting experience. The essential information can be found in the CCAST User Guide.

Example files

All the source codes and job submission scripts discussed in this document can be found in the following compressed file:

on Thunder: /mmfs1/thunder/projects/ccastest/examples/intel_examples.tar.gz

on Thunder Prime: /mmfs1/projects/ccastest/examples/intel_examples.tar.gz

Conventions used in this document

• Terminal commands are denoted by inline code prefixed with $, output omits the $

$ echo You are the coolest programmer ever

You are the coolest programmer ever

• Code is denoted by code blocks

if (hacker) {

access_granted = True

}

• Variable inputs are denoted by capital letters in brackets

[PASSWORD]

1. Introduction to Intel oneAPI Toolkits

Intel Parallel Studio XE is a software package that enables developers to build, analyze, and scale their applications. CCAST/NDSU currently uses the 2020 Cluster Edition for Linux.

On Thunder, execute

• module avail to display all available environment modules

• module display intel/2020.1.217 to view environment variables associated with Intel Parallel Studio XE version 2020.1.217

• module load intel/2020.1.217 to load those variables into your working environment (required, e.g., when you want to compile and link a program using Intel compilers and libraries).

On Thunder Prime, execute

• module avail to display all available environment modules

• module display intel-parallel-studio/cluster.2020.4-gcc-vcxt to view environment variables associated with Intel Parallel Studio XE version 2020.4

• module load intel-parallel-studio/cluster.2020.4-gcc-vcxt to load those variables into your working environment (required, e.g., when you want to compile and link a program using Intel compilers and libraries).

This tutorial will focus on the building aspects, namely the compilers and math libraries. Only very basic knowledge of UNIX/Linux command line, C++, and Fortran is assumed.

NOTE: In Dec. 2020, Intel Parallel Studio XE Cluster Edition transitioned into Intel oneAPI Toolkits (hereafter referred to as Intel oneAPI). To load environment variables associated with Intel oneAPI on Thunder, execute: module load intel/2021.1.1. All the following examples on Thunder now use Intel oneAPI; the examples on Thunder Prime still use Intel Parallel Studio 2020.

1.1 Intel compilers

Common compilers used among developers are GNU compilers. Intel compilers are more optimized and are able to take advantage of the latest features existing in Intel’s CPUs to make fast code. They include C++ and Fortran compilers.

1.2 Intel MPI

Message Passing Interface (MPI) is an interface that enables scaling in HPC clusters by creating multiple running processes of the same code and passing messages among them. These processes are not threads, but simply separate instances of the code written. It has a very high scaling capacity, but message overhead and redundant data among nodes is a bottleneck. Intel has its own MPI implementation (Intel MPI) that includes a developer guide for common usage scenarios.

Aside from the examples found here, additional Intel MPI examples can be found here.

1.3 OpenMP

OpenMP uses threads and takes advantage of shared memory resources to enable near seamless data exchange in parallelized sections of code. As a result, a process’ OpenMP threads can only exist on a single compute node and are not scalable beyond that node. On the upside, it takes almost no code refactoring to implement.

OpenMP examples of C/C++ and Fortran can be found on Intel’s developer guide and reference online.

1.4 Hybrid OpenMP/MPI

Since MPI is scalable but has high overhead and OpenMP cannot scale beyond a single compute node, hybrid applications are sometimes necessary. The following hybrid discussion is taken from this article. The benefits of a hybrid application include additional parallelism, reduced overhead, reduced load imbalance, and reduced memory access cost. Pure MPI is often the best approach, but, of course, that depends upon the implementation. Some disadvantages of a hybrid application include idle threads outside OpenMP calls, synchronization issues, and imbalanced memory access when more than one socket exists on a compute node. Refer to said article before adding OpenMP to an MPI application.

1.5 Intel MKL

Intel Math Kernel Library (Intel MKL) is a math library that exploits the core counts and architectures of Intel CPUs to reach a high degree of optimization and parallelization. It has implementations of many standard math packages, like BLAS and LAPACK. This means no code changes are required if these libraries are already being utilized, a developer merely needs to link against Intel MKL.

For more information, see the article Using Intel Math Kernel Library (MKL) on HPC Clusters.

2. Using Intel compilers: Sequential programs

Intel has three compiler commands, for C, C++, and Fortran. They have the following syntax:

• C++ compiler - C language: icc

• C++ compiler - C++ language: icpc

• Fortran compiler - Fortran77/Fortran95 language: ifort

A simple compiler command to compile and link code has the following format:

[COMPILER] [SOURCE_FILE ...] -o [OUTPUT_FILE_NAME]

where the ellipses in [SOURCE_FILE] denotes one or more source files.

This Intel webpage has many code samples that we can use and learn from.

The following are simple sequential examples to demonstrate the basic usage of the compilers. The only requirement to run these programs is an appropriate compiler.

2.1 Example: C

Here is a simple C program to print “Hello World!”. We import stdio.h, print, and then return from our program. stdio.h imports the code necessary to print to the console.

seq.c

// standard input/output
#include <stdio.h>

int main(int argc, char* argv[])
{
    // standard output
    printf("Hello World!\n");
    return 0;
}

Before we can compile this program, we need to load the Intel environment module.

On Thunder:

$ module load intel/2021.1.1

On Thunder Prime:

$ module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

Since this is a C program, denoted by the file extension “.c”, we will use the icc command from Intel’s C++ compiler. We pass in our source file and denote the name of the executable with the -o option.

Now, our compilation and linking looks like this:

$ icc seq.c -o exe_seq_c

CCAST uses PBS Pro for batching on the Thunder/Thunder Prime cluster. To “batch” is to arrange objects, a batching system arranges programs for execution.

To run our program on a cluster, we need a job submission (PBS Pro) script. To customize our script, we can use PBS directives in the form of either:

• an option to the qsub command

• a PBS header line

We will be using PBS header lines. They are prefixed with #PBS. The # symbol is also used for comments, so it is common to use ## to differentiate a header line from a comment.

## this is a comment
# this is a also comment, but below is a PBS header line
#PBS -[OPTION] [VALUES]

In our job script, we need to

• specify the program used to interpret the file, this is required to be on the first line: #!/bin/bash

• add appropriate PBS options: #PBS -[OPTION] [VALUES]

• go to our current working directory: cd $PBS_O_WORKDIR

• execute our program: ./[EXECUTABLE_NAME]

• end the script: exit 0

seq_c.pbs

#!/bin/bash
## name of the queue
#PBS -q default
## name of the job
#PBS -N job_seq_c
## request resources: Only 1 CPU core is needed for a serial job!
#PBS -l select=1:mem=1gb:ncpus=1
## time we need for the job
#PBS -l walltime=00:10:00
## replace "[GROUP_NAME]" below with your actual group ID
#PBS -W group_list=x-ccast-prj-[GROUP_NAME]

## use PBS environment variable to change to working directory
cd $PBS_O_WORKDIR

## run the executable
./exe_seq_c

exit 0

We can submit this job with the qsub command:

$ qsub seq_c.pbs

To view the output we can use the cat command, short for “concatenate”:

$ cat job_seq_c.o[JOB_ID]

Hello World!

2.2 Example: Fortran

The equivalent “Hello World” program in Fortran looks like the following. We are using Fortran’s print statement, specifying a format with the '(a12)' edit descriptor. This prints 12 characters (a, i, and f edit descriptors are for characters, integers, and reals, respectively). Comments are denoted with !, but we are using !! to differentiate from omp directives later in this document.

seq.f90

program seq
!! standard output, with character edit descriptor 'a'
print '(a12)', 'Hello World!'
end program seq

For our Fortran file extensions in this tutorial, we will be using .f90. This represents free-form source, which is just a specific way of interpreting the Fortran source code.

We compile and link in a similar way with Intel’s Fortran compiler. Again, loading Intel first (if we haven’t already) and using the -o compiler option for the executable name:

On Thunder:

$ module load intel/2021.1.1

$ ifort seq.f90 -o exe_seq_f90

On Thunder Prime:

$ module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

$ ifort seq.f90 -o exe_seq_f90

We can use the same PBS script as before except we have swapped out the executable name for exe_seq_f90:

seq_f90.pbs

#!/bin/bash
#PBS -q default
#PBS -N job_seq_f90
#PBS -l select=1:mem=1gb:ncpus=1
#PBS -l walltime=08:00:00
#PBS -W group_list=x-ccast-prj-[GROUP_NAME]

cd $PBS_O_WORKDIR

./exe_seq_f90

exit 0

Submit the job with qsub:

$ qsub seq_f90.pbs

View the output with cat:

$ cat job_seq_f90.o[JOB_ID]

Hello World!

3. Using Intel compilers: Parallel programs

3.1 MPI

MPI programs require:

• appropriate compiler options

• an MPI implementation to link against

• a program to handle the separate instances of the code, mpirun in our case.

To use the appropriate compiler options, it is often easiest to use a compiler wrapper. Intel has MPI wrappers of its original three compiler commands. A wrapper function is a routine in software that calls a different routine with predefined parameters. The MPI wrappers call the Intel compilers with the required options to run an MPI program.

• C++ Compiler - C Language: mpiicc

• C++ Compiler - C++ Language: mpiicpc

• Fortran Compiler - Fortran77/Fortran95 Language: mpiifort

mpirun takes the following options:

• -n: number of MPI processes to launch

• -ppn: number of processes to launch on each node. By default, processes are assigned to the physical cores on the first node, overflowing to the following node and so on.

• -f: filepath to host file listing the cluster nodes.

To run an MPI program we must:

• specify mpiprocs. This can vary based on how many MPI processes we want to launch. Here, we will set it to the number of CPU cores we have.

#PBS -l select=1:mem=2gb:ncpus=[NUMBER_CPU_CORES]:mpiprocs=[NUMBER_CPU_CORES]

• load the Intel module:

On Thunder:

$ module load intel/2021.1.1

On Thunder Prime:

$ module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

• set an environment variable named I_MPI_HYDRA_TOPOLIB to ipl. This fixes an issue concerning the topology detection:

export I_MPI_HYDRA_TOPOLIB=ipl

• select tcp as the Open Fabrics Interfaces (OFI) provider for the libfabric library, via the use of the I_MPI_OFI_PROVIDER environment variable:

export I_MPI_OFI_PROVIDER=tcp

• obtain the number of processes from the node file (hostfile):

NUM_PROC=$(wc -l < $PBS_NODEFILE)

With the [VARIABLE]=$([COMMAND]) syntax, we are simple storing the output of a command into a variable. This command counts the number of lines in the file $PBS_NODEFILE, i.e. the file containing the identifiers of the nodes we will use.

• run the program with the command mpirun:

mpirun -n NUM_PROC -f $PBS_NODEFILE ./exe

We don’t specify the -ppn option in this case, opting for the default behavior.

3.1.1 MPI example: C

Every MPI program will require the following statements:

• #include <mpi.h>

• MPI_Init(&argc, &argv)

• MPI_Finalize()

In addition, we will create a variable called mpi_id and store the current rank using the MPI_Comm_rank function. The rank of an MPI process is simply a unique identifier represented as an integer.

mpi.c

#include <stdio.h>

// MPI interface

#include <mpi.h>

int main(int argc, char* argv[])
{
int mpi_id;

// initialize the MPI execution environment

MPI_Init(&argc, &argv);

// determine the rank of the calling process

MPI_Comm_rank(MPI_COMM_WORLD, &mpi_id);

// standard output, with integer format specifier %i replaced with mpi_id

printf("Process %i says \"Hello World!\"\n", mpi_id);

// terminate the MPI execution environment

MPI_Finalize();

return 0;
}

We are loading the Intel module again, which is only required if it has not been loaded previously in this particular connection to the HPC system.

On Thunder:

$ module load intel/2021.1.1

On Thunder Prime:

$ module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

This time, we use our C MPI compiler wrapper for compilation and linking:

$ mpiicc mpi.c -o exe_mpi_c

Our job submission script with the previously mentioned modifications will look like the following:

mpi_c.pbs

#!/bin/bash
#PBS -q default
#PBS -N job_mpi_c
#PBS -l select=4:mem=1gb:ncpus=2:mpiprocs=2
#PBS -l walltime=00:30:00
#PBS -W group_list=x-ccast-prj-[GROUP_NAME]

## load Intel to use mpirun and link with Intel MPI

##On Thunder using module load intel/2021.1.1

##On Thunder Prime, replace below line into module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

module load intel/2021.1.1

## fix possible issues with topology detection and inter-node communication

export I_MPI_HYDRA_TOPOLIB=ipl

export I_MPI_OFI_PROVIDER=tcp

cd $PBS_O_WORKDIR

## count the number of MPI processes
NUM_PROC=$(wc -l < $PBS_NODEFILE)

## run the executable
mpirun -n $NUM_PROC -f $PBS_NODEFILE ./exe_mpi_c

exit 0

Submit our job:

$ qsub mpi_c.pbs

Output the results:

$ cat job_mpi_c.o[JOB_ID]

Process 0 says "Hello World!"
Process 1 says "Hello World!"

...

As an exercise, modify the resource request in the job submission script (mpi_c.pbs) as the following and resubmit the job:

#PBS -l select=2:mem=5gb:ncpus=4:mpiprocs=4

#PBS -l place=scatter

The "place=scatter" option is selected since we want the two chunks (select=2) to be on two different compute nodes (i.e., not on the same node).

3.1.2 MPI example: Fortran

Our Fortran example is the same as our C example, but with slightly different syntax. Most notably, a variable ierror is required for error handling. The statements now look like these:

• include "mpif.h"

• call mpi_init(ierror)

• call mpi_finalize(ierror)

We create variables mpi_id and ierror for the rank and errors, respectively. We get our MPI rank with mpi_comm_rank and then print out our rank with the appropriate formatting.

mpi.f90

program mpi

    !! MPI interface
    include "mpif.h"

    integer :: mpi_id, ierror

    !! initialize the MPI execution environment
    call mpi_init(ierror)

!! determine the rank of the calling process
    call mpi_comm_rank(mpi_comm_world, mpi_id, ierror)

    !! standard output, with integer and character edit descriptors, 'i' and 'a', respectively
    print "(a8, i3, a20)", 'Process ', mpi_id, ' says "Hello World!"'

    !! terminate the MPI execution environment
    call mpi_finalize(ierror)

end program mpi

Load the Intel module if we have not yet done so:

On Thunder:

$ module load intel/2021.1.1

On Thunder Prime:

$ module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

A call to the Fortran MPI compiler wrapper for compilation and linking looks like:

$ mpiifort mpi.f90 -o exe_mpi_f90

Again, using the same job submission script, with an executable name change to exe_mpi_f90.

mpi_f90.pbs

#!/bin/bash
#PBS -q default
#PBS -N job_mpi_f90
#PBS -l select=1:mem=1gb:ncpus=2:mpiprocs=2
#PBS -l walltime=00:30:00
#PBS -W group_list=x-ccast-prj-[GROUP_NAME]

## load Intel to use mpirun and link with Intel MPI

##On Thunder using module load intel/2021.1.1

##On Thunder Prime, replace below line into module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

module load intel/2021.1.1

## fix possible issues with topology detection and inter-node communication

export I_MPI_HYDRA_TOPOLIB=ipl

export I_MPI_OFI_PROVIDER=tcp

cd $PBS_O_WORKDIR

## count the number of MPI processes

NUM_PROC=$(wc -l < $PBS_NODEFILE)

## run the executable
mpirun -n $NUM_PROC -f $PBS_NODEFILE ./exe_mpi_f90

exit 0

Submit our job:

$ qsub mpi_f90.pbs

Output the results:

$ cat job_mpi_f90.o[JOB_ID]

Process 1 says "Hello World!"
Process 0 says "Hello World!"

...

As an exercise, modify the resource request in the job submission script (mpi_f90.pbs) as the following and resubmit the job:

#PBS -l select=2:mem=5gb:ncpus=4:mpiprocs=4

#PBS -l place=scatter

The "place=scatter" option is selected since we want the two chunks (select=2) to be on two different compute nodes (i.e., not on the same node).

3.2 OpenMP

To use OpenMP, the qopenmp option must be added to the compiler command

[COMPILER] [SOURCE_FILE ...] -o [OUTPUT_FILE_NAME] -qopenmp

By default, libraries are dynamically linked. To use statically linked libraries the -qopenmp-link=static option can be used instead:

[COMPILER] [SOURCE_FILE ...] -o [OUTPUT_FILE_NAME] -qopenmp-link=static

In addition, we will obviously need a compiler and an OpenMP implementation to link against.

When using OpenMP, the ompthreads environment variable will need to be set. This can change based on how many threads are desired.

#PBS -l ompthreads=[NUMBER_OF_THREADS]

3.2.1 OpenMP example: C

Here is our OpenMP example. In this code, we:

• include the omp.h header

• create the omp_id variable to store the ID of each OpenMP thread

• create a threaded section of code with #pragma omp parallel

Inside the threaded section, we:

• use the private(omp_id) statement to make our omp_id variable private to each thread

• obtain the thread ID with the omp_get_thread_num() statement

• print out results.

omp.c

#include<stdio.h>

// OpenMP interface
#include<omp.h>

int main(int argc, char *argv[])
{
    int omp_id;

    // define a parallel region
    // define a private var, omp_id, for each thread
    #pragma omp parallel private(omp_id)
    {
        // get the thread number of the executing thread
        omp_id = omp_get_thread_num();
        printf("Thread %i says \"Hello World!\"\n", omp_id);
    }

    return 0;
}

Load the Intel module if we have not yet done so:

On Thunder:

$ module load intel/2021.1.1

On Thunder Prime:

$ module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

Using the -qopenmp option with the C compiler command, compilation and linking looks like this:

$ icc omp.c -o exe_omp_c -qopenmp

For our job submission script, ompthreads was added and we continue to load intel/2021.1.1 on Thunder (intel-parallel-studio/cluster.2020.4-gcc-vcxt on Thunder Prime). We are not using mpirun in this instance, that is only required for MPI applications.

omp_c.pbs

#!/bin/bash
#PBS -q default
#PBS -N job_omp_c
#PBS -l select=1:mem=2gb:ncpus=4:ompthreads=4
#PBS -l walltime=00:30:00
#PBS -W group_list=x-ccast-prj-[GROUP_NAME]

## load Intel to link with Intel OpenMP

##On Thunder using module load intel/2021.1.1

##On Thunder Prime, replace below line into module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

module load intel/2021.1.1

cd $PBS_O_WORKDIR

./exe_omp_c

exit 0

Submit our job:

$ qsub omp_c.pbs

Output the results:

$ cat job_omp_c.o[JOB_ID]

Thread 3 says "Hello World!"
Thread 0 says "Hello World!"

...

3.2.2 OpenMP example: Fortran

Looking at our Fortran example, we have the same fundamental steps for this program:

• include the omp_lib.h header

• create the omp_id variable to store our ID

• create a threaded section of code with !$omp parallel

Inside the threaded section we again:

• use the private(omp_id) statement to make our omp_id variable private to each thread

• obtain the thread ID with the omp_get_thread_num() statement

• print out results with appropriate formatting.

omp.f90

program omp

    !! OpenMP interface
    include "omp_lib.h"

    integer :: omp_id

    !! define a parallel region
    !! define a private var, omp_id, for each thread
    !$omp parallel private(omp_id)

!! get the thread number of the executing thread
omp_id = omp_get_thread_num()
print '(a7, i3, a20)', 'Thread ', omp_id, ' says "Hello World!"'

!$omp end parallel

end program omp

Load the Intel module if we have not yet done so:

On Thunder:

$ module load intel/2021.1.1

On Thunder Prime:

$ module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

Using the -qopenmp option with the Fortran compiler command, compilation and linking looks like this:

$ ifort omp.f90 -o exe_omp_f90 -qopenmp

We are using the same job submission script, with ompthreads, module load intel/2021.1.1 (intel-parallel-studio/cluster.2020.4-gcc-vcxt on Thunder Prime), and ./exe_omp_f90.

omp_f90.pbs

#!/bin/bash
#PBS -q default
#PBS -N job_omp_f90
#PBS -l select=1:mem=2gb:ncpus=4:ompthreads=4
#PBS -l walltime=00:30:00
#PBS -W group_list=x-ccast-prj-[GROUP_NAME]

## load Intel to link with Intel OpenMP

##On Thunder using module load intel/2021.1.1

##On Thunder Prime, replace below line into module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

module load intel/2021.1.1

cd $PBS_O_WORKDIR

./exe_omp_f90

exit 0

Submit our job:

$ qsub omp_f90.pbs

Output the results:

$ cat job_omp_f90.o[JOB_ID]

Thread 3 says "Hello World!"
Thread 1 says "Hello World!"

...

3.3 Hybrid OpenMP/MPI

For a hybrid application, we need all of the required elements of the previous two technologies. On top of that, it is recommended to set the I_MPI_PIN_DOMAIN environment variable. It defines process pinning. Pinning is another way to describe the binding of a process/thread to a computing core. Process pinning is simply defining how an MPI process is pinned to a set of CPUs. A value of omp is used for I_MPI_PIN_DOMAIN to ensure that each MPI process has exactly ompthreads cores at its disposal for OpenMP multithreading.

export I_MPI_PIN_DOMAIN=omp

3.3.1 Hybrid example: C

In this example, we combine code from both approaches. The logical order of the disparate parts is:

• Include the mpi.h and omp.h headers

• Create the mpi_id and omp_id variables before they are used

• Initialize MPI with MPI_Init(&argc, &argv)

• Get our mpi_id with MPI_Comm_rank(MPI_COMM_WORLD, &mpi_id)

• Create our threaded section with #pragma omp parallel private(omp_id)

• Obtain our omp_id with omp_get_thread_num()

• Print out results

• Finalize MPI with MPI_Finalize().

hybrid.c

#include <stdio.h>

#include <omp.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
    int mpi_id, omp_id;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &mpi_id);

    #pragma omp parallel private(omp_id)
    {
        omp_id = omp_get_thread_num();
        printf("Hello World! I am thread %i from process %i\n", omp_id, mpi_id);
    }

    MPI_Finalize();

    return 0;
}

We compile and link with both mpiicc and -qopenmp.

On Thunder:

$ module load intel/2021.1.1

On Thunder Prime:

$ module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

Then,

$ mpiicc hybrid.c -o exe_hybrid_c -qopenmp

In our job submission script we have to load Intel:

On Thunder:

$ module load intel/2021.1.1

On Thunder Prime:

$ module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

We have our MPI parts:

• mpiprocs

• export I_MPI_HYDRA_TOPOLIB=ipl

• export I_MPI_OFI_PROVIDER=tcp

• mpirun

Our OpenMP part:

• ompthreads

And our new statement to allow the combination of the two:

• export I_MPI_PIN_DOMAIN=omp

hybrid_c.pbs

#!/bin/bash
#PBS -q default
#PBS -N job_hybrid_c
#PBS -l select=4:mem=1gb:ncpus=4:mpiprocs=2:ompthreads=2
#PBS -l walltime=00:30:00
#PBS -W group_list=x-ccast-prj-[GROUP_NAME]

##On Thunder using module load intel/2021.1.1

##On Thunder Prime, replace below line into module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

module load intel/2021.1.1

export I_MPI_HYDRA_TOPOLIB=ipl

export I_MPI_OFI_PROVIDER=tcp

export I_MPI_PIN_DOMAIN=omp

cd $PBS_O_WORKDIR

NUM_PROC=$(wc -l < $PBS_NODEFILE)

mpirun -n $NUM_PROC -f $PBS_NODEFILE ./exe_hybrid_c

exit 0

Submit the job:

$ qsub hybrid_c.pbs

Output the results:

$ cat job_hybrid_c.o[JOB_ID]

Hello World! I am thread 0 from process 1
Hello World! I am thread 1 from process 1
Hello World! I am thread 0 from process 0
Hello World! I am thread 1 from process 0

...

3.3.2 Hybrid example: Fortran

Here is what the process looks like in Fortran:

• Include the mpif.h and omp_lib.h headers in any order

• Create the mpi_id, omp_id, and ierror variables before they are used

• Initialize MPI with call mpi_init(ierror)

• Get our mpi_id with call mpi_comm_rank(mpi_comm_world, mpi_id, ierror)

• Create our threaded section with !$omp parallel private(omp_id)

• Obtain our omp_id with omp_get_thread_num()

• Print out results with formatting

• Finalize MPI with call mpi_finalize(ierror)

hybrid.f90

program hybrid

    include "mpif.h"
    include "omp_lib.h"

    integer :: mpi_id, omp_id, ierror

    call mpi_init(ierror)
    call mpi_comm_rank(mpi_comm_world, mpi_id, ierror)

    !$omp parallel private(omp_id)
        omp_id = omp_get_thread_num()
        print '(a26, i3, a8, i3)', 'Hello World! I am process ', mpi_id, ' thread ', omp_id
    !$omp end parallel

    call mpi_finalize(ierror)

end program hybrid

Compiling and linking with MPI wrapper and OpenMP option:

On Thunder:

$ module load intel/2021.1.1

On Thunder Prime:

$ module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

Then,

$ mpiifort hybrid.f90 -o exe_hybrid_f90 -qopenmp

Our job submission script is unsurprisingly identical, with the executable swapped to ./exe_hybrid_f90

hybrid_f90.pbs

#!/bin/bash
#PBS -q default
#PBS -N job_hybrid_f90
#PBS -l select=4:mem=1gb:ncpus=4:mpiprocs=2:ompthreads=2
#PBS -l walltime=00:30:00
#PBS -W group_list=x-ccast-prj-[GROUP_NAME]

##On Thunder using module load intel/2021.1.1

##On Thunder Prime, replace below line into module load intel-parallel-studio/cluster.2020.4-gcc-vcxt

module load intel/2021.1.1

export I_MPI_HYDRA_TOPOLIB=ipl

export I_MPI_OFI_PROVIDER=tcp

export I_MPI_PIN_DOMAIN=omp

cd $PBS_O_WORKDIR

NUM_PROC=$(wc -l < $PBS_NODEFILE)

mpirun -n $NUM_PROC -f $PBS_NODEFILE ./exe_hybrid_f90

exit 0

Submit the job:

$ qsub hybrid_f90.pbs

Output the results:

$ cat job_hybrid_f90.o[JOB_ID]

Hello World! I am thread 0 from process 1
Hello World! I am thread 1 from process 1
Hello World! I am thread 0 from process 0
Hello World! I am thread 1 from process 0

...

Using Intel Compilers on HPC Clusters

1. Introduction to Intel oneAPI Toolkits

1.1 Intel compilers

1.2 Intel MPI

1.3 OpenMP

1.4 Hybrid OpenMP/MPI

1.5 Intel MKL

2. Using Intel compilers: Sequential programs

2.1 Example: C

2.2 Example: Fortran

3. Using Intel compilers: Parallel programs

3.1 MPI

3.2 OpenMP

3.3 Hybrid OpenMP/MPI

See Also