Running Machine Learning on HPC Clusters

A tutorial on running machine learning software tools on HPC clusters.

This document provides instructions on (i) how to run TensorFlow, PyTorch, and Keras jobs on CCAST's Thunder/Thunder Prime cluster, and (ii) [optional but strongly recommended] how to install those machine learning tools in your HOME directory.

I. Running machine learning jobs

1. Introduction to machine learning

As a subset of artificial intelligence, machine learning is computer-based algorithms that improve models automatically through existing experience. Based on sample data, known as "training data", machine learning algorithms build mathematical models to make predictions or decisions after “learning” from those training data.

Machine learning software tools are often written in Python, an interpreted, object-oriented, high-level programming language with dynamic semantics. It has high-level built-in data structures, together with dynamic typing and dynamic binding, making it very attractive for machine learning.

2. Machine learning frameworks and tools

TensorFlow is a Python-friendly open source and free software library for machine learning. It eases the process of acquiring data, training models, serving predictions, and refining future results in machine learning. 

PyTorch is a Python-based scientific computing package targeted at using the power of GPUs. It is a machine learning research platform that provides maximum flexibility and speed.

Keras is a free and open source Python library that is powerful and easy-to-use for developing and evaluating machine learning models. 

3. Running machine learning tools on clusters

3.1 Example files

All the examples and job submission scripts discussed in this document can be found in the following compressed file:
/mmfs1/thunder/projects/ccastest/examples/ml_examples.tar.gz (on Thunder).
/mmfs1/projects/ccastest/examples/ml_examples.tar.gz (on Thunder Prime).

Copy the example file ml_examples.tar.gz to your SCRATCH directory (You need to run jobs from here, NOT from your home directory!). 
cp /mmfs1/thunder/projects/ccastest/examples/ml_examples.tar.gz $SCRATCH (on Thunder).
cp /mmfs1/projects/ccastest/examples/ml_examples.tar.gz $SCRATCH (on Thunder Prime).

Go to your SCRATCH directory and decompress the file:
$ cd $SCRATCH
$ tar -xvf ml_examples.tar.gz

3.2 TensorFlow


Running serial TensorFlow jobs on CPUs

This example is MNIST dataset-based neural network that includes 3 layers. 

Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/tf_serial 

Modify the tf_serial.pbs file as needed (using a text editor such as vi, nano, or emacs):

#!/bin/bash
#PBS -q default
#PBS -N test
## serial job: only 1 CPU core
#PBS -l select=1:mem=2gb:ncpus=1
#PBS -l walltime=08:00:00
## replace "x-ccast-prj" below with "x-ccast-prj-[your project group name]"
#PBS -W group_list=x-ccast-prj

cd ${PBS_O_WORKDIR}

##/gpfs1/apps/centos7/opt/tensorflow2/tf2/bin/activate is for Thunder
##replace it with /mmfs1/apps/tfvenv240/bin/activate for Thunder Prime
source /gpfs1/apps/centos7/opt/tensorflow2/tf2/bin/activate
python3 tf_serial.py

Submit the PBS script to the queue:
$ qsub tf_serial.pbs  

The result of job is in the file which name is test.o[job ID] that is in the same directory with tf_serial.pbs directory. 

Running parallel (multithreaded) TensorFlow jobs on CPUs

By default, all CPUs that is configured by user in the PBS script with “ncpus” parameter are aggregated under cpu:0 device. TensorFlow uses those multiple CPU cores by default. TensorFlow uses strategies to make distributing neural networks across multiple devices easier. The strategies used to distribute TensorFlow across multiple nodes include MirroredStrategy, MultiWorkerMirroredStrategy, and ParameterServerStrategy and so on. 

This example is MNIST dataset-based Neural Network that includes 3 layers. 

Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/tf_parallel

Modify the tf_multithreaded.pbs file as needed (using a text editor such as vi, nano, or emacs):

#!/bin/bash
#PBS -q default
#PBS -N test
##changes "ncpus" and "mem" as needed; keep select=1
#PBS -l select=1:mem=4gb:ncpus=2
#PBS -l walltime=08:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your project group name]"
#PBS -W group_list=x-ccast-prj

cd ${PBS_O_WORKDIR}

##/gpfs1/apps/centos7/opt/tensorflow2/tf2/bin/activate is for Thunder
##replace it with /mmfs1/apps/tfvenv240/bin/activate for Thunder Prime
source /gpfs1/apps/centos7/opt/tensorflow2/tf2/bin/activate

python3 tf_multithreaded.py

Submit the PBS script to the queue:
$ qsub tf_multithreaded.pbs

The result of job is in the file which name is test.o[job ID] that is in the same directory with tf_multithreaded.pbs directory. 

Running TensorFlow jobs on GPUs

TensorFlow code will run on a single GPU with no code changes required. The simplest way to run on multiple GPUs is using Distribution Strategies. If user would like a particular operation to run on a device, user can use with tf.device to create a device context, and all the operations within that context will run on the same designated device. With running jobs on GPUs, user can access GPU node by ssh node[node ID] command and check GPU usage by nvidia-smi command.

This example is CIFAR-10 dataset-based Convolutional Neural Network that includes 3 convolution layers. 

Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/tf_GPU 

Modify the tf_GPU.pbs file as needed (using a text editor such as vi, nano, or emacs):

#!/bin/bash
#PBS -q gpus
#PBS -N test
##keep ncpus=1
#PBS -l select=1:mem=10gb:ncpus=1:ngpus=2
#PBS -l walltime=03:00:00
##change "x-ccast-prj" to "x-ccast-prj-[your project group name]"
#PBS -W group_list=x-ccast-prj

cd ${PBS_O_WORKDIR}

#module load gcc/7.3.0-gcc and module load  CUDAToolkit/10.1 is for Thunder
#replace them with module load cudnn and module load cuda/11.0.2-gcc-dayq for Thunder Prime
module load gcc/7.3.0-gcc
module load  CUDAToolkit/10.1

##/gpfs1/apps/centos7/opt/tensorflow2/tf2/bin/activate is for Thunder
##replace it with /mmfs1/apps/tfvenv240/bin/activate for Thunder Prime
source /gpfs1/apps/centos7/opt/tensorflow2/tf2/bin/activate

#pass $NGPUS to TensorFlow
# Because $NGPUS is not defined by PBS like $NCPUS is, we need to extract this value in a different way.
# Note: In Bash, $0 stores the path of the currently running script
NGPUS=$(grep -oP 'ngpus=\d+' $0 | cut -d= -f2)

# Also, code should be parameterized. Using text substitution to modify
# code at runtime is considered bad practice.
python tf_GPU.py $NGPUS

Submit the PBS script to the queue:
$ qsub tf_GPU.pbs 

The result of job is in the file which name is test.o[job ID] that is in the same directory with tf_GPU.pbs directory. 

3.3 PyTorch


Running PyTorch jobs

PyTorch uses some libraries, which attempts optimizations to utilize CPU to its full capacity. PyTorch code will automatically utilize multiple CPU cores that is assigned by user through ncpus parameter in PBS script. 

This example is CIFAR-10 dataset-based Convolutional Neural Network that includes 6 convolution layers with memristor device as synapse.  This example can use multiple CPUs.

Get into example directory:
$ cd $SCRATCH/pytorch 

Modify the pytorch.pbs file as needed (using a text editor such as vi, nano, or emacs):

#!/bin/bash
#PBS -q default
#PBS -N test
#PBS -j oe
#PBS -l select=1:mem=10gb:ncpus=4
#PBS -l walltime=01:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your sponsor's project group]"
#PBS -W group_list=x-ccast-prj

cd ${PBS_O_WORKDIR}

##/gpfs1/apps/centos7/opt/tensorflow2/tf2/bin/activate is for Thunder
##replace it with /mmfs1/apps/tfvenv240/bin/activate for Thunder Prime
source /gpfs1/apps/centos7/opt/tensorflow2/tf2/bin/activate

python3  pytorch.py

Submit the PBS script to the queue:
$ qsub pytorch.pbs

The result of job is in the file which name is test.o[job ID] that is in the same directory with pytorch.pbs directory. 

3.4 Keras


Running Keras jobs

Keras automatically runs the computations on as many cores as are available that is assigned by user through ncpus parameter in PBS script.

This example is sequence to sequence learning for performing number addition.  

Get into example directory.
$ cd $SCRATCH/Keras2.3

Modify the keras.pbs file as needed (using a text editor such as vi, nano, or emacs):

#!/bin/bash
#PBS -q default
#PBS -N keras_test
#PBS -j oe
##change "ncpus" and "mem" if needed
#PBS -l select=1:mem=2gb:ncpus=2
#PBS -l walltime=08:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your project group name]"
#PBS -W group_list=x-ccast-prj

cd ${PBS_O_WORKDIR}

##/gpfs1/apps/centos7/opt/tensorflow2/tf2/bin/activate is for Thunder
##replace it with /mmfs1/apps/tfvenv240/bin/activate for Thunder Prime
source /gpfs1/apps/centos7/opt/tensorflow2/tf2/bin/activate
python3 addition_rnn.py

Submit the PBS script to the queue:
$ qsub keras.pbs

The result of job is in the file which name is test.o[job ID] that is in the same directory with keras.pbs directory. 

4. Parallel Scaling Performance

Before you start submitting the main bulk of your jobs, it is advisable to first submit a test job. A test job should be representative of the main body of your work and scaled-down (e.g. a small subset of your data or a low number of job steps) to ensures a short queue time, short run time, and that minimal resources are expended. Then running your test jobs with different CPU/GPU cores to optimize your setting. Here, the same TensorFlow example as above to show the results with CPU or GPU parallel scaling.

4.1 CPUs scaling

The CPUs scaling examples convert from previous tf_GPU example for only running on CPU. The following table shows the scaling results for parallel TensorFlow example on CPUs:

Dataset  Processors Time  Efficiency   Speedup  Training Size*
CIFAR-10  1 0:41:16   1  128*500
CIFAR-10  2 0:23:21 0.88 1.77  128*500
CIFAR-10  3 0:16:23 0.84  2.52  128*500
CIFAR-10  4 0:13:29 0.77 3.06  128*500
CIFAR-10  5 0:11:47 0.70 3.50  128*500
CIFAR-10  6 0:09:57 0.69  4.15  128*500
CIFAR-10  8 0:08:34 0.60 4.82   128*500
CIFAR-10  9 0:08:00 0.57 5.16   128*500
CIFAR-10  12 0:06:30 0.53 6.35  128*500
CIFAR-10   16 0:05:26 0.47  7.60  128*500

* Training Size = (batch size) x (training steps)

It can be concluded that 4 CPUs is the optimal choice for this example since the running time doesn’t reduce significantly when more than 4 CPUs are chosen.   

4.2 GPUs scaling

The GPUs scaling examples come from previous tf_GPU example. The following table shows the scaling results for parallel TensorFlow example on GPUs:

 No. Dataset  Processors
(CPU/GPU) 
Queue   Time   Training Size*
 1 CIFAR-10   1/1 gpus 0:03:54  1024 x 1 x 1000
 2 CIFAR-10   1/2 gpus  0:06:04 1024 x 2 x 1000
 3 CIFAR-10   1/3 gpus  0:08:11 1024 x 3 x 1000
 4 CIFAR-10   1/4 gpus  0:08:57 1024 x 4 x 1000
 5 CIFAR-10   1/1 gpus  0:06:27  1024 x 1 x 2000
 6 CIFAR-10    1/2 gpus  0:06:04 1024 x 2 x 1000
 7 CIFAR-10    1/3 gpus  0:06:04 1024 x 3 x 667
 8 CIFAR-10    1/4 gpus  0:05:59 1024 x 4 x 500
 9 CIFAR-10    1/1 gpus  0:01:28 8192 x 1 x 10 
10  CIFAR-10    1/2 gpus  0:01:56 4096 x 2 x 10
11 CIFAR-10    1/3 gpus  0:02:24 2731 x 3 x 10
12 CIFAR-10    1/4 gpus  0:03:05 2048 x 4 x 10 
13  CIFAR-10    1/1 gpus   OOM*(20GB) 81920 x 1 x 10
14  CIFAR-10    1/2 gpus  OOM(20GB) 40960 x 2 x 10
15  CIFAR-10    1/3 gpus  OOM(20GB) 27310 x 3 x 10
16  CIFAR-10    1/4gpus OOM(20GB)20480 x 4 x 10

* Training Size = (batch size per GPU) x (#. of GPUs) x (training steps)
* Out-of-Memory 

The following conclusions can be obtained by this table:

(i) The consumed time of CPU increases with increasing the number of GPUs. As we can see from the 1st to 4th jobs, although the number of GPUs is increased, each GPU processes the same training size (1024 x 1000). CPU consumes more time to process the data between CPU and GPUs. The running time increases with the number of GPUs increasing.

(ii) Each training step consumes more time with more GPUs than that with less GPUs. As a comparison, from the 5th to 8th jobs, each training step processes more images with increasing the number of GPUs and the total training size is fix with decreasing the training steps. But the running time does not have a significant change from the 5th to 8th jobs. Therefore, the time of data transmission between GPU and CPU is dominant rather than the time of computing in GPUs.

(iii) Further study the last conclusion. From the 9th to 12th jobs, the training step is fixed and reducing the batch size when the number of GPUs increases and keeping the total training size is the same. The running time increases with increasing the number of GPUs. The 9th job has a maximum batch size and the shortest running time. The 12th job has a maximum number of GPUs and the longest running time. Therefore, the time of data transmission between GPU and CPU is dominant rather than the time of computing in GPUs.

(iv) The strategy that increasing the batch size for each GPU is a trade-off between computing time and memory size. According previous conclusions, increasing the batch size to maximize the role of GPUs to reduce the data transmission between GPU and CPU is a strategy when running jobs on GPU nodes. However, the batch size cannot be increased arbitrarily because the Out-of-Memory error might happen from the 13th to 16th jobs. 

II. [Optional] Installing machine learning tools

This part explains the installation of TensorFlow, PyTorch, and Keras using Anaconda in a user's HOME directory, with running examples for each tool.

1. Introduction to Anaconda

Sometimes, high-performance computing (HPC) users should have a specific version of machine learning tool that is not a system-wide installed application for users to run a job. Using Anaconda, it is easy for users to install machine learning tools in their HOME directory. Anaconda is a free and open-source distribution of programming languages for scientific computing, that aims to simplify package management and deployment. Package versions in Anaconda are managed by the package management system conda.

2. Anaconda installation

This method will install Anaconda in /mmfs1/thunder/home/[your user name] on Thunder and /mmfs1/home/[your user name] on Thunder Prime by default:
$ wget https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh
bash Anaconda3-2019.07-Linux-x86_64.sh
---
Do you wish the installer to initialize Anaconda3
by running conda init? [yes|no]
$ yes
---
$ source ~/.bashrc

3. Installing TensorFlow, PyTorch, and Keras

3.1 TensorFlow

#create TensorFlow-GPU environment and name it ‘tf-gpu’
$ conda create -n tf-gpu tensorflow-gpu 

3.2 PyTorch

#create python environment and name it ‘pytorch1.7’
$ conda create -n pytorch1.7  pytorch torchvision torchaudio cudatoolkit=10.1 -c pytorch

3.3 Keras

#create keras environment with tensorflow2.3 and name it ‘keras2.3.1-TF2.3’
$ conda create -n keras2.3.1-TF2.3  Keras==2.3.1 tensorflow==2.3

#create keras environment with tensorflow gpu and name it ‘keras_gpu’
$ conda create -n keras_gpu  Keras==2.3.1 tensorflow-gpu

4. Running examples on Thunder/Thunder Prime with local environment

The jobs are the same as those use the system-wide version that are included in the ml_examples.tar.gz file in /mmfs1/thunder/projects/ccastest/examples directory on Thunder and /mmfs1/projects/ccastest/examples directory on Thunder Prime. Note that the local environment should be activated in the PBS script, which is different with system-wide version and is shown as following PBS script in those three examples.   

4.1 Copy and decompress examples

Copy the example file ml_examples.tar.gz to your SCRATCH directory, if you haven't done so:
$ cp /mmfs1/thunder/projects/ccastest/examples/ml_examples.tar.gz $SCRATCH/ml_examples.tar.gz (on Thunder).
$ cp /mmfs1/projects/ccastest/examples/ml_examples.tar.gz $SCRATCH/ml_examples.tar.gz (on Thunder Prime).

Go to your SCRATCH directory and decompress the file:
$ cd $SCRATCH
$ tar -zvxf ml_examples.tar.gz 

4.2 Running TensorFlow examples

This example is CIFAR-10 dataset-based Convolutional Neural Network that includes 3 convolution layers running on GPU. 

Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/tf_localenv_GPU

Modify the tf_GPU.pbs file as needed (using a text editor such as vi, nano, or emacs):

#!/bin/bash
#PBS -q gpus
#PBS -N test
##keep ncpus=1
#PBS -l select=1:mem=10gb:ncpus=1:ngpus=2
#PBS -l walltime=03:00:00
##change "x-ccast-prj" to "x-ccast-prj-[your project group name]"
#PBS -W group_list=x-ccast-prj

cd ${PBS_O_WORKDIR}

##replace "source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu" with "source /mmfs1/thunder/home/[your user name]/anaconda3/bin/activate tf-gpu"
##this is for activating local environment
##/mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu is for Thunder
##replace it with /mmfs1/home/xxx.xxx/anaconda3/bin/activate tf-gpu for Thunder Prime
source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu

#pass $NGPUS to TensorFlow
#sed -i "10c num_gpus = '$NGPUS';" TensorFlow_GPU.py
# Because $NGPUS is not defined by PBS like $NCPUS is, we need
# to extract this value in a different way.
# Note: In Bash, $0 stores the path of the currently running script
NGPUS=$(grep -oP 'ngpus=\d+' $0 | cut -d= -f2)

# Also, code should be parameterized. Using text substitution to modify
# code at runtime is considered bad practice.
python tf_GPU.py $NGPUS

Submit the PBS script to the queue:
$ qsub tf_GPU.pbs  

The result of job is in the file which name is test.o[job ID] that is in the same directory with tf_GPU.pbs directory. 

4.2 Running PyTorch examples


Example 1 (running on CPUs):

This example is using PyTorch Tensors to fit a two-layer network to random data running on CPU.

Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/pytorch_localenv

Modify the pytorch.pbs file as needed (using a text editor such as vi, nano, or emacs):

#!/bin/bash
#PBS -q default
#PBS -N test
#PBS -j oe
##keep select=1
#PBS -l select=1:mem=10gb:ncpus=4
#PBS -l walltime=01:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your sponsor's project group]"
#PBS -W group_list=x-ccast-prj

cd ${PBS_O_WORKDIR}

##replace "source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate pytorch1.7" with "source /mmfs1/thunder/home/[your user name]/anaconda3/bin/activate pytorch1.7"
##this is for activating local environment
##/mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu is for Thunder
##replace it with /mmfs1/home/xxx.xxx/anaconda3/bin/activate tf-gpu for Thunder Prime
source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate pytorch1.7

python3  pytorch.py

Submit the PBS script to the queue:
$ qsub pytorch.pbs

The result of job is in the file which name is test.o[job ID] that is in the same directory with the directory of pytorch.pbs file. 

Example 2 (running on GPUs):

This example is using PyTorch Tensors to fit a two-layer network to random data running on GPU.

Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/pytorch_localenv_GPU

Modify the pytorch_GPU.pbs file as needed (using a text editor such as vi, nano, or emacs):

#!/bin/bash
#PBS -q gpus
#PBS -N test
#PBS -j oe
##keep ncpus=1
#PBS -l select=1:mem=10gb:ncpus=2:ngpus=1
#PBS -l walltime=01:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your sponsor's project group]"
#PBS -W group_list=x-ccast-prj

cd ${PBS_O_WORKDIR}

##replace "source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate pytorch1.7" with "source /mmfs1/thunder/home/[your user name]/anaconda3/bin/activate pytorch1.7"
##this is for activating local environment
##/mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu is for Thunder
##replace it with /mmfs1/home/xxx.xxx/anaconda3/bin/activate tf-gpu for Thunder Prime
source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate pytorch1.7

python3  pytorch_GPU.py

Submit the PBS script to the queue:
$ qsub pytorch_GPU.pbs

The result of job is in the file which name is test.o[job ID] that is in the same directory with the directory of pytorch_GPU.pbs file. 

4.2 Running Keras examples


Example 1 (running on CPUs):

This example is sequence to sequence learning for performing number addition running on CPU.  

Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/keras_localenv

Modify the keras.pbs file as needed (using a text editor such as vi, nano, or emacs):

#!/bin/bash
#PBS -q default
#PBS -N test 
#PBS -j oe
##change "ncpus" and "mem" if needed
#PBS -l select=1:mem=2gb:ncpus=2
#PBS -l walltime=08:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your project group name]"
#PBS -W group_list=x-ccast-prj

cd ${PBS_O_WORKDIR}

##replace "source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate keras2.3.1-TF2.3" with "source /mmfs1/thunder/home/[your user name]/anaconda3/bin/activate keras2.3.1-TF2.3"
##this is for activating local environment
##/mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu is for Thunder
##replace it with /mmfs1/home/xxx.xxx/anaconda3/bin/activate tf-gpu for Thunder Prime
source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate keras2.3.1-TF2.3

python3 addition_rnn.py

Submit the PBS script to the queue:
$ qsub keras.pbs

The result of job is in the file which name is test.o[job ID] that is in the same directory with keras.pbs directory.

Example 2 (running on GPUs):

This example is sequence to sequence learning for performing number addition running on GPU.  

Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/keras_localenv_GPU

Modify the keras_GPU.pbs file as needed (using a text editor such as vi, nano, or emacs):

#!/bin/bash
#PBS -q gpus
#PBS -N test
#PBS -j oe
##keep ncpus=1
#PBS -l select=1:mem=2gb:ncpus=1:ngpus=1
#PBS -l walltime=08:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your project group name]"
#PBS -W group_list=x-ccast-prj

cd ${PBS_O_WORKDIR}

##replace "source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate keras_gpu " with "source /mmfs1/thunder/home/[your user name]/anaconda3/bin/activate keras_gpu "
##this is for activating local environment
##/mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu is for Thunder
##replace it with /mmfs1/home/xxx.xxx/anaconda3/bin/activate tf-gpu for Thunder Prime
source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate keras_gpu

python3 addition_rnn.py

Submit the PBS script to the queue:
$ qsub keras_GPU.pbs

The result of job is in the file which name is test.o[job ID] that is in the same directory with keras_GPU.pbs directory.

See Also: