1. Introduction to Anaconda
Sometimes, high-performance computing (HPC) users should have a specific version of machine learning tool that is not a system-wide installed application for users to run a job. Using Anaconda, it is easy for users to install machine learning tools in their HOME directory. Anaconda is a free and open-source distribution of programming languages for scientific computing, that aims to simplify package management and deployment. Package versions in Anaconda are managed by the package management system conda.
2. Anaconda installation
This method will install Anaconda in /mmfs1/thunder/home/[your user name] on Thunder and /mmfs1/home/[your user name] on Thunder Prime by default:
$ wget https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh
bash Anaconda3-2019.07-Linux-x86_64.sh
---
Do you wish the installer to initialize Anaconda3
by running conda init? [yes|no]
$ yes
---
$ source ~/.bashrc
3. Installing TensorFlow, PyTorch, and Keras
3.1 TensorFlow
#create TensorFlow-GPU environment and name it ‘tf-gpu’
$ conda create -n tf-gpu tensorflow-gpu
3.2 PyTorch
#create python environment and name it ‘pytorch1.7’
$ conda create -n pytorch1.7 pytorch torchvision torchaudio cudatoolkit=10.1 -c pytorch
3.3 Keras
#create keras environment with tensorflow2.3 and name it ‘keras2.3.1-TF2.3’
$ conda create -n keras2.3.1-TF2.3 Keras==2.3.1 tensorflow==2.3
#create keras environment with tensorflow gpu and name it ‘keras_gpu’
$ conda create -n keras_gpu Keras==2.3.1 tensorflow-gpu
4. Running examples on Thunder/Thunder Prime with local environment
The jobs are the same as those use the system-wide version that are included in the ml_examples.tar.gz file in /mmfs1/thunder/projects/ccastest/examples directory on Thunder and /mmfs1/projects/ccastest/examples directory on Thunder Prime. Note that the local environment should be activated in the PBS script, which is different with system-wide version and is shown as following PBS script in those three examples.
4.1 Copy and decompress examples
Copy the example file ml_examples.tar.gz to your SCRATCH directory, if you haven't done so:
$ cp /mmfs1/thunder/projects/ccastest/examples/ml_examples.tar.gz $SCRATCH/ml_examples.tar.gz (on Thunder).
$ cp /mmfs1/projects/ccastest/examples/ml_examples.tar.gz $SCRATCH/ml_examples.tar.gz (on Thunder Prime).
Go to your SCRATCH directory and decompress the file:
$ cd $SCRATCH
$ tar -zvxf ml_examples.tar.gz
4.2 Running TensorFlow examples
This example is CIFAR-10 dataset-based Convolutional Neural Network that includes 3 convolution layers running on GPU.
Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/tf_localenv_GPU
Modify the tf_GPU.pbs file as needed (using a text editor such as vi, nano, or emacs):
#!/bin/bash
#PBS -q gpus
#PBS -N test
##keep ncpus=1
#PBS -l select=1:mem=10gb:ncpus=1:ngpus=2
#PBS -l walltime=03:00:00
##change "x-ccast-prj" to "x-ccast-prj-[your project group name]"
#PBS -W group_list=x-ccast-prj
cd ${PBS_O_WORKDIR}
##replace "source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu" with "source /mmfs1/thunder/home/[your user name]/anaconda3/bin/activate tf-gpu"
##this is for activating local environment
##/mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu is for Thunder
##replace it with /mmfs1/home/xxx.xxx/anaconda3/bin/activate tf-gpu for Thunder Prime
source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu
#pass $NGPUS to TensorFlow
#sed -i "10c num_gpus = '$NGPUS';" TensorFlow_GPU.py
# Because $NGPUS is not defined by PBS like $NCPUS is, we need
# to extract this value in a different way.
# Note: In Bash, $0 stores the path of the currently running script
NGPUS=$(grep -oP 'ngpus=\d+' $0 | cut -d= -f2)
# Also, code should be parameterized. Using text substitution to modify
# code at runtime is considered bad practice.
python tf_GPU.py $NGPUS
Submit the PBS script to the queue:
$ qsub tf_GPU.pbs
The result of job is in the file which name is test.o[job ID] that is in the same directory with tf_GPU.pbs directory.
4.2 Running PyTorch examples
Example 1 (running on CPUs):
This example is using PyTorch Tensors to fit a two-layer network to random data running on CPU.
Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/pytorch_localenv
Modify the pytorch.pbs file as needed (using a text editor such as vi, nano, or emacs):
#!/bin/bash
#PBS -q default
#PBS -N test
#PBS -j oe
##keep select=1
#PBS -l select=1:mem=10gb:ncpus=4
#PBS -l walltime=01:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your sponsor's project group]"
#PBS -W group_list=x-ccast-prj
cd ${PBS_O_WORKDIR}
##replace "source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate pytorch1.7" with "source /mmfs1/thunder/home/[your user name]/anaconda3/bin/activate pytorch1.7"
##this is for activating local environment
##/mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu is for Thunder
##replace it with /mmfs1/home/xxx.xxx/anaconda3/bin/activate tf-gpu for Thunder Prime
source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate pytorch1.7
python3 pytorch.py
Submit the PBS script to the queue:
$ qsub pytorch.pbs
The result of job is in the file which name is test.o[job ID] that is in the same directory with the directory of pytorch.pbs file.
Example 2 (running on GPUs):
This example is using PyTorch Tensors to fit a two-layer network to random data running on GPU.
Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/pytorch_localenv_GPU
Modify the pytorch_GPU.pbs file as needed (using a text editor such as vi, nano, or emacs):
#!/bin/bash
#PBS -q gpus
#PBS -N test
#PBS -j oe
##keep ncpus=1
#PBS -l select=1:mem=10gb:ncpus=2:ngpus=1
#PBS -l walltime=01:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your sponsor's project group]"
#PBS -W group_list=x-ccast-prj
cd ${PBS_O_WORKDIR}
##replace "source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate pytorch1.7" with "source /mmfs1/thunder/home/[your user name]/anaconda3/bin/activate pytorch1.7"
##this is for activating local environment
##/mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu is for Thunder
##replace it with /mmfs1/home/xxx.xxx/anaconda3/bin/activate tf-gpu for Thunder Prime
source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate pytorch1.7
python3 pytorch_GPU.py
Submit the PBS script to the queue:
$ qsub pytorch_GPU.pbs
The result of job is in the file which name is test.o[job ID] that is in the same directory with the directory of pytorch_GPU.pbs file.
4.2 Running Keras examples
Example 1 (running on CPUs):
This example is sequence to sequence learning for performing number addition running on CPU.
Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/keras_localenv
Modify the keras.pbs file as needed (using a text editor such as vi, nano, or emacs):
#!/bin/bash
#PBS -q default
#PBS -N test
#PBS -j oe
##change "ncpus" and "mem" if needed
#PBS -l select=1:mem=2gb:ncpus=2
#PBS -l walltime=08:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your project group name]"
#PBS -W group_list=x-ccast-prj
cd ${PBS_O_WORKDIR}
##replace "source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate keras2.3.1-TF2.3" with "source /mmfs1/thunder/home/[your user name]/anaconda3/bin/activate keras2.3.1-TF2.3"
##this is for activating local environment
##/mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu is for Thunder
##replace it with /mmfs1/home/xxx.xxx/anaconda3/bin/activate tf-gpu for Thunder Prime
source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate keras2.3.1-TF2.3
python3 addition_rnn.py
Submit the PBS script to the queue:
$ qsub keras.pbs
The result of job is in the file which name is test.o[job ID] that is in the same directory with keras.pbs directory.
Example 2 (running on GPUs):
This example is sequence to sequence learning for performing number addition running on GPU.
Get into example directory from your SCRATCH directory:
$ cd $SCRATCH/keras_localenv_GPU
Modify the keras_GPU.pbs file as needed (using a text editor such as vi, nano, or emacs):
#!/bin/bash
#PBS -q gpus
#PBS -N test
#PBS -j oe
##keep ncpus=1
#PBS -l select=1:mem=2gb:ncpus=1:ngpus=1
#PBS -l walltime=08:00:00
##replace "x-ccast-prj" below with "x-ccast-prj-[your project group name]"
#PBS -W group_list=x-ccast-prj
cd ${PBS_O_WORKDIR}
##replace "source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate keras_gpu " with "source /mmfs1/thunder/home/[your user name]/anaconda3/bin/activate keras_gpu "
##this is for activating local environment
##/mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate tf-gpu is for Thunder
##replace it with /mmfs1/home/xxx.xxx/anaconda3/bin/activate tf-gpu for Thunder Prime
source /mmfs1/thunder/home/xxx.xxx/anaconda3/bin/activate keras_gpu
python3 addition_rnn.py
Submit the PBS script to the queue:
$ qsub keras_GPU.pbs
The result of job is in the file which name is test.o[job ID] that is in the same directory with keras_GPU.pbs directory.