CCAST User Guide
1. Introduction, Context, and Qualifications
The Center for Computationally Assisted Science and Technology (CCAST; pronounced "c-cast") provides advanced cyberinfrastructure for research and education at NDSU and beyond. CCAST develops, manages, brokers, and operates high-performance (HPC), cloud, and interactive computing resources, and educates researchers on proper and efficient use of the resources and on other topics of interest to the computational science and engineering community.
We use UNIX/Linux primarily. The basic level of services is FREE of charge to NDSU faculty, staff, and students as well as certain external collaborators (upon approval of CCAST's Executive Director). Additional services are available for a fee.
1.1 Acknowledging CCAST
Users are required to include the following statement (or a close variant) in all research outputs (papers, presentations, theses, etc.) that have used CCAST resources: "This work used resources of the Center for Computationally Assisted Science and Technology (CCAST) at North Dakota State University, which were made possible in part by NSF MRI Award No. 2019077."
The wording is subject to change; e.g., when we need to acknowledge specific funding sources that support certain CCAST resources. Please check the welcome message that appears when you log in to CCAST systems for the most accurate acknowledgment statement.
1.2 Reporting requirements
Users, usually through their Principal Investigators (PIs; i.e., sponsors of their CCAST accounts), are required to report any research outputs and activities that have been enabled by the use of CCAST resources. Reporting items often include publications, presentations, grant applications, patents, theses, etc.
1.3 CCAST usage policies
Users are required to carefully read and comply with CCAST Usage Policies.
1.4 How can you get help?
Read this User Guide carefully and check the CCAST website and related Knowledge Base articles before contacting us. If you still cannot find answers to your questions, send an e-mail to ndsu.ccast.support@ndsu.edu. In the e-mail, describe the issues, clearly state your questions, and provide a copy of the error messages and job submission script, the IDs of your failed jobs, the name of the code, and any other info (including input and output files) that may help debug the issues. Please do not directly contact CCAST individual staff for technical support as this bypasses our tracking system to avoid dropped calls.
1.5 About this document
This document will be updated often since hardware specifications, system administration practice, and usage policies, etc. are subject to changes.
2. Getting Started
2.1 Applying for an account
To be able to access to Thunder and Thunder Prime–the two HPC clusters at CCAST–you need to have an active account with us. Please apply for a CCAST account if you have not already done so. A link to the online application form is available on the CCAST website.
2.2 Connecting to CCAST's HPC clusters
See Logging into CCAST.
2.3 Transferring files
Between a Windows computer and Thunder: WinSCP client should be used. Download (for free) and install it, then open the application. In the "WinSCP Login" window, enter the hostname thunder.ccast.ndsu.edu (Thunder) or prime.ccast.ndsu.edu (Thunder Prime) as well as your username and password, then click on "Login". Once logged in, you will see a screen with two panels: the left shows files on your computer and the right shows your files on Thunder or Thunder Prime (usually your HOME directory, but you can double-click on the address bar and change the location). You can then easily drag and drop files between your computer and Thunder or Thunder Prime.
Between a Mac/Linux computer and Thunder: To transfer files from Thunder/Thunder Prime to your computer: scp [[username@hostname]:[source-file]] [[destination]]
. Example (for Thunder Prime): scp username@prime.ccast.ndsu.edu:/mmfs1/home/username/myfile.txt /home/mycomputer/myfile.txt
To transfer files from your computer to Thunder/Thunder Prime: scp [[source-file]] [[username@hostname]:[destination]]
. Example (for Thunder Prime): scp myfile.txt username@prime.ccast.ndsu.edu:/mmfs1/home/username
.
2.4 Learning UNIX/Linux and HPC
Users are strongly recommended to attend the CCAST Advanced Research Computing Training Program, offered every Fall and Spring semester, as well as other special training events. Specialized training for individual researchers/research groups is also available. Contact CCAST for more information.
There are also lots of free training materials out there on the Internet. We recommend the following:See also the CCAST Reference Card for a list of the most useful Linux commands and tricks. Tutorials for certain applications on Thunder/Thunder Prime can be found in our Knowledge Base articles.
3. Research Computing Resources
3.1 Hardware
CCAST’s Thunder Cluster and Thunder Prime Cluster currently have a combination of over 12,000 CPU cores and 70 GPUs in total. There are several big-memory nodes on each cluster. To check which nodes are currently free or partially free on Thunder or Thunder Prime, execute the command freenodes
(run freenodes --help
to see all available options). The information will help you make the right choice when you request computing resources for your jobs.
3.2 Software
There are many software programs installed on Thunder and Thunder Prime. Most are available to all CCAST users; some, e.g., ANSYS, VASP, etc., available only to those who have valid licenses and other authorized users. Software are usually organized as modules; to check available modules, execute module avail
. You can also install software for yourself. Contact CCAST at ndsu.ccast.support@ndsu.edu if you need help.
3.3 Storage space
Once logged in, you are in your HOME directory: /mmfs1/thunder/home/username
(Thunder) or /mmfs1/home/username
(Thunder Prime). Data in HOME is backed up periodically to tape, so it is a reliable storage area. Do NOT use your HOME directory for data or job input/output. Running jobs out of HOME is prohibited as it affects the interactive use and other important tasks.
Each research group usually has a PROJECTS directory; the full path is /mmfs1/thunder/projects/PI-username
(Thunder) or /mmfs1/projects/PI-username
(Thunder Prime) where PI-username
is the username of the Principal Investigator (PI). This area has a larger storage space and is backed up periodically to tape. All researchers working under the PI can store and share data in this space.
Backup practice: CCAST runs backups of HOME and PROJECTS data regularly. Contact CCAST for more details.
Each regular user has a SCRATCH directory: /mmfs1/thunder/scratch/username
(Thunder) or /mmfs1/scratch/username
(Thunder Prime). It is designed as a place for working directories for jobs. Please submit your jobs from this directory. Note that SCRATCH data is NOT backed up, and the systems are currently set up to automatically DELETE files in SCRATCH that are 60 days old.
Contact CCAST if your research group really needs more storage space beyond the basic level.
3.4 Compute Condominium
Researchers can purchase condo nodes using equipment purchase funds from their grants or other available funds. These PI-owned compute nodes are attached to CCAST’s Thunder Prime cluster to take advantage of the existing infrastructure. Contact CCAST if you have questions regarding the condominium model.
4. Running Jobs
Once you logged in to a CCAST HPC cluster, you are on one of its login nodes. Login nodes have limited resources and are intended only for basic tasks such as transferring data, managing files, compiling software, editing scripts, and checking on or managing jobs. DO NOT run your jobs on the login nodes!
Jobs must be submitted to a queue system, which is monitored by a job scheduler, using a job script. The job scheduler currently used on the Thunder and Thunder Prime clusters is OpenPBS. The scheduler handles job submission requests and assigns jobs to specific compute nodes available at the time.
To be able to run your jobs and run them efficiently, you need to have some basic knowledge of the application you are using. This includes whether the application is serial (i.e., runs on only one CPU core) or parallel (i.e., can run on multiple CPU cores). If it is parallel, what is the underlying parallel programming model: shared-memory (e.g., using OpenMP, Pthreads, etc.), distributed-memory (e.g., using MPI), or hybrid? You need such information to determine how you would like to request resources for your jobs.
4.1 Sample input files and job scripts
If you are new to running jobs on Thunder and/or Thunder Prime or if it has been a while since the last time you ran an application, it is highly recommended that you first run some sample jobs we provide before running your own jobs. Users can copy sample input files and job scripts for various applications from /mmfs1/thunder/projects/ccastest/examples
(Thunder) or /mmfs1/projects/ccastest/examples
(Thunder Prime). More job examples for more applications will be added as they become available. Please check this directory frequently for the latest version of the job scripts.
A job submission script (also referred to as a "PBS job script" or "PBS script") to run a serial job is given below as an example:
#!/bin/bash
#PBS -q default
#PBS -N test
#PBS -l select=1:mem=1gb:ncpus=1
#PBS -l walltime=08:00:00
#PBS -W group_list=x-ccast-prj-<prjname>
cd $PBS_O_WORKDIR
./my-serial-program
For any job script, you need to replace prjname
with your project group name. If you do not know your prjname
, on Thunder or Thunder Prime, execute the command id
or groups
and look for the name x-ccast-prj-...
Also, if you are not sure how to select a value for mem
, set it to the value of M*ncpus
, where M = 1 or 2gb. Keep in mind that CCAST resources are shared among many users. Only request what you actually need.
A PBS job script is simply a text file in your working directory. The easiest way to create the file is to copy an appropriate sample PBS job script from /mmfs1/thunder/projects/ccastest/examples
(Thunder) or /mmfs1/projects/ccastest/examples
(Thunder Prime) and then modify it as needed using some text editor such as nano
(for novice Linux users), emacs
, or vi
(for more experienced users). See also the PBS Cheat Sheet.
4.2 Queue policies on Thunder and Thunder Prime
Different types of queues on Thunder are given below. Users can also find info about the queues on Thunder or Thunder Prime by executing qstat
-q
.
Route Queue | Execution Queue | Walltime (hours) | Authorized Group |
default | def-short | 24 | users who belong a project group other than x-ccast-prj-training |
def-medium | 72 | ||
def-long | 168 | ||
gpus | 168 | ||
preemptible | - | ||
bigmem | bm-short | 24 | |
bm-long | 168 | ||
training | 24 | users in the project group x-ccast-prj-training | |
condo01, condo02, etc. | - | authorized condo users only |
4.3 Launching and monitoring jobs
After preparing a suitable job script (with the filename job.pbs
, for instance), see Sec. 4.1, you can submit the job by typing: qsub job.pbs
. This will assign your job to the queue. Depending on the available resources, it may or may not start immediately. To check the status of your job(s), type qstat -u $USER
. If you want to kill the job, use the command qdel ID
, where ID
is the ID of the job you want to kill. For more useful PBS commands and options, see the PBS Cheat Sheet.
4.4 How to get your work done faster?
If you use software packages developed by others, be mindful of the parameters used in your input files. A small tuning of the parameters can significantly improve computational efficiency. If you write and run your own code, see if it can be optimized to make it run faster or parallelize it if it is not yet parallel.
When running parallel jobs, a question arises: How many cores/nodes should you request for the jobs? Note: the requested resources in the sample PBS job scripts we provide are not optimized for your jobs! Also note that, if you want to get your jobs done faster, simply adding a lot more cores/nodes is rarely the answer! You should do some scaling tests to identify the optimal number of cores/nodes for your jobs.
When you have many similar parallel jobs, we recommend that you run a first few jobs with different numbers of CPU cores. By looking the computing time needed to finish the jobs vs. the number of cores, you'll have a pretty good idea of how many cores you should choose for the remaining jobs. Contact CCAST for help with improving your job efficiency and speeding up your research process.