Bioinformatics - MaSuRCA

Instructions on how to run (and, if needed, install a customized version of) MaSuRCA

MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches.

  1. Running MaSuRCA on Thunder
  2. Install customized MaSuRCA on Thunder
Please refer to the CCAST User Guide and the the article Running Bioinformatics Software on HPC Clusters for general information about using CCAST resources and running bioinformatics software on CCAST's HPC clusters.

1. Running MaSuRCA on Thunder

Example: Assemble single- or paired-end reads 

Location: /gpfs1/projects/ccastest/training/examples/MaSuRCA_example

File list:

· masurca_job.pbs: job submission script  

· frag_1.fastq: a set of sequences in fastq format

· fraq_2.fastq: a set of sequences in fastq format 

· config.txt: a configuration file to direct MaSuRCA to generate the desired bash script


Steps:

· Copy the example directory to your SCRATCH directory

o   cp -r /gpfs1/projects/ccastest/training/examples/MaSuRCA_example $SCRATCH

· Go to the copied directory

o   cd  $SCRATCH/MaSuRCA_example

· Edit the job submission script as needed, then submit the job

o    qsub masurca_job.pbs


2. Install Customized MaSuRCA on Thunder

Warning: This part is intended ONLY for those who want to install and test their own version in their HOME directory.


Summary

(a)    Require installation of gcc 4.7 or higher. (System GCC is 4.8.5 – module load not needed) 

(b)   bzip2-devel is required for building; (available via ‘module load bzip2')

(c)    Other tools are installed by itself. (such as jellyfish – module load not needed) 

(d)    Having NUM_THREADS in the config file in the first step of running.  


Details


In the following pages, we assume that you want to install the software in a directory named “SOFTWARE” inside your HOME directory on the CCAST’s Thunder cluster. “USERNAME is your username on Thunder.


(a) Install

· Go to the SOFTWARE directory: 

o    cd /gpfs1/home/USERNAME/SOFTWARE

· Download and unzip: 

o    "wget https://github.com/alekseyzimin/masurca/releases/download/3.3.2/MaSuRCA-3.3.2.tar.gz

o    "tar -zxvf MaSuRCA-3.3.2.tar.gz

· Go to the MaSuRCA directory:

o    cd /gpfs1/home/USERNAME/SOFTWARE/MaSuRCA-3.3.2

· Load bzip2 module

o    module load bzip2

· Install MaSuRCA

o    ./install.sh

(b) Test


MaSuRCA runs with 2 steps. The first step uses a configuration file to generate a shell script called assemble.sh. Then, executes the shell script to complete the actual assembly. The easiest way is to copy the sample configuration file to the directory of choice for running the assembly and then modify. 


· Test in scratch directory

o    "cd /gpfs1/scratch/USERNAME

· Make a directory for it: 

o    "mkdir MaSuRCA_test

· Go into it. 

o    "cd MaSuRCA_test

· Download data and unzip:

o    "wget http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_1.fastq.gz

o    "wget http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/frag_2.fastq.gz

o    "gunzip frag_1.fastq.gz frag_2.fastq.gz

· Write config file: 

o    Copy the template config file sr_config_example.txt:  

o    "cp /gpfs1/home/USERNAME/SOFTWARE/MaSuRCA-3.3.2/sr_config_example.txt .

· Modify the sr_config_example.txt: 

o    Specify input: 

o    "PE= pe 180 27 /gpfs1/home/USERNAME/MaSuRCA_test/frag_1.fastq  /gpfs1/home/USERNAME/MaSuRCA_test/frag_2.fastq

· Ignore jump 

o    "#JUMP......

· Set threads: 

o    "NUM_THREADS = 4

· Write and submit the job:

o    "qsub masurca_test.pbs

---------------masurca_test.pbs-----------------

#!/bin/bash  

#PBS -q default  

#PBS -N MaSuRCA_test  

#PBS -l select=1:mem=20gb:ncpus=4  

#PBS -l walltime=10:00:00  

#PBS -W group_list=x-ccast-prj-[your project group name here]

cd $PBS_O_WORKDIR  

#Set path to your MaSuRCa binaries

export PATH=$PATH:/gpfs1/home/USERNAME/SOFTWARE/MaSuRCA-3.3.2/bin

masurca sr_config_example.txt 

./assemble.sh

exit 0

See Also: