Bioinformatics - MEGAHIT

Instructions on how to run (and, if needed, install a customized version of) MEGAHIT

MEGAHIT is an ultra-fast and memory efficient NGS assembler. It is optimized for metagenomes, but also works well on generic single genome assembly (small or mammalian size) and single-cell assembly.

  1. Running MEGAHIT on Thunder
  2. Install customized MEGAHIT on Thunder
Please refer to the CCAST User Guide and the the article Running Bioinformatics Software on HPC Clusters for general information about using CCAST resources and running bioinformatics software on CCAST's HPC clusters.

1. Running MEGAHIT on Thunder


Example: Assemble single- or paired-end reads 


Location: /gpfs1/projects/ccastest/training/examples/MEGAHIT_example


File list:

·  megahit_job.pbs: job submission script  

·  r3_1.fa: a set of sequences in fasta format

·  r3_2.fa: a set of sequences in fasta format 


Steps:

· Copy the example directory to your SCRATCH directory

o   cp -r /gpfs1/projects/ccastest/training/examples/MEGAHIT_example $SCRATCH

· Go to the copied directory

o   cd  $SCRATCH/MEGAHIT_example

· Edit the job submission script as needed, then submit the job

o    qsub megahit_job.pbs


2. Install Customized MEGAHIT on Thunder

Warning: This part is intended ONLY for those who want to install and test their own version in their HOME directory.

Summary

(a)    For building: zlib (Installed. Can be checked by "ldconfig -p | grep libz"), cmake >= 2.8 (CCAST 2.8.12.2), g++ >= 4.8.4 (CCAST 4.8.5).

(b)    For running: gzip (Installed) and bzip2 (Installed).

(c)    For self-testing: Python 3 (module load).

(d)    "-t" option for number of threads.


Details


In the following pages, we assume that you want to install the software in a directory named “SOFTWARE” inside your HOME directory on the CCAST’s Thunder cluster. “USERNAME is your username on Thunder.


(a) Install

· Go to your software directory: 

o    "cd /gpfs1/home/USERNAME/SOFTWARE

· Git clone the MEGAHIT:

o    "git clone https://github.com/voutcn/megahit.git

· Go to the MEGAHIT directory and update the submodule:  

o    "cd megahit"

o    "git submodule update --init"

· Create a build directory and go into it:

o    "mkdir build && cd build"

· Build and Self-test: (Self-test needs Python3)

o    "cmake .. -DCMAKE_BUILD_TYPE=Release"

o    "make -j4"

· Make test 

o    "module load python/3.4.3-gcc"

o    "make simple_test"

(b) Test

· Make a test directory and go into it: 

o    "cd /gpfs1/scratch/USERNAME 

o    "mkdir Megahit_example"

o    "cd Megahit_example"

· Copy two pair-end sequences from the given test data to current location:  

o    "cp /gpfs1/home/USERNAME/SOFTWARE/megahit/test_data/r3* .

· Write and submit the job 

o    "qsub megahit_job.pbs"

------------------------------------------- file megahit_job.pbs -------------------------------------------

#!/bin/bash

#PBS -q default

#PBS -N test

##does not work for multiple nodes (i.e., select=1)

##change mem, ncpus, and walltime as needed:

#PBS -l select=1:mem=10gb:ncpus=4

#PBS -l walltime=02:00:00

## Replace “x-ccast-prj” with “x-ccast-prj-[your project group name here]”

#PBS -W group_list=x-ccast-prj

 

cd $PBS_O_WORKDIR

# Set path to MEGAHIT binaries

export MY_MEGAHIT=/gpfs1/home/USERNAME/SOFTWARE/megahit/build

$MY_MEGAHIT/megahit -1 r3_1.fa -2 r3_2.fa -t $NCPUS -o OUTPUT_DIR

 

exit 0

 

See Also: