Topics Map > Services > Research Computing and Support > CCAST

Bioinformatics - RepeatMasker

Instructions on how to run (and, if needed, install a customized version of) RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.

  1. Running RepeatMasker on Thunder
  2. Install customized RepeatMasker on Thunder
Please refer to the CCAST User Guide and the the article Running Bioinformatics Software on HPC Clusters for general information about using CCAST resources and running bioinformatics software on CCAST's HPC clusters.

1. Running RepeatMasker on Thunder


Example: screen sequences and find repeats


Location: /gpfs1/projects/ccastest/training/examples/RepeatMasker_example


File list

· repeatmasker_job.pbs: job submission script  

· simple-good-large.fa: sequences to be screened in fasta format


Steps

· Copy example directory to your SCRATCH directory

o    cp -r /gpfs1/projects/ccastest/training/examples/RepeatMasker_example $SCRATCH

· Go to the copied directory

o    cd  $SCRATCH/RepeatMasker_example

· Edit the job submission script as needed, then submit the job

o    qsub repeatmasker_job.pbs


2. Install Customized RepeatMasker on Thunder

Warning: This part is intended ONLY for those who want to install and test their own version in their HOME directory.

Summary

(a)    Perl 5.0.4 or higher. (perl available via “module load perl/5.26.1-gcc”);

(b)    Perl Text::Soundex module required. (install described below)

(c)    Require one of the sequence search engines including nhmmerCross_MatchABBlast/WUBlastRMBlast and Decypher. (Cross_Match is built-in by default. RMBlast is also installed in this tutorial.);

(d)    Requires Tandem Repeats Finder (install described below)

(e)    Recommend RepBase Libraries. (Need a license for the RepBase RepeatMasker Edition – will not be installed)

(f)     Check for Dfam Updates. (Updated, but it's optional.)

(g)    "-pa" option: Number of processors to use in parallel (only works for batch files or sequences larger than 50 kb).

Details

In the following pages, we assume that you want to install the software in a directory named “SOFTWARE” inside your HOME directory on the CCAST’s Thunder cluster. “USERNAME is your username on Thunder.


(a) Download Tandem Repeats Finder (TRF) Binary (Cannot find source codes)

· Download:

o    "wget https://tandem.bu.edu/trf/downloads/trf409.linux64

· Give executable privilege:

o    "chmod +x trf409.linux64"


(b) Install Perl Text::Soundex module

· Load Perl:

o    "module load perl/5.26.1-gcc"

· Install Text::Soundex module locally: (BTW, uninstall in case: cpanm --uninstall Text::Soundex)

o    "cpanm --local-lib=/gpfs1/home/USERNAME/SOFTWARE/perl5 Text::Soundex"

· Update Perl lib path:

o    "echo 'export PERL5LIB=/gpfs1/home/USERNAME/SOFTWARE/perl5/lib/perl5/x86_64-linux:$PERL5LIB' >> /gpfs1/home/USERNAME/.bashrc"

· Reload the settings: 

o    "source /gpfs1/home/USERNAME/.bashrc"


(c) Install RMBlast from source (optional) (RMBlast is a modified version of NCBI Blast+)

· Go to the SOFTWARE directory: 

o    "cd /gpfs1/home/USERNAME/SOFTWARE

· Download NCBI Blast+, unzip and go into the uncompressed directory: 

o    "wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.9.0/ncbi-blast-2.9.0+-src.tar.gz"

o    "tar zxvf ncbi-blast-2.9.0+-src.tar.gz"

o    "cd ncbi-blast-2.9.0+-src"

· Download Rmblast patch file, unzip and patch: 

o    "wget http://www.repeatmasker.org/isb-2.9.0+-rmblast.patch.gz"

o    "gunzip isb-2.9.0+-rmblast.patch.gz"

o    "patch -p1 < isb-2.9.0+-rmblast.patch"

· Go to c++ directory and install RMBlast in "/gpfs1/home/USERNAME/SOFTWARE/rmblast_install_here": 

o    "cd c++"

o    "./configure --with-mt --without-debug --without-krb5 --without-openssl --with-projects=scripts/home/rmblastn/project.lst --prefix=/gpfs1/home/USERNAME/SOFTWARE/rmblast_install_here --without-bdb --without-lmdb --without-boost"

· Build

o    "make"

o    "make install"


(d) Install RepeatMasker

· Download RepeatMasker, unzip and go into the uncompressed directory:  

o    "wget http://www.repeatmasker.org/RepeatMasker-open-4-0-9-p2.tar.gz"

o    "tar xzvf RepeatMasker-open-4-0-9-p2.tar.gz"

· Update Dfam ( file "Dfam.hmm"): (optional)

· Go to Libraries directory:

o    "cd RepeatMasker/Libraries"

· Download the Dfam.hmm.gz library:

o    "wget http://www.dfam.org/releases/Dfam_3.1/families/Dfam.hmm.gz"

· Unzip it to replace the old one.

o    "gunzip -f Dfam.hmm.gz"

· Run Configure Script: 

· (Re-run the configure script every time you want to change search engine or update Dfam library.)

· Go back to the RepeatMasker directory and run the configure script:

o    "cd .."

o    "perl ./configure"

· Input below TRF full path when it asks:

o    "/gpfs1/home/USERNAME/SOFTWARE/trf409.linux64"

· When it prompts "Add a Search Engine", you can use the default one: Cross_Match. Here we use RMBlast that we installed above.  

· Input "2" to use RMBlast. 

· Input below RMBlast install location when it asks:

o    "/gpfs1/home/USERNAME/SOFTWARE/rmblast_install_here/bin"

· Input "Y" when it prompts "Do you want RMBlast to be your default?"

· When it prompts "Add a Search Engine" again, input "5" (Done) to finish. 


(e) Test RepeatMasker

· Make a test directory and go into it: 

o    "cd /gpfs1/scratch/USERNAME/ 

o    "mkdir RepeatMasker_example"

o    "cd RepeatMasker_example"

· Download a fasta format DNA sequences data:  

o    "wget https://github.com/rmhubley/RepeatMasker/raw/master/t/seqs/fastaformat/simple-good-large.fa

· Write and submit the job: 

o    "qsub repeatmasker_job.pbs"


------------------------------------------- file repeatmasker_job.pbs -------------------------------------------

#!/bin/bash

#PBS -q default

#PBS -N test

##does not work for multiple nodes (i.e., select=1)

##change mem, ncpus, and walltime as needed:

#PBS -l select=1:mem=10gb:ncpus=4

#PBS -l walltime=02:00:00

##change "x-ccast-prj" to "x-ccast-prj-[your project group name here]"

#PBS -W group_list=x-ccast-prj

 

cd $PBS_O_WORKDIR

module load perl/5.26.1-gcc

 

# Set path to your RepeatMasker binaries

export MY_REPEATMASKER=/gpfs1/home/USERNAME/SOFTWARE/RepeatMasker

 

$MY_REPEATMASKER/RepeatMasker -pa $NCPUS -dir OUTPUT_DIR simple-good-large.fa

 

exit 0


See Also:




Keywords:ccast, hpc, thunder, bioinformatics, repeatmasker   Doc ID:108078
Owner:Liu Y.Group:IT Knowledge Base
Created:2020-12-24 10:23 CSTUpdated:2020-12-29 01:12 CST
Sites:IT Knowledge Base
CleanURL:https://kb.ndsu.edu/repeatmasker
Feedback:  0   0