Topics Map > Services > Research Computing and Support > CCAST

Bioinformatics - SortMeRNA

Instructions on how to run (and, if needed, install a customized version of) SortMeRNA

SortMeRNA is a program tool for filtering, mapping and OTU-picking NGS reads in metatranscriptomic and metagenomic data. 

  1. Running SortMeRNA on Thunder
  2. Install customized SortMeRNA on Thunder
Please refer to the CCAST User Guide and the the article Running Bioinformatics Software on HPC Clusters for general information about using CCAST resources and running bioinformatics software on CCAST's HPC clusters.

1. Running SortMeRNA on Thunder


Example: Map transcriptomic data to references


Location: /gpfs1/projects/ccastest/training/examples/SortMeRNA_example


File list

· sortmerna_job.pbs: job submission script  

· set5_simulated_amplicon_silva_bac_16s.fasta: simulated RNA sequences in fasta

· rRNA_databases (directory): a collection of RNA sequence sets


Steps

· Copy example directory to your SCRATCH directory

o    cp -r /gpfs1/projects/ccastest/training/examples/SortMeRNA_example $SCRATCH

· Go to the copied directory

o    cd  $SCRATCH/SortMeRNA_example

· Edit the job submission script as needed, then submit the job

o    qsub sortmerna_job.pbs


2. Install Customized SortMeRNA on Thunder

Warning: This part is intended ONLY for those who want to install and test their own version in their HOME directory.

Summary

(a) ZLib (system installed. Command "ldconfig -p | grep libz\.so" can be used to check it.);

(b) RocksDB (install described below);

(c) RapidJson (install described below);

(d) cmake-3.13+ (available via “module load cmake/3.14.5-gcc);

(e) GCC supporting C++ 14 (available via “module load gcc/7.3.0-gcc”);

(f) git. (system installed);

When building RocksDB and SortMeRNA, the default git 1.8.3.1 causes "Unknown option: -C" in building process. But it seems only a version check. One can comment the relevant codes in CMakeLists.txt in their top directories, or just ignore it.

------------------------------------------Comment git in CMakeLists.txt------------------------------------------------

# Prepare Version information

#

#if(GIT_FOUND AND EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/.git")

#  if(WIN32)

#    execute_process(COMMAND $ENV{COMSPEC} /C ${GIT_EXECUTABLE} -C ${CMAKE_CURRENT_SOURCE_DIR} rev-parse HEAD OUTPUT_VARIABLE GIT_SHA)

#  else()

#    execute_process(COMMAND ${GIT_EXECUTABLE} -C ${CMAKE_CURRENT_SOURCE_DIR} rev-parse HEAD OUTPUT_VARIABLE GIT_SHA)

#  endif()

#else()

  set(GIT_SHA 0)

#endif()

Details

In the following pages, we assume that you want to install the software in a directory named “SOFTWARE” inside your HOME directory on the CCAST’s Thunder cluster. “USERNAME is your username on Thunder.


(a) Load gcc 7.3.0

"module load gcc/7.3.0-gcc"

(b) Install Rocksdb from source

· Download rocksdb: 

o    "git clone https://github.com/facebook/rocksdb.git"

· Build using cmake: (pushd and popd for quick go back)

o    "mkdir -p rocksdb/build/Release"

o    "pushd rocksdb/build/Release"

o    "cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/gpfs1/home/USERNAME/SOFTWARE/rocksdb_install_here -DWITH_ZLIB=1 -DWITH_GFLAGS=0 -DPORTABLE=1 -DWITH_TESTS=0 -DWITH_TOOLS=0 ../.."

o    "cmake --build ."

o    "cmake --build . --target install"

o    "popd"

(c) Install Rapidjson from source

(SortMeRNA only needs 'include')

· Download rapidjson: 

·   "git clone https://github.com/Tencent/rapidjson.git"

· Build using cmake: 

o    "mkdir -p rapidjson/build/Release"

o    "pushd rapidjson/build/Release"

o    "cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/gpfs1/home/USERNAME/SOFTWARE/rapidjson_install_here -DRAPIDJSON_BUILD_EXAMPLES=OFF -DRAPIDJSON_BUILD_DOC=OFF ../.."

o    "cmake --build ."

o    "cmake --build . --target install"

o    "popd"

(d) Install Sortmerna from source

· Download Sortmerna: 

o    "git clone https://github.com/biocore/sortmerna.git"

· Build using cmake:

· The CMakelist.txt file in the sortmerna directory prevents installing it in other locations, it can be changed as below. (I don't know why the author added this line and have questioned in its github site.)

------------------------------Comment "set..." in "sortmerna/CMakeLists.txt"----------------------------------------

# Installation and packaging

#set(CMAKE_INSTALL_PREFIX ${CMAKE_CURRENT_SOURCE_DIR}/dist CACHE PATH "Install path prefix, prepended onto install directories." FORCE)

 

o    "mkdir -p sortmerna/build/Release"

o    "pushd sortmerna/build/Release"

o    "cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/gpfs1/home/USERNAME/SOFTWARE/sortmerna_install_here -DCPACK_BINARY_TGZ=ON -DROCKSDB_HOME=/gpfs1/home/USERNAME/SOFTWARE/rocksdb_install_here -DROCKSDB_SRC=/gpfs1/home/USERNAME/SOFTWARE/rocksdb -DRAPIDJSON_HOME=/gpfs1/home/USERNAME/SOFTWARE/rapidjson_install_here -DEXTRA_CXX_FLAGS_RELEASE="-lrt" ../.."

o    "cmake --build ."

o    "cmake --build . --target install

o    "cmake --build . --target package"

o    "popd"

(e) Run the provided Integration Test

There is an integration test written in python in the source code directory "sortmerna/tests". The latest version of sortmerna needs python 3.5(or higher) and python package scikit-bio to run this test. I installed Anaconda python platform for easy package management in the future. Anaconda is popular in scientific computing. It will install a lot of popular scientific packages such as numpy. One can install Miniconda if no interest in bundled packages.

· Download Anaconda:

o    wget https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh"

· Install Anaconda: 

o    "chmod +x Anaconda3-2019.03-Linux-x86_64.sh"

o    "./Anaconda3-2019.03-Linux-x86_64.sh"

· Type yes to accept the license.

· Specify a different install location as "/gpfs1/home/USERNAME/SOFTWARE/anaconda3".

· Type “no” when asking to add conda init

· Install Sci-kit Bio using Conda

o    conda install -c https://conda.anaconda.org/biocore scikit-bio

· Run the provided tests: 

o    There are two tests provided written in python unittest. File "test_sortmerna.py" is for normal data and the other file "test_sortmerna_zlib.py" for compressed data.

o    Notice that the commands sortmerna will generate a directory "~/kvdb". It is in the user home by default if you don't change the test files. And it need to be emptied if testing again, otherwise it will raise error.)

o    These two test files used two different incompatible versions of scikit-bio packages.

· Go to copy files from provided test directory:

o    cd /gpfs1/scratch/USERNAME

o    mkdir sortmernatest

o    "cd /gpfs1/home/USERNAME/SOFTWARE/sortmerna/tests

· Write and submit the test job for normal data: ("test_sortmerna.py")

o    "qsub sortmerna_selftest1_job.pbs

--------------- sortmerna_selftest1_job.pbs -----------------

#!/bin/bash

#PBS -q default

#PBS -N SortMeRNA_test1

#PBS -l select=1:mem=10gb:ncpus=4

#PBS -l walltime=1:00:00

## Replace “x-ccast-prj” with “x-ccast-prj-[your project group name here]”

#PBS -W group_list=x-ccast-prj

cd $PBS_O_WORKDIR

# Add your SortMeRNA binary directory to path

export PATH=$PATH:/gpfs1/home/USERNAME/SOFTWARE/sortmerna_install_here/bin

python test_sortmerna.py

exit 0


· Write and submit the test job for compressed data: ("test_sortmerna_zlib.py")

· For this test, the provided python code has two issues.

o    It uses an outdated version of scikit-bio package which is not compatible with the one used in previous test. The incompatible function is: "skbio.parse.sequences.parse_fasta". The simplest way to solve this is to create a function-equivalent package to mimic it. Follow the below commands to create the mimic package: 

o    "mkdir skbio"

o    "cd skbio"

o    "touch __init__.py"

o    "mkdir parse"

o    "cd parse"

cat > sequences.py <<EOF

def parse_fasta(f):

    while True:

        mark = f.readline(1)

        if not mark: break

        if mark == '>': yield ':)'

EOF

o    "cd ../.."


One will get a file structure as below:

o    I think the test case has a wrong assertion value, possibly due to the update of the test data. The "272" in the "test_sortmerna_zlib.py" should change to "264". One can change it by the command:

o    "sed -i 's/272/264/g' test_sortmerna_zlib.py"

·       At last write the job and submit:

o    "qsub sortmerna_selftest2_job.pbs

--------------- sortmerna_selftest2_job.pbs -----------------

#!/bin/bash

#PBS -q default

#PBS -N SortMeRNA_test2

#PBS -l select=1:mem=10gb:ncpus=4

#PBS -l walltime=1:00:00

## Replace “x-ccast-prj” with “x-ccast-prj-[your project group name here]”

#PBS -W group_list=x-ccast-prj

cd $PBS_O_WORKDIR

# Add your SortMeRNA binary directory to path

export PATH=$PATH:/gpfs1/home/USERNAME/SOFTWARE/sortmerna_install_here/bin

python test_sortmerna_zlib.py

exit 0


(f) Run a simple test example

The user must first index the reference database by using the command indexdb and then filter/map reads against the database using the command sortmerna. The commands cannot create directories for you, so you need to create by yourself if specified in the arguments.

· Go to the scratch directory: 

o    "cd /gpfs1/scratch/USERNAME " 

· Make and go into test directory: 

o    "mkdir -p SortMeRNA_example/rRNA_databases"

o    "cd SortMeRNA_example/rRNA_databases"

· Copy the data:

o    "cp /gpfs1/home/USERNAME/SOFTWARE/sortmerna/rRNA_databases/*.fasta ."

o    "cd .."

o    "cp /gpfs1/home/USERNAME/SOFTWARE/sortmerna/tests/data/set5_simulated_amplicon_silva_bac_16s.fasta ."

· Write and Submit the job:

· Make sure to make directories for the generated index data, temporary data and key-value data.

o    "qsub sortmerna_job.pbs

-------------------------------------------------------- sortmerna_job.pbs------------------------------------------

#!/bin/bash

#PBS -q default

#PBS -N SortMeRNA_test

#PBS -l select=1:mem=8gb:ncpus=4

#PBS -l walltime=4:00:00

## Replace “x-ccast-prj” with “x-ccast-prj-[your project group name here]”

#PBS -W group_list=x-ccast-prj

cd $PBS_O_WORKDIR

# Add your SortMeRNA binary directory to path

export PATH=$PATH:/gpfs1/home/USERNAME/SOFTWARE/sortmerna_install_here/bin

#create directories for generated data

mkdir index

mkdir tmp

mkdir kvdb

#index the reference rRNA database

indexdb --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:\

./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-db:\

./rRNA_databases/silva-arc-16s-id95.fasta,./index/silva-arc-16s-db:\

./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-db:\

./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-db:\

./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s:\

./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-db:\

./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s-db \

--tmpdir "${PBS_O_WORKDIR}/tmp"

#filter the rRNA reads

sortmerna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:\

./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-db:\

./rRNA_databases/silva-arc-16s-id95.fasta,./index/silva-arc-16s-db:\

./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-db:\

./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-db:\

./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s:\

./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-db:\

./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s-db \

--reads set5_simulated_amplicon_silva_bac_16s.fasta --aligned set5_aligned \

--other set5_other -d "${PBS_O_WORKDIR}/kvdb" --log --otu_map --de_novo_otu \

--blast "1 cigar qcov" -v --fastx -a $NCPUS

exit 0

 

See Also:




Keywords:ccast, hpc, thunder, bioinformatics, sortmerna   Doc ID:108082
Owner:Liu Y.Group:IT Knowledge Base
Created:2020-12-26 10:05 CSTUpdated:2020-12-29 01:10 CST
Sites:IT Knowledge Base
CleanURL:https://kb.ndsu.edu/sortmerna
Feedback:  0   0