3. Dependencies

These manuals aim at helping you install programs and dependencies used in the Gene-regulation library.

Some of them are mandatory, and some are optional, depending on the Snakemake workflows you need to run.

They were tested under Ubuntu 14.04.

3.1. Manual installation

This manual is organized in sections, so you can cherry-pick the programs you want to manually install. For “all inclusive” solutions, please refer yourself to the following sections.

3.1.1. General

3.1.1.1. Generic tools

3.1.1.1.1. nano

Nano is a simple command-line text editor.

sudo apt-get install nano
3.1.1.1.2. rsync

Rsync is an open source utility that provides fast incremental file transfer.

sudo apt-get install rsync
3.1.1.1.3. git

Git rsync is a version control system (VCS) for tracking changes in computer files and coordinating work on those files among multiple people.

sudo apt-get install git

Optional:

  • Create an account on GitHub.
  • Add your ssh public key to your GitHub account settings (account > settings > SSH keys > add SSH key).
less ~/.ssh/id_rsa.pub
3.1.1.1.4. zlib

Unix package required by several tools, including Sickle and Bamtools.

sudo apt-get install libz-dev
3.1.1.1.5. Java

Java is required by several tools using GUIs, such as FastQC or IGV.

It seems java 9 causes issues with IGV, so we chose to use java 8 here.

echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y install oracle-java8-installer

Check installation:

java -version

3.1.1.2. Create bin/ and app_sources/ (optional)

While some programs will be installed completely automatically, others will not. Here we create a directory that will be used for manual installations.

mkdir $HOME/bin
mkdir $HOME/app_sources

You might then have to edit your $PATH manually (see next section).

3.1.1.3. Edit $PATH

In order to use manually installed programs and make them executable, you may have to update your $PATH environment variable. You can do so by editing the ~/.profile file.

nano ~/.profile

Fetch this paragraph and add the path to manually installed executables:

# set PATH so it includes user's private bin if it exists
if [ -d "$HOME/bin" ] ; then
    PATH="$HOME/bin:$PATH"
fi

Execute the file to validate the change.

source ~/.profile

3.1.1.4. Graphviz

Snakemake can generate useful graphviz outputs.

sudo apt-get install graphviz

3.1.2. Python

Snakemake requires to have Python 3.3+ installed. You can check this by issuing the following commands in a terminal:

python --version
python3 --version

If you don’t have python 3 you should install it.

sudo apt-get install python3

Install python package managers and devel libraries.

apt-get install python-dev
apt-get install python3.4-dev
sudo apt-get install python-pip
sudo apt-get install python3-pip

3.1.2.1. Pandas library

Python Data Analysis Library is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

This library is used in order to read tab-delimited files used in the workflows (see files samples.tab and design.tab).

pip3 install pandas

3.1.2.2. Package rpy2

The package rpy2 alloàws to access R from within Python code.

sudo apt-get install python-matplotlib
pip3 install "rpy2<2.3.10"

3.1.3. R

You can fetch a CRAN mirror here.

sudo sh -c "echo 'deb <your mirror> trusty/' >> /etc/apt/sources.list"                          ## Repository for Ubuntu 14.04 Trusty Tahr
#sudo sh -c "echo 'deb http://ftp.igh.cnrs.fr/pub/CRAN/ trusty/' >> /etc/apt/sources.list"      ## Mirror in Montpellier, France
sudo apt-get -y update
sudo apt-get -y install r-base r-base-dev libcurl4-openssl-dev libxml2-dev
echo "r <- getOption('repos'); r['CRAN'] <- 'http://cran.us.r-project.org'; options(repos = r);" >> ~/.Rprofile

Check installation:

R --version

3.1.4. Snakemake

“Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute clusters without modifying the workflow. It is the first system to support the use of automatically inferred multiple named wildcards (or variables) in input and output filenames.”

(Köster and Rahman, 2012)

NB: Python 3 and pip3 are required (see this section).

pip3 install snakemake

You can check that snakemake works properly with this basic script.

"""Snakefile to test basic functions of snakemake.
"""
rule all:
    input: expand("bye.txt")

rule hello:
    """Write HELLO in a text file named hello.txt.
    """
    output: "hello.txt"
    message: "Generating {output} file."
    shell: "echo HELLO > {output}"

rule bye:
    """Write BYE in a text file named bye.txt.
    """
    input: "hello.txt"
    output: "bye.txt"
    message: "Generating {output} file."
    shell: "echo BYE > {output}"
touch $HOME/hello.py
nano $HOME/hello.py             ## copy/paste script above and save

Execute the workflow; two files should be created: hello.txt and bye.txt.

cd ; snakemake -s hello.py

In case you need to upgrade snakemake:

pip3 install snakemake --upgrade

If you want to use Snakemake reports function (optional):

pip3 install docutils

3.1.5. Quality control

3.1.5.1. FastQC

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

The main functions of FastQC are:

  • Import of data from BAM, SAM or FastQ files (any variant)
  • Providing a quick overview to tell you in which areas there may be problems
  • Summary graphs and tables to quickly assess your data
  • Export of results to an HTML based permanent report
  • Offline operation to allow automated generation of reports without running the interactive application

Links:

FastQC is available from linux repositories:

sudo apt-get install fastqc

However, since it’s an older version, it can cause problems of dependencies.

We recommend installing it manually:

cd $HOME/app_sources
wget --no-clobber http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip -o fastqc_v0.11.5.zip
chmod +x FastQC/fastqc
ln -s -f $HOME/app_sources/FastQC/fastqc $HOME/bin/fastqc

NB: FastQC requires to have Java installed (even for commandline use). See dedicated section to install it.

Check installation:

fastqc --version

3.1.5.2. MultiQC

MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools.

sudo pip install multiqc

NB: a bug can appear depending on versions:

Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/matplotlib Storing debug log for failure in /home/gr/.pip/pip.log

If so, it can be avoided by installing ubuntu dependencies, then reinstalling multiqc:

sudo apt-get install libfreetype6-dev python-matplotlib
sudo pip install multiqc

Check installation:

multiqc --version

3.1.6. Trimming

The quality of the reads generated by high-throughput sequencing technologies tends to decrease at their ends. Trimming consists in cutting out theses ends, and thus better the quality of reads before the mapping.

3.1.6.1. Sickle

Sickle is a trimming tool which better the quality of NGS reads.

Sickle uses sliding windows computing sequencing quality along the reads. When the quality falls below a chose q-value threshold, the reads is cut. If the size of the remaining read is too short, it is completely removed. Sickle takes into account three different types of read quality: Illumina, Solexa, Sanger.

  • Pre-requisite: install zlib (link to section).
  • Clone the git repository into your bin (link to section) and run make.
cd $HOME/app_sources
git clone https://github.com/najoshi/sickle.git
cd sickle
make
cp sickle $HOME/bin

Check installation:

sickle --version

3.1.6.2. Cutadapt

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

pip install --user --upgrade cutadapt
mv /root/.local/bin/cutadapt $HOME/bin

Check installation:

cutadapt --version

3.1.6.3. TrimGalore

In our workflows we use TrimGalore, a wrapper around Cutadapt and FastQC. It should be installed if you want to run cutadapt.

cutadapt --version                              # Check that cutadapt is installed
fastqc -v                                       # Check that FastQC is installed

cd $HOME/app_sources
curl -fsSL https://github.com/FelixKrueger/TrimGalore/archive/0.4.3.tar.gz -o trim_galore.tar.gz
tar xvzf trim_galore.tar.gz
mv TrimGalore-0.4.3/trim_galore $HOME/bin

Check installation:

trim_galore --version

3.1.6.4. BBDuk

cd $HOME/app_sources
wget https://sourceforge.net/projects/bbmap/files/BBMap_37.31.tar.gz
tar xvzf BBMap_37.31.tar.gz
cp `find bbmap/ -maxdepth 1 -executable -type f` $HOME/bin

3.1.7. Alignment/mapping

The point of mapping is to replace the reads obtained from the sequencing step onto a reference genome. When the read is long enough, it can be mapped on the genome with a pretty good confidence, by tolerating a certain amount of so-called mismatches. However, genomes can contain repeated regions that are harder to deal with.

We call “sequencing depth” the average number of reads mapped at each position of the genome. The bigger the sequencing depth, the better the quality of the alignment, and the better the downstream analyses.

3.1.7.1. BWA

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It is designed for short reads alignment.

Li H. and Durbin R. (2009). Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60.

sudo apt-get install bwa

Check installation:

bwa

3.1.7.2. Bowtie

Bowtie performs ungapped alignment, and is therefore not suitable for certain types of data, like RNA-seq data.

cd $HOME/app_sources
wget --no-clobber http://downloads.sourceforge.net/project/bowtie-bio/bowtie/1.1.1/bowtie-1.1.1-linux-x86_64.zip
unzip bowtie-1.1.1-linux-x86_64.zip
cp `find bowtie-1.1.1/ -maxdepth 1 -executable -type f` $HOME/bin

Check installation:

bowtie --help

3.1.7.3. Bowtie2

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index (based on the Burrows-Wheeler Transform or BWT) to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 gigabytes of RAM. Bowtie 2 supports gapped, local, and paired-end alignment modes. Multiple processors can be used simultaneously to achieve greater alignment speed. Bowtie 2 outputs alignments in SAM format, enabling interoperation with a large number of other tools (e.g. SAMtools, GATK) that use SAM. Bowtie 2 is distributed under the GPLv3 license, and it runs on the command line under Windows, Mac OS X and Linux.”

General documentation

Instructions

Downloads

Reference:

Langmead B, Trapnell C, Pop M, L Salzberg S. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 200910:R25. DOI: 10.1186/gb-2009-10-3-r25

cd $HOME/app_sources
wget http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.6/bowtie2-2.2.6-linux-x86_64.zip
unzip bowtie2-2.2.6-linux-x86_64.zip
cp `find bowtie2-2.2.6/ -maxdepth 1 -executable -type f` $HOME/bin

Check installation:

bowtie2 --version

3.1.7.4. Subread-align

The Subread package comprises a suite of software programs for processing next-gen sequencing read data including:

Subread: a general-purpose read aligner which can align both genomic DNA-seq and RNA-seq reads. It can also be used to discover genomic mutations including short indels and structural variants. Subjunc: a read aligner developed for aligning RNA-seq reads and for the detection of exon-exon junctions. Gene fusion events can be detected as well. featureCounts: a software program developed for counting reads to genomic features such as genes, exons, promoters and genomic bins. exactSNP: a SNP caller that discovers SNPs by testing signals against local background noises

Reference:

Liao Y, Smyth GK and Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108, 2013

cd $HOME/app_sources
wget -nc https://sourceforge.net/projects/subread/files/subread-1.5.2/subread-1.5.2-source.tar.gz
tar zxvf subread-1.5.2-source.tar.gz
cd subread-1.5.2-source/src
make -f Makefile.Linux
cd ../bin
cp `find * -executable -type f` $HOME/bin

Check installation:

subread-align --version

3.1.7.5. Tophat

TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

cd $HOME/app_sources
wget --no-clobber https://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.14.Linux_x86_64.tar.gz
tar xvfz tophat-2.0.14.Linux_x86_64.tar.gz
cd tophat-2.0.14.Linux_x86_64
rm -Rf AUTHORS LICENSE README intervaltree/ sortedcontainers/
mv ./* $HOME/bin
cd ..; rm -Rf tophat-2.0.14.Linux_x86_64*

Check installation:

tophat --version

3.1.8. Peak-calling

The following tools can be used to perform ChIP-seq peak-calling.

3.1.8.1. Homer

Required in order to run the tutorials.

Web page

Install instructions

mkdir $HOME/app_sources/homer
cd $HOME/app_sources/homer
wget "http://homer.salk.edu/homer/configureHomer.pl"
perl configureHomer.pl -install homer
cp `find $HOME/app_sources/homer/bin -maxdepth 1 -executable -type f` $HOME/bin

The basic Homer installation does not contain any sequence data. To download sequences for use with HOMER, use the configureHomer.pl script. To get a list of available packages:

perl $HOME/bin/HOMER/configureHomer.pl -list

To install packages, simply use the -install option and the name(s) of the package(s).

However, Homer can also work with custom genomes in FASTA format and gene annotations in GTF format. Thus the Gene-regulation workflows don’t require to install any genome.

Check installation:

findMotifs.pl

3.1.8.2. MACS 1.4

Required in order to run the demo workflow “ChIP-seq” on dataset GSE20870 (in the tutorials section).

cd $HOME/app_sources
wget "https://github.com/downloads/taoliu/MACS/MACS-1.4.2.tar.gz"
tar -xvzf MACS-1.4.2.tar.gz
cd MACS-1.4.2
sudo python setup.py install

Check installaiton:

macs14 --version

3.1.8.3. MACS 2

Required in order to run the tutorials.

sudo apt-get install python-numpy
sudo pip install MACS2

Check installation:

macs2 --version

3.1.8.4. bPeaks

Peak-caller developped specifically for yeast, can be useful in order to process small genomes only.

It is currently not used in demo workflows, and is therefore not mandatory to run the tutorials.

Available as an R package.

Web page

install.packages("bPeaks")
library(bPeaks)

3.1.8.5. SPP

This peak-caller is used in the ChIP-seq study case GSE20870.

  • Method 1: git

See github page.

Commands in R:

require(devtools)
devtools::install_github('hms-dbmi/spp', build_vignettes = FALSE)
  • Method 2: Bioconductor [deprecated]
source("http://bioconductor.org/biocLite.R")
biocLite("spp")
install.packages("caTools")
install.packages("spp")
  • Method 3: commandline [deprecated]
apt-get install libboost-all-dev
cd $HOME/app_sources
wget -nc http://compbio.med.harvard.edu/Supplements/ChIP-seq/spp_1.11.tar.gz
sudo R CMD INSTALL spp_1.11.tar.gz
  • Method 4: the ultimate protocol for Ubuntu 14.04

In unix shell:

# unix libraries
sudo apt-get update
sudo apt-get -y install r-base
sudo apt-get -y install libboost-dev zlibc zlib1g-dev

In R shell:

# Rsamtools
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")

In unix shell:

# spp
cd $HOME/app_sources
wget http://compbio.med.harvard.edu/Supplements/ChIP-seq/spp_1.11.tar.gz
sudo R CMD INSTALL spp_1.11.tar.gz

Check installation in R:

library(spp)

A few links:

  • Download page can be found here, better chose version 1.11.
  • SPP requires the Bioconductor library Rsamtools to be installed beforehand.
  • Unix packages gcc and libboost (or equivalents) must be installed.
  • You can find a few more notes here.
  • Good luck!

3.1.8.6. Mosaics

This peak-caller is used in the ChIP-seq study case GSE20870.

Installation in R from Bioconductor:

source("https://bioconductor.org/biocLite.R")
biocLite("mosaics")

3.1.8.7. SWEMBL

This peak-caller is used in the ChIP-seq study case GSE20870.

git clone https://github.com/stevenwilder/SWEMBL.git
cd SWEMBL
make
cp SWEMBL $(BIN_DIR)

Deprecated method

cd $HOME/app_sources
wget "http://www.ebi.ac.uk/~swilder/SWEMBL/SWEMBL.3.3.1.tar.bz2" && \
bunzip2 -f SWEMBL.3.3.1.tar.bz2 && \
tar xvf SWEMBL.3.3.1.tar && \
rm SWEMBL.3.3.1.tar && \
chown -R ubuntu-user SWEMBL.3.3.1 && \
cd SWEMBL.3.3.1 && \
make

# This method require a manual fix of C flags in makefile
# gcc main.c IO.c calc.c stack.c summit.c refcalc.c wiggle.c overlap.c -o SWEMBL -lz -lm

3.1.9. Motif discovery, motif analysis

These software can be useful for the analysis of ChIP-seq peaks.

3.1.9.1. Regulatory Sequence Analysis Tools (RSAT)

see dedicated section

Link

to translate

Suite logicielle spécialisée pour l’analyse de motifs cis-régulateurs, développée par les équipes de Morgane Thomas-Chollier (ENS, Paris) et Jacques van Helden (TAGC, Marseille). Inclut des outils spécifiques pour l’analyse de données de ChIP-seq.

3.1.9.2. MEME

Link

to translate

Suite logicielle spécialisée pour l’analyse de motifs cis-régulateurs, développée par l’équipe de Tim Bailey. Inclut des outils spécifiques pour l’analyse de données de ChIP-seq.

3.1.10. RNA-seq

3.1.10.1. featureCounts

Liao Y, Smyth GK and Shi W. featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30, 2014

3.1.11. Miscellaneous

3.1.11.1. SRA toolkit

This toolkit includes a number of programs, allowing the conversion of *.sra files. fastq-dump translates *.sra files to *.fastq files.

You can download last version here, or issue the following commands:

cd $HOME/app_sources
wget -nc http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.5.2/sratoolkit.2.5.2-ubuntu64.tar.gz
tar xzf sratoolkit.2.5.2-ubuntu64.tar.gz
cp `find sratoolkit.2.5.2-ubuntu64/bin -maxdepth 1 -executable -type l` $HOME/bin

You can also install SRA toolkit simply by issuing this command, but likely it won’t be the most recent release:

sudo apt-get install sra-toolkit
fastq-dump --version
  fastq-dump : 2.1.7

3.1.11.2. Samtools

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.

SAMtools provides several tools to process such files.

cd $HOME/app_sources
wget --no-clobber http://sourceforge.net/projects/samtools/files/samtools/1.3/samtools-1.3.tar.bz2
bunzip2 -f samtools-1.3.tar.bz2
tar xvf samtools-1.3.tar
cd samtools-1.3
make
sudo make install

3.1.11.3. Bedtools

The bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.

sudo apt-get install bedtools

or get the latest version:

cd $HOME/app_sources
wget --no-clobber https://github.com/arq5x/bedtools2/releases/download/v2.24.0/bedtools-2.24.0.tar.gz
tar xvfz bedtools-2.24.0.tar.gz
cd bedtools2
make
sudo make install

3.1.11.4. Bedops

cd $HOME/app_sources
wget -nc https://github.com/bedops/bedops/releases/download/v2.4.19/bedops_linux_x86_64-v2.4.19.tar.bz2
tar jxvf bedops_linux_x86_64-v2.4.19.tar.bz2
mkdir bedops
mv bin bedops
cp bedops/bin/* $HOME/bin

3.1.11.5. Deeptools

cd $HOME/app_sources
git clone https://github.com/fidelram/deepTools
cd deepTools
python setup.py install

3.1.11.6. Picard tools

todo

3.1.11.7. Other

3.2. Makefile

The Gene-regulation library comprises a makefile that can install most of the dependencies described in the previous section. It is recommended when you’re setting up a virtual environments, as described in these tutorials.

If you want to run the workflows on your personal computer or on a server, you should follow the manual installation, or contact a sysadmin.

The makefile currently allows running the following workflows:

  • import_from_sra.wf
  • quality_control.wf
  • ChIP-seq.wf

It is not yet handling al the RNA-seq dependencies.

# it is assumed that you have defined a global variable with the path to the Gene-regulation library
cd ${GENE_REG_PATH}
make -f scripts/makefiles/install_tools_and_libs.mk all
source ~/.bashrc

3.3. Conda

A number of dependencies of Gene-regulation can be installed through a Conda environment. This list is not exhaustive.

conda install -c bioconda sickle=0.5 conda install -c bioconda bowtie=1.2.0 conda install -c bioconda bowtie2=2.3.0 conda install -c bioconda subread=1.5.0.post3 conda install -c bioconda tophat=2.1.1 conda install -c bioconda bwa=0.7.15 conda install -c bioconda fastqc=0.11.5 conda install -c bioconda macs2=2.1.1.20160309 conda install -c bioconda homer=4.8.3 conda install -c bioconda bedtools=2.26.0 conda install -c bioconda samtools=1.3.1 conda install -c bioconda bamtools=2.4.0