4. Dependencies¶
Note: this section needs to be refreshed
4.1. Manual installation¶
This manual aims at helping you install the necessary programs and dependencies in order to have the snakemake workflows work. It was designed for Unix-running computers (Ubuntu, Debian).
4.1.1. General requirements¶
4.1.1.1. Generic tools¶
4.1.1.1.1. ssh¶
sudo apt-get install ssh
ssh-keygen
4.1.1.1.2. rsync¶
rsync is an open source utility that provides fast incremental file transfer.
sudo apt-get install rsync
4.1.1.1.3. git¶
- Install git on your machine.
sudo apt-get install git
Optional:
- Create an account on GitHub.
- Add your ssh public key to your GitHub account settings (account > settings > SSH keys > add SSH key).
less ~/.ssh/id_rsa.pub
4.1.1.1.4. zlib¶
Several tools require this dependency (e.g. sickle, bamtools...).
sudo apt-get install libz-dev
4.1.1.1.5. qsub¶
4.1.1.2. Create bin/ and app_sources/ (optional)¶
While some programs will be installed completely automatically, others will not. Here we create a directory that will be used for manual installations.
mkdir $HOME/bin
mkdir $HOME/app_sources
You might then have to edit your $PATH
manually (see next section).
4.1.1.3. Edit $PATH
¶
In order to use manually installed programs and make them executable,
you may have to update your $PATH
environment variable. You can do
so by editing the ~/.profile
file.
nano ~/.profile
Fetch this paragraph and add the path to manually installed executables:
# set PATH so it includes user's private bin if it exists
if [ -d "$HOME/bin" ] ; then
PATH="$HOME/bin:$PATH"
fi
Execute the file to validate the change.
source ~/.profile
4.1.2. Snakemake workflows basic requirements¶
4.1.2.1. Python¶
Snakemake requires to have Python 3.3+ installed. You can check this by issuing the following commands in a terminal:
python --version # usually the default python version is 2.7+
python3 --version
If you don’t have python 3 you should install it.
sudo apt-get install python3
Install pip and pip3.
sudo apt-get install python-pip
sudo apt-get install python3-pip
Not installed natively?
apt-get install python-dev
apt-get install python3.4-dev
4.1.2.1.1. Pandas library¶
This library is used in order to read tab-delimited files used in the workflows
(see files samples.tab
and design.tab
).
pip3 install pandas
4.1.2.1.2. Package rpy2¶
pip3 install "rpy2<2.3.10"
4.1.2.2. R¶
todo
4.1.2.3. Snakemake¶
- Documentation
- FAQ
- Forum
- See also Snakemake section for tutorials.
Now you have installed Python 3 and pip3 (see previous section), you can install snakemake safely.
pip3 install snakemake
You can check that snakemake works properly with this basic script:
"""Snakefile to test basic functions of snakemake.
"""
rule all:
input: expand("bye.txt")
rule hello:
"""Write HELLO in a text file named hello.txt.
"""
output: "hello.txt"
message: "Generating {output} file."
shell: "echo HELLO > {output}"
rule bye:
"""Write BYE in a text file named bye.txt.
"""
input: "hello.txt"
output: "bye.txt"
message: "Generating {output} file."
shell: "echo BYE > {output}"
- Save it to
~/workspace/hello.py
. - Issue the command
cd ~/workspace ; snakemake -s hello.py
. - 2 files should be created:
hello.txt
andbye.txt
.
As of December 2015, you need snakemake version 3.4+.
pip3 install snakemake --upgrade
If you want to use Snakemake reports function (optional):
pip3 install docutils
4.1.3. NGS analysis software & tools¶
4.1.3.1. Quality assessment¶
4.1.3.1.1. FastQC¶
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
The main functions of FastQC are:
- Import of data from BAM, SAM or FastQ files (any variant)
- Providing a quick overview to tell you in which areas there may be problems
- Summary graphs and tables to quickly assess your data
- Export of results to an HTML based permanent report
- Offline operation to allow automated generation of reports without running the interactive application
Links:
FastQC is available from linux repositories:
sudo apt-get install fastqc
However, since it’s an older version, it can cause problems of dependencies.
We recommend installing it manually:
cd $HOME/app_sources
wget --no-clobber http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip -o fastqc_v0.11.5.zip
chmod +x FastQC/fastqc
ln -s -f $HOME/app_sources/FastQC/fastqc $HOME/bin/fastqc
4.1.3.2. Trimming¶
The quality of the reads generated by high-throughput sequencing technologies tend to decrease at their ends. Trimming consists in cutting out theses ends, and thus better the quality of reads before the mapping.
4.1.3.2.1. Sickle¶
Sickle is a trimming tool which better the quality of NGS reads.
Sickle uses sliding windows computing sequencing quality along the reads. When the quality falls below a chose q-value threshold, the reads is cut. If the size of the remaining read is too short, it is completely removed. Sickle takes into account three different types of read quality: Illumina, Solexa, Sanger.
- Pre-requisite: install
zlib
(link to section). - Clone the git repository into your bin (link to section) and run
make
.
cd $HOME/app_sources
git clone https://github.com/najoshi/sickle.git
cd sickle
make
cp sickle $HOME/bin
4.1.3.3. Alignment/mapping¶
Le but de l’alignement est de replacer les reads issus du séquençage à leur emplacement sur un génome de référence. Lorsque le read est suffisamment long, il peut généralement être mappé sur le génome avec une bonne certitude, en tolérant une certain quantité de mismatches, c’est-à-dire de nucléotides mal appariés. Néanmoins certaines séquences répétées du génome peuvent s’avérer plus difficiles à aligner. On désigne par l’expression “profondeur de séquençage” (ou sequencing depth) le nombre moyen de reads alignés par position sur le génome. Plus cette profondeur est importante, meilleure est la qualité de l’alignement, et plus les analyses ultérieures seront de qualité.
4.1.3.3.1. BWA¶
BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.
Li H. and Durbin R. (2009). Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60.
sudo apt-get install bwa
4.1.3.3.2. Bowtie¶
cd $HOME/app_sources
wget --no-clobber http://downloads.sourceforge.net/project/bowtie-bio/bowtie/$(BOWTIE1_VER)/bowtie-$(BOWTIE1_VER)-linux-x86_64.zip
unzip bowtie-$(BOWTIE1_VER)-linux-x86_64.zip
cp `find bowtie-$(BOWTIE1_VER)/ -maxdepth 1 -executable -type f` $HOME/bin
4.1.3.3.3. Bowtie2¶
cd $HOME/app_sources
wget http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.6/bowtie2-2.2.6-linux-x86_64.zip
unzip bowtie2-2.2.6-linux-x86_64.zip
p `find bowtie2-$(BOWTIE2_VER)/ -maxdepth 1 -executable -type f` $HOME/bin
4.1.3.4. Peak-calling¶
4.1.3.4.1. bPeaks¶
Peak-caller developped specifically for yeast, can be useful in order to process small genomes only.
Available as an R library.
install.packages("bPeaks")
library(bPeaks)
4.1.3.4.2. HOMER¶
mkdir $HOME/app_sources/homer
cd $HOME/app_sources/homer
wget "http://homer.salk.edu/homer/configureHomer.pl"
perl configureHomer.pl -install homer
cp `find $HOME/app_sources/homer/bin -maxdepth 1 -executable -type f` $HOME/bin
The basic Homer installation does not contain any sequence data. To download sequences for use with HOMER, use the configureHomer.pl script. To get a list of available packages:
perl $HOME/bin/HOMER/configureHomer.pl -list
To install packages, simply use the -install option and the name(s) of the package(s).
perl $HOME/bin/HOMER/configureHomer.pl -install mouse # (to download the mouse promoter set)
perl $HOME/bin/HOMER/configureHomer.pl -install mm8 # (to download the mm8 version of the mouse genome)
perl $HOME/bin/HOMER/configureHomer.pl -install hg19 # (to download the hg19 version of the human genome)
Supported organisms:
Organism | Assembly |
---|---|
Human | hg17, hg18, hg19 |
Mouse | mm8, mm9, mm10 |
Rat | rn4, rn5 |
Frog | xenTro2, xenTro3 |
Zebrafish | danRer7 |
Drosophila | dm3 |
|
ce6, ce10 |
|
sacCer2, sacCer3 |
|
ASM294v1 |
Arabidopsis | tair10 |
Rice | msu6 |
HOMER can also work with custom genomes in FASTA format and gene annotations in GTF format.
4.1.3.4.3. MACS 1.4¶
cd $HOME/app_sources
wget "https://github.com/downloads/taoliu/MACS/MACS-1.4.3.tar.gz"
tar -xvzf MACS-1.4.3.tar.gz
cd MACS-1.4.3
sudo python setup.py install
macs14 --version
4.1.3.4.5. SPP R package¶
This one might be a little but tricky (euphemism).
Several possibilities, none of which have I had the courage to retry lately.
- In R
source("http://bioconductor.org/biocLite.R")
biocLite("spp")
install.packages("caTools")
install.packages("spp")
- In commandline
apt-get install libboost-all-dev
cd $HOME/app_sources
wget -nc http://compbio.med.harvard.edu/Supplements/ChIP-seq/spp_1.11.tar.gz
sudo R CMD INSTALL spp_1.11.tar.gz
- Using git (I haven’t tried this one but it looks more recent) (see github page)
require(devtools)
devtools::install_github('hms-dbmi/spp', build_vignettes = FALSE)
I also wrote a little protocol a while ago. Here’s the procedure on Ubuntu 14.04, in this very order:
In unix shell:
# unix libraries
apt-get update
apt-get -y install r-base
apt-get -y install libboost-dev zlibc zlib1g-dev
In R shell:
# Rsamtools
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
In unix shell:
# spp
wget http://compbio.med.harvard.edu/Supplements/ChIP-seq/spp_1.11.tar.gz
sudo R CMD INSTALL spp_1.11.tar.gz
A few links:
4.1.3.4.6. SWEMBL¶
cd $HOME/app_sources
wget "http://www.ebi.ac.uk/~swilder/SWEMBL/SWEMBL.3.3.1.tar.bz2"
bunzip2 -f SWEMBL.3.3.1.tar.bz2
tar xvf SWEMBL.3.3.1.tar
rm SWEMBL.3.3.1.tar
chown -R ubuntu-user SWEMBL.3.3.1
cd SWEMBL.3.3.1
make
It seems there could be issues with C flags. To be investigated.
4.1.3.5. Motif discovery, motif analysis¶
4.1.3.5.1. Regulatory Sequence Analysis Tools (RSAT)¶
see dedicated section
to translate
Suite logicielle spécialisée pour l’analyse de motifs cis-régulateurs, développée par les équipes de Morgane Thomas-Chollier (ENS, Paris) et Jacques van Helden (TAGC, Marseille). Inclut des outils spécifiques pour l’analyse de données de ChIP-seq.
4.1.3.6. Miscellaneous¶
4.1.3.6.1. SRA toolkit¶
This toolkit includes a number of programs, allowing the conversion of
*.sra
files. fastq-dump
translates *.sra
files to
*.fastq
files.
You can download last version here, or issue the following commands:
cd $HOME/app_sources
wget -nc http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.5.2/sratoolkit.2.5.2-ubuntu64.tar.gz
tar xzf sratoolkit.2.5.2-ubuntu64.tar.gz
cp `find sratoolkit.2.5.2-ubuntu64/bin -maxdepth 1 -executable -type l` $HOME/bin
You can also install SRA toolkit simply by issuing this command, but likely it won’t be the most recent release:
sudo apt-get install sra-toolkit
fastq-dump --version
fastq-dump : 2.1.7
4.1.3.6.2. Samtools¶
SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.
SAMtools provides several tools to process such files.
cd $HOME/app_sources
wget --no-clobber http://sourceforge.net/projects/samtools/files/samtools/1.3/samtools-1.3.tar.bz2
bunzip2 -f samtools-1.3.tar.bz2
tar xvf samtools-1.3.tar
cd samtools-1.3
make
sudo make install
4.1.3.6.3. Bedtools¶
The bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.
sudo apt-get install bedtools
or get the latest version:
cd $HOME/app_sources
wget --no-clobber https://github.com/arq5x/bedtools2/releases/download/v2.24.0/bedtools-2.24.0.tar.gz
tar xvfz bedtools-2.24.0.tar.gz
cd bedtools2
make
sudo make install
4.1.3.6.4. Bedops¶
cd $HOME/app_sources
wget -nc https://github.com/bedops/bedops/releases/download/v2.4.19/bedops_linux_x86_64-v2.4.19.tar.bz2
tar jxvf bedops_linux_x86_64-v2.4.19.tar.bz2
mkdir bedops
mv bin bedops
cp bedops/bin/* $HOME/bin
4.1.3.6.5. Deeptools¶
cd $HOME/app_sources
git clone https://github.com/fidelram/deepTools
cd deepTools
python setup.py install
4.1.3.6.6. Picard tools¶
todo
4.1.3.6.7. Other¶
- MICSA: peak-calling & motifs discovery (publication).
- ChIPMunk: deep and wide digging for binding motifs in ChIP-Seq data (publication).
- HMCan: a method for detecting chromatin modifications in cancer samples using ChIP-seq data (publication).
- seqMINER
- Crunch project
- CSDeconv
- ...
4.2. Makefile¶
Has to be revised
The Gene-regulation library comprises a makefile that can install most of the dependencies described in the previous section.
It currently allows running the following workflows:
- import_from_sra.wf
- quality_control.wf
- ChIP-seq.wf
cd $GENE_REG_PATH
make -f gene-regulation/scripts/makefiles/install_tools_and_libs.mk all
source ~/.bashrc
4.3. Conda¶
This section has to be written
conda install -c bioconda sickle=0.5
conda install -c bioconda bowtie=1.2.0
conda install -c bioconda bowtie2=2.3.0
conda install -c bioconda subread=1.5.0.post3
conda install -c bioconda tophat=2.1.1
conda install -c bioconda bwa=0.7.15
conda install -c bioconda fastqc=0.11.5
conda install -c bioconda macs2=2.1.1.20160309
conda install -c bioconda homer=4.8.3
conda install -c bioconda bedtools=2.26.0
conda install -c bioconda samtools=1.3.1
conda install -c bioconda bamtools=2.4.0