Install and run PEPATAC

This guide will get you running PEPATAC quickly. If you get stuck or prefer more detailed instructions, refer to the extended tutorial.


1.1: Clone the PEPATAC pipeline

To begin, we need to get the PEPATAC pipeline itself itself. To clone the pipeline, you can use one of the following methods:

  • using SSH:
git clone git@github.com:databio/pepatac.git
  • using HTTPS:
git clone https://github.com/databio/pepatac.git

1.2: Install required software

You have two options for installing the software prerequisites: 1) use a container, in which case you need only either docker or singularity; or 2) install all prerequisites natively. If you want to install it natively, skip to the native installation instructions.

1.2.1: Use containers!

If you have experience using containers, you may simply run PEPATAC directly in a provided container. First, make sure your environment is set up to run either docker or singularity containers. Then, pull the container image:

Docker: You can pull the docker image from dockerhub like this: docker pull databio/pepatac

Or build the image using the included Dockerfile (you can use a recipe in the included Makefile):

cd pepatac/
make docker

Singularity: You can download the singularity image or build it from the docker image following the recipe in the Makefile:

cd pepatac/
make singularity

Now you'll need to tell the pipeline where you saved the singularity image. You can either create an environment variable called $SIMAGES that points to the folder where your image is stored, or you can tweak the pipeline_interface.yaml file so that the compute.singularity_image attribute is pointing to the right location on disk.

If your containers are set up correctly, then you can skip the next section about installing software. So, jump to obtaining refgenie assemblies. You can also go straight to reading more detailed instructions on running the pipeline in a container.

1.2.2: Install software requirements natively

To use PEPATAC, we need the following software:

Python packages. The pipeline uses pypiper to run a single sample, looper to handle multi-sample projects (for either local or cluster computation), and pararead for parallel processing sequence reads. For peak calling, the pipeline uses MACS2 as the default. You can do a user-specific install of these like this:

pip install --user numpy \
  pandas \
  piper \
  https://github.com/pepkit/looper/zipball/master \
  pararead \
  MACS2

Required executables. We will need some common bioinformatics tools installed. The complete list (including optional tools) is specified in the pipeline configuration file (pipelines/pepatac.yaml) tools section. The following tools are used by the pipeline:

That should do it for required packages! To obtain the full benefit of PEPATAC's QC and annotation features, install the following R packages as well.


1.3: Install optional software

PEPATAC uses R to generate quality control and read/peak annotation plots. These are optional and the pipeline will run without them, but you would not get any QC or annotation plots. The following packages are necessary:

To install the needed packages, run the following at the command prompt:

Rscript -e "install.packages(c('argparser','devtools', 'data.table', \
  'ggplot2', 'gplots', 'gtable', 'scales'), \
  repos='http://cran.us.r-project.org/'); \
  source('https://bioconductor.org/biocLite.R'); biocLite('GenomicRanges'); \
  devtools::install_github(c('pepkit/pepr', 'databio/GenomicDistributions'))"

To extract files quicker, PEPATAC can use pigz in place of gzip if you have it installed. It's not required, but it can help speed everything up when you have many samples to process and the ability to leverage multiple processors.

Don't forget to add this to your PATH too! That's it! Everything we need to run PEPATAC to its full potential should be installed.


2.1: Download refgenie assemblies

Whether using the container or native version, you will need to provide external reference genome assemblies. The pipeline requires genome assemblies produced by refgenie. One feature of the pipeline is prealignments, which siphons off reads by aligning to small genomes before the main alignment to the primary reference genome. Any prealignments you want to use will also require refgenie assemblies. Ideas for common prealignment references are provided by ref_decoy.

You may download pre-indexed references or you may index your own (see refgenie instructions). The pre-indexed references are compressed files, so you need to untar/unzip them after download. In this guide, we will download pre-built genomes.

Grab the hg38, human_repeats, and rCRSd (Revised Cambridge Reference Sequence for human mtDNA) genomes.

wget http://big.databio.org/refgenomes/hg38.tgz
wget http://big.databio.org/refgenomes/human_repeats_170502.tgz
wget http://big.databio.org/refgenomes/rCRSd_170502.tgz

2.2: Download or create annotation files

To calculate TSS enrichments, you will need a TSS annotation file in your reference genome directory. If a pre-built version for your genome of interest isn't present, you can quickly create that file yourself. In the reference genome directory, you can perform the following commands for in this example, hg38:

wget -O hg38_TSS_full.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz \
zcat hg38_TSS_full.txt.gz | \
  awk  '{if($4=="+"){print $3"\t"$5"\t"$5"\t"$4"\t"$13}else{print $3"\t"$6"\t"$6"\t"$4"\t"$13}}' | \
  LC_COLLATE=C sort -k1,1 -k2,2n -u > hg38_TSS.tsv

We also have downloadable pre-built genome annotation files for hg38, hg19, mm10, and mm9 that you can use to annotate the reads and peaks. These files annotate 3' and 5' UTR, Exonic, Intronic, Intergenic, Promoter, and Promoter Flanking Regions of the corresponding genome as indicated in Ensembl or UCSC. Simply move the corresponding genome annotation file into the pepatac/anno folder. Once present in the pepatac/anno folder you don't need to do anything else as the pipeline will look there automatically. Alternatively, you can use the --anno-name pipeline option to directly point to this file when running. You can also learn how to create a custom annotation file to calculate coverage using your own features of interest.


2.3: Configure the pipeline

Once you've obtained assemblies for all genomes you wish to use, you must point the pipeline to where you store these. You can do this in two ways, either: 1) with an environment variable, or 2) by adjusting a configuration option. The pipeline looks for genomes stored in a folder specified by the resources.genomes attribute in the pipeline config file. By default, this points to the shell variable GENOMES, so all you have to do is set an environment variable to the location of your refgenie genomes:

export GENOMES="/path/to/genomes/"

Alternatively, you can skip the GENOMES variable and simply change the value of that configuration option to point to the folder where you stored the assemblies. The advantage of using an environment variable is that it makes the configuration file portable, so the same pipeline can be run on any computing environment, as the location to reference assemblies is not hard-coded to a specific computing environment.


2.4: Run the pipeline

The pipeline can be run directly from the command line for a single sample. Here we'll outline how to run an individual sample, followed by instructions for looping.

2.4.1: Running the pipeline script directly (without looper)

The pipeline is at its core just a python script, and you can run it on the command line for a single sample. To see the command-line options for usage, see usage, which you can also get on the command line by running pipelines/pepatac.py --help. You just need to pass a few command-line parameters to specify sample name, reference genome, input files, etc. See example commands that use test data. Here's the basic command to run a small test example through the pipeline:

pipelines/pepatac.py --single-or-paired paired \
  --prealignments rCRSd human_repeats \
  --genome hg38 \
  --sample-name test1 \
  --input examples/data/test1_r1.fastq.gz \
  --input2 examples/data/test1_r2.fastq.gz \
  --genome-size hs \
  -O $HOME/pepatac_test

This example should take about 15 minutes to complete.

2.4.2: Running the pipeline directly in a container

A full tutorial on using containers is outside the scope of this guide, but here are the basics. Individual jobs can be run in a container by simply running the pepatac.py command through docker run or singularity exec. You can run containers either on your local computer, or in an HPC environment, as long as you have docker or singularity installed. For example, run it locally in singularity like this:

singularity exec --bind $GENOMES $SIMAGES/pepatac pipelines/pepatac.py --help

With docker, you can use:

docker run --rm -it databio/pepatac pipelines/pepatac.py --help

Be sure to mount the volumes you need with --volume. If you're utilizing any environment variables (e.g. $GENOMES), don't forget to include those in your docker command with the -e option. For a more detailed example, check out our guide to learn how to run pepatac in a container.

To run on multiple samples, you can just write a loop to process each sample independently with the pipeline, or you can use...looper! Learn more about using looper with PEPATAC in the how-to guides or in the extended tutorial). Any questions? Feel free to reach out to us. Otherwise, go analyze some ATAC-seq!

2.4.3: Running the pipeline on multiple samples

If you need to run it on many samples, you could write your own sample handling code, but we have pre-configured everything to work nicely with looper, our sample handling engine. The extended tutorial includes a more detailed explanation for how to use looper to analyze some provided example data.