This guide is designed to walk you through the process of obtaining
Whether using the container or native version, you will need to provide external reference genome assemblies. The pipeline requires genome assemblies produced by
One feature of the pipeline is prealignments, which siphons off reads by aligning to small genomes before the main alignment to the primary reference genome. Any prealignments you want to do will also require
refgenie assemblies. When using the default configuration files, the pipeline will pre-align to the mitochondrial genome, so you if you want to use the default settings, you will need
refgenie assemblies for the
rCRSd genome (for human) or
mouse_chrM (for mouse) in addition to the primary assembly you wish to use. Other ideas for common prealignment references are provided by ref_decoy.
You have two options for using
refgenie assemblies with
PEPATAC. If you're using a common genome, there's a good chance there's already a
refgenie assembly for that. Otherwise, you can create your own.
Pre-built genome indices exist for commonly utilized genomes including:
mm9. You may simply download the corresponding pre-indexed references to get started immediately.
For complete and detailed information on indexing your own genomes, see the
For a quick introduciton, a simple example is presented here.
Pypiper (which if you've installed
PEPATAC you will have already installed it too), and, of course, you'll need
pip install --user piper
git clone https://github.com/databio/refgenie.git
Refgenie will produce indices for many alignment software tools should you have them installed.
bowtie2, so make sure you have that installed (see bowtie2 documentation) and in your
src/refgenie.py -i INPUT_FILE.fa
INPUT_FILE.fa is a fasta file of your reference genome, and can be either a local file or a URL.
Once you've procured assemblies for all genomes you wish to use, you must point the pipeline to where you store these. You can do this in two ways, either: 1) with an environment variable, or 2) by adjusting a configuration option.
The pipeline looks for genomes stored in a folder specified by the
resources.genomes attribute in the pipeline config file. By default, this points to the shell variable
GENOMES, so all you have to do is set an environment variable to the location of your refgenie genomes:
(Add this to your
.profile to ensure it persists).
Alternatively, you can skip the
GENOMES variable and simply change the value of that configuration option to point to the folder where you stored the assemblies. The advantage of using an environment variable is that it makes the configuration file portable, so the same pipeline can be run on any computing environment, as the location to reference assemblies is not hard-coded to a specific computing environment.