An introduction to projectInit

projectInit is an R package that helps you load a project-specific R workspace. It reads environment variables to coordinate your working directory, code location, raw data folders, and output folders. It then provides universal functions to access those directories, so you don’t have to worry about keeping track of annoying file path bookkeeping, but can concentrate on your R code instead. This lets you easily work in different environments and share inputs and outputs across sessions and across users.

Environment setup

The package relies on shell environment variables you have set up to define where you store different classes of data (e.g. input data, output plots, code, shared resources, etc.). The specific variables used by projectInit are:

$CODE - Pointer to the collection of project folders where code is kept. Each folder corresponds to a project (for me, each folder is also a git repository)
$PROCESSED - Pointer to the ‘processed data’ filesystem
$RESOURCES - Pointer to the common shared resources directory

So, for instance, you could add this to your .bashrc to populate these locations for your computing environment:

export PROCESSED="/path/to/processed/data/"
export CODE="$HOME/code/"
export RESOURCES="/path/to/resources/"

You would just set these up to correct locations for each user and each computer or server, and projectInit will handle the rest.

Getting started

A data analysis project is a collection of R scripts. We assume that you have a folder for each project where you keep these R scripts, and these folders are each stored in your root $CODE folder. Then, you can initialize an R project using projectInit with this code:

library("projectInit")

## Loading required package: folderfun

projectInit("project_name")

## Created folder function ffCode(): /tmp/RtmpsibWsg/code/project_name

## Created folder function ffProc(): /tmp/RtmpsibWsg/processed/project_name

## Created folder function ffRaw(): /tmp/RtmpsibWsg/data/project_name

## Created folder function ffWeb(): /tmp/RtmpsibWsg/web/project_name

## Created folder function ffOut(): /tmp/RtmpsibWsg/processed/project_name/analysis

## Created folder function ffRes(): /tmp/RtmpsibWsg/resources

## Created folder function ffResCache(): /tmp/RtmpsibWsg/resources/cache/RCache

## Created folder function ffCache(): /tmp/RtmpsibWsg/processed/project_name/RCache

## Created folder function ffProcRoot(): /tmp/RtmpsibWsg/processed

## Created folder function ffRawRoot(): /tmp/RtmpsibWsg/data

## Created folder function ffWebRoot(): /tmp/RtmpsibWsg/web

## Created folder function ffOutRoot(): /tmp/RtmpsibWsg/processed

## Created folder function ffBuild(): /tmp/RtmpsibWsg/code/project_name/RBuild

## Did not find fixed-name config: /tmp/RtmpsibWsg/code/project_name/metadata/config.yaml; /tmp/RtmpsibWsg/code/project_name/metadata/project_config.yaml; /tmp/RtmpsibWsg/code/project_name/metadata/pconfig.yaml; /tmp/RtmpsibWsg/code/project_name/metadata/project_name.yaml

## No RGenomeUtils found.

## No project init script: '/tmp/RtmpsibWsg/code/project_name/src/00-init.R'.

This will do a number of things:

Changes your working directory to ${CODE}/project_name
Runs a project-specific R init script, if you have one saved at ${CODE}/project_name/src/00-init.R.
Populates several project-level variables:

an processed data folder at ${PROCESSED}/project_name
a project-level cache directory at ${PROCESSED}/project_name/RCache. This variable is read as the default by R package simpleCache, so your caches will now be stored in a specified cache directory for this project.

Let’s go through some of these in a bit more detail:

Initialization script

In the init script (by default at ${CODE}/project_name/src/00-init.R), you should put whatever library() calls or other things you need that you want to be loaded all the time, for every script in this project. This is also where I define project-level functions that I want to use multiple times in the project, and where I load up project metadata (like sample annotation sheets) and other configuration options.

Output path functions

The second major thing done by the projectInit() function is that it provides a series of helper functions for input and output directories, without worrying about file paths. projectInit uses the folderfun package (available on CRAN) to do this, and you can read more about how this works there.

For example, try:

ffOut()

## [1] "/tmp/RtmpsibWsg/processed/project_name/analysis"

You can use this convenient function for all your plots. Instead of passing your filename directly to the pdf() function, run it through ffOut() to automatically save it in the project-specific output directory like. First, though, we need to make sure the folder exists, which we can do with the create argument to the folder function:

ffOut(create=TRUE)

## [1] "/tmp/RtmpsibWsg/processed/project_name/analysis/TRUE"

Then we’ll be able to generate our PDF in that folder:

pdf(ffOut("filename.pdf"))
plot(1)
dev.off()

This has two advantages: First, each project has a set space, so your outputs are organized by project systematically. Second, it provides portability: if you change projects or locations, or share this script with others, the code will just work, putting the plot in the correct location. If you had hard-coded the path everywhere, you’d have to change all that to lift code to a new project or to change compute environments, or if you changed the project output folder.

You can also extend this easily with subfolders. When projects become large, it’s useful to divide them into components. For example, I usually organize my R scripts around figures for a scientific paper, so all my scripts that produce panels or supporting information for figure 1 belong together, and so on. There may be a dozen scripts for each figure, and I’d like to easily include all the related output files in a separate subfolder. I can accomplish this like this. For all scripts relevant to figure 1, I use this code at the top of each script:

projectInit("project_name")

## Created folder function ffCode(): /tmp/RtmpsibWsg/code/project_name

## Created folder function ffProc(): /tmp/RtmpsibWsg/processed/project_name

## Created folder function ffRaw(): /tmp/RtmpsibWsg/data/project_name

## Created folder function ffWeb(): /tmp/RtmpsibWsg/web/project_name

## Created folder function ffOut(): /tmp/RtmpsibWsg/processed/project_name/analysis

## Created folder function ffRes(): /tmp/RtmpsibWsg/resources

## Created folder function ffResCache(): /tmp/RtmpsibWsg/resources/cache/RCache

## Created folder function ffCache(): /tmp/RtmpsibWsg/processed/project_name/RCache

## Created folder function ffProcRoot(): /tmp/RtmpsibWsg/processed

## Created folder function ffRawRoot(): /tmp/RtmpsibWsg/data

## Created folder function ffWebRoot(): /tmp/RtmpsibWsg/web

## Created folder function ffOutRoot(): /tmp/RtmpsibWsg/processed

## Created folder function ffBuild(): /tmp/RtmpsibWsg/code/project_name/RBuild

## Did not find fixed-name config: /tmp/RtmpsibWsg/code/project_name/metadata/config.yaml; /tmp/RtmpsibWsg/code/project_name/metadata/project_config.yaml; /tmp/RtmpsibWsg/code/project_name/metadata/pconfig.yaml; /tmp/RtmpsibWsg/code/project_name/metadata/project_name.yaml

## No RGenomeUtils found.

## No project init script: '/tmp/RtmpsibWsg/code/project_name/src/00-init.R'.

setOutputSubdir("fig1_clustering")

## Created folder function ffOut(): /tmp/RtmpsibWsg/processed/project_name/analysis/fig1_clustering

Now in these scripts when I use dirOut(), it’s smart enough to automatically place these figures in an appropriate subfolder:

ffOut()

## [1] "/tmp/RtmpsibWsg/processed/project_name/analysis/fig1_clustering"

You can also just use projectInit("project_name", outputSubdir="fig1_clustering") for short.

Data path functions

You get similar functions to easily access project-specific data, if you need it:

ffProc()

## [1] "/tmp/RtmpsibWsg/processed/project_name"

You can get a quick look at all the folder functions populated using, listff():

folderfun::listff()

##             funcNames    pdirOptVals                                                      
## FF_Build    "ffBuild"    "/tmp/RtmpsibWsg/code/project_name/RBuild"                       
## FF_Cache    "ffCache"    "/tmp/RtmpsibWsg/processed/project_name/RCache"                  
## FF_Code     "ffCode"     "/tmp/RtmpsibWsg/code/project_name"                              
## FF_Out      "ffOut"      "/tmp/RtmpsibWsg/processed/project_name/analysis/fig1_clustering"
## FF_OutRoot  "ffOutRoot"  "/tmp/RtmpsibWsg/processed"                                      
## FF_Proc     "ffProc"     "/tmp/RtmpsibWsg/processed/project_name"                         
## FF_ProcRoot "ffProcRoot" "/tmp/RtmpsibWsg/processed"                                      
## FF_Raw      "ffRaw"      "/tmp/RtmpsibWsg/data/project_name"                              
## FF_RawRoot  "ffRawRoot"  "/tmp/RtmpsibWsg/data"                                           
## FF_Res      "ffRes"      "/tmp/RtmpsibWsg/resources"                                      
## FF_ResCache "ffResCache" "/tmp/RtmpsibWsg/resources/cache/RCache"                         
## FF_Web      "ffWeb"      "/tmp/RtmpsibWsg/web/project_name"                               
## FF_WebRoot  "ffWebRoot"  "/tmp/RtmpsibWsg/web"

Caching

When we run projectInit(), one of the results is that we’ve populated a global variable to point to a project-level cache directory (which by default is located at ${PROCESSED}/project_name/RCache). Now, any calls to simpleCache() will automatically store and load caches from this directory. Since this is a shared, group-level space, anyone running the script on the same computing environment will be able to share caches. So any cached compute needs only be done once; even another user can re-use those caches.

Furthermore, it sets a global cache directory (by default at ${RESOURCES}/cache/RCache), which will automatically be used for global caches by simpleCache::simpleCacheShared(). This lets you share caches across projects with ease.

Setting up `.Rprofile`

To load projectInit by default, add this to your .Rprofile:

tryCatch({
    library(projectInit)
}, error = function(e) {
    message(e)
})

Nathan Sheffield

2019-04-15

Environment setup

Getting started

Initialization script

Output path functions

Data path functions

Caching

Setting up `.Rprofile`

Contents

An introduction to projectInit

Nathan Sheffield

2019-04-15

Environment setup

Getting started

Initialization script

Output path functions

Data path functions

Caching

Setting up .Rprofile

Contents

Setting up `.Rprofile`