intro.Rmd
The folderfun
package makes it easy for you to manage files on disk for your R project. folderfun
is short for folder functions, but you’ll soon discover that it’s fun as well.
In a basic R project, you’ll probably want to read in data and write out plots or results. By default, the reading, plotting, and writing functions will read or write files in your current working directory. That might be fine for small, simple projects, but it breaks down for many real-world use cases.
For example, what if you save your R project as a git repository (which is a good idea)? You don’t want to store large, compressed input files in the same folder, nor do you want to commit your plot outputs. Instead, you’ll want to store the data and results in other folders. Large projects can also require multiple folders for both input and output – for example, you may load some shared data resource that lives in a group folder as well as some of your own project-specific resources. What if you want to work on a project with multiple people? These distributed folders can be organized in different ways and reside on different file systems in different computing environments. It can become a nightmare to keep track of the locations of all the folders on disk where different data and results are stored. And if you start hard-coding paths inside your R script, you make your code less portable, because it will only be able to be run in that computing environment. What if data changes locations? Your code breaks.
folderfun
solves all these issues by making it dead simple to use wrapper folder functions to point to different data sources. Instead of pointing to input or output files with absolute file names, we define a function that remembers a root folder, and then use relative filenames with that function to identify individual files. Coupled with environment variables that define parent folder locations, you can easily maintain project-level subfolders with code that works across individuals and computing environments with almost no effort. This makes your code more portable and sharable and enables multiple users to work together on complex projects in different compute environments while sharing a single code base. Are you convinced yet?
Let’s say we have a project that needs to read data from one folder, let’s call it data
, and write results to another folder, let’s call it results
. Here’s how you might start this analysis naively:
# Load our data:
input1 = read.table("/long/and/annoying/hard/coded/path/data.txt")
input2 = read.table("/long/and/annoying/hard/coded/path/data2.txt")
output1 = processData(input)
output2 = processData2(input2)
# Run other analysis...
# Now write results:
write.table("/different/long/annoying/hard/coded/path/result.txt", output1)
write.table("/different/long/annoying/hard/coded/path/result2.txt", output2)
OK, that works… but this has problems: First, you repeat the paths, making it harder to change if the data move; Second, if you want to refer to these same locations in a different script, you’d have to repeat the paths yet further; and Third, this script won’t work in a different compute environment since filepaths may differ.
We can solve the first problem by defining a path variable, and then using it in multiple places:
inputDir = "/long/and/annoying/hard/coded/path"
outputDir = "/different/long/annoying/hard/coded/path"
input1 = read.table(file.path(inputDir, "data.txt"))
input2 = read.table(file.path(inputDir, "data2.txt"))
output1 = processData(input)
output2 = processData2(input2)
# Run other analysis...
write.table(file.path(outputDir, "result.txt"))
write.table(file.path(outputDir, "result2.txt"))
That’s much nicer; it limits the hard-coded folders to a single variable per folder, making them easier to maintain. Plus, now someone else could re-use this script by just adjusting the variable pointers at the top. But we still haven’t solved the problems of using these variables in another or using this script in another environment. And besides, that file.path(...)
syntax is really annoying! With folderfun
we can do better.
folderfun
approachWith folderfun
, we’ll use a function called setff
to create functions, each of which will provide a path to a folder of interest. This is analogous to what we’re trying to do with inputDir
and outputDir
above, we just use a function call instead of a variable. We assign each folder function a name (In
and Out
in this example), and provide the location to the folder:
library(folderfun)
setff("In", "/long/and/annoying/hard/coded/path/")
## Created folder function ffIn(): /long/and/annoying/hard/coded/path/
setff("Out", "/different/long/annoying/hard/coded/path/")
## Created folder function ffOut(): /different/long/annoying/hard/coded/path/
These functions have created new functions named by prepending the text ff (for folder function) to our given name. These functions allow us to build paths to files inside those folders by simply passing a relative path (filename), like this:
ffIn("data.txt")
ffOut("result.txt")
## [1] "/long/and/annoying/hard/coded/path//data.txt"
## [1] "/different/long/annoying/hard/coded/path//result.txt"
So our original analysis would look something like this:
input1 = read.table(ffIn("data.txt"))
input2 = read.table(ffIn("data2.txt"))
output1 = processData(input)
output2 = processData2(input2)
# Run other analysis...
write.table(ffOut("result.txt"))
write.table(ffOut("result2.txt"))
So, to reiterate: setff("In", ...)
creates a folder function called ffIn
that will prepend the inputDir
path to its argument, giving you easy access to files in the directory referenced in the setff
call. You can have as many folder functions you want with whatever names you like. Creating a function with a name already in use will overwrite the older function with that name.
So far, so good – the folderfun
syntax is much nicer than what we had before. But we still haven’t solved the problem of referring to these same folders from multiple scripts, or sharing scripts across computing environments. What if there was a way to share folder functions across scripts and servers? This is where folderfun
becomes very useful. By using environment variables (or R options
), we eliminate the step of hard-coding anything in the R script.
For example, say we put this code into our .bashrc
or .profile
to define the locations for a particular server:
export INDIR="/long/and/annoying/hard/coded/path/"
export OUTDIR="/different/long/annoying/hard/coded/path/"
Or, from within R we could set environment variables like this:
Or perhaps our locations are R specific, and so we store them in our .Rprofile
: as R options
:
options(INDIR="/long/and/annoying/path/to/hard/coded/file/")
options(OUTDIR="/different/long/annoying/hard/coded/path/")
Setting these variables creates a global variable that can be read by any R script. Furthermore, we could define variables with the same names on different systems. We have effectively outsourced the specification of the root directories to our .Rprofile
or .bashrc
. Now, all we need to do is use the global variables to build our folder functions. We could do this like so:
setff("In", Sys.getenv("INDIR"))
## Created folder function ffIn(): /long/and/annoying/hard/coded/path/
setff("Out", Sys.getenv("OUTDIR"))
## Created folder function ffOut(): /different/long/annoying/hard/coded/path/
ffIn()
ffIn("data.txt")
## [1] "/long/and/annoying/hard/coded/path/"
## [1] "/long/and/annoying/hard/coded/path//data.txt"
alternatively, using R options
setff("In", getOption("INDIR"))
## Created folder function ffIn(): /long/and/annoying/path/to/hard/coded/file/
setff("Out", getOption("OUTDIR"))
## Created folder function ffOut(): /different/long/annoying/hard/coded/path/
ffIn()
ffIn("data.txt")
## [1] "/long/and/annoying/path/to/hard/coded/file/"
## [1] "/long/and/annoying/path/to/hard/coded/file//data.txt"
That code is now portable across scripts and servers because it uses the global folders. But it gets even easier: we’ve wrapped the Sys.getenv
and getOption
calls into setff
so you just need to specify the global variable name to the pathVar
argument:
setff("In", pathVar="INDIR")
When you pass the pathVar
argument, setff
will look first for an R option with that name, and then for an environment variable with that name. So, this has the same effect as above, but no longer requires specifying the path directly in any particular R script. That one line of code, then, is all you need in your script to get the universal ffIn
function.
But wait, there’s more! Now, here’s the ultimate syntactic sugar to make it dead simple to create portable folder functions. If your folder function name matches the name of the pathVar
, then you don’t even need to provide the pathVar
. For example, say we wanted to name our folder function ffIndir
instead of just ffIn
. In that case, you’d get the same result with:
setff("Indir")
The name provided exactly determines the function name (ffIndir
), and it also specifies a priority of places to search for a pathVar
variable: It favors R options
over environment variables, and first looks for a name exactly as given, trying an all-caps and then an all-lowercase version of the name until a nonempty value (neither NULL
nor ""
) is found. If no match is found, the setff
call will result in error.
So far we’ve addressed how to create universal folder functions. We’ve solved the main problems with the traditional approach. Using folder folders combined with R options or environment variables allows us to: 1) Avoid repeating paths either within a script or across scripts, because they are stored globally; 2) Let the exact same script work in two different computing environments. We can do all of this with a simple, easy-to-understand call to setff
, and then wrapping all our references to disk resources with the appropriate ff
function.
But let’s go one step further: what if we want more than just a set of global folders. What if we also want to specify project-specific folders? We might want an input
or output
subfolder that reside in our parent INDIR
and OUTDIR
folders, but give us a separate space for each project. This is possible with another setff
argument: postpend
. Using postpend
allows you to append additional text (e.g. subfolders) to the folder function. For example, here’s some code that will give you a subfolder called projectName
at the location specified by your $DATA
environment variable:
projectName="myproject"
setff("Data", pathVar="DATA", postpend=projectName)
## Created folder function ffData(): /long/and/annoying/path/to/hard/coded/file/myproject
Remember, you could also take advantage of folderfun
’s smart matching in this case by leaving off the pathVar
argument:
projectName="myproject"
setff("Data", postpend=projectName)
## Created folder function ffData(): /long/and/annoying/path/to/hard/coded/file/myproject
There you have it! A single line gives you a portable and project-specific input and output folder functions, making it easier for you to manage your data and results.
You can get a list of all your loaded folder functions with the listff
function:
listff()
## funcNames pdirOptVals
## FF_Data "ffData" "/long/and/annoying/path/to/hard/coded/file/myproject"
## FF_In "ffIn" "/long/and/annoying/path/to/hard/coded/file/"
## FF_Out "ffOut" "/different/long/annoying/hard/coded/path/"
Now let’s see how this fits into a real-world system. In our lab, we have set aside a few locations on our primary server where we store both raw and processed data, and we store the folder locations in shell environment variables called $RAWDATA
and $PROCESSED
. We also have a few other variables that point to shared resources, like $RESOURCES
and $GENOMES
. Our server uses an environment modules system, so we have set up a lab environment module that populates these variables. If we ever need to move anything to a new file system, it’s as simple as updating the environment module, and all lab members’ pointers will automatically point to the new folder.
We use folderfun
to access these folders in R. By convention, we assign a subfolder for each project in each of the RAW
and PROCESSED
folders. Then, we simply need to have this code in each script:
projectName="myproject"
setff("Raw", postpend=projectName)
setff("Processed", postpend=projectName)
Because every project is the same, we’ve wrapped this capability into another function called projectInit
, so we must merely put projectInit(projectName)
at the beginning of each script, and it will have access to the folder functions it needs. The beautiful thing about this approach is that these scripts are now automatically functional on any computing environment and are robust to data moves as long as the environment variables are kept up-to-date.
As noted, setff
attempts to find a path value for either an R option or a shell environment variable. To do so, it uses a function called folderfun::optOrEnvVar
in this package. This prioritized name resolution function may be useful in other contexts, so it’s independently available:
name = "DUMMYTESTVAR"
value = "test_value"
optOrEnvVar(name) # NULL
Sys.setenv(name, value)
optOrEnvVar(name) # Now resolves
Sys.unsetenv(name)
optOrEnvVar(name) # NULL
optArg = list(value)
names(optArg) = name
options(optArg)
optOrEnvVar(name) # Now resolves
Sys.setenv(name, "new?")
optOrEnvVar(name) # on name collision, option trumps environment variable.