../code.databio.org/vignettes/vignette1getStarted.Rmd
vignette1getStarted.Rmd
BiocProject
is a (pending) Bioconductor package that provides a way to use Portable Encapsulated Projects (PEPs) within Bioconductor framework.
This vignette assumes you are already familiar with PEPs. If not, see pep.databio.org to learn more about PEP, and the pepr documentation to learn more about reading PEPs in R
.
BiocProject
uses objects of Project
class (from pepr
) to handle your project metadata, and allows you to provide a data loading/processing function so that you can load both project metadata and data for an entire project with a single line of R
code.
The output of the BiocProject
function is the object that your function returns, but enriched with the PEP in its metadata
slot. This way of metadata storage is uniform across all objects within Bioconductor project (see: ?Annotated-class
for details).
You must first install pepr
:
devtools::install_github(repo='pepkit/pepr')
Then, install BiocProject
:
devtools::install_github(repo='pepkit/BiocProject')
In order to use the BiocProject
package, you first need a PEP. For this vignette, we have included a basic example PEP within the package, but if you like, you can create your own, or download an example PEP.
The central component of a PEP is the project configuration file. Let’s load up BiocProject
and grab the path to our example configuration file:
library(BiocProject) configFile = system.file( "extdata", "example_peps-master", "example_BiocProject", "project_config.yaml", package = "BiocProject" ) configFile #> [1] "/tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject/project_config.yaml"
This path points to a YAML project config file, that looks like this:
pep_version: 2.0.0
sample_table: sample_table.csv
bioconductor:
readFunName: readBedFiles
readFunPath: readBedFiles.R
This configuration file points to the second major part of a PEP: the sample annotation CSV file (sample_table.csv
). Here are the contents of that file:
sample_name | file_path |
---|---|
laminB1Lads | data/laminB1Lads.bed |
vistaEnhancers | data/vistaEnhancers.bed |
In this example, our PEP has two samples, which have two attributes: sample_name
, and file_path
, which points the location for the data.
The configuration file also points to a third file (readBedFiles.R
). This file holds a single R
function called readBedFiles
, which has these contents:
function (project)
{
cwd = getwd()
paths = pepr::sampleTable(project)$file_path
sampleNames = pepr::sampleTable(project)$sample_name
setwd(dirname(project@file))
result = lapply(paths, function(x) {
df = read.table(x)
colnames(df) = c("chr", "start", "end")
gr = GenomicRanges::GRanges(df)
})
setwd(cwd)
names(result) = sampleNames
return(GenomicRanges::GRangesList(result))
}
<bytecode: 0x563eef19c210>
And that’s all there is to it! This PEP consists really of 3 components:
With that, we’re ready to see how BiocProject
works.
BiocProject
functionWith a PEP in hand, it takes only a single line of code to do all the magic with BiocProject
:
bp = BiocProject(file=configFile) #> Loading config file: /tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject/project_config.yaml #> Used function 'readBedFiles' from the environment
This loads the project metadata from the PEP, then loads and calls the actual data processing function, and returns the R object that the data processing function produces, but enriched with the PEP metadata. Consequently, the object contains all your project metadata and data! Let’s inspect the it:
bp #> GRangesList object of length 2: #> $laminB1Lads #> GRanges object with 1302 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr1 11401198-11694590 * #> [2] chr1 14877629-15246452 * #> [3] chr1 18229570-19207602 * #> [4] chr1 29618442-31162049 * #> [5] chr1 33943885-35623392 * #> ... ... ... ... #> [1298] chrX 154066672-154251301 * #> [1299] chrY 2880166-7112793 * #> [1300] chrY 15047033-15333970 * #> [1301] chrY 15603977-16627892 * #> [1302] chrY 16966225-21013116 * #> ------- #> seqinfo: 24 sequences from an unspecified genome; no seqlengths #> #> $vistaEnhancers #> GRanges object with 1339 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr1 3190581-3191428 * #> [2] chr1 8130439-8131887 * #> [3] chr1 10593123-10594209 * #> [4] chr1 10732070-10733118 * #> [5] chr1 10757664-10758631 * #> ... ... ... ... #> [1335] chrX 139380916-139382199 * #> [1336] chrX 139593502-139594774 * #> [1337] chrX 139674499-139675403 * #> [1338] chrX 147829016-147830159 * #> [1339] chrX 150407692-150409052 * #> ------- #> seqinfo: 24 sequences from an unspecified genome; no seqlengths
Since the data processing function returned GenomicRanges::GRangesList
object, the final result of the BiocProject
function is an object of the same class.
The created object provides all the pepr::Project
methods (which you can find in the reference documentation) for pepr.
sampleTable(bp) #> sample_name file_path #> 1: laminB1Lads data/laminB1Lads.bed #> 2: vistaEnhancers data/vistaEnhancers.bed config(bp) #> Config object. Class: Config #> pep_version: 2.0.0 #> sample_table: #> /tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject/sample_table.csv #> bioconductor: #> readFunName: readBedFiles #> readFunPath: #> /tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject/readBedFiles.R #> name: example_BiocProject
Finally, there are a few methods specific to BiocProject
objects:
getProject(bp) #> PEP project object. Class: Project #> file: #> /tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject/project_config.yaml #> samples: 2
In the basic case the function name (and path to source file, if necessary) is specified in the YAML config file itself, like:
bioconductor:
readFunName: function_name
or
bioconductor:
readFunName: function_name
readFunPath: /path/to/the/file.R
The function specified can be a data processing function of any complexity, but has to follow 3 rules listed below.
pepr::Project
object (should use that input to load all the relevant data into R
),Annotated
.Listed below are some of the classes that extend the class Annotated
:
showClass("Annotated") #> Virtual Class "Annotated" [package "S4Vectors"] #> #> Slots: #> #> Name: metadata #> Class: list #> #> Known Subclasses: #> Class "Vector", directly #> Class "Hits", by class "Vector", distance 2 #> Class "SelfHits", by class "Hits", distance 3 #> Class "SortedByQueryHits", by class "Hits", distance 3 #> Class "SortedByQuerySelfHits", by class "SelfHits", distance 4 #> Class "Rle", by class "Vector", distance 2 #> Class "Factor", by class "Vector", distance 2 #> Class "List", by class "Vector", distance 2 #> Class "SimpleList", by class "List", distance 3 #> Class "HitsList", by class "SimpleList", distance 4 #> Class "SelfHitsList", by class "HitsList", distance 5 #> Class "SortedByQueryHitsList", by class "HitsList", distance 5 #> Class "SortedByQuerySelfHitsList", by class "SelfHitsList", distance 6 #> Class "DataFrame", by class "SimpleList", distance 4 #> Class "DFrame", by class "DataFrame", distance 5 #> Class "TransposedDataFrame", by class "List", distance 3 #> Class "Pairs", by class "Vector", distance 2 #> Class "FilterRules", by class "SimpleList", distance 4
Consider the readBedFiles
function as an example of a function that can be used with BiocProject
package:
function (project)
{
cwd = getwd()
paths = pepr::sampleTable(project)$file_path
sampleNames = pepr::sampleTable(project)$sample_name
setwd(dirname(project@file))
result = lapply(paths, function(x) {
df = read.table(x)
colnames(df) = c("chr", "start", "end")
gr = GenomicRanges::GRanges(df)
})
setwd(cwd)
names(result) = sampleNames
return(GenomicRanges::GRangesList(result))
}
The BiocProject
function provides a way to rigorously monitor exceptions related to your data reading function. All the produced warnings and errors are caught, processed and displayed in an organized way:
configFile = system.file( "extdata", "example_peps-master", "example_BiocProject_exceptions", "project_config.yaml", package = "BiocProject" ) bpExceptions = BiocProject(configFile) #> Loading config file: /tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject_exceptions/project_config.yaml #> Function 'readBedFilesExceptions' read from file '/tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject_exceptions/readBedFilesExceptions.R' #> #> ------------------------------ Your function error ------------------------------ #> #> error 1: test error #> -------------------------------------------------------------------------------- #> No data was read. The error message was returned instead: test error #> Warning in .callBiocFun(getFunction(funcName, where = e, mustFind = TRUE), : #> There were warnings associated with your function execution. #> #> -------------------------- Your function warnings (2) -------------------------- #> #> warning 1: first test warning #> #> warning 2: second test warning #> --------------------------------------------------------------------------------
As indicated in the warning messages above – no data is being returned. Instead a S4Vectors::List
with a PEP is its metadata
slot is produced.
bpExceptions #> CharacterList of length 1 #> [[1]] test error
See “More arguments than just a PEP in your function?” vignette if you want to:
See the “Working with remote data” vignette to learn how to download the data from the Internet, process it and store it conveniently with related metadata in any object from the Bioconductor project.
See the “Working with large datasets - simpleCache” vignette to learn how the simpleCache
R package can be used to prevent copious and lengthy results recalculations when working with large datasets.