Introduction

BiocProject is a (pending) Bioconductor package that provides a way to use Portable Encapsulated Projects (PEPs) within Bioconductor framework.

This vignette assumes you are already familiar with PEPs. If not, see pep.databio.org to learn more about PEP, and the pepr documentation to learn more about reading PEPs in R.

BiocProject uses objects of Project class (from pepr) to handle your project metadata, and allows you to provide a data loading/processing function so that you can load both project metadata and data for an entire project with a single line of R code.

The output of the BiocProject function is the object that your function returns, but enriched with the PEP in its metadata slot. This way of metadata storage is uniform across all objects within Bioconductor project (see: ?Annotated-class for details).

Installation

You must first install pepr:

devtools::install_github(repo='pepkit/pepr')

Then, install BiocProject:

devtools::install_github(repo='pepkit/BiocProject')

How to use BiocProject

Introduction to PEP components

In order to use the BiocProject package, you first need a PEP. For this vignette, we have included a basic example PEP within the package, but if you like, you can create your own, or download an example PEP.

The central component of a PEP is the project configuration file. Let’s load up BiocProject and grab the path to our example configuration file:

library(BiocProject)

configFile = system.file(
  "extdata",
  "example_peps-master",
  "example_BiocProject",
  "project_config.yaml",
  package = "BiocProject"
)
configFile
#> [1] "/tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject/project_config.yaml"

This path points to a YAML project config file, that looks like this:

   pep_version: 2.0.0
   sample_table: sample_table.csv
   bioconductor:
      readFunName: readBedFiles
      readFunPath: readBedFiles.R

This configuration file points to the second major part of a PEP: the sample annotation CSV file (sample_table.csv). Here are the contents of that file:

sample_name file_path
laminB1Lads data/laminB1Lads.bed
vistaEnhancers data/vistaEnhancers.bed

In this example, our PEP has two samples, which have two attributes: sample_name, and file_path, which points the location for the data.

The configuration file also points to a third file (readBedFiles.R). This file holds a single R function called readBedFiles, which has these contents:

function (project) 
{
    cwd = getwd()
    paths = pepr::sampleTable(project)$file_path
    sampleNames = pepr::sampleTable(project)$sample_name
    setwd(dirname(project@file))
    result = lapply(paths, function(x) {
        df = read.table(x)
        colnames(df) = c("chr", "start", "end")
        gr = GenomicRanges::GRanges(df)
    })
    setwd(cwd)
    names(result) = sampleNames
    return(GenomicRanges::GRangesList(result))
}
<bytecode: 0x563eef19c210>

And that’s all there is to it! This PEP consists really of 3 components:

  1. the project configuration file (which points to an annotation sheet and specifies your function name)
  2. the annotation sheet
  3. an R file that holds a function that knows how to process this data.

With that, we’re ready to see how BiocProject works.

How to use the BiocProject function

With a PEP in hand, it takes only a single line of code to do all the magic with BiocProject:

bp = BiocProject(file=configFile)
#> Loading config file: /tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject/project_config.yaml
#> Used function 'readBedFiles' from the environment

This loads the project metadata from the PEP, then loads and calls the actual data processing function, and returns the R object that the data processing function produces, but enriched with the PEP metadata. Consequently, the object contains all your project metadata and data! Let’s inspect the it:

bp
#> GRangesList object of length 2:
#> $laminB1Lads
#> GRanges object with 1302 ranges and 0 metadata columns:
#>          seqnames              ranges strand
#>             <Rle>           <IRanges>  <Rle>
#>      [1]     chr1   11401198-11694590      *
#>      [2]     chr1   14877629-15246452      *
#>      [3]     chr1   18229570-19207602      *
#>      [4]     chr1   29618442-31162049      *
#>      [5]     chr1   33943885-35623392      *
#>      ...      ...                 ...    ...
#>   [1298]     chrX 154066672-154251301      *
#>   [1299]     chrY     2880166-7112793      *
#>   [1300]     chrY   15047033-15333970      *
#>   [1301]     chrY   15603977-16627892      *
#>   [1302]     chrY   16966225-21013116      *
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths
#> 
#> $vistaEnhancers
#> GRanges object with 1339 ranges and 0 metadata columns:
#>          seqnames              ranges strand
#>             <Rle>           <IRanges>  <Rle>
#>      [1]     chr1     3190581-3191428      *
#>      [2]     chr1     8130439-8131887      *
#>      [3]     chr1   10593123-10594209      *
#>      [4]     chr1   10732070-10733118      *
#>      [5]     chr1   10757664-10758631      *
#>      ...      ...                 ...    ...
#>   [1335]     chrX 139380916-139382199      *
#>   [1336]     chrX 139593502-139594774      *
#>   [1337]     chrX 139674499-139675403      *
#>   [1338]     chrX 147829016-147830159      *
#>   [1339]     chrX 150407692-150409052      *
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Since the data processing function returned GenomicRanges::GRangesList object, the final result of the BiocProject function is an object of the same class.

How to interact with the returned object

The created object provides all the pepr::Project methods (which you can find in the reference documentation) for pepr.

sampleTable(bp)
#>       sample_name               file_path
#> 1:    laminB1Lads    data/laminB1Lads.bed
#> 2: vistaEnhancers data/vistaEnhancers.bed
config(bp)
#> Config object. Class: Config
#>  pep_version: 2.0.0
#>  sample_table: 
#> /tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject/sample_table.csv
#>  bioconductor:
#>     readFunName: readBedFiles
#>     readFunPath: 
#> /tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject/readBedFiles.R
#>  name: example_BiocProject

Finally, there are a few methods specific to BiocProject objects:

getProject(bp)
#> PEP project object. Class:  Project
#>   file:  
#> /tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject/project_config.yaml
#>   samples:  2

How to provide a data load function

In the basic case the function name (and path to source file, if necessary) is specified in the YAML config file itself, like:

bioconductor:
  readFunName: function_name

or

bioconductor:
  readFunName: function_name
  readFunPath: /path/to/the/file.R

The function specified can be a data processing function of any complexity, but has to follow 3 rules listed below.

Rules:

  1. must take at least a single argument,
  2. the argument must be a pepr::Project object (should use that input to load all the relevant data into R),
  3. must return an object of class that extends the class Annotated.

Listed below are some of the classes that extend the class Annotated:

showClass("Annotated")
#> Virtual Class "Annotated" [package "S4Vectors"]
#> 
#> Slots:
#>                
#> Name:  metadata
#> Class:     list
#> 
#> Known Subclasses: 
#> Class "Vector", directly
#> Class "Hits", by class "Vector", distance 2
#> Class "SelfHits", by class "Hits", distance 3
#> Class "SortedByQueryHits", by class "Hits", distance 3
#> Class "SortedByQuerySelfHits", by class "SelfHits", distance 4
#> Class "Rle", by class "Vector", distance 2
#> Class "Factor", by class "Vector", distance 2
#> Class "List", by class "Vector", distance 2
#> Class "SimpleList", by class "List", distance 3
#> Class "HitsList", by class "SimpleList", distance 4
#> Class "SelfHitsList", by class "HitsList", distance 5
#> Class "SortedByQueryHitsList", by class "HitsList", distance 5
#> Class "SortedByQuerySelfHitsList", by class "SelfHitsList", distance 6
#> Class "DataFrame", by class "SimpleList", distance 4
#> Class "DFrame", by class "DataFrame", distance 5
#> Class "TransposedDataFrame", by class "List", distance 3
#> Class "Pairs", by class "Vector", distance 2
#> Class "FilterRules", by class "SimpleList", distance 4

Consider the readBedFilesfunction as an example of a function that can be used with BiocProject package:

function (project) 
{
    cwd = getwd()
    paths = pepr::sampleTable(project)$file_path
    sampleNames = pepr::sampleTable(project)$sample_name
    setwd(dirname(project@file))
    result = lapply(paths, function(x) {
        df = read.table(x)
        colnames(df) = c("chr", "start", "end")
        gr = GenomicRanges::GRanges(df)
    })
    setwd(cwd)
    names(result) = sampleNames
    return(GenomicRanges::GRangesList(result))
}

Data reading function error/warning handling

The BiocProject function provides a way to rigorously monitor exceptions related to your data reading function. All the produced warnings and errors are caught, processed and displayed in an organized way:

configFile = system.file(
  "extdata",
  "example_peps-master",
  "example_BiocProject_exceptions",
  "project_config.yaml",
  package = "BiocProject"
)

bpExceptions = BiocProject(configFile)
#> Loading config file: /tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject_exceptions/project_config.yaml
#> Function 'readBedFilesExceptions' read from file '/tmp/Rtmpp7Kvae/temp_libpath658d0e8e5/BiocProject/extdata/example_peps-master/example_BiocProject_exceptions/readBedFilesExceptions.R'
#> 
#> ------------------------------ Your function error ------------------------------
#> 
#> error 1: test error
#> --------------------------------------------------------------------------------
#> No data was read. The error message was returned instead: test error
#> Warning in .callBiocFun(getFunction(funcName, where = e, mustFind = TRUE), :
#> There were warnings associated with your function execution.
#> 
#> -------------------------- Your function warnings (2) --------------------------
#> 
#> warning 1: first test warning
#> 
#> warning 2: second test warning
#> --------------------------------------------------------------------------------

As indicated in the warning messages above – no data is being returned. Instead a S4Vectors::List with a PEP is its metadata slot is produced.

bpExceptions
#> CharacterList of length 1
#> [[1]] test error

Further reading

See “More arguments than just a PEP in your function?” vignette if you want to:

  • use an anonymous function instead of one defined a priori
  • use a function that requires more arguments than just a PEP

See the “Working with remote data” vignette to learn how to download the data from the Internet, process it and store it conveniently with related metadata in any object from the Bioconductor project.

See the “Working with large datasets - simpleCache” vignette to learn how the simpleCache R package can be used to prevent copious and lengthy results recalculations when working with large datasets.