Learn implied sample modifier in pepr

This vignette will show you how and why to use the implied attributes functionality of the pepr package.

Problem/Goal

The example below demonstrates how and why to use implied attributes functionality to save your time and effort in case multiple sample attributes need to be defined for many samples and they follow certain patterns. Please consider the example below for reference:

sample_name organism time file_path genome genome_size
frog_0h frog 0 data/lab/project/frog_0h.fastq
frog_1h frog 1 data/lab/project/frog_1h.fastq
human_1h human 1 data/lab/project/human_1h.fastq hg38 hs
human_0h human 0 data/lab/project/human_0h.fastq hg38 hs
mouse_1h mouse 1 data/lab/project/mouse_1h.fastq mm10 mm
mouse_0h mouse 0 data/lab/project/mouse_1h.fastq mm10 mm

Solution

Noticeably, the samples with attributes human and mouse (in the organism column) follow two distinct patterns here. They have additional attributes in attributes genome and genome_size in the sample_table.csv file. Consequently you can use implied attributes to add those attributes to the sample annotations (set global, species-level attributes at the project level instead of duplicating that information for every sample that belongs to a species). The way how this process is carried out is indicated explicitly in the project_config.yaml file (presented below).

  Registered S3 method overwritten by 'pryr':
    method      from
    print.bytes Rcpp
  Warning in readLines(file): incomplete final line found on '/private/var/
  folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/
  pepr/extdata/example_peps-master/example_imply/project_config.yaml'
   pep_version: 2.0.0
   sample_table: sample_table.csv
   looper:
      output_dir: $HOME/hello_looper_results
   sample_modifiers:
      imply:
              if:
                  organism: human
              then:
                  genome: hg38
                  macs_genome_size: hs
              if:
                  organism: mouse
              then:
                  genome: mm10
                  macs_genome_size: mm

Consequently, you can design sample_modifiers.imply - a multi-level key-value section in the project_config.yaml file. Note that the keys must match the column names and attributes in the sample_annotations.csv file.

Let’s introduce a few modifications to the original sample_table.csv file to use the sample_modifiers.imply section of the config. Simply skip the attributes that will be implied and let the pepr do the work for you.

sample_name organism time file_path
frog_0h frog 0 data/lab/project/frog_0h.fastq
frog_1h frog 1 data/lab/project/frog_1h.fastq
human_1h human 1 data/lab/project/human_1h.fastq
human_0h human 0 data/lab/project/human_0h.fastq
mouse_1h mouse 1 data/lab/project/mouse_1h.fastq
mouse_0h mouse 0 data/lab/project/mouse_1h.fastq

Code

Load pepr and read in the project metadata by specifying the path to the project_config.yaml:

library(pepr)
projectConfig = system.file(
"extdata",
paste0("example_peps-", branch),
"example_imply",
"project_config.yaml",
package = "pepr"
)
p = Project(projectConfig)
#> Loading config file: /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/pepr/extdata/example_peps-master/example_imply/project_config.yaml
#> Warning in readLines(con): incomplete final line found on '/private/var/folders/
#> 3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/pepr/
#> extdata/example_peps-master/example_imply/project_config.yaml'

And inspect it:

sampleTable(p)
#>    sample_name organism time                       file_path genome
#> 1:     frog_0h     frog    0  data/lab/project/frog_0h.fastq       
#> 2:     frog_1h     frog    1  data/lab/project/frog_1h.fastq       
#> 3:    human_1h    human    1 data/lab/project/human_1h.fastq   hg38
#> 4:    human_0h    human    0 data/lab/project/human_0h.fastq   hg38
#> 5:    mouse_1h    mouse    1 data/lab/project/mouse_1h.fastq   mm10
#> 6:    mouse_0h    mouse    0 data/lab/project/mouse_1h.fastq   mm10
#>    macs_genome_size
#> 1:                 
#> 2:                 
#> 3:               hs
#> 4:               hs
#> 5:               mm
#> 6:               mm

As you can see, the resulting samples are annotated the same way as if they were read from the original annotations file with attributes in the two last columns manually determined.

What is more, the p object consists of all the information from the project config file (project_config.yaml). Run the following line to explore it:

config(p)
#> Config object. Class: Config
#>  pep_version: 2.0.0
#>  sample_table: 
#> /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/pepr/extdata/example_peps-master/example_imply/sample_table.csv
#>  looper:
#>     output_dir: /Users/mstolarczyk/hello_looper_results
#>  sample_modifiers:
#>     imply:
#>             if:
#>                 organism: human
#>             then:
#>                 genome: hg38
#>                 macs_genome_size: hs
#>             if:
#>                 organism: mouse
#>             then:
#>                 genome: mm10
#>                 macs_genome_size: mm
#>  name: example_imply