Learn sample subannotations in `pepr`

This vignette will show you how and why to use the subsample table functionality of the pepr package.

basic information about the PEP concept visit the project website.
broader theoretical description in the subsample table documentation section.

Problem/Goal

This series of examples below demonstrates how and why to use sample subannoatation functionality in multiple cases to provide multiple input files of the same type for a single sample.

Solutions

Example 1: basic sample subannotation table

This example demonstrates how the sample subannotation functionality is used. In this example, 2 samples have multiple input files that need merging (frog_1 and frog_2), while 1 sample (frog_3) does not. Therefore, frog_3 specifies its file in the sample_table.csv file, while the others leave that field blank and instead specify several files in the subsample_table.csv file.

This example is made up of these components:

Project config file:

  Registered S3 method overwritten by 'pryr':
    method      from
    print.bytes Rcpp
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/example_results

Sample table:

sample_name protocol file

frog_1 anySampleType multi

frog_2 anySampleType multi

frog_3 anySampleType multi

sample_name	protocol	file
frog_1	anySampleType	multi
frog_2	anySampleType	multi
frog_3	anySampleType	multi

Subsample table:

sample_name	subsample_name	file
frog_1	sub_a	data/frog1a_data.txt
frog_1	sub_b	data/frog1b_data.txt
frog_1	sub_c	data/frog1c_data.txt
frog_2	sub_a	data/frog2a_data.txt
frog_2	sub_b	data/frog2b_data.txt

Let’s create the Project object and see if multiple files are present

projectConfig1 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable1",
"project_config.yaml",
package = "pepr"
)
p1 = Project(projectConfig1)
#> Loading config file: /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/pepr/extdata/example_peps-master/example_subtable1/project_config.yaml
# Check the files
p1Samples = sampleTable(p1)
p1Samples$file
#> [[1]]
#> [1] "data/frog1a_data.txt" "data/frog1b_data.txt" "data/frog1c_data.txt"
#> 
#> [[2]]
#> [1] "data/frog2a_data.txt" "data/frog2b_data.txt"
#> 
#> [[3]]
#> [1] "multi"
# Check the subsample names
p1Samples$subsample_name
#> [[1]]
#> [1] "sub_a" "sub_b" "sub_c"
#> 
#> [[2]]
#> [1] "sub_a" "sub_b"
#> 
#> [[3]]
#> NULL

And inspect the whole table in p1@samples slot

sample_name	protocol	file	subsample_name
frog_1	anySampleType	c(“data/frog1a_data.txt”, “data/frog1b_data.txt”, “data/frog1c_data.txt”)	c(“sub_a”, “sub_b”, “sub_c”)
frog_2	anySampleType	c(“data/frog2a_data.txt”, “data/frog2b_data.txt”)	c(“sub_a”, “sub_b”)
frog_3	anySampleType	multi	NULL

You can also access a single subsample if you call the getSubsample method with appropriate sample_name - subsample_name attribute combination. Note, that this is only possible if the subsample_name column is defined in the sub_annotation.csv file.

sampleName = "frog_1"
subsampleName = "sub_a"
getSubsample(p1, sampleName, subsampleName)
#>    sample_name      protocol                 file subsample_name
#> 1:      frog_1 anySampleType data/frog1a_data.txt          sub_a

Example 2: subannotations and derived attributes

This example uses a subsample_table.csv file and a derived attributes to point to files. This is a rather complex example. Notice we must include the file_id column in the sample_table.csv file, and leave it blank; this is then populated by just some of the samples (frog_1 and frog_2) in the subsample_table.csv, but is left empty for the samples that are not merged.

This example is made up of these components:

Project config file:

  Warning in readLines(file): incomplete final line found on '/private/var/
  folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/
  pepr/extdata/example_peps-master/example_subtable2/project_config.yaml'
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/hello_looper_results
      pipeline_interfaces: ../pipeline/pipeline_interface.yaml
   sample_modifiers:
      derive:
          attributes: file
          sources:
              local_files: ../data/{identifier}{file_id}_data.txt
              local_files_unmerged: ../data/{identifier}_data.txt

Sample annotation table:

#> Warning in read.table(sampleAnnotation, sep = ",", header = T): incomplete
#> final line found by readTableHeader on '/private/var/folders/3f/
#> 0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/pepr/
#> extdata/example_peps-master/example_subtable2/sample_table.csv'

sample_name	protocol	identifier	file
frog_1	anySampleType	frog1	local_files
frog_2	anySampleType	frog2	local_files
frog_3	anySampleType	frog3	local_files_unmerged
frog_4	anySampleType	frog4	local_files_unmerged

Sample subannotation table:

sample_name file_id subsample_name

frog_1 a a

frog_1 b b

frog_1 c c

frog_2 a a

frog_2 b b

Let’s load the project config, create the Project object and see if multiple files are present

sample_name	file_id	subsample_name
frog_1	a	a
frog_1	b	b
frog_1	c	c
frog_2	a	a
frog_2	b	b

projectConfig2 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable2",
"project_config.yaml",
package = "pepr"
)
p2 = Project(projectConfig2)
#> Loading config file: /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/pepr/extdata/example_peps-master/example_subtable2/project_config.yaml
#> Warning in readLines(con): incomplete final line found on '/private/var/folders/
#> 3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/pepr/
#> extdata/example_peps-master/example_subtable2/project_config.yaml'
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 3 rows to
#> replace 1 rows
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 2 rows to
#> replace 1 rows
# Check the files
p2Samples = sampleTable(p2)
p2Samples$file
#> [[1]]
#> [1] "../data/frog1a_data.txt"
#> 
#> [[2]]
#> [1] "../data/frog2a_data.txt"
#> 
#> [[3]]
#> [1] "../data/frog3_data.txt"
#> 
#> [[4]]
#> [1] "../data/frog4_data.txt"

And inspect the whole table in p2@samples slot

sample_name	protocol	identifier	file	file_id	subsample_name
frog_1	anySampleType	frog1	../data/frog1a_data.txt	c(“a”, “b”, “c”)	c(“a”, “b”, “c”)
frog_2	anySampleType	frog2	../data/frog2a_data.txt	c(“a”, “b”)	c(“a”, “b”)
frog_3	anySampleType	frog3	../data/frog3_data.txt	NULL	NULL
frog_4	anySampleType	frog4	../data/frog4_data.txt	NULL	NULL

Example 3: subannotations and expansion characters

This example gives the exact same results as Example 2, but in this case, uses a wildcard for frog_2 instead of including it in the subsample_table.csv file. Since we can’t use a wildcard and a subannotation for the same sample, this necessitates specifying a second data source class (local_files_unmerged) that uses an asterisk (*). The outcome is the same.

This example is made up of these components:

Project config file:

   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/hello_looper_results
      pipeline_interfaces: ../pipeline/pipeline_interface.yaml
   sample_modifiers:
      derive:
          attributes: file
          sources:
              local_files: ../data/{identifier}{file_id}_data.txt
              local_files_unmerged: ../data/{identifier}*_data.txt

Sample annotation table:

sample_name	protocol	identifier	file	file_id
frog_1	anySampleType	frog1	local_files	NA
frog_2	anySampleType	frog2	local_files_unmerged	NA
frog_3	anySampleType	frog3	local_files_unmerged	NA
frog_4	anySampleType	frog4	local_files_unmerged	NA

Sample subtable table:

sample_name file_id

frog_1 a

frog_1 b

frog_1 c

Let’s load the project config, create the Project object and see if multiple files are present

sample_name	file_id
frog_1	a
frog_1	b
frog_1	c

projectConfig3 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable3",
"project_config.yaml",
package = "pepr"
)
p3 = Project(projectConfig3)
#> Loading config file: /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/pepr/extdata/example_peps-master/example_subtable3/project_config.yaml
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 3 rows to
#> replace 1 rows
# Check the files
p3Samples = sampleTable(p3)
p3Samples$file
#> [[1]]
#> [1] "../data/frog1a_data.txt"
#> 
#> [[2]]
#> [1] "../data/frog2*_data.txt"
#> 
#> [[3]]
#> [1] "../data/frog3*_data.txt"
#> 
#> [[4]]
#> [1] "../data/frog4*_data.txt"

And inspect the whole table in p3@samples slot

sample_name	protocol	identifier	file	file_id
frog_1	anySampleType	frog1	../data/frog1a_data.txt	c(“a”, “b”, “c”)
frog_2	anySampleType	frog2	../data/frog2*_data.txt
frog_3	anySampleType	frog3	../data/frog3*_data.txt
frog_4	anySampleType	frog4	../data/frog4*_data.txt

Example 4: subannotations and multiple (separate-class) inputs

Merging is for same class inputs (like, multiple files for read1). Different-class inputs (like read1 vs read2) are handled by different attributes (or columns). This example shows you how to handle paired-end data, while also merging within each.

This example is made up of these components:

Project config file:

   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/hello_looper_results
      pipeline_interfaces: ../pipeline/pipeline_interface.yaml

Sample annotation table:

sample_name protocol

frog_1 anySampleType

frog_2 anySampleType

frog_3 anySampleType

frog_4 anySampleType
Sample subannotation table:

sample_name read1 read2

frog_1 frog1a_data.txt frog1a_data2.txt

frog_1 frog1b_data.txt frog1b_data2.txt

frog_1 frog1c_data.txt frog1b_data2.txt

Let’s load the project config, create the Project object and see if multiple files are present

sample_name	protocol
frog_1	anySampleType
frog_2	anySampleType
frog_3	anySampleType
frog_4	anySampleType

sample_name	read1	read2
frog_1	frog1a_data.txt	frog1a_data2.txt
frog_1	frog1b_data.txt	frog1b_data2.txt
frog_1	frog1c_data.txt	frog1b_data2.txt

projectConfig4 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable4",
"project_config.yaml",
package = "pepr"
)
p4 = Project(projectConfig4)
#> Loading config file: /private/var/folders/3f/0wj7rs2144l9zsgxd3jn5nxc0000gn/T/RtmpEYsaEm/temp_libpath43b174cbd72/pepr/extdata/example_peps-master/example_subtable4/project_config.yaml
# Check the read1 and read2 columns
p4Samples = sampleTable(p4)
p4Samples$read1
#> [[1]]
#> [1] "frog1a_data.txt" "frog1b_data.txt" "frog1c_data.txt"
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL
p4Samples$read2
#> [[1]]
#> [1] "frog1a_data2.txt" "frog1b_data2.txt" "frog1b_data2.txt"
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL

And inspect the whole table in p4@samples slot

sample_name	protocol	read1	read2
frog_1	anySampleType	c(“frog1a_data.txt”, “frog1b_data.txt”, “frog1c_data.txt”)	c(“frog1a_data2.txt”, “frog1b_data2.txt”, “frog1b_data2.txt”)
frog_2	anySampleType	NULL	NULL
frog_3	anySampleType	NULL	NULL
frog_4	anySampleType	NULL	NULL

Subsample table in pepr

Michal Stolarczyk & Nathan Sheffield

2020-10-16

Learn sample subannotations in `pepr`

Problem/Goal

Solutions

Example 1: basic sample subannotation table

Example 2: subannotations and derived attributes

Example 3: subannotations and expansion characters

Example 4: subannotations and multiple (separate-class) inputs

Subsample table in pepr

Michal Stolarczyk & Nathan Sheffield

2020-10-16

Learn sample subannotations in pepr

Problem/Goal

Solutions

Example 1: basic sample subannotation table

Example 2: subannotations and derived attributes

Example 3: subannotations and expansion characters

Example 4: subannotations and multiple (separate-class) inputs

Learn sample subannotations in `pepr`