Introduction

Pipeline interface tells the pipeline submission engine (such as looper) how to interact with your project and pipelines. It is just a yaml file with two sections:

protocol_mapping - maps sample protocol (the assay type, sometimes called “library” or “library strategy”) to one or more pipeline program
pipelines - describes the arguments and resources required by each pipeline

Read more about the pipeline interface concept in the looper documentation sections linked below:

Main features

Let’s consider the examples below that illustrate the pipeline interface-related functionality of BiocProject package.

`bioconductor` section in the pipeline interface

The first advantage of pipeline interfce concept is the data processing function declaration possibility in the pipeline interface itself. Since the data processing function is pipeline specific rather than project specific, it is much more convenient to place the bioconductor section within the pipeline section in the pipeline interface file.

   name: PIPELINE1
   path: pipelines/pipeline1.py
   looper_args: TRUE
   required_input_files: read1
   all_input_files: read1 read2
   ngs_input_files: read1 read2
   arguments:
      --sample-name: sample_name
   outputs:
      output1: 
  pipeline1/{sample.sample_name}_{sample.Sample_geo_accession}_1.bw
      output2: 
  pipeline1/{sample.sample_name}_{sample.Sample_geo_accession}_2.bw
   bioconductor:
      readFunName: readData
      readFunPath: readData.R

Get output file paths

The outputs section in the pipeline interface file and outputsByPipeline or outputsByProtocol functions privide a convenient access to the list of output file paths that are to be produced. For instance, the pipeline pipeline1.py with the set of outputs defined above produces the following set of output files when run on the set of samples indicated below.

sample_name	protocol	data_source	SRR	Sample_geo_accession	read1	read2
sample1	PROTO1	SRA	SRR5210416	GSM2471255	/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210416_1.fastq.gz	/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210416_2.fastq.gz
sample2	PROTO1	SRA	SRR5210450	GSM2471300	/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210450_1.fastq.gz	/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210450_2.fastq.gz
sample3	PROTO2	SRA	SRR5210398	GSM2471249	/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210398_1.fastq.gz	/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210398_2.fastq.gz

outputsByPipeline(project=p, pipelineName="pipeline1.py")
#> $output1
#> $output1$sample1
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample1//pipeline1/sample1_GSM2471255_1.bw"
#> 
#> $output1$sample2
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample2//pipeline1/sample2_GSM2471300_1.bw"
#> 
#> 
#> $output2
#> $output2$sample1
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample1//pipeline1/sample1_GSM2471255_2.bw"
#> 
#> $output2$sample2
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample2//pipeline1/sample2_GSM2471300_2.bw"

sample3, which has protocol attribute set to PROTO2 is not included in the outputs since pipeline1.py is mapped only to the PROTO1 protocol in the protocol_mapping section of the pipeline interface file:

getProtocolMappings(getPipelineInterfaces(p)[[1]])
#> $PROTO1
#> [1] "pipeline1.py"
#> 
#> $PROTO2
#> [1] "pipeline2.py"

Similarily, the output file paths can be determined for a given protocol or set of protocols, like:

outputsByProtocols(project=p, protocolNames="PROTO1")
#> [[1]]
#> [[1]]$PROTO1
#> [[1]]$PROTO1$pipeline1.py
#> [[1]]$PROTO1$pipeline1.py$output1
#> [[1]]$PROTO1$pipeline1.py$output1$sample1
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample1//pipeline1/sample1_GSM2471255_1.bw"
#> 
#> [[1]]$PROTO1$pipeline1.py$output1$sample2
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample2//pipeline1/sample2_GSM2471300_1.bw"
#> 
#> 
#> [[1]]$PROTO1$pipeline1.py$output2
#> [[1]]$PROTO1$pipeline1.py$output2$sample1
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample1//pipeline1/sample1_GSM2471255_2.bw"
#> 
#> [[1]]$PROTO1$pipeline1.py$output2$sample2
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample2//pipeline1/sample2_GSM2471300_2.bw"
#> 
#> 
#> 
#> 
#> 
#> [[2]]
#> [[2]]$PROTO1
#> [[2]]$PROTO1$other_pipeline1.py
#> [[2]]$PROTO1$other_pipeline1.py$output1
#> [[2]]$PROTO1$other_pipeline1.py$output1$sample1
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample1//other_pipeline1/sample1_GSM2471255_1.bw"
#> 
#> [[2]]$PROTO1$other_pipeline1.py$output1$sample2
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample2//other_pipeline1/sample2_GSM2471300_1.bw"
#> 
#> 
#> [[2]]$PROTO1$other_pipeline1.py$output2
#> [[2]]$PROTO1$other_pipeline1.py$output2$sample1
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample1//other_pipeline1/sample1_GSM2471255_2.bw"
#> 
#> [[2]]$PROTO1$other_pipeline1.py$output2$sample2
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample2//other_pipeline1/sample2_GSM2471300_2.bw"

Use case

This functionality provides a convenient way to process the files produced by the pipeline, when used in the data processing funcition indicated in the bioconductor section of the pipeline interface file. See the example function below that demonstrates the application of the outputsByPipeline function.

function (project, pipName = "pipeline1.py") 
{
    lapply(outputsByPipeline(project, pipName), function(x) {
        lapply(x, function(x1) {
            message("Reading: ", basename(x1))
            df = read.table(x1, stringsAsFactors = F)
            colnames(df)[1:3] = c("chr", "start", "end")
            GenomicRanges::GRanges(df)
        })
    })
}

Such a link between the project and the outputs (declared in the pipeline interface) makes it possible to read and process the pipeline results with just a line of code:

bp = BiocProject(configFile)
#> Loaded config file: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/project_config.yaml
#> The 'bioconductor' key found in the pipeline interface
#> Reading: sample1_GSM2471255_1.bw
#> Reading: sample2_GSM2471300_1.bw
#> Reading: sample1_GSM2471255_2.bw
#> Reading: sample2_GSM2471300_2.bw
#> Used function 'readData' from the environment

Using a pipeline interface in your project

Michal Stolarczyk

2019-04-30

Introduction

Main features

`bioconductor` section in the pipeline interface

Get output file paths

Use case

Contents

Using a pipeline interface in your project

Michal Stolarczyk

2019-04-30

Introduction

Main features

bioconductor section in the pipeline interface

Get output file paths

Use case

Contents

`bioconductor` section in the pipeline interface