Introduction

Pipeline interface tells the pipeline submission engine (such as looper) how to interact with your project and pipelines. It is just a yaml file with two sections:

  • protocol_mapping - maps sample protocol (the assay type, sometimes called “library” or “library strategy”) to one or more pipeline program
  • pipelines - describes the arguments and resources required by each pipeline

Read more about the pipeline interface concept in the looper documentation sections linked below:

Main features

Let’s consider the examples below that illustrate the pipeline interface-related functionality of BiocProject package.

bioconductor section in the pipeline interface

The first advantage of pipeline interfce concept is the data processing function declaration possibility in the pipeline interface itself. Since the data processing function is pipeline specific rather than project specific, it is much more convenient to place the bioconductor section within the pipeline section in the pipeline interface file.

   name: PIPELINE1
   path: pipelines/pipeline1.py
   looper_args: TRUE
   required_input_files: read1
   all_input_files: read1 read2
   ngs_input_files: read1 read2
   arguments:
      --sample-name: sample_name
   outputs:
      output1: 
  pipeline1/{sample.sample_name}_{sample.Sample_geo_accession}_1.bw
      output2: 
  pipeline1/{sample.sample_name}_{sample.Sample_geo_accession}_2.bw
   bioconductor:
      readFunName: readData
      readFunPath: readData.R

Get output file paths

The outputs section in the pipeline interface file and outputsByPipeline or outputsByProtocol functions privide a convenient access to the list of output file paths that are to be produced. For instance, the pipeline pipeline1.py with the set of outputs defined above produces the following set of output files when run on the set of samples indicated below.

sample_name protocol data_source SRR Sample_geo_accession read1 read2
sample1 PROTO1 SRA SRR5210416 GSM2471255 /Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210416_1.fastq.gz /Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210416_2.fastq.gz
sample2 PROTO1 SRA SRR5210450 GSM2471300 /Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210450_1.fastq.gz /Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210450_2.fastq.gz
sample3 PROTO2 SRA SRR5210398 GSM2471249 /Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210398_1.fastq.gz /Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/SRR5210398_2.fastq.gz

sample3, which has protocol attribute set to PROTO2 is not included in the outputs since pipeline1.py is mapped only to the PROTO1 protocol in the protocol_mapping section of the pipeline interface file:

Similarily, the output file paths can be determined for a given protocol or set of protocols, like:

outputsByProtocols(project=p, protocolNames="PROTO1")
#> [[1]]
#> [[1]]$PROTO1
#> [[1]]$PROTO1$pipeline1.py
#> [[1]]$PROTO1$pipeline1.py$output1
#> [[1]]$PROTO1$pipeline1.py$output1$sample1
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample1//pipeline1/sample1_GSM2471255_1.bw"
#> 
#> [[1]]$PROTO1$pipeline1.py$output1$sample2
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample2//pipeline1/sample2_GSM2471300_1.bw"
#> 
#> 
#> [[1]]$PROTO1$pipeline1.py$output2
#> [[1]]$PROTO1$pipeline1.py$output2$sample1
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample1//pipeline1/sample1_GSM2471255_2.bw"
#> 
#> [[1]]$PROTO1$pipeline1.py$output2$sample2
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample2//pipeline1/sample2_GSM2471300_2.bw"
#> 
#> 
#> 
#> 
#> 
#> [[2]]
#> [[2]]$PROTO1
#> [[2]]$PROTO1$other_pipeline1.py
#> [[2]]$PROTO1$other_pipeline1.py$output1
#> [[2]]$PROTO1$other_pipeline1.py$output1$sample1
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample1//other_pipeline1/sample1_GSM2471255_1.bw"
#> 
#> [[2]]$PROTO1$other_pipeline1.py$output1$sample2
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample2//other_pipeline1/sample2_GSM2471300_1.bw"
#> 
#> 
#> [[2]]$PROTO1$other_pipeline1.py$output2
#> [[2]]$PROTO1$other_pipeline1.py$output2$sample1
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample1//other_pipeline1/sample1_GSM2471255_2.bw"
#> 
#> [[2]]$PROTO1$other_pipeline1.py$output2$sample2
#> [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BiocProject/extdata/example_peps-master/example_piface/../output/results_pipeline/sample2//other_pipeline1/sample2_GSM2471300_2.bw"

Use case

This functionality provides a convenient way to process the files produced by the pipeline, when used in the data processing funcition indicated in the bioconductor section of the pipeline interface file. See the example function below that demonstrates the application of the outputsByPipeline function.

function (project, pipName = "pipeline1.py") 
{
    lapply(outputsByPipeline(project, pipName), function(x) {
        lapply(x, function(x1) {
            message("Reading: ", basename(x1))
            df = read.table(x1, stringsAsFactors = F)
            colnames(df)[1:3] = c("chr", "start", "end")
            GenomicRanges::GRanges(df)
        })
    })
}

Such a link between the project and the outputs (declared in the pipeline interface) makes it possible to read and process the pipeline results with just a line of code: