This loading profile can show enrichment of genomic signals with high loading values in region set but not in surrounding genome, suggesting that variation is linked specifically to that region set.

getLoadingProfile(loadingMat, signalCoord, regionSet,
  PCsToAnnotate = c("PC1", "PC2"), binNum = 25, verbose = TRUE)

Arguments

loadingMat

matrix of loadings (the coefficients of the linear combination that defines each PC). One named column for each PC. One row for each original dimension/variable (should be same order as original data/signalCoord). Given by prcomp(x)$rotation.

signalCoord

a GRanges object or data frame with coordinates for the genomic signal/original data (eg DNA methylation) included in the PCA. Coordinates should be in the same order as the original data and the loadings (each item/row in signalCoord corresponds to a row in loadingMat). If a data.frame, must have chr and start columns. If end is included, start and end should be the same. Start coordinate will be used for calculations.

regionSet

A genomic ranges (GRanges) object with regions corresponding to the same biological annotation. Must be from the same reference genome as the coordinates for the actual data/samples (signalCoord).

PCsToAnnotate

A character vector with principal components to include. eg c("PC1", "PC2") These should be column names of loadingMat.

binNum

Number of bins to split each region into when making the aggregate loading profile. More bins will give a higher resolution but perhaps more noisy profile.

verbose

A "logical" object. Whether progress of the function should be shown, one bar indicates the region set is completed. Useful when using `lapply` to get the loading profiles of many region sets.

Value

A data.frame with the binned loading profile, one row per bin. columns: binID and one column for each PC in PCsToAnnotate.

Details

All regions in a given region set are combined into a single aggregate profile. Regions should be expanded on each side to include a wider area of the genome around the regions of interest (see example and vignettes). To make the profile, first we take the absolute value of the loadings. Then each region is split into `binNum` bins. All loadings in each bin are averaged to get one value per bin. Finally, corresponding bins from the different regions are averaged (eg all bin1's averaged with each other, all bin2's averaged with each other, etc.) to get a single "meta-region" loading profile. Since DNA strand information is not considered, the profile is averaged symmetrically around the center. A peak in the middle of this profile suggests that variability is specific to the region set of interest and is not a product of the surrounding genome. A region set can still be significant even if it does not have a peak. For example, some histone modification region sets may be in large genomic blocks and not show a peak, despite having variation across samples.

Examples

data("brcaMCoord1") data("brcaLoadings1") data("esr1_chr1") esr1_chr1_expanded <- resize(esr1_chr1, 14000, fix="center") getLoadingProfile(loadingMat=brcaLoadings1, signalCoord=brcaMCoord1, regionSet=esr1_chr1_expanded, PCsToAnnotate=c("PC1", "PC2"), binNum=25)
#> binID PC1 PC2 #> 1 1 0.001579187 0.0012772220 #> 2 2 0.001419394 0.0012085044 #> 3 3 0.001246910 0.0009932502 #> 4 4 0.001268014 0.0011317951 #> 5 5 0.001308523 0.0012141147 #> 6 6 0.001167137 0.0011503114 #> 7 7 0.001176674 0.0009985293 #> 8 8 0.001137848 0.0008553383 #> 9 9 0.001098797 0.0008743427 #> 10 10 0.001370241 0.0009616829 #> 11 11 0.001208272 0.0009799370 #> 12 12 0.001513095 0.0009693535 #> 13 13 0.002290633 0.0008470308 #> 14 14 0.001513095 0.0009693535 #> 15 15 0.001208272 0.0009799370 #> 16 16 0.001370241 0.0009616829 #> 17 17 0.001098797 0.0008743427 #> 18 18 0.001137848 0.0008553383 #> 19 19 0.001176674 0.0009985293 #> 20 20 0.001167137 0.0011503114 #> 21 21 0.001308523 0.0012141147 #> 22 22 0.001268014 0.0011317951 #> 23 23 0.001246910 0.0009932502 #> 24 24 0.001419394 0.0012085044 #> 25 25 0.001579187 0.0012772220