Visualize how individual regions are associated with target variable

Visualize how much each region in a region set is associated with each target variable. For each target variable (`signalCol`), the average (absolute) signal value is calculated for each region in the region set. Then for a given target variable, the average signal is converted to a percentile/quantile based on the distribution of all signal values for that target variable. These values are plotted in a heatmap.

regionQuantileByTargetVar(
  signal,
  signalCoord,
  regionSet,
  rsName = "",
  signalCol = paste0("PC", 1:5),
  maxRegionsToPlot = 8000,
  cluster_rows = TRUE,
  row_title = "Region",
  column_title = rsName,
  column_title_side = "top",
  cluster_columns = FALSE,
  name = "Percentile of Loading Scores in PC",
  col = c("skyblue", "yellow"),
  absVal = TRUE,
  ...
)

Arguments

signal	Matrix of feature contribution scores (the contribution of each epigenetic feature to each target variable). One named column for each target variable. One row for each original epigenetic feature (should be same order as original data/signalCoord). For (an unsupervised) example, if PCA was done on epigenetic data and the goal was to find region sets associated with the principal components, you could use the x$rotation output of prcomp(epigenetic data) as the feature contribution scores/`signal` parameter.
signalCoord	A GRanges object or data frame with coordinates for the genomic signal/original epigenetic data. Coordinates should be in the same order as the original data and the feature contribution scores (each item/row in signalCoord corresponds to a row in signal). If a data.frame, must have chr and start columns (optionally can have end column, depending on the epigenetic data type).
regionSet	A genomic ranges (GRanges) object with regions corresponding to the same biological annotation. Must be from the same reference genome as the coordinates for the actual data/samples (signalCoord). The regions that will be visualized.
rsName	Character. Name of the region set. For use as a title for the heatmap.
signalCol	A character vector with the names of the sample variables of interest/target variables (e.g. PCs or sample phenotypes).
maxRegionsToPlot	How many top regions from region set to include in heatmap. Including too many may slow down computation and increase memory use. If regionSet has more regions than maxRegionsToPlot, a number of regions equal to maxRegionsToPlot will be randomly sampled from the region set and these regions will be plotted. Clustering rows is a major limiting factor on how long it takes to plot the regions so if you want to plot many regions, you can also set cluster_rows to FALSE.
cluster_rows	Logical object, whether to cluster rows or not (may increase computation time significantly for large number of rows)
row_title	Character object, row title
column_title	Character object, column title
column_title_side	Character object, where to put the column title: "top" or "bottom"
cluster_columns	Logical object, whether to cluster columns. It is recommended to keep this as FALSE so it will be easier to compare target variables that have a certain order such as PCs (with cluster_columns = FALSE, they will be in the same specified order in different heatmaps)
name	Character object, legend title
col	A vector of colors or a color mapping function which will be passed to the ComplexHeatmap::Heatmap() function. See ?Heatmap (the "col" parameter) for more details.
absVal	Logical. If TRUE, take the absolute value of values in signal. Choose TRUE if you think there may be some genomic loci in a region set that will increase and others will decrease (if there may be anticorrelation between regions in a region set). Choose FALSE if you expect regions in a given region set to all change in the same direction (all be positively correlated with each other).
...	Optional parameters for ComplexHeatmap::Heatmap()

Value

A heatmap. Columns are signalCol's, rows are regions. This heatmap allows you to see if some regions are associated with certain target variables but not others. Also, you can see if a subset of regions in the region set are associated with target variables while another subset are not associated with any target variables To color each region, first the (absolute) signal values within that region are averaged. Then this average is compared to the distribution of all (absolute) individual signal values for the given target variable to get a quantile/percentile for that region. Colors are based on this quantile/percentile. The output is a Heatmap object (ComplexHeatmap package).

Examples

data("brcaATACCoord1")
data("brcaATACData1")
data("esr1_chr1")
featureContributionScores <- prcomp(t(brcaATACData1))$rotation
regionByPCHM <- regionQuantileByTargetVar(signal = featureContributionScores, 
                                   signalCoord = brcaATACCoord1, 
                                   regionSet = esr1_chr1, 
                                   rsName = "Estrogen Receptor Chr1", 
                                   signalCol=paste0("PC", 1:2),
                                   maxRegionsToPlot = 8000, 
                                   cluster_rows = TRUE, 
                                   cluster_columns = FALSE, 
                                   column_title = rsName, 
                                   name = "Percentile of Loading Scores in PC")