Introduction.html 49.8 KB
 1 2 3 4 5 6 7   Christian Arnold committed Dec 14, 2021 8 Introduction and Methodological Details • GRaNIE  9 10 11 12 13 14   Christian Arnold committed Dec 14, 2021 15   16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 
 Christian Arnold committed Dec 18, 2021 87 
 88 89 90 

Abstract

 Christian Arnold committed Dec 14, 2021 103 

This vignette introduces the GRaNIE package and explains the main features, methods and necessary background.

 104 105 
 Christian Arnold committed Dec 18, 2021 106 107 108 

Motivation and Necessity

 109 110 111   Christian Arnold committed Dec 14, 2021 112   Christian Arnold committed May 25, 2021 113   114 115   Christian Arnold committed Dec 14, 2021 116 

Genetic variants associated with diseases often affect non-coding regions, thus likely having a regulatory role. To understand the effects of genetic variants in these regulatory regions, identifying genes that are modulated by specific regulatory elements (REs) is crucial. The effect of gene regulatory elements, such as enhancers, is often cell-type specific, likely because the combinations of transcription factors (TFs) that are regulating a given enhancer have celltype specific activity. This TF activity can be quantified with existing tools such as diffTF and captures differences in binding of a TF in open chromatin regions. Collectively, this forms a gene regulatory network (GRN) with cell-type and data-specific TF-RE and RE-gene links. Here, we reconstruct such a GRN using bulk RNAseq and open chromatin (e.g., using ATACseq or ChIPseq for open chromatin marks) and optionally TF activity data. Our network contains different types of links, connecting TFs to regulatory elements, the latter of which is connected to genes in the vicinity or within the same chromatin domain (TAD). We use a statistical framework to assign empirical FDRs and weights to all links using a permutation-based approach.

 Christian Arnold committed Jun 01, 2021 117 

In summary, we present a framework to reconstruct predictive enhancer-mediated regulatory network models that are based on integrating of expression and chromatin accessibility/activity pattern across individuals, and provide a comprehensive resource of cell-type specific gene regulatory networks for particular cell types.

 118 
 Christian Arnold committed Dec 18, 2021 119 120 121 

Installation and Example Workflow

 Christian Arnold committed Dec 14, 2021 122 

Please see the quick start vignette for how to install our GRaNIE package(s) and the workflow vignette for an example workflow.

 123 
 Christian Arnold committed Dec 18, 2021 124 125 126 

Input

 Christian Arnold committed May 25, 2021 127 

In our GRN approach, we integrate multiple data modalities. Here, we describe them in detail and their required format.

 Christian Arnold committed Dec 18, 2021 128 129 130 

Open chromatin and RNA-seq data

 131 132 133 134 135 136 137 138 139 

Open chromatin data may come from ATAC-seq, DNAse-seq or ChIP-seq data for particular histone modifications that associate with open chromatin such as histone acetylation (e.g., H3K27ac). They all capture open chromatin either directly or indirectly, and while we primarily tested and used ATAC-seq while developing the package, the others should also be applicable for our framework. From here on, we will refer to these regions simply as peaks.

For RNA-seq, the data represent expression counts per gene across samples.

Here is a quick graphical representation which format is required to be compatible with our framework:

• columns are samples, rows are peaks / genes
• column names are required while rownames are ignored
• except for one ID columns, all other columns must be numeric and represent counts per sample
• ID column: for peaks
 Christian Arnold committed May 25, 2021 140 
• The name of the ID column can be anything and can be specific later in the pipeline. For peaks, we usually use peakID while for RNA-seq, we use EnsemblID  141 
•  Christian Arnold committed Dec 18, 2021 142 
• for peaks, the required format is “chr:start-end”, with chr denoting the chromosome, followed by :, and then start, -, and end for the peak start and end, respectively.
•  143 144 
•  Christian Arnold committed May 25, 2021 145 146 
• counts should be raw if possible (that is, integers), but we also support pre-normalized data. See here for more information.
•  Christian Arnold committed Dec 14, 2021 147 
• peak and RNA-seq data may contain a distinct set of samples, with some samples overlapping but others not. This is no issue and as long as some samples are found in both of them, the GRaNIE pipeline can work with it. Note that only the shared samples between both data modalities are kept, however, so make sure that the sample names match between them and share as many samples as possible. See the methods part for guidelines on how many samples we recommend.
•  148 
 Christian Arnold committed May 25, 2021 149 150 

Note that peaks must not overlap. If they do, an informative error message is thrown and the user is requested to modify the peak input data so that no overlaps exist among all peaks. This can be done by either merging overlapping peaks or deleting those that overlap with other peaks based on other criteria such as peak signal, by keeping only the strongest peak, for example.

For guidelines on how many peaks are necessary or recommended, see the section below.

 151 
 Christian Arnold committed Dec 18, 2021 152 153 154 

TF and TFBS data

 Christian Arnold committed May 25, 2021 155 156 

TF and TFBS data is mandatory as input. Specifically, the package requires a bed file per TF with TF binding sites (TFBS). TFBS can either be in-silico predicted, or experimentally verified, as long as genome-wide TFBS can be used. For convenience and orientation, we provide TFBS predictions for HOCOMOCO-based TF motifs that were used with PWMScan for hg19, hg38 and mm10. Check the workflow vignette for an example.

However, you may also use your own TFBS data, and we provide full flexibility in doing so. Only some manual preparation is necessary. Briefly, if you decide to use your own TFBS data, you have to prepare the following:

 157 
 Christian Arnold committed May 25, 2021 158 159 160 
• a folder that contains one TFBS file per TF in bed format, 6 columns
• file names must be {TF}{suffix}.{fileEnding}, where {TF} specifies the name of the TF, {suffix} an optional and arbitrary string (we use _TFBS, for example), and {fileEnding} the file type (supported are bed and bed.gz).
• the folder must also contain a so-called translation table
•  161 
 Christian Arnold committed May 25, 2021 162 

For more methodological details, details on how to construct these files, their exact format etc we refer to diffTF paper for details.

 163 
 Christian Arnold committed Dec 18, 2021 164 165 166 

Sample metadata (optional but highly recommended)

 Christian Arnold committed Dec 14, 2021 167 

Providing sample metadata is optional, but highly recommended - if available, the sample metadata is integrated into the PCA plots to understand where the variation in the data comes from and whether any of the metadata (e.g., age, sex, sequencing batch) is associated with the PCs from a PC, indicating a batch effect that needs to be addressed before running the GRaNIE pipeline.

 Christian Arnold committed Jun 01, 2021 168 169 

 Christian Arnold committed Dec 18, 2021 170 171 172 

Hi-C data (optional)

 Christian Arnold committed May 25, 2021 173 174 175 176 

Integration of Hi-C data is optional and serves as alternative to identifying peak-gene pairs to test for correlation based on a predefined and fixed neighborhood size (see Methods).

If Hi-C data are available, the pipeline expects a BED file format with at least 3 columns: chromosome name, start, and end. An ID column is optional and assumed to be in the 4th column, all additional columns are ignored.

For more details, see the R help (?addConnections_peak_gene) and the Methods.

 Christian Arnold committed Dec 18, 2021 177 178 179 

SNP data (optional, coming soon)

 180 181 182 

We also plan to integrate SNP data soon, stay tuned!

 Christian Arnold committed Dec 18, 2021 183 184 185 

Methodological Details and Basic Mode of Action

 186 

In this section, we give methodological details and guidelines.

 Christian Arnold committed Dec 18, 2021 187 188 189 

Data normalization

 Christian Arnold committed Jun 01, 2021 190 

An important consideration is data normalization for RNA and open chromatin data. We currently support three choices of normalization of either peak or RNA-Seq data: quantile, DESeq_sizeFactor and none and refer to the R help for more details (?addData). The default for RNA-Seq is a quantile normalization, while for the open chromatin peak data, it is DESeq_sizeFactor (i.e., a “regular” DESeq size factor normalization). Importantly, DESeq_sizeFactor requires raw data, while quantile does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or “none”.

 Christian Arnold committed Dec 14, 2021 191 

While we recommend raw counts for both peaks and RNA-Seq as input and offer several normalization choices in the pipeline, it is also possible to provide pre-normalized data. Note that the normalization method may have a large influence on the resulting eGRN network, so make sure the choice of normalization is reasonable. For more details, see the next sections.

 192 
 Christian Arnold committed Dec 18, 2021 193 194 195 

Permutations

 Christian Arnold committed Jun 01, 2021 196 197 

RNA-Seq is shuffled, this is permutation 1. TODO: More

 Christian Arnold committed Dec 18, 2021 198 199 200 201 202 203 

TF-peak connections

 Christian Arnold committed Jun 01, 2021 204 205 

TODO: Describe hoe we establish TF-peak links

 Christian Arnold committed Dec 18, 2021 206 207 208 

TF Activity connections

 Christian Arnold committed Jun 01, 2021 209 

As explained above, TF-peak connections are found by correlation TF expression with peak accessibility. In addition to expression, we also offer to identify statistically significant TF-peak links based on TF Activity and not expression of the TFs. The concept of TF Activity is described in more detail in our diffTF paper. In short, we define TF motif activity, or TF activity for short, as the effect of a TF on the state of chromatin as measured by chromatin accessibility or active chromatin marks (i.e., ATAC-seq, DNase sequencing [DNase-seq], or histone H3 lysine 27 acetylation [H3K27ac] ChIP-seq). A TF Activity score is therefore needed for each TF and each sample.

 Christian Arnold committed Dec 14, 2021 210 

TF Activity information can either be calculated within the GRaNIE framework using a simplified and empirical approach) or it can be calculated outside of our framework using designated methods and then imported into our framework. We now describe these two choices in more detail.

 Christian Arnold committed Dec 18, 2021 211 212 213 

Calculating TF Activity

 Christian Arnold committed Dec 14, 2021 214 

In our GRaNIE approach, we empirically estimate TF Activity for each TF with the following approach:

 Christian Arnold committed Jun 01, 2021 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 
• normalize the raw peak counts by one of the supported normalization methods (see below)
• from the TF-peak accessibility matrix as calculated before, identify the subset of peaks with a TFBS overlap for the particular TF based on the user-provided TFBS data
• scaling and centering of the normalized accessibility scores per row (i.e., peak) so that row means are close to 0 for each peak
• the column means (i.e., sample means) from the scaled and centered counts are then taken as approximation for the TF Activity

By default, we currently offer the four different types of normalizing the raw data for calculating TF Activity. Options 2 to 4 are described in more detail in the section Data normalization above, while option 1 is currently only available for TF Activity and therefore explained below (this may change in the future):

1. Cyclic LOESS normalization (default)
Local regression (LOESS) is a commonly used approach for fitting flexible non-linear functions, which involves computing many local linear regression fits and combining them. Briefly, a normalization factor is derived per gene and sample using the normOffsets function of the csaw package in R as opposed to using one size factor for each sample only as with the regular size factor normalization in DeSeq. For each sample, a LOWESS (Locally Weighted Scatterplot Smoothing) curve is fitted to the log-counts against the log-average count. The fitted value for each bin pair is used as the generalized linear model offset for that sample. The use of the average count provides more stability than the average log-count when low counts are present. For more details, see the csaw package in R and the normOffsets methods therein.
2. Standard size factor normalization from DeSeq
3. Quantile normalization
4. No normalization
 Christian Arnold committed Dec 18, 2021 230 231 232 

Importing TF Activity

 Christian Arnold committed Jun 01, 2021 233 234 

Soon, it will also be possible to import TF Activity data into our framework as opposed to calculating it using the procedure as described above. This feature is currently in development and will be available soon.

 Christian Arnold committed Dec 18, 2021 235 236 237 

 Christian Arnold committed Jun 01, 2021 238 239 240 241 

Once TF Activity data is available, finding TF-peak links and assessing their significance is then done in complete analogy as for the TF expression data - just the input data is different (TF Activity as opposed to TF expression). The so-called connection type - expression or TF Activity, is stored in the GRN object and output tables and therefore allows to tailor and filter the resulting network accordingly. All output PDFs also contain the information whether a TF-peak link has been established via the TF expression or TF Activity.

 Christian Arnold committed Dec 18, 2021 242 243 244 

Peak-gene associations

 Christian Arnold committed Jun 01, 2021 245 

We offer two options of where in the gene the overlap with the extended peak may occur: at the 5’ end of the gene (the default) or anywhere in the gene. For more information see the R help (?addConnections_peak_gene and the parameter overlapTypeGene in particular)

 Christian Arnold committed Dec 18, 2021 246 247 248 

Two approaches for identifying peak-gene pairs to test for correlation

 Christian Arnold committed Jun 01, 2021 249 250 

We offer two options to decide which peak-gene pairs to test for correlation: in absence of additional topologically associating domain (TADs) data from Hi-C or similar approaches, the pipeline used a local neighborhood-based approach with a custom neighborhood size (default: 250 kb up- and downstream of the peak) to select peak-gene pairs to test. In the presence of TAD data, all peak-gene pairs within a TAD are tested, while peaks located outside of any TAD domain are ignored. The user has furthermore the choice to specify whether overlapping TADs should be merged or not.

 251 
 Christian Arnold committed May 25, 2021 252 
 Christian Arnold committed Dec 18, 2021 253 254 255 

Guidelines, Recommendations, Limitations, Scope

 Christian Arnold committed May 25, 2021 256 

In this section, we provide a few guidelines and recommendations that may be helpful for your analysis.

 Christian Arnold committed Dec 18, 2021 257 258 259 

Package scope

 Christian Arnold committed Dec 14, 2021 260 

In this section, we want explicitly mention the designated scope of the GRaNIE package, its limitations and additional / companion packages that may be used subsequently or beforehand.

 Christian Arnold committed Jun 01, 2021 261 262 

Coming soon.

 Christian Arnold committed Dec 18, 2021 263 264 265 

Transcription factor binding sites (TFBS)

 Christian Arnold committed Dec 14, 2021 266 

TFBS are a crucial input for any GRaNIE analysis. Our GRaNIE approach is very agnostic as to how these files are generated - as long as one BED file per TF is provided with TFBS positions, the TF can be integrated.As explained above, we usually work with TFBS as predicted by PWMScan based on HOCOMOCO TF motifs, while in-silico predicted TFBS are by no means a requirement of the pipeline. Instead, JASPAR TFBS or TFBS from any other database can also be used. The total number of TF and TFBS per TF seems more relevant here, due to the way we integrate TFBS: We create a binary 0/1 overlap matrix for each peak and TF, with 0 indicating that no TFBS for a particular TF overlaps with a particular peak, while 1 indicates that at least 1 TFBS from the TFBS input data does indeed overlap with the particular peak by at least 1 bp. Thus, having more TFBS in general also increases the number of 1s and therefore the foreground of the TF (see the diagnostic plots) while it makes the foreground also more noisy if the TFBS list contains too many false positives. As always in biology, this is a trade-off.

 267 
 Christian Arnold committed Dec 18, 2021 268 269 270 

Peaks

 Christian Arnold committed May 25, 2021 271 272 

The number of peaks that is provided as input matters greatly for the resulting GRN and its connectivity. From our experience, this number should be in a reasonable range so that there is enough data to build a GRN, but also not so many that the whole pipeline runs unnecessarily long. We have good experience with the number of peaks ranging between 50,000 and 200,000 or so, although these are not hard thresholds but rather recommendations.

With respect to the recommended width of the peaks, we usually use peaks that have a width of a couple of hundred base pairs until a few kb, while the default is to filter peaks if they are wider than 10,000 bp (parameter maxSize_peaks in the function filterData). Remember: peaks are used to overlap them with TFBS, so if a particular peak is too narrow, the likelihood of not overlapping with any (predicted) TFBS from any TF increases, and such a peak is subsequently essentially ignored.

 273 
 Christian Arnold committed Dec 18, 2021 274 275 276 

RNA-Seq

 Christian Arnold committed May 25, 2021 277 278 279 280 281 282 

The following list is subject to change and provides some rough guidelines for the RNA-Seq data:

1. We recommend using raw counts if possible, and checking carefully in a PCA whether any batch effects are visible.
2. Genes with very small counts across samples are advised to be removed by running the function filterData, see the argument minNormalizedMeanRNA for more information. You may want to check beforehand how many gens have a row mean of >1. This number is usually in the tens of thousands.
3. At the moment, we did not properly test our framework for single-cell RNA-Seq data, and therefore cannot provide support for it. Thus, use regular bulk data until we advanced with the single-cell applicability.
 Christian Arnold committed May 07, 2021 283 
 Christian Arnold committed Dec 18, 2021 284 285 286 

Peak-gene p-values accuracy and violations

 Christian Arnold committed May 25, 2021 287 288 289 

Coming soon!

 Christian Arnold committed Dec 18, 2021 290 291 292 

Output

 Christian Arnold committed May 25, 2021 293 

Here, we describe the various output files that are produced by the pipeline. They are described in the order they are produced in the pipeline.

 Christian Arnold committed Dec 18, 2021 294 295 296 

GRN object

 Christian Arnold committed Jun 01, 2021 297 

Our pipeline works and output a so-called GRN object. The goal is simple: All information is stored in it, and by keeping everything within one object and sharing it with others, they have all the necessary data and information to run the GRN workflow. A consistent and simple workflow logic makes it easy and intuitive to work with it, similar to other packages such as DESeq2.

 Christian Arnold committed Dec 14, 2021 298 

Technically speaking, it is an S4 object of class GRN. As you can see from the workflow vignette, almost all GRaNIE functions return a GRN object (with the notable exception of get functions). All GRaNIE functions (except initializeGRN, which creates an empty GRN object) also require a GRN object as first argument, which makes is easy and intuitive to work with the package, at least this was the goal we had in mind. We would be happy to receive your feedback about it!

 Christian Arnold committed May 25, 2021 299 

GRN objects contain all data and results necessary for the various functions the package provides, and various extractor functions allow to extract information out of an GRN object such as the various get functions. In addition, printing a GRN object results in an object summary that is printed (try it out and just type GRN in the console if your GRN object is called like this!). In the future, we aim to add more convenience functions. If you have specific ideas, please let us know.

 Christian Arnold committed Dec 18, 2021 300 

The slots of a GRN object are described in the R help, see ?GRN for details. While we work on general extractor functions for the various slots for optimal user experience, we currently suggest to also access and explore the data directly with the @ operator. For example, GRN@config accesses the configuration slot that contains all parameters and object metadata, and slotNames(GRN) prints all available slots of the object.

 301 
 Christian Arnold committed Dec 18, 2021 302 303 304 

PCA plots and results

 Christian Arnold committed May 25, 2021 305 306 

The pipeline outputs PCA plots for both peaks and RNA as well as original (i..e, the counts the user provided as input) and normalized (i.e., the counts after normalizing them if any normalization method has been provided) data. Thus, in total, 4 different PCA plots are produced, 2 per data modality (peaks and RNA) and 2 per data type (original and normalized counts).

Each PDF consists of three parts: PCA results based on the top 500, top 1000 and top 5000 features (see page headers). For each part, different plot types are available and briefly explained in the following:

 Christian Arnold committed May 07, 2021 307 
 Christian Arnold committed May 25, 2021 308 
1. Multi-density plot across all samples (1 page)
2.  Christian Arnold committed May 07, 2021 309 
 Christian Arnold committed Jun 01, 2021 310 311 312 313 

Multi-density plot for read counts across all samples

 Christian Arnold committed May 07, 2021 314 
 Christian Arnold committed May 25, 2021 315 316 317 
1. Screeplot (1 page)
 Christian Arnold committed Jun 01, 2021 318 319 320 321 

PCA screeplot

 Christian Arnold committed May 07, 2021 322 
 Christian Arnold committed May 25, 2021 323 324 325 
 Christian Arnold committed Jun 01, 2021 326 327 328 329 

 Christian Arnold committed May 07, 2021 330 
 Christian Arnold committed May 25, 2021 331 332 333 
1. PCA plots with different metadata being colored (5 or more pages, depending on available metadata)
 Christian Arnold committed Jun 01, 2021 334 335 336 337 

 Christian Arnold committed May 07, 2021 338 
 Christian Arnold committed May 25, 2021 339 

Currently, the actual PCA result data are not stored in the GRN object, but this will be available soon as well. We will update the Vignette once this is done and mention it in the Changelog.

 340 
 Christian Arnold committed Dec 18, 2021 341 342 343 

TF-peak diagnostic plots

 Christian Arnold committed May 25, 2021 344 

TF-peak diagnostic plots are available for each TF, and they currently look as follows:

 Christian Arnold committed Jun 01, 2021 345 346 347 348 

TF-peak diagnostic plots for an example TF

 Christian Arnold committed May 07, 2021 349 
 Christian Arnold committed May 25, 2021 350 

The TF name is indicated in the title, and each page shows two plots. In each plot, the TF-peak FDR for each correlation bin (ranging from -1 to 1 in bins of size 0.05) is shown. The only difference between the two plots is the directionality upon which the FDR is empirically derived from: the upper plot is for the positive and the lower plot for the negative direction. Each plot is also colored by the number of distinct TF-peak connections that fall into the particular bin. Mostly, correlation bins with smaller absolute correlation values have higher frequencies (i.e., more TF-peak links fall into them) while correlation bins with more extreme correlation values are less frequent. In the end, for the resulting network, the directionality can be ignored and only those TF-peak links are kept with small FDRs, irrespective of the directionality.

 Christian Arnold committed May 07, 2021 351 
 Christian Arnold committed Dec 18, 2021 352 353 354 

Activator-repressor classification diagnostic plots and results

 Christian Arnold committed Dec 14, 2021 355 

The pipeline produces 3 different plot types related to the activator-repressor (AR) classification that can optionally be run as part of the GRaNIE workflow. For each of the 3 types, plots are produced for both the original, non-permuted (labeled as original) as well as the permuted (labeled as permuted) data.

 Christian Arnold committed May 25, 2021 356 357 358 359 

The AR classification is run for the RNA expression data (labeled as expression) and can additionally also be run for TF activity data (labeled as TFActivity, see the function addConnections_TF_peak and its parameter options).

In the following, the 3 plot types are briefly explained:

1.  Christian Arnold committed Dec 18, 2021 360 Summary heatmaps (files starting with TF_classification_summaryHeatmap): This is described in detail in the diffTF documentation.  Christian Arnold committed May 25, 2021 361 362 
 Christian Arnold committed Jun 01, 2021 363 364 365 366 

AR summary heatmap

 Christian Arnold committed May 07, 2021 367 
 Christian Arnold committed May 25, 2021 368 369 
1.  Christian Arnold committed Dec 18, 2021 370 Summary stringency plots (files starting with TF_classification_stringencyThresholds): This is described in detail in the diffTF documentation.  Christian Arnold committed May 25, 2021 371 372 
 Christian Arnold committed Jun 01, 2021 373 374 375 376 

AR stringency thresholds

 Christian Arnold committed May 07, 2021 377 
 Christian Arnold committed May 25, 2021 378 379 380 381 
1. Density plots per TF (files starting with TF_classification_densityPlotsForegroundBackground): Density plots for each TF, with one TF per page. The plot shows the foreground (red, labeled as Motif) and background (gray, labeled as Non-motif) densities of the correlation coefficient (either Pearson or Spearman, see x-axis label) from peaks with (foreground) or without (background) a (predicted) TFBS in the peak for the particular TF. The numbers in the parenthesis summarize the underlying total number of peaks.
 Christian Arnold committed Jun 01, 2021 382 383 384 385 

Density plots per TF

 Christian Arnold committed May 07, 2021 386 
 Christian Arnold committed Jun 01, 2021 387 

It is also possible to extract the results from the AR classification out of a GRN object. Currently, this can only be done manually, extractor functions are in the works that will further enhance the user experience. The results are stored in the slot GRN@data$TFs$classification[[permIndex]] [[connectionType]]$TF.classification. Here, permIndex refers to the original, non-permuted (“0”) or permuted (“1”) data, while connectionType here is either expression or TFActivity, depending on whether the pipeline has also be run for TF Activity in addition to expression (see function addConnections_TF_peak). Thus, typically, the results for the original data are stored in GRN@data$TFs$classification[["0"]] [["expression"]]$TF.classification. If intermediate results from the classification have not been deleted (the default is to delete them as they can occupy a large amount of memory in the object, see the parameters of AR_classification_wrapper for details), they can be accessed similarly within GRN@data$TFs$classification[[permIndex]] [[connectionType]]: TF_cor_median_foreground, TF_cor_median_background, TF_peak_cor_foreground, TF_peak_cor_background, and act.rep.thres.l.

 Christian Arnold committed May 07, 2021 388 
 Christian Arnold committed Dec 18, 2021 389 390 391 

Peak-gene diagnostic plots

 Christian Arnold committed May 25, 2021 392 393 

We provide a number of diagnostic plots for the peak-gene links that are imperative for understanding the biological system and GRN. In what follows, we describe them briefly, along with some notes on expected patterns, implications etc. Note that this section is subject to continuous change.

We currently offer a summary QC figure for the peak-gene connections that looks as follows:

 Christian Arnold committed Jun 01, 2021 394 395 396 397 

Summary peak-gene QC figure

 Christian Arnold committed May 07, 2021 398 
 Christian Arnold committed May 25, 2021 399 400 

As you can see, the Figure is divided into two rows: the upper row focuses on the peak-gene raw p-value of the correlation results, while the lower row focuses on the peak-gene correlation coefficient. The left side visualizes the data of the corresponding metrics via density plots, while the right side bins the metrics and visualizes them with barplots for highlighting differences between real and permuted data as well as negatively and positively correlated peak-gene links (denoted as r+ and r-, respectively).

We now describe these plots in more detail.

 Christian Arnold committed Dec 18, 2021 401 402 403 

Correlation raw p-value distribution

 Christian Arnold committed Jun 01, 2021 404   Christian Arnold committed May 25, 2021 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 

First and most importantly, we focus on the distribution of the raw p-values from the correlation tests (peak accessibility vs gene expression) of all peak-gene links. We can investigate these from multiple perspectives.

Let’s start with a density plot. The upper left plot shows the raw p-value density for the particular gene type as indicated in the title (here: all gene types), stratified on two levels:

1. Random (permuted, left) and non-random (real, right) connections
2. Connections that have a negative (r-, gray) and positive (r+, black) correlation coefficient, respectively

Generally speaking, we consider both the random connections as well as r- connections as background.

What we would like to see is:

• random connections show little to no signal, with a flat curve along the x-axis, with little to no difference between r+ and r- connections
• for the real connections and r+ links, a strong peak at small p-values, and a (marginally) flat distribution for higher ones (similar to a well-calibrated raw p-value distribution for any hypothesis test such as differential expression). For r- links, the peak at small p-values should be much smaller and ideally the curve is completely flat. However, from the datasets we examined, this is rarely the case.

If any of these quality controls is not met, it may indicate an issue with data normalization, the underlying biology and what the GRN represents, or an issue with data size or covariates influencing the results.

To quantify the difference between r+ and r- links within each connection type (random vs real), we can also plot the results in form of ratios rather than densities for either the r+ / r- or the real / permuted dimension. These plots are shown in the upper right panel of the summary plot.

For the r+ / r- dimension and permuted data, the ratios should be close to 1 across all p-value bins, while for the real data, a high ratio is typically seen for small p-values. In general, the difference between the permuted and real bar should be large for small p-values and close to 1 for larger ones.

For the real / permuted dimension, what we want to see is again a high ratio for small p-value bins for the r+ links, indicating that when comparing permuted vs real, there are many more small p-value links in real data as compared to permuted. This usually does not hold true for the r- links, though, as can be seen also from the plot: the gray bars are smaller and closer to 1 across the whole binned p-value range.

Lastly, we can also stratify the raw p-value distribution for r+ and r- peak-gene connections according to different biological properties such as the peak-gene distance and others (see below). Here, we focus solely on the real data and additionally stratify the p-value distributions of peak-gene links by their genomic distance (measured as the distance of the middle of the peak to the TSS of the gene, in base pairs). For this, we bin the peak-gene distance equally into 10 bins, which results in the bins containing a non-equal number of data points but genomic distance is increased uniformly from bin to bin:

 Christian Arnold committed Jun 01, 2021 422 423 424 425 

Density of raw p-values, stratified by (1) peak-gene distance (using equally sized bins) and (2) r+ / r- links

 426 
 Christian Arnold committed May 25, 2021 427 428 

We generally (hope to) see that for smaller peak-gene distances (in particular those that overlap, i.e., the peak and the gene are in direct vicinity or even overlapping), the difference between r+ and r- links is bigger as for more distant links. We also include the random links, for which no difference between r+ and r- links is visible, as expected for a well-calibrated background.

Let’s plot the same, but stratified by peak-gene distance and r+ / r- within each plot instead:

 Christian Arnold committed Jun 01, 2021 429 430 431 432 

Density of raw p-values, stratified by (1) peak-gene distance (using equally sized bins) and (2) r+ / r- links

 Christian Arnold committed May 07, 2021 433 434 
 Christian Arnold committed Dec 18, 2021 435 436 437 

Correlation coefficient distribution

 Christian Arnold committed May 25, 2021 438 

So far, we analyzed the raw p-value distribution in detail. Let’s focus now on the distribution of the correlation coefficient per se.

 Christian Arnold committed Jun 01, 2021 439 440 441 442 

Density of the correlation coefficient, stratified by (1) peak-gene distance (using equally sized bins)

 Christian Arnold committed May 25, 2021 443 444 

Here, we can also include the random links for comparison. We see that the distribution of the correlation coefficient is also slightly different across bins, and most extreme for links with a small genomic distance (darker colored lines).

 Christian Arnold committed May 07, 2021 445 446 
 Christian Arnold committed Dec 18, 2021 447 448 449 

Connection summary plots

 Christian Arnold committed May 25, 2021 450 451 452 

We currently offer two different connection summary PDFs, both of which are produced from the function plot_stats_connectionSummary. Both PDFs shows the number of connections for each node type (TF, peak, and gene), while in the boxplots, peaks are further differentiated into TF-peak and peak-gene entities. They also iterate over various parameters and plot one plot per page and parameter combination, as indicated in the title:

1.  Christian Arnold committed Dec 14, 2021 453 allowMissingTFs: TRUE or FALSE (i.e., allow TFs to be missing when summarizing the eGRN network. If set to TRUE, a valid connection may consist of just peak to gene, with no TF being connected to the peak. For more details, see the R help for plot_stats_connectionSummary)
2.  Christian Arnold committed May 25, 2021 454 455 456 457 458 459 460 
3. allowMissingGenes: TRUE or FALSE (i.e., allow genes to be missing when summarizing the GRN network. If set to TRUE, a valid connection may consist of just TF to peak, with no gene being connected to the peak. For more details, see the R help for plot_stats_connectionSummary)
4. TF_peak.connectionType. Either expression or TFActivity to denote which connection type the summary is based on.

Both plot types compare the connectivity for the real and permuted data (denoted as Network type in the boxplot PDF), which allows a better judgment of the connectivity from the real data.

An example page for the summary heatmap looks like this:

 Christian Arnold committed Jun 01, 2021 461 462 463 464 

Example heatmap for the connection summary

 465 
 Christian Arnold committed May 25, 2021 466 467 

Here, two heatmaps are shown, one for real (top) and one for permuted (bottom) network. Each of them shows for different combinations of TF-peak and peak-gene FDRs (0.01 to 0.2) the number of unique node types for the given FDR combination (here: TFs).

Second, a multi-page summary PDF for the connections in form of a boxplot, as exemplified with the following Figure:

 Christian Arnold committed Jun 01, 2021 468 469 470 471 

Example boxplot for the connection summary

 472 
 Christian Arnold committed May 25, 2021 473 474 

Here, we can see that [TODO]

TF- genes, TF-peaks and peak-genes for both real and permuted data, similar to the summary heatmap (see above).

 Christian Arnold committed May 07, 2021 475 
 Christian Arnold committed Dec 18, 2021 476 477 478 

Output tables

 Christian Arnold committed Dec 14, 2021 479 

While no designated output table is created during a typical GRaNIE analysis from the core functions, the function getGRNConnections can be used to extract the resulting eGRN network from a GRN object and save it as a separate data frame. This can then be stored on disk or processed further within R. For more information, see the corresponding R help ( ?getGRNConnections).

 480 481 
 Christian Arnold committed Dec 18, 2021 482 483 484 

Memory footprint and execution time, feasibility with large datasets

 Christian Arnold committed May 25, 2021 485 486 487 488 489 490 491 492 

In this section, we will give an overview over CPU and memory requirements when running or planning a GRN analysis.

Both CPU time and memory footprint primarily depend on similar factors, namely the number of

1. TFs
2. peaks
3. samples

While the number of TFs is typically similar across analyses when the default database is used (HOCOMOCO + PWMScan), the number of peaks and samples may vary greatly across analyses.

 Christian Arnold committed Dec 18, 2021 493 494 495 

CPU time

 Christian Arnold committed May 25, 2021 496 

A typical analysis runs within an hour or two on a standard machine with 2 to 4 cores or so. CPU time non-surprisingly depends primarily on the number of used cores for those functions that support multiple cores and that are time-consuming in nature.

 497 
 Christian Arnold committed Dec 18, 2021 498 499 500 

Memory footprint

 Christian Arnold committed May 25, 2021 501 502 503 

The maximum memory footprint can be a few GB in R. We recommend not using more than 100,000 to 200,000 or so peaks as the memory footprint as well as running time may otherwise increase notably.

Given that our approach is a correlation-based one, it seems preferable to maximize the number of samples while retaining only peaks that carry a biological signal that is differing among samples.

The size of the GRN object is typically in the range of a few hundred MB, but can exceed 1 GB for large datasets. If you have troubles with big datasets, let us know! We always look for ways to further optimize the memory footprint.

 Christian Arnold committed May 07, 2021 504 505 
 Christian Arnold committed Dec 18, 2021 506 507 508 

References

 509 510 511 512 513 514 515 516 517 518 519 520 521 522