Introduction.html 49.8 KB
Newer Older
1
2
3
4
5
6
7
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
Christian Arnold's avatar
Christian Arnold committed
8
<title>Introduction and Methodological Details • GRaNIE</title>
9
10
11
12
13
14
<!-- jquery --><script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.1/jquery.min.js" integrity="sha256-CSXorXvZcTkaix6Yvo6HppcZGetbYMGWSFlBw8HfCJo=" crossorigin="anonymous"></script><!-- Bootstrap --><link href="https://cdnjs.cloudflare.com/ajax/libs/bootswatch/3.4.0/flatly/bootstrap.min.css" rel="stylesheet" crossorigin="anonymous">
<script src="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.4.1/js/bootstrap.min.js" integrity="sha256-nuL8/2cJ5NDSSwnKD8VqreErSWHtnEP9E7AySL+1ev4=" crossorigin="anonymous"></script><!-- bootstrap-toc --><link rel="stylesheet" href="../bootstrap-toc.css">
<script src="../bootstrap-toc.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous">
<!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.6/clipboard.min.js" integrity="sha256-inc5kl9MA1hkeYUt+EC3BhlIgyp/2jDIyBLS6k3UxPI=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- pkgdown --><link href="../pkgdown.css" rel="stylesheet">
<script src="../pkgdown.js"></script><meta property="og:title" content="Introduction and Methodological Details">
Christian Arnold's avatar
Christian Arnold committed
15
<meta property="og:description" content="GRaNIE">
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
<!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]--><!-- Global site tag (gtag.js) - Google Analytics --><script async src="https://www.googletagmanager.com/gtag/js?id=G-530L9SXFM1"></script><script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-530L9SXFM1');
</script>
</head>
<body data-spy="scroll" data-target="#toc">
    <div class="container template-article">
      <header><div class="navbar navbar-default navbar-fixed-top" role="navigation">
  <div class="container">
    <div class="navbar-header">
      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false">
        <span class="sr-only">Toggle navigation</span>
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
      </button>
      <span class="navbar-brand">
Christian Arnold's avatar
Christian Arnold committed
39
        <a class="navbar-link" href="../index.html">GRaNIE</a>
40
        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Released version">0.14.4</span>
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
      </span>
    </div>

    <div id="navbar" class="navbar-collapse collapse">
      <ul class="nav navbar-nav">
<li>
  <a href="../index.html"></a>
</li>
<li>
  <a href="../articles/quickStart.html">Getting Started</a>
</li>
<li class="dropdown">
  <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">
    Vignettes
     
    <span class="caret"></span>
  </a>
  <ul class="dropdown-menu" role="menu">
<li>
      <a href="../articles/quickStart.html">Getting Started</a>
    </li>
    <li>
      <a href="../articles/Introduction.html">Introduction</a>
    </li>
    <li>
      <a href="../articles/workflow.html">Workflow example</a>
    </li>
  </ul>
</li>
<li>
  <a href="../reference/index.html">Reference</a>
</li>
<li>
  <a href="../news/index.html">Changelog &amp; News</a>
</li>
      </ul>
<ul class="nav navbar-nav navbar-right"></ul>
</div>
<!--/.nav-collapse -->
  </div>
<!--/.container -->
</div>
<!--/.navbar -->

      

87
      </header><script src="Introduction_files/header-attrs-2.11/header-attrs.js"></script><div class="row">
88
89
90
  <div class="col-md-9 contents">
    <div class="page-header toc-ignore">
      <h1 data-toc-skip>Introduction and Methodological Details</h1>
91
                        <h4 class="author">Christian Arnold, Judith Zaugg</h4>
92
            
93
            <h4 class="date">18 December 2021</h4>
94
95
96
97
98
99
100
101
102
      
      
      <div class="hidden name"><code>Introduction.Rmd</code></div>

    </div>

    
        <div class="abstract">
      <p class="abstract">Abstract</p>
Christian Arnold's avatar
Christian Arnold committed
103
      <p>This vignette introduces the <code>GRaNIE</code> package and explains the main features, methods and necessary background.</p>
104
105
    </div>
    
106
107
108
<div id="motivation" class="section level1">
<h1 class="hasAnchor">
<a href="#motivation" class="anchor"></a>Motivation and Necessity</h1>
109
110
111
<!-- <div align="center"> -->
<!-- <figure> -->
<!-- <img src="figs/Logo.png" height="200px"/> -->
Christian Arnold's avatar
Christian Arnold committed
112
<!-- <figcaption><i>Figure 1 - `GRaNIE` logo.</i></figcaption> -->
113
<!-- </figure><br/><br/> -->
114
115
<!-- </div> -->
<!-- <br/><br/> -->
Christian Arnold's avatar
Christian Arnold committed
116
<p>Genetic variants associated with diseases often affect non-coding regions, thus likely having a regulatory role. To understand the effects of genetic variants in these regulatory regions, identifying genes that are modulated by specific regulatory elements (REs) is crucial. The effect of gene regulatory elements, such as enhancers, is often cell-type specific, likely because the combinations of transcription factors (TFs) that are regulating a given enhancer have celltype specific activity. This TF activity can be quantified with existing tools such as <em>diffTF</em> and captures differences in binding of a TF in open chromatin regions. Collectively, this forms a gene regulatory network (GRN) with cell-type and data-specific TF-RE and RE-gene links. Here, we reconstruct such a GRN using bulk RNAseq and open chromatin (e.g., using ATACseq or ChIPseq for open chromatin marks) and optionally TF activity data. Our network contains different types of links, connecting TFs to regulatory elements, the latter of which is connected to genes in the vicinity or within the same chromatin domain (TAD). We use a statistical framework to assign empirical FDRs and weights to all links using a permutation-based approach.</p>
Christian Arnold's avatar
Christian Arnold committed
117
<p>In summary, we present a framework to reconstruct predictive enhancer-mediated regulatory network models that are based on integrating of expression and chromatin accessibility/activity pattern across individuals, and provide a comprehensive resource of cell-type specific gene regulatory networks for particular cell types.</p>
118
</div>
119
120
121
<div id="installation" class="section level1">
<h1 class="hasAnchor">
<a href="#installation" class="anchor"></a>Installation and Example Workflow</h1>
Christian Arnold's avatar
Christian Arnold committed
122
<p>Please see the <a href="quickStart.html">quick start vignette for how to install our <code>GRaNIE</code> package(s)</a> and the <a href="workflow.html">workflow vignette for an example workflow</a>.</p>
123
</div>
124
125
126
<div id="input" class="section level1">
<h1 class="hasAnchor">
<a href="#input" class="anchor"></a>Input</h1>
127
<p>In our <code>GRN</code> approach, we integrate multiple data modalities. Here, we describe them in detail and their required format.</p>
128
129
130
<div id="input_peaks" class="section level2">
<h2 class="hasAnchor">
<a href="#input_peaks" class="anchor"></a>Open chromatin and RNA-seq data</h2>
131
132
133
134
135
136
137
138
139
<p>Open chromatin data may come from ATAC-seq, DNAse-seq or ChIP-seq data for particular histone modifications that associate with open chromatin such as histone acetylation (e.g., H3K27ac). They all capture open chromatin either directly or indirectly, and while we primarily tested and used ATAC-seq while developing the package, the others should also be applicable for our framework. <em>From here on, we will refer to these regions simply as peaks.</em></p>
<p>For RNA-seq, the data represent expression counts per gene across samples.</p>
<p>Here is a quick graphical representation which format is required to be compatible with our framework:</p>
<ul>
<li>columns are samples, rows are peaks / genes</li>
<li>column names are required while rownames are ignored</li>
<li>except for one ID columns, all other columns must be numeric and represent counts per sample</li>
<li>ID column: for peaks
<ul>
140
<li>The name of the ID column can be anything and can be specific later in the pipeline. For peaks, we usually use <code>peakID</code> while for RNA-seq, we use <code>EnsemblID</code>
141
</li>
142
<li>for peaks, the required format is “chr:start-end”, with <code>chr</code> denoting the chromosome, followed by <code><a href="https://rdrr.io/r/base/Colon.html">:</a></code>, and then <code>start</code>, <code><a href="https://rdrr.io/r/base/Arithmetic.html">-</a></code>, and <code>end</code> for the peak start and end, respectively.</li>
143
144
</ul>
</li>
145
146
<li>counts should be raw if possible (that is, integers), but we also support pre-normalized data. <a href="#methods_dataNorm">See here for more information.</a>
</li>
Christian Arnold's avatar
Christian Arnold committed
147
<li>peak and RNA-seq data may contain a distinct set of samples, with some samples overlapping but others not. This is no issue and as long as <em>some</em> samples are found in both of them, the <code>GRaNIE</code> pipeline can work with it. Note that only the shared samples between both data modalities are kept, however, so make sure that the sample names match between them and share as many samples as possible. See the methods part for guidelines on how many samples we recommend.</li>
148
</ul>
149
150
<p>Note that peaks must not overlap. If they do, an informative error message is thrown and the user is requested to modify the peak input data so that no overlaps exist among all peaks. This can be done by either merging overlapping peaks or deleting those that overlap with other peaks based on other criteria such as peak signal, by keeping only the strongest peak, for example.</p>
<p>For guidelines on how many peaks are necessary or recommended, see <a href="#guidelines">the section below</a>.</p>
151
</div>
152
153
154
<div id="input_TF" class="section level2">
<h2 class="hasAnchor">
<a href="#input_TF" class="anchor"></a>TF and TFBS data</h2>
155
156
<p>TF and TFBS data is mandatory as input. Specifically, the package requires a <code>bed</code> file per TF with TF binding sites (TFBS). TFBS can either be in-silico predicted, or experimentally verified, as long as genome-wide TFBS can be used. For convenience and orientation, we provide TFBS predictions for HOCOMOCO-based TF motifs that were used with <code>PWMScan</code> for <code>hg19</code>, <code>hg38</code> and <code>mm10</code>. Check the <a href="workflow.html">workflow vignette for an example</a>.</p>
<p>However, you may also use your own TFBS data, and we provide full flexibility in doing so. Only some manual preparation is necessary. Briefly, if you decide to use your own TFBS data, you have to prepare the following:</p>
157
<ul>
158
159
160
<li>a folder that contains one TFBS file per TF in <code>bed</code> format, 6 columns</li>
<li>file names must be <code>{TF}{suffix}.{fileEnding}</code>, where <code>{TF}</code> specifies the name of the TF, <code>{suffix}</code> an optional and arbitrary string (we use <code>_TFBS</code>, for example), and <code>{fileEnding}</code> the file type (supported are <code>bed</code> and <code>bed.gz</code>).</li>
<li>the folder must also contain a so-called translation table</li>
161
</ul>
162
<p>For more methodological details, details on how to construct these files, their exact format etc we refer to <code>diffTF</code> paper for details.</p>
163
</div>
164
165
166
<div id="input_metadata" class="section level2">
<h2 class="hasAnchor">
<a href="#input_metadata" class="anchor"></a>Sample metadata (optional but highly recommended)</h2>
Christian Arnold's avatar
Christian Arnold committed
167
<p>Providing sample metadata is optional, but highly recommended - if available, the sample metadata is integrated into the PCA plots to understand where the variation in the data comes from and whether any of the metadata (e.g., age, sex, sequencing batch) is associated with the PCs from a PC, indicating a batch effect that needs to be addressed before running the <code>GRaNIE</code> pipeline.</p>
Christian Arnold's avatar
Christian Arnold committed
168
169
<p>The integration of sample metadata is in the <code>addData</code> function, see <code><a href="../reference/addData.html">?addData</a></code> for more information.</p>
</div>
170
171
172
<div id="input_HiC" class="section level2">
<h2 class="hasAnchor">
<a href="#input_HiC" class="anchor"></a>Hi-C data (optional)</h2>
173
174
175
176
<p>Integration of Hi-C data is optional and serves as alternative to identifying peak-gene pairs to test for correlation based on a predefined and fixed <em>neighborhood</em> size (see <a href="#methods_peakGene">Methods</a>).</p>
<p>If Hi-C data are available, the pipeline expects a BED file format with at least 3 columns: chromosome name, start, and end. An ID column is optional and assumed to be in the 4th column, all additional columns are ignored.</p>
<p>For more details, see the R help (<code><a href="../reference/addConnections_peak_gene.html">?addConnections_peak_gene</a></code>) and the <a href="#methods_peakGene">Methods</a>.</p>
</div>
177
178
179
<div id="input_SNP" class="section level2">
<h2 class="hasAnchor">
<a href="#input_SNP" class="anchor"></a>SNP data (optional, coming soon)</h2>
180
181
182
<p>We also plan to integrate SNP data soon, stay tuned!</p>
</div>
</div>
183
184
185
<div id="methods" class="section level1">
<h1 class="hasAnchor">
<a href="#methods" class="anchor"></a>Methodological Details and Basic Mode of Action</h1>
186
<p>In this section, we give methodological details and guidelines.</p>
187
188
189
<div id="methods_dataNorm" class="section level2">
<h2 class="hasAnchor">
<a href="#methods_dataNorm" class="anchor"></a>Data normalization</h2>
Christian Arnold's avatar
Christian Arnold committed
190
<p>An important consideration is data normalization for RNA and open chromatin data. We currently support three choices of normalization of either peak or RNA-Seq data: <code>quantile</code>, <code>DESeq_sizeFactor</code> and <code>none</code> and refer to the R help for more details (<code><a href="../reference/addData.html">?addData</a></code>). The default for RNA-Seq is a quantile normalization, while for the open chromatin peak data, it is <code>DESeq_sizeFactor</code> (i.e., a “regular” <code>DESeq</code> size factor normalization). Importantly, <code>DESeq_sizeFactor</code> requires raw data, while <code>quantile</code> does not necessarily. We nevertheless recommend raw data as input, although it is also possible to provide pre-normalized data as input and then topping this up with another normalization method or “none”.</p>
Christian Arnold's avatar
Christian Arnold committed
191
<p>While we recommend raw counts for both peaks and RNA-Seq as input and offer several normalization choices in the pipeline, it is also possible to provide pre-normalized data. Note that the normalization method may have a large influence on the resulting <code>eGRN</code> network, so make sure the choice of normalization is reasonable. For more details, see the next sections.</p>
192
</div>
193
194
195
<div id="methods_permutedData" class="section level2">
<h2 class="hasAnchor">
<a href="#methods_permutedData" class="anchor"></a>Permutations</h2>
Christian Arnold's avatar
Christian Arnold committed
196
197
<p>RNA-Seq is shuffled, this is permutation 1. TODO: More</p>
</div>
198
199
200
201
202
203
<div id="methods_TF_peak" class="section level2">
<h2 class="hasAnchor">
<a href="#methods_TF_peak" class="anchor"></a>TF-peak connections</h2>
<div id="methods_TF_peak_build" class="section level3">
<h3 class="hasAnchor">
<a href="#methods_TF_peak_build" class="anchor"></a>Establishing TF-peak links</h3>
Christian Arnold's avatar
Christian Arnold committed
204
205
<p>TODO: Describe hoe we establish TF-peak links</p>
</div>
206
207
208
<div id="methods_TF_peak_TFActivity" class="section level3">
<h3 class="hasAnchor">
<a href="#methods_TF_peak_TFActivity" class="anchor"></a>TF Activity connections</h3>
Christian Arnold's avatar
Christian Arnold committed
209
<p>As explained above, TF-peak connections are found by correlation TF <em>expression</em> with peak accessibility. In addition to <em>expression</em>, we also offer to identify statistically significant TF-peak links based on <em>TF Activity</em> and not expression of the TFs. The concept of TF Activity is described in more detail in our <em>diffTF</em> paper. In short, we define TF motif activity, or TF activity for short, as the effect of a TF on the state of chromatin as measured by chromatin accessibility or active chromatin marks (i.e., ATAC-seq, DNase sequencing [DNase-seq], or histone H3 lysine 27 acetylation [H3K27ac] ChIP-seq). A <em>TF Activity</em> score is therefore needed <em>for each TF and each sample</em>.</p>
Christian Arnold's avatar
Christian Arnold committed
210
<p>TF Activity information can either be calculated within the <code>GRaNIE</code> framework <a href="#methods_TF_peak_TFActivity_calculating">using a simplified and empirical approach)</a> or it can be calculated outside of our framework using designated methods and then <a href="#methods_TF_peak_TFActivity_importing">imported into our framework</a>. We now describe these two choices in more detail.</p>
211
212
213
<div id="methods_TF_peak_TFActivity_calculating" class="section level4">
<h4 class="hasAnchor">
<a href="#methods_TF_peak_TFActivity_calculating" class="anchor"></a>Calculating TF Activity</h4>
Christian Arnold's avatar
Christian Arnold committed
214
<p>In our <em>GRaNIE</em> approach, we empirically estimate TF Activity for each TF with the following approach:</p>
Christian Arnold's avatar
Christian Arnold committed
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
<ul>
<li>normalize the raw peak counts by one of the supported normalization methods (see below)</li>
<li>from the TF-peak accessibility matrix as calculated before, identify the subset of peaks with a TFBS overlap for the particular TF based on the user-provided TFBS data</li>
<li>scaling and centering of the normalized accessibility scores per row (i.e., peak) so that row means are close to 0 for each peak</li>
<li>the column means (i.e., sample means) from the scaled and centered counts are then taken as approximation for the TF Activity</li>
</ul>
<p>By default, we currently offer the four different types of normalizing the raw data for calculating TF Activity. Options 2 to 4 are described in more detail in the section <a href="#methods_dataNorm">Data normalization above</a>, while option 1 is currently only available for TF Activity and therefore explained below (this may change in the future):</p>
<ol style="list-style-type: decimal">
<li>Cyclic LOESS normalization (default)  <br> Local regression (LOESS) is a commonly used approach for fitting flexible non-linear functions, which involves computing many local linear regression fits and combining them. Briefly, a normalization factor is derived per gene and sample using the <code>normOffsets</code> function of the <code>csaw</code> package in R as opposed to using one size factor for each sample only as with the regular size factor normalization in <code>DeSeq</code>. For each sample, a LOWESS (Locally Weighted Scatterplot Smoothing) curve is fitted to the log-counts against the log-average count. The fitted value for each bin pair is used as the generalized linear model offset for that sample. The use of the average count provides more stability than the average log-count when low counts are present. For more details, see the <code>csaw</code> package in <code>R</code> and the <code>normOffsets</code> methods therein.</li>
<li>Standard size factor normalization from <code>DeSeq</code>
</li>
<li>Quantile normalization</li>
<li>No normalization</li>
</ol>
</div>
230
231
232
<div id="methods_TF_peak_TFActivity_importing" class="section level4">
<h4 class="hasAnchor">
<a href="#methods_TF_peak_TFActivity_importing" class="anchor"></a>Importing TF Activity</h4>
Christian Arnold's avatar
Christian Arnold committed
233
234
<p>Soon, it will also be possible to import TF Activity data into our framework as opposed to calculating it using the procedure as described above. This feature is currently in development and will be available soon.</p>
</div>
235
236
237
<div id="methods_TF_peak_TFActivity_adding" class="section level4">
<h4 class="hasAnchor">
<a href="#methods_TF_peak_TFActivity_adding" class="anchor"></a>Adding TF Activity TF-peak connections</h4>
Christian Arnold's avatar
Christian Arnold committed
238
239
240
241
<p>Once TF Activity data is available, finding TF-peak links and assessing their significance is then done in complete analogy as for the TF expression data - just the input data is different (TF Activity as opposed to TF expression). The so-called connection type - <em>expression</em> or <em>TF Activity</em>, is stored in the <em>GRN</em> object and output tables and therefore allows to tailor and filter the resulting network accordingly. All output PDFs also contain the information whether a TF-peak link has been established via the <em>TF expression</em> or <em>TF Activity</em>.</p>
</div>
</div>
</div>
242
243
244
<div id="methods_peakGene" class="section level2">
<h2 class="hasAnchor">
<a href="#methods_peakGene" class="anchor"></a>Peak-gene associations</h2>
Christian Arnold's avatar
Christian Arnold committed
245
<p>We offer two options of where in the gene the overlap with the extended peak may occur: at the 5’ end of the gene (the default) or anywhere in the gene. For more information see the R help (<code><a href="../reference/addConnections_peak_gene.html">?addConnections_peak_gene</a></code> and the parameter <code>overlapTypeGene</code> in particular)</p>
246
247
248
<div id="two-approaches-for-identifying-peak-gene-pairs-to-test-for-correlation" class="section level3">
<h3 class="hasAnchor">
<a href="#two-approaches-for-identifying-peak-gene-pairs-to-test-for-correlation" class="anchor"></a>Two approaches for identifying peak-gene pairs to test for correlation</h3>
Christian Arnold's avatar
Christian Arnold committed
249
250
<p>We offer two options to decide which peak-gene pairs to test for correlation: in absence of additional topologically associating domain (TADs) data from Hi-C or similar approaches, the pipeline used a local neighborhood-based approach with a custom neighborhood size (default: 250 kb up- and downstream of the peak) to select peak-gene pairs to test. In the presence of TAD data, all peak-gene pairs within a TAD are tested, while peaks located outside of any TAD domain are ignored. The user has furthermore the choice to specify whether overlapping TADs should be merged or not.</p>
</div>
251
</div>
252
</div>
253
254
255
<div id="guidelines" class="section level1">
<h1 class="hasAnchor">
<a href="#guidelines" class="anchor"></a>Guidelines, Recommendations, Limitations, Scope</h1>
256
<p>In this section, we provide a few guidelines and recommendations that may be helpful for your analysis.</p>
257
258
259
<div id="guidelines_scope" class="section level2">
<h2 class="hasAnchor">
<a href="#guidelines_scope" class="anchor"></a>Package scope</h2>
Christian Arnold's avatar
Christian Arnold committed
260
<p>In this section, we want explicitly mention the designated scope of the <code>GRaNIE</code> package, its limitations and additional / companion packages that may be used subsequently or beforehand.</p>
Christian Arnold's avatar
Christian Arnold committed
261
262
<p>Coming soon.</p>
</div>
263
264
265
<div id="guidelines_TFBS" class="section level2">
<h2 class="hasAnchor">
<a href="#guidelines_TFBS" class="anchor"></a>Transcription factor binding sites (TFBS)</h2>
Christian Arnold's avatar
Christian Arnold committed
266
<p>TFBS are a crucial input for any <code>GRaNIE</code> analysis. Our <code>GRaNIE</code> approach is very agnostic as to how these files are generated - as long as one BED file per TF is provided with TFBS positions, the TF can be integrated.As explained above, we usually work with TFBS as predicted by <code>PWMScan</code> based on <code>HOCOMOCO</code> TF motifs, while in-silico predicted TFBS are by no means a requirement of the pipeline. Instead, <code>JASPAR</code> TFBS or TFBS from any other database can also be used. The total number of TF and TFBS per TF seems more relevant here, due to the way we integrate TFBS: We create a binary 0/1 overlap matrix for each peak and TF, with 0 indicating that no TFBS for a particular TF overlaps with a particular peak, while 1 indicates that at least 1 TFBS from the TFBS input data does indeed overlap with the particular peak by at least 1 bp. Thus, having more TFBS in general also increases the number of 1s and therefore the <em>foreground</em> of the TF (see the diagnostic plots) while it makes the foreground also more noisy if the TFBS list contains too many false positives. As always in biology, this is a trade-off.</p>
267
</div>
268
269
270
<div id="guidelines_peaks" class="section level2">
<h2 class="hasAnchor">
<a href="#guidelines_peaks" class="anchor"></a>Peaks</h2>
271
272
<p>The number of peaks that is provided as input matters greatly for the resulting GRN and its connectivity. From our experience, this number should be in a reasonable range so that there is enough data to build a GRN, but also not so many that the whole pipeline runs unnecessarily long. We have good experience with the number of peaks ranging between 50,000 and 200,000 or so, although these are not hard thresholds but rather recommendations.</p>
<p>With respect to the recommended width of the peaks, we usually use peaks that have a width of a couple of hundred base pairs until a few kb, while the default is to filter peaks if they are wider than 10,000 bp (parameter <code>maxSize_peaks</code> in the function <code>filterData</code>). Remember: peaks are used to overlap them with TFBS, so if a particular peak is too narrow, the likelihood of not overlapping with any (predicted) TFBS from any TF increases, and such a peak is subsequently essentially ignored.</p>
273
</div>
274
275
276
<div id="guidelines_RNA" class="section level2">
<h2 class="hasAnchor">
<a href="#guidelines_RNA" class="anchor"></a>RNA-Seq</h2>
277
278
279
280
281
282
<p>The following list is subject to change and provides some rough guidelines for the RNA-Seq data:</p>
<ol style="list-style-type: decimal">
<li>We recommend using raw counts if possible, and checking carefully in a PCA whether any batch effects are visible.</li>
<li>Genes with very small counts across samples are advised to be removed by running the function <code>filterData</code>, see the argument <code>minNormalizedMeanRNA</code> for more information. You may want to check beforehand how many gens have a row mean of &gt;1. This number is usually in the tens of thousands.</li>
<li>At the moment, we did not properly test our framework for single-cell RNA-Seq data, and therefore cannot provide support for it. Thus, use regular bulk data until we advanced with the single-cell applicability.</li>
</ol>
283
</div>
284
285
286
<div id="guidelines_peakGene" class="section level2">
<h2 class="hasAnchor">
<a href="#guidelines_peakGene" class="anchor"></a>Peak-gene p-values accuracy and violations</h2>
287
288
289
<p>Coming soon!</p>
</div>
</div>
290
291
292
<div id="output" class="section level1">
<h1 class="hasAnchor">
<a href="#output" class="anchor"></a>Output</h1>
293
<p>Here, we describe the various output files that are produced by the pipeline. They are described in the order they are produced in the pipeline.</p>
294
295
296
<div id="output_GRN" class="section level2">
<h2 class="hasAnchor">
<a href="#output_GRN" class="anchor"></a>GRN object</h2>
Christian Arnold's avatar
Christian Arnold committed
297
<p><strong>Our pipeline works and output a so-called <code>GRN</code> object. The goal is simple: All information is stored in it, and by keeping everything within one object and sharing it with others, they have all the necessary data and information to run the <code>GRN</code> workflow. A consistent and simple workflow logic makes it easy and intuitive to work with it, similar to other packages such as <code>DESeq2</code>.</strong></p>
Christian Arnold's avatar
Christian Arnold committed
298
<p>Technically speaking, it is an S4 object of class <code>GRN</code>. As you can see from the <a href="workflow.html">workflow vignette</a>, almost all <code>GRaNIE</code> functions return a <code>GRN</code> object (with the notable exception of <code>get</code> functions). All <code>GRaNIE</code> functions (except <code>initializeGRN</code>, which creates an empty <code>GRN</code> object) also require a <code>GRN</code> object as first argument, which makes is easy and intuitive to work with the package, at least this was the goal we had in mind. We would be happy to receive your feedback about it!</p>
299
<p><code>GRN</code> objects contain all data and results necessary for the various functions the package provides, and various extractor functions allow to extract information out of an <code>GRN</code> object such as the various <code>get</code> functions. In addition, printing a <code>GRN</code> object results in an object summary that is printed (try it out and just type <code>GRN</code> in the console if your <code>GRN</code> object is called like this!). In the future, we aim to add more convenience functions. If you have specific ideas, please let us know.</p>
300
<p>The slots of a <code>GRN</code> object are described in the R help, see <code>?GRN</code> for details. While we work on general extractor functions for the various slots for optimal user experience, we currently suggest to also access and explore the data directly with the <code><a href="https://rdrr.io/r/base/slotOp.html">@</a></code> operator. For example, <code>GRN@config</code> accesses the configuration slot that contains all parameters and object metadata, and <code>slotNames(GRN)</code> prints all available slots of the object.</p>
301
</div>
302
303
304
<div id="output_PCA" class="section level2">
<h2 class="hasAnchor">
<a href="#output_PCA" class="anchor"></a>PCA plots and results</h2>
305
306
<p>The pipeline outputs PCA plots for both peaks and RNA as well as original (i..e, the counts the user provided as input) and normalized (i.e., the counts after normalizing them if any normalization method has been provided) data. Thus, in total, 4 different PCA plots are produced, 2 per data modality (peaks and RNA) and 2 per data type (original and normalized counts).</p>
<p>Each PDF consists of three parts: PCA results based on the top 500, top 1000 and top 5000 features (see page headers). For each part, different plot types are available and briefly explained in the following:</p>
307
<ol style="list-style-type: decimal">
308
<li>Multi-density plot across all samples (1 page)</li>
309
</ol>
Christian Arnold's avatar
Christian Arnold committed
310
311
312
313
<div class="figure">
<img src="figs/PCA_peaks_raw/p-01.png" alt="&lt;i&gt;Multi-density plot for read counts across all samples&lt;/i&gt;" width="100%"><p class="caption">
<i>Multi-density plot for read counts across all samples</i>
</p>
314
</div>
315
316
317
<ol start="2" style="list-style-type: decimal">
<li>Screeplot (1 page)</li>
</ol>
Christian Arnold's avatar
Christian Arnold committed
318
319
320
321
<div class="figure">
<img src="figs/PCA_peaks_raw/p-02.png" alt="&lt;i&gt;PCA screeplot&lt;/i&gt;" width="100%"><p class="caption">
<i>PCA screeplot</i>
</p>
322
</div>
323
324
325
<ol start="3" style="list-style-type: decimal">
<li>Metadata correlation plot</li>
</ol>
Christian Arnold's avatar
Christian Arnold committed
326
327
328
329
<div class="figure">
<img src="figs/PCA_peaks_raw/p-03.png" alt="&lt;i&gt;Metadata correlation plot for PCA&lt;/i&gt;" width="100%"><p class="caption">
<i>Metadata correlation plot for PCA</i>
</p>
330
</div>
331
332
333
<ol start="4" style="list-style-type: decimal">
<li>PCA plots with different metadata being colored (5 or more pages, depending on available metadata)</li>
</ol>
Christian Arnold's avatar
Christian Arnold committed
334
335
336
337
<div class="figure">
<img src="figs/PCA_peaks_raw/p-06.png" alt="&lt;i&gt;PCA plots for various metadata&lt;/i&gt;" width="100%"><p class="caption">
<i>PCA plots for various metadata</i>
</p>
338
</div>
339
<p>Currently, the actual PCA result data are not stored in the <code>GRN</code> object, but this will be available soon as well. We will update the Vignette once this is done and mention it in the Changelog.</p>
340
</div>
341
342
343
<div id="output_TF_peak" class="section level2">
<h2 class="hasAnchor">
<a href="#output_TF_peak" class="anchor"></a>TF-peak diagnostic plots</h2>
344
<p>TF-peak diagnostic plots are available for each TF, and they currently look as follows:</p>
Christian Arnold's avatar
Christian Arnold committed
345
346
347
348
<div class="figure">
<img src="figs/TFPeak_fdr_orig/p-25.png" alt="&lt;i&gt;TF-peak diagnostic plots for an example TF&lt;/i&gt;" width="100%"><p class="caption">
<i>TF-peak diagnostic plots for an example TF</i>
</p>
349
</div>
350
<p>The TF name is indicated in the title, and each page shows two plots. In each plot, the TF-peak FDR for each correlation bin (ranging from -1 to 1 in bins of size 0.05) is shown. The only difference between the two plots is the directionality upon which the FDR is empirically derived from: the upper plot is for the <em>positive</em> and the lower plot for the <em>negative</em> direction. Each plot is also colored by the number of distinct TF-peak connections that fall into the particular bin. Mostly, correlation bins with smaller absolute correlation values have higher frequencies (i.e., more TF-peak links fall into them) while correlation bins with more extreme correlation values are less frequent. In the end, for the resulting network, the directionality can be ignored and only those TF-peak links are kept with small FDRs, irrespective of the directionality.</p>
351
</div>
352
353
354
<div id="output_AR" class="section level2">
<h2 class="hasAnchor">
<a href="#output_AR" class="anchor"></a>Activator-repressor classification diagnostic plots and results</h2>
Christian Arnold's avatar
Christian Arnold committed
355
<p>The pipeline produces 3 different plot types related to the activator-repressor (AR) classification that can optionally be run as part of the <code>GRaNIE</code> workflow. For each of the 3 types, plots are produced for both the original, non-permuted (labeled as <code>original</code>) as well as the permuted (labeled as <code>permuted</code>) data.</p>
356
357
358
359
<p>The AR classification is run for the RNA expression data (labeled as <code>expression</code>) and can additionally also be run for TF activity data (labeled as <code>TFActivity</code>, see the function <code>addConnections_TF_peak</code> and its parameter options).</p>
<p>In the following, the 3 plot types are briefly explained:</p>
<ol style="list-style-type: decimal">
<li>
360
<strong>Summary heatmaps (files starting with <code>TF_classification_summaryHeatmap</code>)</strong>: <a href="https://difftf.readthedocs.io/en/latest/chapter2.html#files-comparisontype-diagnosticplotsclassification1-pdf-and-comparisontype-diagnosticplotsclassification2-pdf">This is described in detail in the <code>diffTF</code> documentation.</a>
361
362
</li>
</ol>
Christian Arnold's avatar
Christian Arnold committed
363
364
365
366
<div class="figure">
<img src="figs/AR_heatmap_expr_real/p-1.png" alt="&lt;i&gt;AR summary heatmap&lt;/i&gt;" width="100%"><p class="caption">
<i>AR summary heatmap</i>
</p>
367
</div>
368
369
<ol start="2" style="list-style-type: decimal">
<li>
370
<strong>Summary stringency plots (files starting with <code>TF_classification_stringencyThresholds</code>)</strong>: <a href="https://difftf.readthedocs.io/en/latest/chapter2.html#files-comparisontype-diagnosticplotsclassification1-pdf-and-comparisontype-diagnosticplotsclassification2-pdf">This is described in detail in the <code>diffTF</code> documentation.</a>
371
372
</li>
</ol>
Christian Arnold's avatar
Christian Arnold committed
373
374
375
376
<div class="figure">
<img src="figs/AR_stringency_expr_real/p-1.png" alt="&lt;i&gt;AR stringency thresholds&lt;/i&gt;" width="100%"><p class="caption">
<i>AR stringency thresholds</i>
</p>
377
</div>
378
379
380
381
<ol start="3" style="list-style-type: decimal">
<li>
<strong>Density plots per TF (files starting with <code>TF_classification_densityPlotsForegroundBackground</code>)</strong>: Density plots for each TF, with one TF per page. The plot shows the foreground (red, labeled as <code>Motif</code>) and background (gray, labeled as <code>Non-motif</code>) densities of the correlation coefficient (either Pearson or Spearman, see x-axis label) from peaks with (foreground) or without (background) a (predicted) TFBS in the peak for the particular TF. The numbers in the parenthesis summarize the underlying total number of peaks.</li>
</ol>
Christian Arnold's avatar
Christian Arnold committed
382
383
384
385
<div class="figure">
<img src="figs/AR_density_expr_real/p-06.png" alt="&lt;i&gt;Density plots per TF&lt;/i&gt;" width="100%"><p class="caption">
<i>Density plots per TF</i>
</p>
386
</div>
Christian Arnold's avatar
Christian Arnold committed
387
<p>It is also possible to extract the results from the AR classification out of a <code>GRN</code> object. Currently, this can only be done manually, extractor functions are in the works that will further enhance the user experience. The results are stored in the slot <code>GRN@data$TFs$classification[[permIndex]] [[connectionType]]$TF.classification</code>. Here, <code>permIndex</code> refers to the original, non-permuted (“0”) or permuted (“1”) data, while <code>connectionType</code> here is either <code>expression</code> or <code>TFActivity</code>, depending on whether the pipeline has also be run for TF Activity in addition to expression (see function <code>addConnections_TF_peak</code>). Thus, typically, the results for the original data are stored in <code>GRN@data$TFs$classification[["0"]] [["expression"]]$TF.classification</code>. If intermediate results from the classification have not been deleted (the default is to delete them as they can occupy a large amount of memory in the object, see the parameters of <code>AR_classification_wrapper</code> for details), they can be accessed similarly within <code>GRN@data$TFs$classification[[permIndex]] [[connectionType]]</code>: <code>TF_cor_median_foreground</code>, <code>TF_cor_median_background</code>, <code>TF_peak_cor_foreground</code>, <code>TF_peak_cor_background</code>, and <code>act.rep.thres.l</code>.</p>
388
</div>
389
390
391
<div id="output_peak_gene" class="section level2">
<h2 class="hasAnchor">
<a href="#output_peak_gene" class="anchor"></a>Peak-gene diagnostic plots</h2>
392
393
<p>We provide a number of diagnostic plots for the peak-gene links that are imperative for understanding the biological system and GRN. In what follows, we describe them briefly, along with some notes on expected patterns, implications etc. Note that this section is subject to continuous change.</p>
<p>We currently offer a summary QC figure for the peak-gene connections that looks as follows:</p>
Christian Arnold's avatar
Christian Arnold committed
394
395
396
397
<div class="figure">
<img src="figs/peakGene_QC_all/p-1.png" alt="&lt;i&gt;Summary peak-gene QC figure&lt;/i&gt;" width="100%"><p class="caption">
<i>Summary peak-gene QC figure</i>
</p>
398
</div>
399
400
<p>As you can see, the Figure is divided into two rows: the upper row focuses on the peak-gene raw p-value of the correlation results, while the lower row focuses on the peak-gene correlation coefficient. The left side visualizes the data of the corresponding metrics via density plots, while the right side bins the metrics and visualizes them with barplots for highlighting differences between real and permuted data as well as negatively and positively correlated peak-gene links (denoted as <code>r+</code> and <code>r-</code>, respectively).</p>
<p>We now describe these plots in more detail.</p>
401
402
403
<div id="correlation-raw-p-value-distribution" class="section level3">
<h3 class="hasAnchor">
<a href="#correlation-raw-p-value-distribution" class="anchor"></a>Correlation raw p-value distribution</h3>
Christian Arnold's avatar
Christian Arnold committed
404
<!-- unlikely that we are getting increased accessibility in a peak when a repressor binds based on the plots from diffTF (Fig3., repressor/activator footprints). I also think it’s due to strong differences in condition -->
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
<p>First and most importantly, we focus on the distribution of the raw p-values from the correlation tests (peak accessibility vs gene expression) of all peak-gene links. We can investigate these from multiple perspectives.</p>
<p>Let’s start with a density plot. The upper left plot shows the raw p-value density for the particular gene type as indicated in the title (here: all gene types), stratified on two levels:</p>
<ol style="list-style-type: decimal">
<li>Random (permuted, left) and non-random (real, right) connections</li>
<li>Connections that have a negative (<code>r-</code>, gray) and positive (<code>r+</code>, black) correlation coefficient, respectively</li>
</ol>
<p>Generally speaking, we consider both the random connections as well as <code>r-</code> connections as background.</p>
<p>What we would like to see is:</p>
<ul>
<li>random connections show little to no signal, with a flat curve along the x-axis, with little to no difference between <code>r+</code> and <code>r-</code> connections</li>
<li>for the real connections and <code>r+</code> links, a strong peak at small p-values, and a (marginally) flat distribution for higher ones (similar to a well-calibrated raw p-value distribution for any hypothesis test such as differential expression). For <code>r-</code> links, the peak at small p-values should be much smaller and ideally the curve is completely flat. However, from the datasets we examined, this is rarely the case.</li>
</ul>
<p>If any of these quality controls is not met, it may indicate an issue with data normalization, the underlying biology and what the GRN represents, or an issue with data size or covariates influencing the results.</p>
<p>To quantify the difference between <code>r+</code> and <code>r-</code> links within each connection type (random vs real), we can also plot the results in form of ratios rather than densities for either the <code>r+</code> / <code>r-</code> or the real / permuted dimension. These plots are shown in the upper right panel of the summary plot.</p>
<p>For the <code>r+</code> / <code>r-</code> dimension and permuted data, the ratios should be close to 1 across all p-value bins, while for the real data, a high ratio is typically seen for small p-values. In general, the difference between the permuted and real bar should be large for small p-values and close to 1 for larger ones.</p>
<p>For the real / permuted dimension, what we want to see is again a high ratio for small p-value bins for the <code>r+</code> links, indicating that when comparing permuted vs real, there are many more small p-value links in real data as compared to permuted. This usually does not hold true for the <code>r-</code> links, though, as can be seen also from the plot: the gray bars are smaller and closer to 1 across the whole binned p-value range.</p>
<p>Lastly, we can also stratify the raw p-value distribution for <code>r+</code> and <code>r-</code> peak-gene connections according to different biological properties such as the peak-gene distance and others (see below). Here, we focus solely on the real data and additionally stratify the p-value distributions of peak-gene links by their genomic distance (measured as the distance of the middle of the peak to the TSS of the gene, in base pairs). For this, we bin the peak-gene distance equally into 10 bins, which results in the bins containing a non-equal number of data points but genomic distance is increased uniformly from bin to bin:</p>
Christian Arnold's avatar
Christian Arnold committed
422
423
424
425
<div class="figure">
<img src="figs/peakGene_QC_all/p-2.png" alt="&lt;i&gt;Density of raw p-values, stratified by (1) peak-gene distance (using equally sized bins) and (2) `r+` / `r-` links&lt;/i&gt;" width="100%"><p class="caption">
<i>Density of raw p-values, stratified by (1) peak-gene distance (using equally sized bins) and (2) <code>r+</code> / <code>r-</code> links</i>
</p>
426
</div>
427
428
<p>We generally (hope to) see that for smaller peak-gene distances (in particular those that overlap, i.e., the peak and the gene are in direct vicinity or even overlapping), the difference between r+ and r- links is bigger as for more distant links. We also include the random links, for which no difference between r+ and r- links is visible, as expected for a well-calibrated background.</p>
<p>Let’s plot the same, but stratified by peak-gene distance and <code>r+</code> / <code>r-</code> within each plot instead:</p>
Christian Arnold's avatar
Christian Arnold committed
429
430
431
432
<div class="figure">
<img src="figs/peakGene_QC_all/p-3.png" alt="&lt;i&gt;Density of raw p-values, stratified by (1) peak-gene distance (using equally sized bins) and (2) `r+` / `r-` links&lt;/i&gt;" width="100%"><p class="caption">
<i>Density of raw p-values, stratified by (1) peak-gene distance (using equally sized bins) and (2) <code>r+</code> / <code>r-</code> links</i>
</p>
433
434
</div>
</div>
435
436
437
<div id="correlation-coefficient-distribution" class="section level3">
<h3 class="hasAnchor">
<a href="#correlation-coefficient-distribution" class="anchor"></a>Correlation coefficient distribution</h3>
438
<p>So far, we analyzed the raw p-value distribution in detail. Let’s focus now on the distribution of the correlation coefficient per se.</p>
Christian Arnold's avatar
Christian Arnold committed
439
440
441
442
<div class="figure">
<img src="figs/peakGene_QC_all/p-4.png" alt="&lt;i&gt;Density of the correlation coefficient, stratified by (1) peak-gene distance (using equally sized bins)&lt;/i&gt;" width="100%"><p class="caption">
<i>Density of the correlation coefficient, stratified by (1) peak-gene distance (using equally sized bins)</i>
</p>
443
444
</div>
<p>Here, we can also include the <em>random</em> links for comparison. We see that the distribution of the correlation coefficient is also slightly different across bins, and most extreme for links with a small genomic distance (darker colored lines).</p>
445
446
</div>
</div>
447
448
449
<div id="output_connectionSummary" class="section level2">
<h2 class="hasAnchor">
<a href="#output_connectionSummary" class="anchor"></a>Connection summary plots</h2>
450
451
452
<p>We currently offer two different connection summary PDFs, both of which are produced from the function <code>plot_stats_connectionSummary</code>. Both PDFs shows the number of connections for each node type (TF, peak, and gene), while in the boxplots, peaks are further differentiated into TF-peak and peak-gene entities. They also iterate over various parameters and plot one plot per page and parameter combination, as indicated in the title:</p>
<ol style="list-style-type: decimal">
<li>
Christian Arnold's avatar
Christian Arnold committed
453
<code>allowMissingTFs</code>: TRUE or FALSE (i.e., allow TFs to be missing when summarizing the <code>eGRN</code> network. If set to <code>TRUE</code>, a valid connection may consist of just peak to gene, with no TF being connected to the peak. For more details, see the <code>R</code> help for <code>plot_stats_connectionSummary</code>)</li>
454
455
456
457
458
459
460
<li>
<code>allowMissingGenes</code>: TRUE or FALSE (i.e., allow genes to be missing when summarizing the <code>GRN</code> network. If set to <code>TRUE</code>, a valid connection may consist of just TF to peak, with no gene being connected to the peak. For more details, see the <code>R</code> help for <code>plot_stats_connectionSummary</code>)</li>
<li>
<code>TF_peak.connectionType</code>. Either <code>expression</code> or <code>TFActivity</code> to denote which connection type the summary is based on.</li>
</ol>
<p>Both plot types compare the connectivity for the real and permuted data (denoted as <code>Network type</code> in the boxplot PDF), which allows a better judgment of the connectivity from the real data.</p>
<p>An example page for the summary heatmap looks like this:</p>
Christian Arnold's avatar
Christian Arnold committed
461
462
463
464
<div class="figure">
<img src="figs/connectionsHeatmap/p-3.png" alt="&lt;i&gt;Example heatmap for the connection summary&lt;/i&gt;" width="100%"><p class="caption">
<i>Example heatmap for the connection summary</i>
</p>
465
</div>
466
467
<p>Here, two heatmaps are shown, one for real (top) and one for permuted (bottom) network. Each of them shows for different combinations of TF-peak and peak-gene FDRs (0.01 to 0.2) the number of unique node types for the given FDR combination (here: TFs).</p>
<p>Second, a multi-page summary PDF for the connections in form of a boxplot, as exemplified with the following Figure:</p>
Christian Arnold's avatar
Christian Arnold committed
468
469
470
471
<div class="figure">
<img src="figs/connectionsBoxplot/p-04.png" alt="&lt;i&gt;Example boxplot for the connection summary&lt;/i&gt;" width="100%"><p class="caption">
<i>Example boxplot for the connection summary</i>
</p>
472
</div>
473
474
<p>Here, we can see that [TODO]</p>
<p>TF- genes, TF-peaks and peak-genes for both real and permuted data, similar to the summary heatmap (see above).</p>
475
</div>
476
477
478
<div id="output_other" class="section level2">
<h2 class="hasAnchor">
<a href="#output_other" class="anchor"></a>Output tables</h2>
Christian Arnold's avatar
Christian Arnold committed
479
<p>While no designated output table is created during a typical <code>GRaNIE</code> analysis from the core functions, the function <code>getGRNConnections</code> can be used to extract the resulting <code>eGRN</code> network from a <code>GRN</code> object and save it as a separate data frame. This can then be stored on disk or processed further within R. For more information, see the corresponding R help ( <code><a href="../reference/getGRNConnections.html">?getGRNConnections</a></code>).</p>
480
481
</div>
</div>
482
483
484
<div id="resources" class="section level1">
<h1 class="hasAnchor">
<a href="#resources" class="anchor"></a>Memory footprint and execution time, feasibility with large datasets</h1>
485
486
487
488
489
490
491
492
<p>In this section, we will give an overview over CPU and memory requirements when running or planning a <code>GRN</code> analysis.</p>
<p>Both CPU time and memory footprint primarily depend on similar factors, namely the number of</p>
<ol style="list-style-type: decimal">
<li>TFs</li>
<li>peaks</li>
<li>samples</li>
</ol>
<p>While the number of TFs is typically similar across analyses when the default database is used (HOCOMOCO + PWMScan), the number of peaks and samples may vary greatly across analyses.</p>
493
494
495
<div id="resources_cpu" class="section level2">
<h2 class="hasAnchor">
<a href="#resources_cpu" class="anchor"></a>CPU time</h2>
496
<p>A typical analysis runs within an hour or two on a standard machine with 2 to 4 cores or so. CPU time non-surprisingly depends primarily on the number of used cores for those functions that support multiple cores and that are time-consuming in nature.</p>
497
</div>
498
499
500
<div id="resources_memory" class="section level2">
<h2 class="hasAnchor">
<a href="#resources_memory" class="anchor"></a>Memory footprint</h2>
501
502
503
<p>The maximum memory footprint can be a few GB in <code>R</code>. We recommend not using more than 100,000 to 200,000 or so peaks as the memory footprint as well as running time may otherwise increase notably.</p>
<p>Given that our approach is a correlation-based one, it seems preferable to maximize the number of samples while retaining only peaks that carry a biological signal that is differing among samples.</p>
<p>The size of the <code>GRN</code> object is typically in the range of a few hundred MB, but can exceed 1 GB for large datasets. If you have troubles with big datasets, let us know! We always look for ways to further optimize the memory footprint.</p>
504
505
</div>
</div>
506
507
508
<div id="refs" class="section level1">
<h1 class="hasAnchor">
<a href="#refs" class="anchor"></a>References</h1>
509
510
511
512
513
514
515
516
517
518
519
520
521
522
</div>
  </div>

  <div class="col-md-3 hidden-xs hidden-sm" id="pkgdown-sidebar">

        <nav id="toc" data-toggle="toc"><h2 data-toc-skip>Contents</h2>
    </nav>
</div>

</div>



      <footer><div class="copyright">
523
  <p>Developed by Christian Arnold, Judith Zaugg, Rim Moussa.</p>
524
525
526
</div>

<div class="pkgdown">
527
  <p>Site built with <a href="https://pkgdown.r-lib.org/">pkgdown</a> 1.6.1.</p>
528
529
530
531
532
533
534
535
536
537
</div>

      </footer>
</div>

  


  </body>
</html>