Commit 5b366a6b authored by Christian Arnold's avatar Christian Arnold
Browse files

Vignette updates

parent 357fb3fd
Pipeline #28908 passed with stage
in 15 seconds
......@@ -14,6 +14,15 @@ output:
BiocStyle::html_document
---
```{r <knitr, echo=FALSE, message=FALSE, results="hide", class.output="scroll-200"}
library("knitr")
opts_chunk$set(
tidy = TRUE,
cache = FALSE,
message = FALSE,
fig.align="center")
```
# Motivation and Necessity {#motivation}
......@@ -81,7 +90,7 @@ For more methodological details, details on how to construct these files, their
Providing sample metadata is optional, but highly recommended - if available, the sample metadata is integrated into the PCA plots to understand where the variation in the data comes from and whether any of the metadata (e.g., age, sex, sequencing batch) is associated with the PCs from a PC, indicating a batch effect that needs to be addressed before running the `GRaNIE` pipeline.
The integration of sample metadata can be achieved in the `addData()` function, see `?addData` for more information.
The integration of sample metadata can be achieved in the `addData()` function (click the link for more information).
## Hi-C data (optional) {#input_HiC}
......@@ -235,27 +244,27 @@ Each PDF consists of three parts: PCA results based on the top 500, top 1000 and
1. Multi-density plot across all samples (1 page)
```{r, echo=FALSE, fig.cap="<i>Multi-density plot for read counts across all samples</i>", out.width = '80%'}
```{r, echo=FALSE, fig.cap="<i>Multi-density plot for read counts across all samples</i>", out.width = '80%', fig.align='center'}
knitr::include_graphics("figs/PCA_peaks_raw/p-01.png")
```
2. Screeplot (1 page)
```{r, echo=FALSE, fig.cap="<i>PCA screeplot</i>", out.width = '80%'}
```{r, echo=FALSE, fig.cap="<i>PCA screeplot</i>", out.width = '80%', fig.align='center'}
knitr::include_graphics("figs/PCA_peaks_raw/p-02.png")
```
3. Metadata correlation plot
```{r, echo=FALSE, fig.cap="<i>Metadata correlation plot for PCA</i>", out.width = '80%'}
```{r, echo=FALSE, fig.cap="<i>Metadata correlation plot for PCA</i>", out.width = '80%', fig.align='center'}
knitr::include_graphics("figs/PCA_peaks_raw/p-03.png")
```
4. PCA plots with different metadata being colored (5 or more pages, depending on available metadata)
```{r, echo=FALSE, fig.cap="<i>PCA plots for various metadata</i>", out.width = '100%'}
```{r, echo=FALSE, fig.cap="<i>PCA plots for various metadata</i>", out.width = '100%', fig.align='center'}
knitr::include_graphics("figs/PCA_peaks_raw/p-06.png")
```
......@@ -267,7 +276,7 @@ Currently, the actual PCA result data are not stored in the `GRN` object, but we
TF-peak diagnostic plots are available for each TF, and they currently look as follows:
```{r, echo=FALSE, fig.cap="<i>TF-peak diagnostic plots for an example TF</i>", out.width = '100%'}
```{r, echo=FALSE, fig.cap="<i>TF-peak diagnostic plots for an example TF</i>", out.width = '100%', fig.align='center'}
knitr::include_graphics("figs/TFPeak_fdr_orig/p-25.png")
```
......@@ -284,21 +293,21 @@ In the following, the 3 plot types are briefly explained:
1. **Summary heatmaps (files starting with `TF_classification_summaryHeatmap`)**: [This is described in detail in the `diffTF` documentation.](https://difftf.readthedocs.io/en/latest/chapter2.html#files-comparisontype-diagnosticplotsclassification1-pdf-and-comparisontype-diagnosticplotsclassification2-pdf)
```{r, echo=FALSE, fig.cap="<i>AR summary heatmap</i>", out.width = '60%'}
```{r, echo=FALSE, fig.cap="<i>AR summary heatmap</i>", out.width = '60%', fig.align='center', fig.align='center'}
knitr::include_graphics("figs/AR_heatmap_expr_real/p-1.png")
```
2. **Summary stringency plots (files starting with `TF_classification_stringencyThresholds`)**: [This is described in detail in the `diffTF` documentation.](https://difftf.readthedocs.io/en/latest/chapter2.html#files-comparisontype-diagnosticplotsclassification1-pdf-and-comparisontype-diagnosticplotsclassification2-pdf)
```{r, echo=FALSE, fig.cap="<i>AR stringency thresholds</i>", out.width = '60%'}
```{r, echo=FALSE, fig.cap="<i>AR stringency thresholds</i>", out.width = '40%', fig.align='center', fig.align='center'}
knitr::include_graphics("figs/AR_stringency_expr_real/p-1.png")
```
3. **Density plots per TF (files starting with `TF_classification_densityPlotsForegroundBackground`)**: Density plots for each TF, with one TF per page. The plot shows the foreground (red, labeled as `Motif`) and background (gray, labeled as `Non-motif`) densities of the correlation coefficient (either Pearson or Spearman, see x-axis label) from peaks with (foreground) or without (background) a (predicted) TFBS in the peak for the particular TF. The numbers in the parenthesis summarize the underlying total number of peaks.
```{r, echo=FALSE, fig.cap="<i>Density plots per TF</i>", out.width = '60%'}
```{r, echo=FALSE, fig.cap="<i>Density plots per TF</i>", out.width = '60%', fig.align='center', fig.align='center'}
knitr::include_graphics("figs/AR_density_expr_real/p-06.png")
```
......@@ -313,7 +322,7 @@ We provide a number of diagnostic plots for the peak-gene links that are imperat
We currently offer a summary QC figure for the peak-gene connections that looks as follows:
```{r, echo=FALSE, fig.cap="<i>Summary peak-gene QC figure</i>", out.width = '100%'}
```{r, echo=FALSE, fig.cap="<i>Summary peak-gene QC figure</i>", out.width = '100%', fig.align='center'}
knitr::include_graphics("figs/peakGene_QC_all/p-1.png")
```
......@@ -355,7 +364,7 @@ For the real / permuted dimension, what we want to see is again a high ratio for
Lastly, we can also stratify the raw p-value distribution for `r+` and `r-` peak-gene connections according to different biological properties such as the peak-gene distance and others (see below). Here, we focus solely on the real data and additionally stratify the p-value distributions of peak-gene links by their genomic distance (measured as the distance of the middle of the peak to the TSS of the gene, in base pairs). For this, we bin the peak-gene distance equally into 10 bins, which results in the bins containing a non-equal number of data points but genomic distance is increased uniformly from bin to bin:
```{r, echo=FALSE, fig.cap="<i>Density of raw p-values, stratified by (1) peak-gene distance (using equally sized bins) and (2) `r+` / `r-` links</i>", out.width = '80%'}
```{r, echo=FALSE, fig.cap="<i>Density of raw p-values, stratified by (1) peak-gene distance (using equally sized bins) and (2) `r+` / `r-` links</i>", out.width = '80%', fig.align='center'}
knitr::include_graphics("figs/peakGene_QC_all/p-2.png")
```
......@@ -366,7 +375,7 @@ We generally (hope to) see that for smaller peak-gene distances (in particular t
Let's plot the same, but stratified by peak-gene distance and `r+` / `r-` within each plot instead:
```{r, echo=FALSE, fig.cap="<i>Density of raw p-values, stratified by (1) peak-gene distance (using equally sized bins) and (2) `r+` / `r-` links</i>", out.width = '80%'}
```{r, echo=FALSE, fig.cap="<i>Density of raw p-values, stratified by (1) peak-gene distance (using equally sized bins) and (2) `r+` / `r-` links</i>", out.width = '80%', fig.align='center'}
knitr::include_graphics("figs/peakGene_QC_all/p-3.png")
```
......@@ -375,7 +384,7 @@ knitr::include_graphics("figs/peakGene_QC_all/p-3.png")
So far, we analyzed the raw p-value distribution in detail. Let's focus now on the distribution of the correlation coefficient per se.
```{r, echo=FALSE, fig.cap="<i>Density of the correlation coefficient, stratified by (1) peak-gene distance (using equally sized bins)</i>", out.width = '80%'}
```{r, echo=FALSE, fig.cap="<i>Density of the correlation coefficient, stratified by (1) peak-gene distance (using equally sized bins)</i>", out.width = '80%', fig.align='center'}
knitr::include_graphics("figs/peakGene_QC_all/p-4.png")
```
......@@ -396,7 +405,7 @@ Both plot types compare the connectivity for the real and permuted data (denoted
An example page for the summary heatmap looks like this:
```{r, echo=FALSE, fig.cap="<i>Example heatmap for the connection summary</i>", out.width = '80%'}
```{r, echo=FALSE, fig.cap="<i>Example heatmap for the connection summary</i>", out.width = '80%', fig.align='center'}
knitr::include_graphics("figs/connectionsHeatmap/p-3.png")
```
......@@ -406,7 +415,7 @@ Here, two heatmaps are shown, one for real (top) and one for permuted (bottom)
Second, a multi-page summary PDF for the connections in form of a boxplot, as exemplified with the following Figure:
```{r, echo=FALSE, fig.cap="<i>Example boxplot for the connection summary</i>", out.width = '80%'}
```{r, echo=FALSE, fig.cap="<i>Example boxplot for the connection summary</i>", out.width = '80%', fig.align='center'}
knitr::include_graphics("figs/connectionsBoxplot/p-04.png")
```
......
......@@ -56,4 +56,4 @@ The `GRaNIEData` package provides example data that can be used to test the pack
## Bug Reports, Feature Requests and Contact Information
Please check out **[https://grp-zaugg.embl-community.io/GRaNIE](https://grp-zaugg.embl-community.io/GRaNIE)** for how to get in contact with us.
\ No newline at end of file
Please check out **[the main page](https://grp-zaugg.embl-community.io/GRaNIE)** for how to get in contact with us.
\ No newline at end of file
......@@ -10,7 +10,8 @@ vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
output:
BiocStyle::html_document
BiocStyle::html_document:
code_folding: hide
---
......@@ -63,7 +64,8 @@ library("knitr")
opts_chunk$set(
tidy = TRUE,
cache = FALSE,
message = FALSE)
message = FALSE,
fig.align="center")
```
First, let's load the required libraries `GRaNIE` and `GRaNIEData`. The `tidyverse` package is already loaded and attached when loading the `GRaNIE` package, but we nevertheless load it here explicitly to highlight that we'll use various `tidyverse` functions for data import.
......@@ -135,8 +137,7 @@ You may notice that the peaks and RNA-seq data have different samples being incl
We got all the data in the right format, we can start with our *GRaNIE* analysis now! We start by specifying some parameters such as the genome assembly version the data have been produced with, as well as some optional object metadata that helps us to distinguish this *GRaNIE* object from others.
```{r initializeObject, echo=TRUE, include=TRUE, class.output="scroll-200"}
# Genome assembly shortcut. Either hg19, hg38 or mm10. Both peaks and RNA data must have the same genome assembly
genomeAssembly = "hg38"
genomeAssembly = "hg38" #Either hg19, hg38 or mm10. Both peaks and RNA data must have the same genome assembly
# Optional and arbitrary list with information and metadata that is stored within the GRaNIE object
objectMetadata.l = list(name = paste0("Macrophages_infected_primed"),
......@@ -152,7 +153,6 @@ GRN = GRaNIE::initializeGRN(objectMetadata = objectMetadata.l,
genomeAssembly = genomeAssembly)
GRN
```
Initializing a *GRaNIE* object occurs in the function `initializeGRN()` and is trivial: All we need to specify is an output folder (this is where all the pipeline output is automatically being saved unless specified otherwise) and the genome assembly shortcut of the data. We currently support *hg19*, *hg38*, and *mm10*. Please contact us if you need additional genomes. The *metadata* argument is recommended but optional and may contain an arbitrarily complex named list that is stored as additional metadata for the *GRaNIE* object. Here, we decided to specify a name for the *GRaNIE* object as well as the original paths for all 3 input files and the genome assembly.
......@@ -168,14 +168,11 @@ An important consideration is data normalization for RNA and ATAC. We currently
```{r addData, echo=TRUE, include=TRUE, class.output="scroll-200"}
GRN = GRaNIE::addData(GRN,
countsPeaks.df, normalization_peaks = "DESeq_sizeFactor", idColumn_peaks = idColumn_peaks,
countsRNA.df, normalization_rna = "quantile", idColumn_RNA = idColumn_RNA,
sampleMetadata = sampleMetadata.df,
forceRerun = TRUE)
```
We can see from the output the details for the used normalization method, and the number of samples that are kept in the *GRaNIE* object. Here, all 29 samples from the RNA data are kept because they are also found in the peak data, while only 29 out of 31 samples from the peak data are also found in the RNA data, resulting in 29 shared samples overall. The RNA counts are also permuted, which will be the basis for all analysis and plots in subsequent steps that repeat the analysis for permuted data in addition to the real, non-permuted data.
......@@ -187,10 +184,8 @@ It is time for our first QC plots using the function `plotPCA_all()`! Now that w
Note that while this step is recommended to do, it is fully optional from a workflow point of view.
```{r runPCA, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = GRaNIE::plotPCA_all(GRN, type = c("rna", "peaks"), topn = 500, forceRerun = TRUE)
```{r runPCA, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200", collapse=FALSE, results="hold"}
GRN = GRaNIE::plotPCA_all(GRN, type = c("rna", "peaks"), topn = 500, forceRerun = TRUE)
```
We can see from the output that four PDF files have been produced, each of which plots the PCA results for the most variable 500, 1000, and 5000 features, respectively. For reasons of brevity and organization, we describe their interpretation and meaning in detail in the Introductory vignette and not here, however (click here for guidance and example plots).
......@@ -202,11 +197,9 @@ Now it is time to add data for TFs and predicted TF binding sites (TFBS)! Our *G
For more parameter details, see the R help (`?addTFBS` and `?overlapPeaksAndTFBS`).
```{r addTFBS, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = GRaNIE::addTFBS(GRN, motifFolder = folder_TFBS_first50, TFs = "all", filesTFBSPattern = "_TFBS", fileEnding = ".bed.gz", forceRerun = TRUE)
GRN = GRaNIE::overlapPeaksAndTFBS(GRN, nCores = 1, forceRerun = TRUE)
```
We see from the output that 75 TFs have been found in the specified input folder, and the number of TFBS that overlap our peaks for each of them. We now successfully added our TFs and TFBS to the *GRaNIE* object.
......@@ -231,11 +224,9 @@ The default values are usually suitable for bulk data and should result in the r
For more parameter details, see the R help (`?filterData`).
```{r filterData, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
# Chromosomes to keep for peaks. This should be a vector of chromosome names
chrToKeep_peaks = c(paste0("chr", 1:22), "chrX", "chrY")
GRN = GRaNIE::filterData(GRN, minNormalizedMean_peaks = 5, minNormalizedMeanRNA = 1, chrToKeep_peaks = chrToKeep_peaks, maxSize_peaks = 10000, forceRerun = TRUE)
```
We can see from the output that no peaks have been filtered due to their size and almost 11,000 have been filtered due to their small mean read counts, which collectively leaves around 64,000 peaks out of 75,000 originally. For the RNA data, almost half of the data has been filtered (16,211 out of around 35,000 genes).
......@@ -258,11 +249,9 @@ This function may run a while, and each time-consuming step has a built-in progr
For more parameter options and parameter details, see the R help (`?addConnections_TF_peak`).
```{r addTFPeakConnections, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = GRaNIE::addConnections_TF_peak(GRN, plotDiagnosticPlots = TRUE,
connectionTypes = c("expression"),
corMethod = "pearson", forceRerun = TRUE)
```
From the output, we see that a total of 65 TFs have RNA-seq data available and consequently will be included and correlated with the peak accessibility. The created PDF files are mentioned in the output and these we'll take a look at now!
......@@ -313,10 +302,8 @@ For more parameter details, see the R help (`?AR_classification_wrapper`).
After steps that take up a bit of time, it may make sense to store the *GRaNIE* object to disk in order to be able to restore it at any time point. This can simply be done, for example, by saving it as an *rds* file using the built-in function *saveRDS* from R to save our *GRaNIE* object in a compressed rds format.
```{r saveObjectIntermediate, echo=TRUE, include=TRUE, eval = FALSE}
GRN_file_outputRDS = paste0(dir_output, "/GRN.rds")
saveRDS(GRN, GRN_file_outputRDS)
```
You can then, at any time point, read it back into R with the following line:
......@@ -347,11 +334,9 @@ Note that to make the function run faster, we restrict the maximum peak-gene dis
For more parameter details, see the R help (`?addConnections_peak_gene`).
```{r addPeakGeneConnections, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = GRaNIE::addConnections_peak_gene(GRN,
overlapTypeGene = "TSS", corMethod = "pearson", promoterRange = 10000, TADs = NULL,
nCores = 1, plotDiagnosticPlots = TRUE, plotGeneTypes = list(c("all")), forceRerun = TRUE)
```
We see from the output that almost 42,000 peak-gene links have been identified that match our parameters (here: a maximum peak-gene distance of 10 kb). From these 42.000, however, only around 18,804 actually had corresponding RNA-seq data available, while RNA-seq data was missing or has been filtered for the other. This is a rather typical case, as not all known and annotated genes are included in the RNA-seq data in the first place. Similar to before, the correlations have also been performed for the permuted data.
......@@ -366,13 +351,11 @@ Let's now check some diagnostic plots for the peak-gene connections. In analogy
Now that we added both TF-peaks and peak-gene links to our *GRaNIE* object, we are ready to filter and combine them. So far, they are stored separately in the object for various reasons (see the Introductory Vignette for details), but ultimately, we aim for combining them to derive TF-peak-gene connections. To do so, we can simply run the `filterGRNAndConnectGenes()` function and filter the individual TF-peak and peak-gene links to our liking. The function has many more arguments, and we only specify a few in the example below. As before, we get a *GRaNIE* object back that now contains the merged and filtered TF-peak-gene connections that we can later extract. Some of the filters apply to the TF-peak links, some of them to the peak-gene links, the parameter name is intended to indicate that.
```{r combineAndFilter, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = GRaNIE::filterGRNAndConnectGenes(GRN, TF_peak.fdr.threshold = 0.2,
peak_gene.fdr.threshold = 0.2, peak_gene.fdr.method = "BH",
gene.types = c("protein_coding", "lincRNA"),
allowMissingTFs = FALSE, allowMissingGenes = FALSE
)
```
The output shows the number of links before and after applying a particular filter that has been set for both real and permuted data. As expected and reassuringly, almost no connections remain for the permuted data, while the real data keeps around 2500 connections.
......@@ -427,8 +410,6 @@ GRN = GRaNIE::generateStatsSummary(GRN,
GRN = GRaNIE::plot_stats_connectionSummary(GRN, type = "heatmap", forceRerun = TRUE)
GRN = GRaNIE::plot_stats_connectionSummary(GRN, type = "boxplot", forceRerun = TRUE)
```
The output is not very informative here and just tells us about the current progress and parameter it iterates over. We can now check the two new PDF files that have been created! Please see the Introductory Vignette for examples and interpretation.
......@@ -447,9 +428,7 @@ All functions can be called individually, adjusted flexibly and the data is stor
For user convenience, all aforementioned functions can be called at once via a designated wrapper function `performAllNetworkAnalyses()`.
```{r enrichment, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = GRaNIE::performAllNetworkAnalyses(GRN, forceRerun = TRUE)
```
......@@ -458,10 +437,8 @@ GRN = GRaNIE::performAllNetworkAnalyses(GRN, forceRerun = TRUE)
We are now finished with the main workflow, all that is left to do is to save our *GRaNIE* object to disk so we can load it at a later time point without having to repeat the analysis. We recommend to run the convenience function `deleteIntermediateData()` beforehand that aims to reduce its size by deleting some intermediate data that may still be stored within the object. For more parameter details, see the R help (`?deleteIntermediateData`). Finally, as we did already in the middle of the workflow, we save the object finally in rds format.
```{r saveObject, echo=TRUE, include=TRUE, eval = FALSE, class.output="scroll-200"}
GRN = GRaNIE::deleteIntermediateData(GRN)
saveRDS(GRN, GRN_file_outputRDS)
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment