Commit 4683bce0 authored by Christian Arnold's avatar Christian Arnold
Browse files

Version 1.8, see Changelog for details

parent 6789855e
......@@ -210,6 +210,17 @@ Details
.. note:: Importantly, *conditionSummary* can either be specified as "factor" or "numeric"/"integer", which changes the way the results are interpreted and what the log2 fold-changes represent. *conditionSummary* is usually specified as factor because you want to make a pairwise comparison of exactly two conditions. If *conditionSummary* is specified as "integer" or "numeric" (i.e., continuous-valued), however, the variable is treated as continuously-scaled, which changes the interpretation of the results: the reported log2 fold change is then per unit of change of that variable. That is, in the final Volcano plot, TFs displayed in the left side have a negative slope per unit of change of that variable, while TFs at the right side have a positive one.
``filterChromosomes`` (optional)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
TODO
Details
TODO
.. _parameter_nPermutations:
......@@ -468,9 +479,18 @@ Summary
String. Default “”. Path to the file with RNA-Seq counts.
Details
If no RNA-Seq data is included, set to the empty string “”. Otherwise, if ``RNASeqIntegration`` (:ref:`parameter_RNASeqIntegration`) is set to true, specify the path to a tab-separated file with *raw* RNA-Seq counts. We apply some basic filtering for lowly expressed genes and exclude genes with small counts, so there is principally no need for manual filtering unless you want to do so. For guidance, you may want to read `Question 4 here <https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html>`__.
If no RNA-Seq data is included, set to the empty string “”. Otherwise, if ``RNASeqIntegration`` (:ref:`parameter_RNASeqIntegration`) is set to true, specify the path to a tab-separated file with *raw* RNA-Seq counts (i.e., non-negative integer values, see Figure below). We apply some basic filtering for lowly expressed genes and exclude genes with small counts, so there is principally no need for manual filtering unless you want to do so. For guidance, you may want to read `Question 4 here <https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html>`__.
The first line must be used for labeling the samples, with column names being identical to the sample names as specific in the sample summary table (``summaryFile``, :ref:`parameter_summaryFile`). If you have RNA-Seq data for only a subset of the input samples, this is no problem - the classification will then naturally only be based on the subset. The first column must be named ENSEMBL and it must contain ENSEMBL IDs (e.g., *ENSG00000028277*) without dots. The IDs are then matched to the IDs from the file as specified in ``HOCOMOCO_mapping`` (:ref:`parameter_HOCOMOCO_mapping`). For an example of how the count table should look like, check this image:
.. figure:: Figures/rawCounts_RNA.png
:scale: 40 %
:alt: raw RNA-seq counts
:align: center
Example of how the RNA-Seq raw count table has to look like. Thanks to Phoebe Valdes for the image and the permission to use it here. Click on the image for better quality.
The first line must be used for labeling the samples, with column names being identical to the sample names as specific in the sample summary table (``summaryFile``, :ref:`parameter_summaryFile`). If you have RNA-Seq data for only a subset of the input samples, this is no problem - the classification will then naturally only be based on the subset. The first column must be named ENSEMBL and it must contain ENSEMBL IDs (e.g., *ENSG00000028277*) without dots. The IDs are then matched to the IDs from the file as specified in ``HOCOMOCO_mapping`` (:ref:`parameter_HOCOMOCO_mapping`).
.. _parameter_HOCOMOCO_mapping:
......
......@@ -47,7 +47,7 @@ master_doc = 'index'
# General information about the project.
project = 'diffTF'
copyright = '2020, Christian Arnold, Ivan Berest, Judith B. Zaugg'
copyright = '2021, Christian Arnold, Ivan Berest, Judith B. Zaugg'
author = 'Christian Arnold, Ivan Berest, Judith B. Zaugg'
# The version info for the project you're documenting, acts as replacement for
......@@ -55,9 +55,9 @@ author = 'Christian Arnold, Ivan Berest, Judith B. Zaugg'
# built documents.
#
# The short X.Y version.
version = '1.7.1'
version = '1.8'
# The full version, including alpha/beta/rc tags.
release = '1.7.1'
release = '1.8'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
......
......@@ -63,16 +63,23 @@ Open Access. DOI: `https://doi.org/10.1016/j.celrep.2019.10.106 <https://doi.org
Change log
============================
Version 1.8 (2020-07-15)
- more stringent criteria for when to include a TF in the binning step. Previously, one bin with data was enough to include the TF. We identified that in rare edge cases, this may not be enough to reliably estimate the TF activity, and now at least 2 distinct bins with enough data are required
- added a parameter *filterChr* to specify whether or not the sex chromosomes (or any other chromosomes) should be filtered from the peaks. Until now, chrX, chrY and chrM were filtered by default. If the parameter is not set explicitly, the previous behavior will be executed.
- the summaryFinal step now outputs also the DESeq object as an rds file for the RNA-Seq data. This allows to extract gene counts etc.
- other minor changes
- updated TFBS predictions for hg38
Version 1.7.1 (2020-05-20)
- Fixed one typo in the file ``TF_Gene_TranslationTables/HOCOMOCO_v10/translationTable_mm10.csv``. For the TF PAX5.S, the wrong Ensembl ID was provided (the one for PAX2 and not PAX5). This may have caused differences in the classification for PAX5.S when integrating RNA-seq data. Only this TF is affected, and this was also only an issue forthe combination of mm10 and HOCOMOCO v10. Thanks to Jiang Kan for letting us know!
Version 1.7 (2020-05-14)
- Multiple small fixes, thanks to Guandong Shang and Jiang Kan for reporting them:
- The TF name may now contain underscores. Before, that caused an error and is fixed now. We also cleared up the documentation about this.
- TFs with zero TFBS overlapping with the peaks (and therefore, overlap files with 0 lines) do not cause an error anymore and are skipped in subsequent steps, in analogy to TFs that had between 1 and 10 TFBS.
- Fixed a bug that caused an error when running the last ``summaryFinal`` step that related to duplicated TFs in the HOCOMOCO table.
- Implemented a debug mode via the new optional parameter ``debugMode``. This mode may be used to store the R session in a file and can be used to send to us for easier troubleshooting. See the documentation for more details.
Version 1.6 (2020-01-22)
......
......@@ -154,7 +154,7 @@ if (par.l$debugMode) {
TF.motifs.ori.l = list()
for (fileCur in par.l$files_input_TF_allMotives) {
TF.motifs.ori.l[[fileCur]] = read_tidyverse_wrapper(fileCur, type = "tsv", col_names = TRUE,
TF.motifs.ori.l[[fileCur]] = read_tidyverse_wrapper(fileCur, type = "tsv", col_names = FALSE,
col_types = list(
col_character(), # "TF",
col_character(), # "TFBSID"
......@@ -402,11 +402,11 @@ for (fileCur in par.l$files_input_TF_allMotives) {
} # end for each bin
if (nBinsWithData == 0) {
# Changed from requiring at least 1 to at least 2
if (nBinsWithData <= 2) {
nPermutationsSkipped = nPermutationsSkipped + 1
flog.info(paste0(" Not enough (non-NA) data for any of the ", nBinsAll, " bins. This may happen for individual permutations, see warning at the end."))
flog.info(paste0(" Not enough (non-NA) data for the bins. Data from at least 2 bins is required. This may happen for TFs with an overall small number of TFBS or for individual permutations, see warning at the end."))
calculateVariance = FALSE
} else {
......@@ -450,7 +450,7 @@ for (fileCur in par.l$files_input_TF_allMotives) {
}
}
# Estimate the variance
# Filter for bins for which we actually have data for
......
......@@ -178,7 +178,7 @@ read_tidyverse_wrapper <- function(file, type = "tsv", ncolExpected = NULL, minR
install.packages("BiocManager")
for (packageCur in packagesToInstall) {
BiocManager::install(packageCur)
BiocManager::install(packageCur, update = FALSE, ask = FALSE)
}
}
......@@ -885,7 +885,7 @@ createDebugFile <- function(snakemake) {
nWorkers = multicoreWorkers()
}
MulticoreParam(workers = nWorkers, progressBar = TRUE, stop.on.error = TRUE)
MulticoreParam(workers = nWorkers, stop.on.error = TRUE)
}
......@@ -1207,7 +1207,7 @@ createBindingMatrixFromFiles <- function(mapping, peakCounts, rootOutdir, extens
readHOCOMOCOTable <- function(file, delim = " ") {
HOCOMOCO_mapping.df = read_tidyverse_wrapper(file, type = "delim", delim = delim) %>%
HOCOMOCO_mapping.df = read_tidyverse_wrapper(file, type = "delim", delim = delim, col_types = cols()) %>%
mutate(ENSEMBL = gsub("\\..+", "", ENSEMBL, perl = TRUE)) # Clean ENSEMBL IDs
assertSubset(c("ENSEMBL", "HOCOID"), colnames(HOCOMOCO_mapping.df))
......@@ -1267,7 +1267,7 @@ filterPeaksByRowMeans <- function(peakCounts, TF.peakMatrix = NULL, minMean = 1,
normalizeCounts <- function(rawCounts, method = "quantile", idColumn, removeCols = c()) {
normalizeCounts <- function(rawCounts, method = "quantile", idColumn, removeCols = c(), returnDESeqObj = FALSE) {
start = Sys.time()
checkAndLoadPackages(c("tidyverse", "futile.logger", "checkmate", "preprocessCore"), verbose = FALSE)
......@@ -1333,6 +1333,10 @@ normalizeCounts <- function(rawCounts, method = "quantile", idColumn, removeCols
dd = estimateSizeFactors(dd)
counts.norm = DESeq2::counts(dd, normalized = TRUE)
if (returnDESeqObj) {
return(dd)
}
} else if (method == "none") {
......@@ -1467,8 +1471,8 @@ correlateATAC_RNA <- function(countsRNA, countsATAC, HOCOMOCO_mapping, corMethod
}
HOCOMOCO_mapping.exp = dplyr::filter(HOCOMOCO_mapping, ENSEMBL %in% countsRNA.norm.TFs.df$ENSEMBL)
flog.info(paste0(" Correlate RNA-Seq and ATAC-Seq counts for ", nrow(countsATAC), " peaks and ", nrow(countsRNA.norm.TFs.df), " TF genes"))
flog.info(paste0(" Correlate RNA-Seq and ATAC-Seq counts for ", nrow(countsATAC), " peaks and ", nrow(countsRNA.norm.TFs.df), " unique TF genes."))
flog.info(paste0(" Note: For subsequent steps, the same gene may be associated with multiple TF, depending on the translation table."))
# Correlate TF gene counts with ATAC-Seq counts
# countsRNA: rows: all TF genes, columns: all samples
# countsATAC: rows: peak IDs, columns: samples
......@@ -1571,6 +1575,8 @@ calculate_classificationThresholds <- function(background, par.l) {
finalizeClassificationAndAppend <- function(output.global.TFs, median.cor.tfs, act.rep.thres.l, par.l, t.cor.sel.matrix, t.cor.sel.matrix.non, significanceThreshold_Wilcoxon = 0.05) {
checkAndLoadPackages(c("progress"), verbose = FALSE)
start = Sys.time()
flog.info(paste0("Finalize classification"))
colnameMedianCor = paste0("median.cor.tfs")
......@@ -1581,7 +1587,16 @@ finalizeClassificationAndAppend <- function(output.global.TFs, median.cor.tfs, a
colnames(AR.data)[1] = colnameMedianCor
output.global.TFs[,colnameMedianCor] = NULL
output.global.TFs = merge(output.global.TFs, AR.data, by = "TF",all.x = TRUE)
if ("TF" %in% colnames(output.global.TFs)) {
output.global.TFs = merge(output.global.TFs, AR.data, by = "TF",all.x = TRUE)
} else if ("TF.name" %in% colnames(output.global.TFs)) {
output.global.TFs = merge(output.global.TFs, AR.data, by = "TF.name",all.x = TRUE)
} else {
message = paste0("Could npt find column for merging.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
# Define classes, tidyverse style
......@@ -1599,11 +1614,15 @@ finalizeClassificationAndAppend <- function(output.global.TFs, median.cor.tfs, a
if (!is.null(significanceThreshold_Wilcoxon)) {
assertNumber(significanceThreshold_Wilcoxon, lower = 0, upper = 1)
flog.info(paste0(" Perform Wilcoxon test."))
flog.info(paste0(" Perform Wilcoxon test for each TF. This may take a few minutes."))
output.global.TFs[,colnameClassificationPVal] = NULL
# Do a Wilcoxon test for each TF as a 2nd filtering criterion
pb <- progress_bar$new(total = ncol(t.cor.sel.matrix))
for (TFCur in colnames(t.cor.sel.matrix)) {
pb$tick()
rowNo = which(output.global.TFs$TF == TFCur)
# Should normally not happen
......@@ -1676,6 +1695,15 @@ finalizeClassificationAndAppend <- function(output.global.TFs, median.cor.tfs, a
} # end if doWilcoxon
# Print a summary of the classification
flog.info(" Summary of classification:")
colnamesIndex = which(grepl("final", colnames(output.global.TFs)))
for (colnameCur in colnamesIndex) {
flog.info(paste0(" Column ", colnames(output.global.TFs)[colnameCur]))
tbl = table(output.global.TFs %>% pull(colnameCur))
flog.info(paste0(" ", paste0(names(tbl), ": ", tbl), collapse = ", "))
}
.printExecutionTime(start)
output.global.TFs
......@@ -1693,11 +1721,16 @@ plot_density <- function(foreground.m, background.m, file = NULL, ...) {
# 1. Determine maximum y-values across all TFs
yMax = 2
for (colCur in seq_len(ncol(foreground.m))) {
yMaxCur = max(c(density(foreground.m[,colCur], na.rm = TRUE)$y, density(background.m[,colCur], na.rm = T)$y))
if (yMaxCur > yMax) {
yMax = yMaxCur + 0.1
}
n_notNA = length(which(!is.na(foreground.m[,colCur])))
if (n_notNA > 1){
yMaxCur = max(c(density(foreground.m[,colCur], na.rm = TRUE)$y, density(background.m[,colCur], na.rm = T)$y))
if (yMaxCur > yMax) {
yMax = yMaxCur + 0.1
}
}
}
if (!is.null(file)) {
......@@ -1711,14 +1744,23 @@ plot_density <- function(foreground.m, background.m, file = NULL, ...) {
dataBackground = background.m[,colCur]
mainLabel = paste0(TFCur," (#TFBS = ",length(which(!is.na(dataMotif)))," )")
plot(density(dataMotif, na.rm = TRUE), xlim = c(-1,1), ylim = c(0,yMax),
main= mainLabel, lwd=2.5, col="red", axes = FALSE, xlab = "Pearson correlation")
abline(v=0, col="black", lty=2)
legend("topleft",box.col = adjustcolor("white",alpha.f = 0), legend = c("Motif","Non-motif"), lwd = c(2,2),cex = 0.8, col = c("red","darkgrey"), lty = c(1,1) )
axis(side = 1, lwd = 1, line = 0)
axis(side = 2, lwd = 1, line = 0, las = 1)
n_notNA1 = length(which(!is.na(dataMotif)))
n_notNA2 = length(which(!is.na(dataBackground)))
if (n_notNA1 > 1 & n_notNA2 > 1 ){
plot(density(dataMotif, na.rm = TRUE), xlim = c(-1,1), ylim = c(0,yMax),
main= mainLabel, lwd=2.5, col="red", axes = FALSE, xlab = "Pearson correlation")
abline(v=0, col="black", lty=2)
legend("topleft",box.col = adjustcolor("white",alpha.f = 0), legend = c("Motif","Non-motif"), lwd = c(2,2),cex = 0.8, col = c("red","darkgrey"), lty = c(1,1) )
axis(side = 1, lwd = 1, line = 0)
axis(side = 2, lwd = 1, line = 0, las = 1)
lines(density(dataBackground, na.rm = T), lwd = 2.5, col = "darkgrey")
} else {
flog.warn(paste0(" Not enough data for estimating densities for TF ", TFCur, ", skip for plotting."))
}
lines(density(dataBackground, na.rm = T), lwd = 2.5, col = "darkgrey")
}
......@@ -1881,12 +1923,10 @@ plot_AR_thresholds <- function(median.cor.tfs, median.cor.tfs.non, par.l, act.r
}
# plot_heatmapAR(peak_TF_overlapCur.df, HOCOMOCO_mapping.df.exp, sort.cor.m.l[[index]], par.l,
# median.cor.tfs, median.cor.tfs.non, act.rep.thres.l, finalClassification = output.global.TFs,
# file = paste0(file_output_plot_densityClass, "_perm", permutationCur, ".pdf"), width = 5, height = 8)
# Code from Armando Reyes
plot_heatmapAR <- function(TF.peakMatrix.df, HOCOMOCO_mapping.df.exp, sort.cor.m, par.l, median.cor.tfs, median.cor.tfs.non, act.rep.thres.l, finalClassification = NULL, file = NULL, ...) {
plot_heatmapAR <- function(TF.peakMatrix.df, HOCOMOCO_mapping.df.exp, sort.cor.m, par.l,
median.cor.tfs, median.cor.tfs.non, act.rep.thres.l, finalClassification = NULL, file = NULL, ...) {
start = Sys.time()
flog.info(paste0("Plotting AR heatmap", if_else(is.null(file), "", paste0(" to file ", file))))
......@@ -1894,7 +1934,7 @@ plot_heatmapAR <- function(TF.peakMatrix.df, HOCOMOCO_mapping.df.exp, sort.cor.m
missingGenes = which(!HOCOMOCO_mapping.df.exp$HOCOID %in% colnames(sort.cor.m))
if (length(missingGenes) > 0) {
HOCOMOCO_mapping.df.exp = dplyr::filter(HOCOMOCO_mapping.df.exp, ENSEMBL %in% colnames(sort.cor.m))
HOCOMOCO_mapping.df.exp = dplyr::filter(HOCOMOCO_mapping.df.exp, HOCOID %in% colnames(sort.cor.m))
}
if (!is.null(file)) {
......@@ -1903,9 +1943,6 @@ plot_heatmapAR <- function(TF.peakMatrix.df, HOCOMOCO_mapping.df.exp, sort.cor.m
cor.r.pearson.m <- sort.cor.m[,as.character(HOCOMOCO_mapping.df.exp$HOCOID)]
# Filter to only expressed genes
TF.peakMatrix.filt.df = TF.peakMatrix.df[, which(colnames(TF.peakMatrix.df) %in% HOCOMOCO_mapping.df.exp$HOCOID)]
stopifnot(identical(colnames(TF.peakMatrix.filt.df), as.character(HOCOMOCO_mapping.df.exp$HOCOID)))
stopifnot(identical(colnames(cor.r.pearson.m), as.character(HOCOMOCO_mapping.df.exp$HOCOID)))
BREAKS = seq(-1,1,0.05)
......@@ -1917,13 +1954,12 @@ plot_heatmapAR <- function(TF.peakMatrix.df, HOCOMOCO_mapping.df.exp, sort.cor.m
for (i in 1:ncol(cor.r.pearson.m)) {
TF = colnames(cor.r.pearson.m)[i]
TF_name = TF #as.character(HOCOMOCO_mapping.df.exp$HOCOID[HOCOMOCO_mapping.df.exp$HOCOID==TF])
## for the background, use all peaks
h_noMotif = hist(cor.r.pearson.m[,TF][TF_Peak_all.m[,TF] == 0], breaks = BREAKS, plot = FALSE)
## for the foreground use only peaks with less than min_mot_n different TF motifs
h_Motif = hist(cor.r.pearson.m[,TF][TF_Peak.m[,TF] != 0], breaks = BREAKS, plot = FALSE)
h_Motif = hist(cor.r.pearson.m[,TF][TF_Peak.m[,TF] != 0], breaks = BREAKS, plot = FALSE)
diff_density = h_Motif$density - h_noMotif$density
diffDensityMat[rownames(diffDensityMat) == TF_name[1], ] <- diff_density
diffDensityMat[rownames(diffDensityMat) == TF, ] <- diff_density
}
diffDensityMat = diffDensityMat[!is.na(diffDensityMat[,1]),]
colnames(diffDensityMat) = signif(h_Motif$mids,1)
......@@ -1933,7 +1969,10 @@ plot_heatmapAR <- function(TF.peakMatrix.df, HOCOMOCO_mapping.df.exp, sort.cor.m
n_min = if_else(colSums(TF_Peak.m) < nrow(TF_Peak.m),colSums(TF_Peak.m), nrow(TF_Peak.m) - colSums(TF_Peak.m))
names(n_min) = HOCOMOCO_mapping.df.exp$HOCOID#[match(names(n_min), as.character(tf2ensg$ENSEMBL))]
n_min <- sapply(split(n_min,names(n_min)),sum)
quantile(n_min)
# Make sure n_min and diffDenityMat are compatible because some NA rows may have been filtered out for diffDensityMat
n_min <- n_min[rownames(diffDensityMat)]
#quantile(n_min)
remove_smallN = which(n_min < par.l$threshold_minNoTFBS_heatmap)
cor(n_min[-remove_smallN], rowMax(diffDensityMat)[-remove_smallN], method = 'pearson')
......
......@@ -376,7 +376,7 @@ NFIB ENSG00000147862 NFIB.0.D
NFIC ENSG00000141905 NFIC.0.A
NFIC ENSG00000141905 NFIC.1.A
NFIL3 ENSG00000165030 NFIL3.0.D
NFKB1 ENSG00000109320 NFKB1.0.A
NFKB1 ENSG00000109320 NFKB1.1.B
NFKB2 ENSG00000077150 NFKB2.0.B
NFYA ENSG00000001167 NFYA.0.A
NFYB ENSG00000120837 NFYB.0.A
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment