Commit febff89d authored by Christian Arnold's avatar Christian Arnold

Version 1.3, see Changelog for Details

parent 6792c214
......@@ -133,9 +133,12 @@ A working ``R`` installation is needed and a number of packages from either CRAN
.. code-block:: R
install.packages(c("checkmate", "futile.logger", "tidyverse", "reshape2", "RColorBrewer", "ggrepel", "lsr", "modeest", "boot", "grDevices", "pheatmap", "matrixStats", "locfdr", "pheatmap"))
source("https://bioconductor.org/biocLite.R")
biocLite(c("limma", "vsn", "csaw", "DESeq2", "DiffBind", "geneplotter", "Rsamtools"))
install.packages(c("checkmate", "futile.logger", "tidyverse", "reshape2", "RColorBrewer", "ggrepel", "lsr", "modeest", "boot", "grDevices", "pheatmap", "matrixStats", "locfdr"))
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("limma", "vsn", "csaw", "DESeq2", "DiffBind", "geneplotter", "Rsamtools", "preprocessCore"))
.. _docs-runOwnAnalysis:
......
......@@ -164,10 +164,10 @@ PARAMETER ``conditionComparison``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
String. Default "". Specifies the two conditions you want to compare.
String. Default "". Specifies the two conditions you want to compare. Only relevant if *conditionSummary* is specified as a factor.
Details
This parameter specifies the contrast you are making in *diffTF*. Two conditions have to be specified, separated by a comma. For example, if you want to compare GMP and MPP samples, the parameter should be "GMP,MPP". Both conditions have to be present in the column "conditionSummary" in the sample file table (see parameter ``summaryFile`` (:ref:`parameter_summaryFile`)).
This parameter is only relevant if *conditionSummary* is specified as a factor, in which case it specifies the contrast you are making in *diffTF*. Otherwise, it is ignored. Exactly two conditions have to be specified, comma-separated. For example, if you want to compare GMP and MPP samples, the parameter should be "GMP,MPP". Both conditions have to be present in the column "conditionSummary" in the sample file table (see parameter ``summaryFile`` (:ref:`parameter_summaryFile`)).
.. note:: The order of the two conditions matters. The condition specified first is the reference condition. For the "GMP,MPP" example, all log2 fold-changes will be the log2fc of *MPP* as compared to *GMP*. That means that a positive log2 fold-change means it is higher in *MPP* as compared to *GMP*. This is particularly relevant for the *allMotifs* output file.
......@@ -181,7 +181,20 @@ Summary
String. Default *conditionSummary*. Design formula for the differential accessibility analysis.
Details
This important parameter defines the actual contrast that is done in the differential analysis. That is, which groups of samples are being compared? Examples include mutant vs wild type, mutated vs. unmutated, etc. The last element in the formula must always be *conditionSummary*, which defines the two groups that are being compared. This name is currently hard-coded and required by the pipeline. Our pipeline allows including additional variables to model potential confounding variables, like gender, batches etc. For each additional variable that is part of the formula, a corresponding and identically named column in the sample summary file must be specified. For example, for an analysis that also includes the batch number of the samples, you may specify this as "*~ Treatment + conditionSummary*".
This important parameter defines the actual contrast that is done in the differential analysis. That is, which groups of samples are being compared? Examples include mutant vs wild type, mutated vs. unmutated, etc. The last element in the formula must always be *conditionSummary*, which defines the two groups that are being compared or the continuous variable that is used for inferring negative or positive changes, respectively (see parameter :ref:`parameter_designVariableTypes`). This name is currently hard-coded and required by the pipeline. Our pipeline allows including additional variables to model potential confounding variables, like gender, batches etc. For each additional variable that is part of the formula, a corresponding and identically named column in the sample summary file must be specified. For example, for an analysis that also includes the batch number of the samples, you may specify this as "*~ Treatment + conditionSummary*".
.. _parameter_designContrastRNA:
PARAMETER ``designContrastRNA``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
String. Default *conditionSummary*. Design formula for the RNA-Seq data. Only relevant and needed if parameter (:ref:`parameter_RNASeqIntegration`) is set to *true*. If missing (to increase compatibility with previous versions of *diffTF*), the default value will be taken.
Details
This important parameter defines the actual contrast that is done in the differential analysis. That is, which groups of samples are being compared? Examples include mutant vs wild type, mutated vs. unmutated, etc. The last element in the formula must always be *conditionSummary*, which defines the two groups that are being compared or the continuous variable that is used for inferring negative or positive changes, respectively (see parameter :ref:`parameter_designVariableTypes`). This name is currently hard-coded and required by the pipeline. Our pipeline allows including additional variables to model potential confounding variables, like gender, batches etc. For each additional variable that is part of the formula, a corresponding and identically named column in the sample summary file must be specified. For example, for an analysis that also includes the batch number of the samples, you may specify this as "*~ Treatment + conditionSummary*".
.. _parameter_designVariableTypes:
......@@ -193,10 +206,9 @@ Summary
String. Default *conditionSummary:factor*. The data types of **all** elements listed in ``designContrast`` (:ref:`parameter_designContrast`).
Details
Names must be separated by commas, spaces are allowed and will be eliminated automatically. The data type must be specified with a “:”, followed by either “numeric”, “integer”, “logical”, or “factor”. For example, if ``designContrast`` (:ref:`parameter_designContrast`) is specified as "*~ Treatment + conditionSummary*", the corresponding types might be "Treatment:factor, conditionSummary:factor". If a data type is specified as either "logical" or "factor", the variable will be treated as a discrete variable with a finite number of distinct possibilities (something like batch, for example). *conditionSummary* is usually specified as factor because you want to make a pairwise comparison of exactly two conditions. If *conditionSummary* is specified as "integer" or "numeric", however, the variable is treated as continuously-scaled, which changes the interpretation of the results, see the note below.
.. note:: Importantly, if the variable of interest is continuous-valued (i.e., marked as being integer or numeric), then the reported log2 fold change is per unit of change of that variable. That is, in the final Volcano plot, TFs displayed in the left side have a negative slope per unit of change of that variable, while TFs at the right side have a positive one.
Names must be separated by commas, spaces are allowed and will be eliminated automatically. The data type must be specified with a “:”, followed by either “numeric”, “integer”, “logical”, or “factor”. For example, if ``designContrast`` (:ref:`parameter_designContrast`) is specified as "*~ Treatment + conditionSummary*", the corresponding types might be "Treatment:factor, conditionSummary:factor". If a data type is specified as either "logical" or "factor", the variable will be treated as a discrete variable with a finite number of distinct possibilities (something like batch, for example).
.. note:: Importantly, *conditionSummary* can either be specified as "factor" or "numeric"/"integer", which changes the way the results are interpreted and what the log2 fold-changes represent. *conditionSummary* is usually specified as factor because you want to make a pairwise comparison of exactly two conditions. If *conditionSummary* is specified as "integer" or "numeric" (i.e., continuous-valued), however, the variable is treated as continuously-scaled, which changes the interpretation of the results: the reported log2 fold change is then per unit of change of that variable. That is, in the final Volcano plot, TFs displayed in the left side have a negative slope per unit of change of that variable, while TFs at the right side have a positive one.
.. _parameter_nPermutations:
......@@ -443,7 +455,7 @@ Summary
String. Default “”. Path to the file with RNA-Seq counts.
Details
If no RNA-Seq data is included, set to the empty string “”. Otherwise, if ``RNASeqIntegration`` (:ref:`parameter_RNASeqIntegration`) is set to true, specify the path to a tab-separated file with normalized RNA-Seq counts. It does not matter whether the values have been variance-stabilized or not, as long as values across samples are comparable. Also, consider filtering lowly expressed genes. For guidance, you may want to read `Question 4 here <https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html>`_.
If no RNA-Seq data is included, set to the empty string “”. Otherwise, if ``RNASeqIntegration`` (:ref:`parameter_RNASeqIntegration`) is set to true, specify the path to a tab-separated file with *raw* RNA-Seq counts. We apply some basic filtering for lowly expressed genes and exclude genes with small counts, so there is principally no need for manual filtering unless you want to do so. For guidance, you may want to read `Question 4 here <https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html>`_.
The first line must be used for labeling the samples, with column names being identical to the sample names as specific in the sample summary table (``summaryFile``, :ref:`parameter_summaryFile`). If you have RNA-Seq data for only a subset of the input samples, this is no problem - the classification will then naturally only be based on the subset. The first column must be named ENSEMBL and it must contain ENSEMBL IDs (e.g., *ENSG00000028277*) without dots. The IDs are then matched to the IDs from the file as specified in ``HOCOMOCO_mapping`` (:ref:`parameter_HOCOMOCO_mapping`).
......
......@@ -55,9 +55,9 @@ author = 'Christian Arnold, Ivan Berest, Judith B. Zaugg'
# built documents.
#
# The short X.Y version.
version = '1.2'
version = '1.3'
# The full version, including alpha/beta/rc tags.
release = '1.2.5'
release = '1.3.0'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
......
......@@ -44,7 +44,7 @@ Citation
If you use this software, please cite the following reference:
Ivan Berest*, Christian Arnold*, Armando Reyes-Palomares, Giovanni Palla, Kasper Dindler Rassmussen, Kristian Helin & Judith B. Zaugg. *Quantification of differential transcription factor activity and multiomics-based classification into activators and repressors: diffTF*. 2018. in review.
Ivan Berest*, Christian Arnold*, Armando Reyes-Palomares, Giovanni Palla, Kasper Dindler Rassmussen, Holly Giles, Peter-Martin Bruch, Sascha Dietrich, Wolfgang Huber, Kristian Helin & Judith B. Zaugg. *Quantification of differential transcription factor activity and multiomics-based classification into activators and repressors: diffTF*. 2019. in review.
We also put the paper on *bioRxiv*, please read all methodological details here:
`Quantification of differential transcription factor activity and multiomic-based classification into activators and repressors: diffTF <https://www.biorxiv.org/content/early/2018/12/01/368498>`_.
......@@ -53,6 +53,10 @@ We also put the paper on *bioRxiv*, please read all methodological details here:
Change log
============================
Version 1.3 (2019-07-17)
- Various minor changes and small bug fixes as reported by users
- improved the RNA-Seq classification, further information will follow soon.
Version 1.2.5 (2019-03-13)
- Updated the TFBS_hg38_FIMO_HOCOMOCOv11 archive one more time to exclude non-assembled contigs such as HLA*. To make the pipeline more stable for such edge cases, the parameter ``dir_TFBS_sorted`` has been removed, and sorting and filtering of chromosomes is now always performed. Only chromosomes are kept in both the consensus peak files and the TFBS bed files that start with ``chr`` and are neither sex chromosomes (``chrX`` or ``chrY``) nor ``chrM``. If you want to keep sex chromosomes in your analysis (although we think this is not recommended), simply edit the Snakefile and remove the "chrX" and "chrY" occurences in the two filtering rules.
......
......@@ -99,6 +99,9 @@ checkAndLogWarningsAndErrors(par.l$designFormulaVariableTypes, checkCharacter(pa
par.l$nPermutations = snakemake@config$par_general$nPermutations
assertIntegerish(par.l$nPermutations, lower = 0)
par.l$conditionComparison = snakemake@config$par_general$conditionComparison
checkAndLogWarningsAndErrors(par.l$conditionComparison, checkCharacter(par.l$conditionComparison, len = 1))
## PARAMS ##
assertList(snakemake@params, min.len = 1)
assertSubset(names(snakemake@params), c("", "doCyclicLoess", "allBAMS"))
......@@ -135,7 +138,6 @@ printParametersLog(par.l)
# READ METADATA #
#################
sampleData.l = readRDS(par.l$file_input_metadata)
if (length(sampleData.l) == 0) {
......@@ -163,6 +165,19 @@ if (length(colnamesNew) != nrow(sampleData.df)) {
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
designComponents.l = checkDesignIntegrity(snakemake, par.l, sampleData.df)
components3types = designComponents.l$types
variableToPermute = designComponents.l$variableToPermute
# Which of the two modes should be done, pairwise or quantitative?
comparisonMode = "quantitative"
if (components3types["conditionSummary"] == "logical" | components3types["conditionSummary"] == "factor") {
comparisonMode = "pairwise"
}
# Initiate data structures that are populated hereafter
TF_output.df = tribble(~permutation, ~TF, ~chr, ~MSS, ~MES, ~TFBSID, ~strand, ~peakID, ~limma_avgExpr, ~l2FC, ~limma_B, ~limma_t_stat, ~DESeq_ldcSE, ~DESeq_stat, ~DESeq_baseMean, ~pval, ~pval_adj)
......@@ -393,6 +408,8 @@ if (skipTF) {
}
)
res_DESeq
}
)
......@@ -405,8 +422,18 @@ if (skipTF) {
if (!skipTF) {
# Enforce the correct order
res_DESeq.df <- as.data.frame(DESeq2::results(res_DESeq, contrast = c(variableToPermute, conditionComparison[1], conditionComparison[2])))
#Enforce the correct order of the comparison
if (comparisonMode == "pairwise") {
res_DESeq.df <- as.data.frame(DESeq2::results(res_DESeq, contrast = c(variableToPermute, conditionComparison[1], conditionComparison[2])))
} else {
# Same as without specifying contrast at all
res_DESeq.df <- as.data.frame(DESeq2::results(res_DESeq, contrast = list(variableToPermute)))
}
final.TF.df = data_frame("TFBSID" = rownames(res_DESeq.df),
"DESeq_baseMean" = res_DESeq.df$baseMean,
......
......@@ -215,6 +215,8 @@ CGBins = seq(0,1, 1/par.l$nBins)
# PERMUTATIONS #
################
nPermutationsSkipped = 0
for (fileCur in par.l$files_input_TF_allMotives) {
# Log 2 fold-changes from the particular permutation
......@@ -282,6 +284,9 @@ for (fileCur in par.l$files_input_TF_allMotives) {
uniqueBins = unique(TF.motifs.all$CG.bins)
nBins = length(uniqueBins)
# Sometimes, a bin is missing in the TF.motifs.all data, therefore decreasing the apparent number of bins
nBinsAll = length(levels(TF.motifs.all$CG.bins))
nCol = ncol(perm.l[[TFCur]])
# TODO: TFBSID needed?
......@@ -381,12 +386,14 @@ for (fileCur in par.l$files_input_TF_allMotives) {
if (nBinsWithData == 0) {
message = paste0(" Not enough data for any of the ", nBins, " bins, this TF will be skipped in subsequent steps")
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
nPermutationsSkipped = nPermutationsSkipped + 1
flog.info(paste0(" Not enough (non-NA) data for any of the ", nBinsAll, " bins. This may happen for individual permutations, see warning at the end."))
calculateVariance = FALSE
} else {
flog.info(paste0(" Finished calculation across bins successfully for ", nBinsWithData, " out of ", nBins, " bins"))
}
###################################################################
# Summarize bootstrap results and estimate covariance across bins #
......@@ -515,6 +522,19 @@ for (fileCur in par.l$files_input_TF_allMotives) {
} # end for each permutation
if (nPermutationsSkipped > 0) {
message = paste0("Could not calculate results for ", nPermutationsSkipped, " out of ", par.l$nPermutations , " permutations. If this happens only for a small fraction of permutations, this warning can be ignored. If this happens for a large fraction, however, the statistical significance as given by diffTF may have to be treated with caution. For individual permutations, this may happen if in none of the bins, there are at least ", par.l$minNoDatapoints, " TFBS per bin for which a log2 fold-change could be calculated beforehand.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
if (nPermutationsSkipped >= par.l$nPermutations ) {
message = paste0("This TF will be skipped in subsequent steps due to missing values across all permutations. It will not appear in the final output plots.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
}
}
# Save objects
saveRDS(list( binSummary = perm.l, covarianceSummary = summaryCov.df), file = par.l$file_output_permResults)
......
......@@ -44,8 +44,9 @@ assertList(snakemake@config, min.len = 1)
file_peaks = snakemake@config$peaks$consensusPeaks
conditionComparison = strsplit(snakemake@config$par_general$conditionComparison, ",")[[1]]
assertVector(conditionComparison, len = 2)
par.l$designFormula = snakemake@config$par_general$designContrast
checkAndLogWarningsAndErrors(par.l$designFormula, checkCharacter(par.l$designFormula, len = 1, min.chars = 3))
TFBS_dir = snakemake@config$additionalInputFiles$dir_TFBS
assertDirectoryExists(dirname(TFBS_dir), access = "r")
......@@ -63,12 +64,15 @@ assertIntegerish(par.l$nPermutations, lower = 0, len = 1)
par.l$nBootstraps = as.integer(snakemake@config$par_general$nBootstraps)
assertIntegerish(par.l$nBootstraps, len = 1)
par.l$designFormula = snakemake@config$par_general$designContrast
checkAndLogWarningsAndErrors(par.l$designFormula, checkCharacter(par.l$designFormula, len = 1, min.chars = 3))
par.l$file_input_sampleData = snakemake@config$samples$summaryFile
checkAndLogWarningsAndErrors(par.l$file_input_sampleData, checkFileExists(par.l$file_input_sampleData, access = "r"))
par.l$conditionComparison = snakemake@config$par_general$conditionComparison
checkAndLogWarningsAndErrors(par.l$conditionComparison, checkCharacter(par.l$conditionComparison, len = 1))
par.l$plotRNASeqClassification = as.logical(snakemake@config$par_general$RNASeqIntegration)
assertFlag(par.l$plotRNASeqClassification)
## LOG ##
assertList(snakemake@log, min.len = 1)
par.l$file_log = snakemake@log[[1]]
......@@ -94,22 +98,26 @@ printParametersLog(par.l)
# TODO: First loading DESeq2 before DiffBind seems to prevent the segfault
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "DiffBind", "checkmate", "stats"), verbose = FALSE)
# Step 2
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "vsn", "csaw", "checkmate", "limma", "tools", "geneplotter", "RColorBrewer"), verbose = FALSE)
# Step 2: diffPeaks
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "vsn", "csaw", "checkmate", "limma", "tools", "geneplotter", "RColorBrewer", "matrixStats"), verbose = FALSE)
# Step 3
# Step 3: analyzeTF
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "vsn", "modeest", "checkmate", "limma", "geneplotter", "RColorBrewer", "tools"), verbose = FALSE)
# Step 4
# Step 4: summary1
checkAndLoadPackages(c("tidyverse", "futile.logger", "modeest", "checkmate", "ggrepel"), verbose = FALSE)
# Step 5
checkAndLoadPackages(c("tidyverse", "futile.logger", "checkmate", "tools", "methods", "boot"), verbose = FALSE)
# Step 5: binningTF
checkAndLoadPackages(c("tidyverse", "futile.logger", "checkmate", "tools", "methods", "boot", "lsr"), verbose = FALSE)
# Step 6
checkAndLoadPackages(c("tidyverse", "futile.logger", "lsr", "ggrepel", "checkmate", "tools", "methods", "grDevices", "pheatmap"), verbose = FALSE)
# Step 6: summaryFinal
checkAndLoadPackages(c("tidyverse", "futile.logger","ggrepel", "checkmate", "tools", "methods", "grDevices", "pheatmap"), verbose = FALSE)
if (par.l$plotRNASeqClassification) {
# Require some more packages here
checkAndLoadPackages(c( "lsr", "DESeq2", "matrixStats", "pheatmap", "preprocessCore"), verbose = FALSE)
}
# Check the version of readr, at least 1.1.0 is required to properly write gz files
......@@ -129,18 +137,37 @@ if (par.l$nPermutations == 0 && par.l$nBootstraps < 1000) {
sampleData.df = read_tsv(par.l$file_input_sampleData, col_names = TRUE, col_types = cols())
# Check the sample table
nDistValues = length(unique(sampleData.df$conditionSummary))
if (nDistValues != 2) {
message = paste0("The column 'conditionSummary' must contain exactly 2 different values, but ", nDistValues, " were found.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
if (!testSubset(unique(sampleData.df$conditionSummary), conditionComparison)) {
message = paste0("The elements specified in 'conditionComparison' in the config file must be a subset of the values in the column 'conditionSummary' in the sample file")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
components3types = checkDesignIntegrity(snakemake, par.l, sampleData.df)$types
# Check the sample table. Distinguish between the two modes: Factor and integer for conditionSummary
if (components3types["conditionSummary"] == "logical" | components3types["conditionSummary"] == "factor") {
conditionComparison = strsplit(snakemake@config$par_general$conditionComparison, ",")[[1]]
assertVector(conditionComparison, len = 2)
nDistValues = length(unique(sampleData.df$conditionSummary))
if (nDistValues != 2) {
message = paste0("The column 'conditionSummary' must contain exactly 2 different values, but ", nDistValues, " were found.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
if (!testSubset(unique(sampleData.df$conditionSummary), conditionComparison)) {
message = paste0("The elements specified in 'conditionComparison' in the config file must be a subset of the values in the column 'conditionSummary' in the sample file")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
} else {
# Test whether at least two distinct values are present
nDistinct = length(unique(sampleData.df$conditionSummary))
if (nDistinct < 2) {
message = paste0("At least 2 distinct values for the column 'conditionSummary' must be present in the sample file in the quantitative mode.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
}
# Check the design formula
designFormula = convertToFormula(par.l$designFormula, colnames(sampleData.df))
......@@ -221,7 +248,7 @@ if (file_peaks != "") {
flog.info(paste0("Peak file contains ", nrow(peaks.df), " peaks."))
if (nrow(peaks.df) > 100000) {
message = paste0("The number of peaks is very high, subsequent steps may be slow, particularly in the prepareBinning and binningTF steps. Make sure the preparingBinning step has enough memory available. We recommend at least 50 GB. Alternatively, consider decreasing the number of peaks for improved performance.")
message = paste0("The number of peaks is high (", nrow(peaks.df), "), subsequent steps may be slow, particularly the binningTF steps. If you encounter problems related to execution time during the analysis, consider decreasing the number of peaks for improved performance.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
if (snakemake@config$par_general$nPermutations > 5) {
......
This diff is collapsed.
......@@ -412,7 +412,7 @@ myMAPlot <- function(M, idx, main, minMean = 0) {
main <- paste(main, ", Number of genes:", dim(M)[1])
pl <- (qplot(mean, difference, main = main, ylim = c(-5,5), asp = 2, geom = "point", alpha = I(.5), color = I("grey30"), shape = I(16)) #
+ geom_hline(aes(yintercept=0), col = "#9850C3", show.legend = FALSE)
+ geom_hline(aes(yintercept = 0), col = "#9850C3", show.legend = FALSE)
+ geom_smooth(method = "loess", se = FALSE, col = "#5D84C5", span = .4)
+ theme_bw()
)
......@@ -664,3 +664,182 @@ checkAndLogWarningsAndErrors <- function(object, checkResult, isWarning = FALSE)
}
}
#############
# FUNCTIONS #
#############
# Code from Armando Reyes
heatmap.act.rep <- function(df.tf.peak.matrix, HOCOMOCO_mapping.df.exp, cor.m, par.l, median.cor.tfs, median.cor.tfs.non, act.rep.thres.l){
missingGenes = which(!HOCOMOCO_mapping.df.exp$ENSEMBL %in% colnames(cor.m))
if (length(missingGenes) > 0) {
HOCOMOCO_mapping.df.exp = filter(HOCOMOCO_mapping.df.exp, ENSEMBL %in% colnames(cor.m))
}
cor.r.pearson.m <- cor.m[,as.character(HOCOMOCO_mapping.df.exp$ENSEMBL)]
stopifnot(identical(colnames(df.tf.peak.matrix), as.character(HOCOMOCO_mapping.df.exp$HOCOID)))
stopifnot(identical(colnames(cor.r.pearson.m), as.character(HOCOMOCO_mapping.df.exp$ENSEMBL)))
colnames(cor.r.pearson.m) <- HOCOMOCO_mapping.df.exp$HOCOID
BREAKS = seq(-1,1,0.05)
diffDensityMat = matrix(NA, nrow = ncol(cor.r.pearson.m), ncol = length(BREAKS) - 1)
rownames(diffDensityMat) = HOCOMOCO_mapping.df.exp$HOCOID
TF_Peak_all.m <- df.tf.peak.matrix
TF_Peak.m <- TF_Peak_all.m
for (i in 1:ncol(cor.r.pearson.m)) {
TF = colnames(cor.r.pearson.m)[i]
TF_name = TF #as.character(HOCOMOCO_mapping.df.exp$HOCOID[HOCOMOCO_mapping.df.exp$HOCOID==TF])
## for the background, use all peaks
h_noMotif = hist(cor.r.pearson.m[,TF][TF_Peak_all.m[,TF] == 0], breaks = BREAKS, plot = FALSE)
## for the foreground use only peaks with less than min_mot_n different TF motifs
h_Motif = hist(cor.r.pearson.m[,TF][TF_Peak.m[,TF] != 0], breaks = BREAKS, plot = FALSE)
diff_density = h_Motif$density - h_noMotif$density
diffDensityMat[rownames(diffDensityMat) == TF_name[1], ] <- diff_density
}
diffDensityMat = diffDensityMat[!is.na(diffDensityMat[,1]),]
colnames(diffDensityMat) = signif(h_Motif$mids,1)
quantile(diffDensityMat)
## check to what extent the number of TF motifs affects the density values
n_min = ifelse(colSums(TF_Peak.m) < nrow(TF_Peak.m),colSums(TF_Peak.m), nrow(TF_Peak.m)-colSums(TF_Peak.m))
names(n_min) = HOCOMOCO_mapping.df.exp$HOCOID#[match(names(n_min), as.character(tf2ensg$ENSEMBL))]
n_min <- sapply(split(n_min,names(n_min)),sum)
quantile(n_min)
remove_smallN = which(n_min < 100)
cor(n_min[-remove_smallN],rowMax(diffDensityMat)[-remove_smallN], method = 'pearson')
factorClassificationPlot <- sort(median.cor.tfs, decreasing = TRUE)
diffDensityMat_Plot = diffDensityMat[match(names(factorClassificationPlot), rownames(diffDensityMat)), ]
diffDensityMat_Plot = diffDensityMat_Plot[!is.na(rownames(diffDensityMat_Plot)),]
annotation_rowDF = data.frame(median_diff = factorClassificationPlot[match(rownames(diffDensityMat_Plot), names(factorClassificationPlot))])
for (thresCur in names(act.rep.thres.l)) {
thresCur.v = act.rep.thres.l[[thresCur]]
labelMain = paste0(as.numeric(thresCur)*100, " / ", (1 - as.numeric(thresCur))*100, " % quantiles")
colBreaks = unique(c((-1),
thresCur.v[1],
thresCur.v[2],
1))
anno_rowDF = data.frame(threshold = cut(annotation_rowDF$median_diff, breaks = colBreaks))
rownames(anno_rowDF) = rownames(diffDensityMat_Plot)
colors = c(par.l$colorCategories["repressor"],par.l$colorCategories["not-expressed"], par.l$colorCategories["activator"])
names(colors) = levels(anno_rowDF$threshold)
pheatmap(diffDensityMat_Plot, cluster_rows = FALSE, cluster_cols = FALSE,
fontsize_row = 1.25, scale = 'row' , fontsize_col = 10, fontsize = 8, labels_col = c(-1, -0.5, 0, 0.5, 1),
annotation_row = anno_rowDF,annotation_legend = FALSE,
annotation_colors = list(threshold = colors), legend = TRUE, annotation_names_row = FALSE, main = labelMain)
}
} # end function
# Which transformation of the y values to do?
transform_yValues <- function(values, addPseudoCount = TRUE, nPermutations, onlyForZero = TRUE) {
# Should only happen with the permutation-based approach
zeros = which(values == 0)
if (length(zeros) > 0 & addPseudoCount) {
values[zeros] = 1 / nPermutations
}
-log10(values)
}
transform_yValues_caption <- function() {
"-log10"
}
my.median = function(x) median(x, na.rm = TRUE)
my.mean = function(x) mean(x, na.rm = TRUE)
checkDesignIntegrity <- function(snakemake, par.l, sampleData.df, useRNA = FALSE) {
par.l$designFormulaVariableTypes = snakemake@config$par_general$designVariableTypes
checkAndLogWarningsAndErrors(par.l$designFormulaVariableTypes, checkCharacter(par.l$designFormulaVariableTypes, len = 1, min.chars = 3))
par.l$designFormulaVariableTypes = gsub(" ", "", par.l$designFormulaVariableTypes)
components = strsplit(par.l$designFormulaVariableTypes, ",")[[1]]
par.l$conditionComparison = snakemake@config$par_general$conditionComparison
checkAndLogWarningsAndErrors(par.l$conditionComparison, checkCharacter(par.l$conditionComparison, len = 1))
if (useRNA) {
# The design formula for RNA-Seq is different from the one we used before for ATAC-Seq
# Either take the one that the user provided or, if he did not, use a general one with only the condition
par.l$designFormulaRNA = snakemake@config$par_general$designContrastRNA
if (is.null(par.l$designFormulaRNA)) {
par.l$designFormulaRNA = "~conditionSummary"
flog.warn(paste0("Could not find the parameter designContrastRNA in the configuration file. The default of \"~conditionSummary\" will be taken as formula. If you know about confounding variables, rerun this step and add the parameter (see the Documentation for details)"))
}
designFormula = as.formula(par.l$designFormulaRNA)
} else {
par.l$designFormula= snakemake@config$par_general$designContrast
designFormula = as.formula(par.l$designFormula)
}
formulaVariables = attr(terms(designFormula), "term.labels")
checkAndLogWarningsAndErrors(components, checkVector(components, min.len = length(formulaVariables)))
# Extract the variable that defines the contrast. Always the last element in the formula
variableToPermute = formulaVariables[length(formulaVariables)]
# Split further
components2 = strsplit(components, ":")
if (!all(sapply(components2,length) == 2)) {
message = "The parameter \"designVariableTypes\" has not been specified correctly. It must contain all the variables that appear in the parameter \"designContrast\". See the documentation for details"
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
components3 = unlist(lapply(components2, "[[", 1))
components3types = tolower(unlist(lapply(components2, "[[", 2)))
names(components3types) = components3
checkAndLogWarningsAndErrors(formulaVariables, checkSubset(formulaVariables, components3))
checkAndLogWarningsAndErrors(components3types, checkSubset(components3types, c("factor", "integer", "numeric", "logical")))
# Check the sample table. Distinguish between the two modes: Factor and integer for conditionSummary
if (components3types["conditionSummary"] == "logical" | components3types["conditionSummary"] == "factor") {
nDistValues = length(unique(sampleData.df$conditionSummary))
if (nDistValues != 2) {
message = paste0("The column 'conditionSummary' must contain exactly 2 different values, but ", nDistValues, " were found.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
conditionsVec = strsplit(par.l$conditionComparison, ",")[[1]]
if (!testSubset(as.character(unique(sampleData.df$conditionSummary)), conditionsVec)) {
message = paste0("The specified elements for the parameter conditionComparison (", par.l$conditionComparison, ") do not correspond to what is specified in the sample summary table (", paste0(unique(sampleData.df$conditionSummary), collapse = ","), "). All elements of conditionComparison must be present in the column conditionSummary.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
} else {
# Test whether at least two distinct values are present
nDistinct = length(unique(sampleData.df$conditionSummary))
if (nDistinct < 2) {
message = paste0("At least 2 distinct values for the column 'conditionSummary' must be present in the sample file in the quantitative mode.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
}
list(types = components3types, variableToPermute = variableToPermute)
}
This diff is collapsed.
......@@ -47,7 +47,8 @@ onstart:
def read_samplesTable(samplesSummaryFile, consensusPeaks):
"""text"""
data = pandas.read_table(samplesSummaryFile)
# Deprecated: data = pandas.read_table(samplesSummaryFile)
data = pandas.read_csv(samplesSummaryFile, sep = "\t")
# Expect a particular number of columns, do a sanity check here
......@@ -625,7 +626,7 @@ rule calcNucleotideContent:
rule binningTF:
input:
nucContent = rules.calcNucleotideContent.output.bed,
sampleDataR = rules.DiffPeaks.output.sampleDataR,
sampleDataR = rules.DiffPeaks.output.sampleDataR,
motifes = expand("{dir}/{compType}allMotifs_log2FC_perm{perm}.tsv.gz", dir = TEMP_EXTENSION_DIR, compType = compType, perm = list(range(0, nPermutationsAdjusted + 1)))
output:
permResults = expand('{dir}/{{TF}}/{extension}/{compType}{{TF}}.permutationResults.rds', dir = TF_DIR, extension = extDir, compType = compType),
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment