Commit f653cb17 authored by Christian Arnold's avatar Christian Arnold

Version 1.1.1, see changelog for details

parent 03a4b949
......@@ -747,12 +747,13 @@ Stores various log and error files.
- ``*.log`` files from R scripts: Each log file is produced by the corresponding R script and contains debugging information as well as warnings and errors:
- ``1.produceConsensusPeaks.R.log``
- ``2.DiffPeaks.R.log``
- ``3.analyzeTF.{TF}.R.log`` for each TF ``{TF}``
- ``4.summary1.R.log``
- ``5.binningTF.{TF}.log`` for each TF ``{TF}``
- ``6.summaryFinal.R.log``
- ``checkParameterValidity.R.log``
- ``produceConsensusPeaks.R.log``
- ``diffPeaks.R.log``
- ``analyzeTF.{TF}.R.log`` for each TF ``{TF}``
- ``summary1.R.log``
- ``binningTF.{TF}.log`` for each TF ``{TF}``
- ``summaryFinal.R.log``
- ``*.log`` summary files: Summary logs for user convenience, produced at very end of the pipeline only. They should contain all errors and warnings from the pipeline run.
......@@ -972,9 +973,9 @@ If an R script fails with a technical error such as ``caught segfault`` (a segme
.. code-block:: R
snakemake = readRDS("{outputFolder}/LOGS_AND_BENCHMARKS/0.checkParameters.R.rds")
snakemake = readRDS("{outputFolder}/LOGS_AND_BENCHMARKS/checkParameters.R.rds")
Replace ``{outputFolder}`` by the folder you used for the analysis, and adjust the ``0.checkParameters`` part also accordingly. Essentially, you just have to provide the path to the corresponding file that is located in the ``LOGS_AND_BENCHMARKS`` subdirectly within the specified output directory.
Replace ``{outputFolder}`` by the folder you used for the analysis, and adjust the ``checkParameters`` part also accordingly. Essentially, you just have to provide the path to the corresponding file that is located in the ``LOGS_AND_BENCHMARKS`` subdirectly within the specified output directory.
Rerunning *Snakemake*
----------------------
......
......@@ -31,20 +31,22 @@ We also put the paper on *bioRxiv*, please read all methodological details here:
Change log
============================
Version 1.1.1 (2018-08-01, coming soon)
Version 1.1.1 (2018-08-01)
- Documentation updates (referenced the bioRxiv paper, extended the section about errors)
- updated the information on how to load the snakemake object into the R workspace in the corresponding R scripts
- fixed a small bug that made the Volcano plot and the circular one appear to have switched sides.
- fixed a bug that made the labels in the Volcano plot switch sides (thanks to Jonas Ungerbeck)
- merged some diagnostic plots for the AR classification in the last step
- renamed R scripts and R log files to make them consistent with the cluster output and error files
Version 1.1 (2018-07-27)
- added a new parameter *dir_TFBS_sorted* in the config file to specify that the TFBS input files are already sorted, which saves some computation time by not resorting them
- added a new parameter ``dir_TFBS_sorted`` in the config file to specify that the TFBS input files are already sorted, which saves some computation time by not resorting them
- updated the TFBS files that are available via download (some files were not presorted correctly)
- added support for single-end BAM files. There is a new parameter *pairedEnd* in the config file that specifies whether reads are paired-end or not.
- restructured some of the permutation-related output files to save space and computation time. The rule *concatenateMotifsPerm* should now be much faster, and the TF-specific *...outputPerm.tsv.gz* files are now much smaller due to an improved column structure
- added support for single-end BAM files. There is a new parameter ``pairedEnd`` in the config file that specifies whether reads are paired-end or not.
- restructured some of the permutation-related output files to save space and computation time. The rule ``concatenateMotifsPerm`` should now be much faster, and the TF-specific ``...outputPerm.tsv.gz`` files are now much smaller due to an improved column structure
Version 1.0.1 (2018-07-25)
- fixed a bug in *2.DiffPeaks.R* that sometimes caused the step to fail, thanks to Jonas Ungerbeck for letting us know
- fixed a bug in *3.analyzeTF* for rare corner cases when *DESeq* fails
- fixed a bug in ``2.DiffPeaks.R`` that sometimes caused the step to fail, thanks to Jonas Ungerbeck for letting us know
- fixed a bug in ``3.analyzeTF`` for rare corner cases when *DESeq* fails
Version 1.0 (2018-07-01)
- released stable version
......
......@@ -99,11 +99,11 @@ assertFileExists(par.l$file_input_metadata, access = "r")
assertList(snakemake@output, min.len = 1)
assertSubset(c("", "summary", "circularPlot", "diagnosticPlots", "plotsRDS"), names(snakemake@output))
par.l$file_output_summary = snakemake@output$summary
par.l$file_plotCircular = snakemake@output$circularPlot
par.l$file_plotVolcano = snakemake@output$volcanoPlot
par.l$file_plotDiagnostic = snakemake@output$diagnosticPlots
par.l$file_output_plots = snakemake@output$plotsRDS
par.l$file_output_summary = snakemake@output$summary
par.l$file_plotCircular = snakemake@output$circularPlot
par.l$file_plotVolcano = snakemake@output$volcanoPlot
par.l$files_plotDiagnostic = snakemake@output$diagnosticPlots
par.l$file_output_plots = snakemake@output$plotsRDS
......@@ -127,12 +127,7 @@ if (par.l$plotRNASeqClassification) {
par.l$file_input_geneCountsPerSample = snakemake@config$additionalInputFiles$RNASeqCounts
assertFileExists(par.l$file_input_HOCOMOCO_mapping, access = "r")
assertFileExists(par.l$file_input_geneCountsPerSample, access = "r")
par.l$file_plotAR1 = snakemake@output$plotClassificationDiagnostic1
par.l$file_plotAR2 = snakemake@output$plotClassificationDiagnostic2
par.l$file_plotAR3 = snakemake@output$plotClassificationDiagnostic3
par.l$file_plotAR4 = snakemake@output$plotClassificationDiagnostic4
}
......@@ -145,7 +140,7 @@ par.l$file_log = snakemake@log[[1]]
allDirs = c(dirname(par.l$file_output_summary),
dirname(par.l$file_plotCircular),
dirname(par.l$file_plotDiagnostic),
dirname(par.l$files_plotDiagnostic),
dirname(par.l$file_log)
)
......@@ -482,7 +477,7 @@ if (length(TF_NA) > 0) {
diagPlots.l = list()
pdf(par.l$file_plotDiagnostic)
pdf(par.l$files_plotDiagnostic[1])
output.global.TFs.orig$pvalue = 0
if (par.l$nPermutations > 0) {
......@@ -564,12 +559,12 @@ if (par.l$nPermutations > 0) {
output.global.TFs.orig$weighted_Tstat_centralized = output.global.TFs.orig$weighted_Tstat - MLE.delta
output.global.TFs.orig$variance = as.numeric(output.global.TFs.orig$variance)
output.global.TFs.orig$pvalue2 = 2*pnorm(-abs(output.global.TFs.orig$weighted_Tstat_centralized), sd = sqrt(output.global.TFs.orig$variance))
output.global.TFs.orig$pvalue = 2*pnorm(-abs(output.global.TFs.orig$weighted_Tstat_centralized), sd = sqrt(output.global.TFs.orig$variance))
# Handle extreme cases with p-values that are practically 0 and would cause subsequent issues
index0 = which(output.global.TFs.orig$pvalue2 < .Machine$double.xmin)
index0 = which(output.global.TFs.orig$pvalue < .Machine$double.xmin)
if (length(index0) > 0) {
output.global.TFs.orig$pvalue2[index0] = .Machine$double.xmin
output.global.TFs.orig$pvalue[index0] = .Machine$double.xmin
}
}
......@@ -586,11 +581,8 @@ output.global.TFs$Cohend_factor = ifelse(output.global.TFs$weighted_CD < par.l$t
output.global.TFs$Cohend_factor = factor(output.global.TFs$Cohend_factor, levels = par.l$classes_CohensD, labels = seq_len(length(par.l$classes_CohensD)))
# TODO move?
output.global.TFs$pvalueAdj = p.adjust(output.global.TFs$pvalue, method = "BH")
#output.global.TFs$pvalueAdj2 = p.adjust(output.global.TFs$pvalue2, method = "BH")
colnamesToPlot = colnames(output.global.TFs)[-which(colnames(output.global.TFs) %in% c("TF", "sign", "classification"))]
......@@ -878,9 +870,9 @@ if (par.l$plotRNASeqClassification) {
# DIAGNOSTIC PLOTS #
####################
####################
# par.l$file_plotAR1 = "classification_activators_repressors.pdf"
pdf(file = par.l$file_plotAR1, width = 3, height = 8)
pdf(file = par.l$files_plotDiagnostic[2], width = 3, height = 8)
xlab="median pearson correlation (r)"
ylab=""
xlim= c(-0.2,0.2)
......@@ -910,10 +902,7 @@ if (par.l$plotRNASeqClassification) {
abline(v=act.rep.thres[2], col=par.l$colorCategories["activator"])
axis(side = 1, lwd = 1, line = 0, at = c(-0.2,0,0.2), cex=1)
dev.off()
# par.l$file_plotAR2 = "Heatmap_correlation_densities.pdf"
pdf(par.l$file_plotAR2, width = 2, height = 8, onefile = FALSE)
heatmap.act.rep(TF.peakMatrix.df, HOCOMOCO_mapping.df.exp)
dev.off()
......@@ -964,8 +953,7 @@ if (par.l$plotRNASeqClassification) {
# Correlation plots for the 3 classes #
#######################################
# par.l$file_plotAR3 = "correlationByClass.pdf"
pdf(par.l$file_plotAR3)
pdf(par.l$files_plotDiagnostic[3])
for (classificationCur in unique(output.global.TFs.merged$classification)) {
output.global.TFs.cur = filter(output.global.TFs.merged, classification == classificationCur)
......@@ -989,7 +977,7 @@ if (par.l$plotRNASeqClassification) {
plot(g)
}
dev.off()
#############################
# Density plots for each TF #
......@@ -997,8 +985,6 @@ if (par.l$plotRNASeqClassification) {
stopifnot(identical(colnames(t.cor.sel.matrix), colnames(t.cor.sel.matrix.non)))
# par.l$file_plotAR4 = "densityPlots.pdf"
pdf(par.l$file_plotAR4)
for (colCur in seq_len(ncol(t.cor.sel.matrix))) {
TFCur = colnames(t.cor.sel.matrix)[colCur]
......@@ -1091,36 +1077,48 @@ for (significanceThresholdCur in par.l$significanceThresholds) {
ymax = max(transform_yValues(significanceThresholdCur), max(output.global.TFs$pValueAdj_log10, na.rm = TRUE)) * 1.1
alphaValueNonSign = 0.3
# Reverse here because negative values at left mean that the condition that has been specified in the beginning is higher.
# Reverse the rev() that was done before for this plot therefore to restore the original order
labelsConditionsNew = rev(conditionComparison)
g = ggplot()
if (par.l$plotRNASeqClassification) {
g = g + geom_point(data = output.global.TFs, aes(weighted_meanDifference, pValueAdj_log10, alpha = pValue_sig, size = TFBS, fill = classification), shape=21, stroke = 0.5, color = "black") + scale_fill_manual("TF class", values = par.l$colorCategories)
g = g +
geom_rect(aes(xmin = -Inf,xmax = 0,ymin = -Inf, ymax = Inf, color = par.l$colorConditions[2]),
alpha = .3, fill = par.l$colorConditions[2], size = 0) +
geom_rect(aes(xmin = 0, xmax = Inf, ymin = -Inf,ymax = Inf, color = par.l$colorConditions[1]), alpha = .3, fill = par.l$colorConditions[1], size = 0) +
scale_color_manual(name = 'TF activity higher in', values = par.l$colorConditions, labels = conditionComparison)
} else {
g = g + geom_point(data = output.global.TFs, aes(weighted_meanDifference, pValueAdj_log10, alpha = pValue_sig, size = TFBS), shape=21, stroke = 0.5, color = "black")
g = g + geom_rect(aes(xmin = -Inf,
xmax = 0,
ymin = -Inf,
ymax = Inf, color = par.l$colorConditions[2]),
alpha = .3) +
geom_rect(aes(xmin = 0,
xmax = Inf,
ymin = -Inf,
ymax = Inf, color = par.l$colorConditions[1]),
alpha = .3)
g = g + scale_fill_manual(name = 'TF activity higher in', values = rev(par.l$colorConditions), labels = labelsConditionsNew)
}
g = g +
geom_rect(aes(xmin = 0,
xmax = Inf,
ymin = -Inf,
ymax = Inf, fill = "condition1"),
alpha = .3) +
geom_rect(aes(xmin = -Inf,
xmax = 0,
ymin = -Inf,
ymax = Inf, fill = "condition2"),
alpha = .3) +
scale_fill_manual(name = 'TF activity higher in', values = par.l$colorConditions, labels = conditionComparison) +
ylim(-0.1,ymax) +
ylab(paste0(transform_yValues_caption(), " (adj. p-value)")) +
xlab("weighted mean difference") +
scale_alpha_manual(paste0("adj. p-value < ", significanceThresholdCur), values = c(alphaValueNonSign, 1), labels = c("no", "yes")) +
geom_hline(yintercept = transform_yValues(significanceThresholdCur), linetype = "dotted")
g = g + ylim(-0.1,ymax) +
ylab(paste0(transform_yValues_caption(), " (adj. p-value)")) +
xlab("weighted mean difference") +
scale_alpha_manual(paste0("adj. p-value < ", significanceThresholdCur), values = c(alphaValueNonSign, 1), labels = c("no", "yes")) +
geom_hline(yintercept = transform_yValues(significanceThresholdCur), linetype = "dotted")
if (par.l$plotRNASeqClassification) {
g = g + geom_label_repel(data = ggrepel_df, aes(weighted_meanDifference, pValueAdj_log10, label = TF, fill = classification),
size = TFLabelSize, fontface = 'bold', color = 'black',
size = TFLabelSize, fontface = 'bold', color = 'white',
segment.size = 0.3, box.padding = unit(0.2, "lines"), max.iter = 5000,
label.padding = unit(0.2, "lines"), # how thick is connectin line
nudge_y = 0.05, nudge_x = 0, # how far from center points
......@@ -1144,12 +1142,14 @@ for (significanceThresholdCur in par.l$significanceThresholds) {
if (par.l$plotRNASeqClassification) {
g = g + guides(alpha = guide_legend(override.aes = list(size=5), order = 2),
fill = guide_legend(override.aes = list(size=5), order = 3))
fill = guide_legend(override.aes = list(size=5), order = 3),
color = guide_legend(override.aes = list(size=5), order = 1))
allPlots.l[["volcano"]] [[pValThrStr]] [[paste0(showClasses,collapse = "-")]] = g
} else {
g = g + guides(alpha = guide_legend(override.aes = list(size=5), order = 2),
fill = guide_legend(override.aes = list(size=5), order = 3))
allPlots.l[["volcano"]] [[pValThrStr]] = g
}
......
......@@ -284,13 +284,13 @@ if config["par_general"]["nPermutations"] == 0 and config["par_general"]["nBoots
# Not to be confused with the dir_scripts directory, which is only used to load the correct functions.R file in all R scripts
dir_scripts = "R/"
script_checkParValidity = "0.checkParameters.R"
script_createConsensusPeaks = "1.createConsensusPeaks.R"
script_DiffPeaks = "2.DiffPeaks.R"
script_analyzeTF = "3.analyzeTF.R"
script_summary1 = "4.summary1.R"
script_binningTF = "5.binningTF.R"
script_summaryFinal = "6.summaryFinal.R"
script_checkParValidity = "checkParameterValidity.R"
script_createConsensusPeaks = "createConsensusPeaks.R"
script_DiffPeaks = "diffPeaks.R"
script_analyzeTF = "analyzeTF.R"
script_summary1 = "summary1.R"
script_binningTF = "binningTF.R"
script_summaryFinal = "summaryFinal.R"
......@@ -314,7 +314,7 @@ rule checkParameterValidity:
output:
flag = touch(TEMP_DIR + "/" + compType + "checkParameterValidity.done"),
consPeaks = TEMP_DIR + "/" + compType + "consensusPeaks.clean.bed",
log: expand('{dir}/0.checkParameters.R.log', dir = LOG_BENCHMARK_DIR)
log: expand('{dir}/checkParameterValidity.R.log', dir = LOG_BENCHMARK_DIR)
message: "{ruleDisplayMessage}Check parameter validity {script_checkParValidity}..."
threads: 1
priority: 1
......@@ -329,7 +329,7 @@ rule produceConsensusPeaks:
output:
consensusPeaks_bed = TEMP_DIR + "/" + compType + "consensusPeaks.bed",
summaryPlot = TEMP_DIR + "/" + compType + "consensusPeaks_lengthDistribution.pdf"
log: expand('{dir}/1.produceConsensusPeaks.R.log', dir = LOG_BENCHMARK_DIR)
log: expand('{dir}/produceConsensusPeaks.R.log', dir = LOG_BENCHMARK_DIR)
message: "{ruleDisplayMessage}Calculate consensus peaks for all peak files with the script {script_createConsensusPeaks}..."
threads: 1
params:
......@@ -520,7 +520,7 @@ rule DiffPeaks:
normCounts = PEAKS_DIR + "/" + compType + "countsNormalized.tsv.gz",
plots = name_plots,
DESeqObj = PEAKS_DIR + "/" + compType + "DESeq.object.rds"
log: expand('{dir}/2.DiffPeaks.R.log', dir = LOG_BENCHMARK_DIR)
log: expand('{dir}/DiffPeaks.R.log', dir = LOG_BENCHMARK_DIR)
message: "{ruleDisplayMessage}Run R script {script_DiffPeaks}"
threads: 1
params:
......@@ -543,7 +543,7 @@ rule analyzeTF:
outputPermTSV = TF_DIR + "/{TF}/" + extDir + "/" + compType + "{TF}.outputPerm.tsv.gz",
outputRDS = TF_DIR + "/{TF}/" + extDir + "/" + compType + "{TF}.summary.rds",
plot_diagnostic = name_plotsDiag
log: expand('{dir}/3.analyzeTF.{{TF}}.R.log', dir = LOG_BENCHMARK_DIR)
log: expand('{dir}/analyzeTF.{{TF}}.R.log', dir = LOG_BENCHMARK_DIR)
message: "{ruleDisplayMessage}Run R script {script_analyzeTF} for TF {wildcards.TF}..."
threads: 1
params:
......@@ -558,7 +558,7 @@ rule summary1:
TF = expand('{dir}/{TF}/{ext}/{compType}{TF}.summary.rds', dir = TF_DIR, TF = allTF, ext = extDir, compType = compType)
output:
outputTable = FINAL_DIR + "/" + compType + "TF_vs_peak_distribution.tsv.gz"
log: expand('{dir}/4.summary1.R.log', dir = LOG_BENCHMARK_DIR)
log: expand('{dir}/summary1.R.log', dir = LOG_BENCHMARK_DIR)
message: "{ruleDisplayMessage}Run R script {script_summary1} ..."
threads: 1
script: dir_scripts + script_summary1
......@@ -615,7 +615,7 @@ rule binningTF:
output:
permResults = expand('{dir}/{{TF}}/{extension}/{compType}{{TF}}.permutationResults.rds', dir = TF_DIR, extension = extDir, compType = compType),
summary = expand('{dir}/{{TF}}/{extension}/{compType}{{TF}}.permutationSummary.tsv.gz', dir = TF_DIR, extension = extDir, compType = compType)
log: expand('{dir}/5.binningTF.{{TF}}.R.log', dir = LOG_BENCHMARK_DIR)
log: expand('{dir}/binningTF.{{TF}}.R.log', dir = LOG_BENCHMARK_DIR)
message: "{ruleDisplayMessage}Run R script {script_binningTF} for TF {wildcards.TF}..."
threads: 1
script: dir_scripts + script_binningTF
......@@ -624,7 +624,7 @@ rule binningTF:
allDiagnosticFiles = []
allDiagnosticFiles.append(FINAL_DIR + "/" + compType + "diagnosticPlots.pdf")
if config["par_general"]["RNASeqIntegration"]:
classificationFiles = expand("{dir}/{compType}diagnosticPlotsClassification{no}.pdf", dir = FINAL_DIR, compType = compType, no = [1,2,3,4])
classificationFiles = expand("{dir}/{compType}diagnosticPlotsClassification{no}.pdf", dir = FINAL_DIR, compType = compType, no = [1,2])
allDiagnosticFiles.append(classificationFiles)
......@@ -640,7 +640,7 @@ rule summaryFinal:
volcanoPlot = FINAL_DIR + "/" + compType + "summary.volcano.pdf",
diagnosticPlots = allDiagnosticFiles,
plotsRDS = FINAL_DIR + "/" + compType + "summary.circular.rds",
log: expand('{dir}/6.summaryFinal.R.log', dir = LOG_BENCHMARK_DIR)
log: expand('{dir}/summaryFinal.R.log', dir = LOG_BENCHMARK_DIR)
message: "{ruleDisplayMessage}Run R script {script_summaryFinal} ..."
threads: 1
params: TFs = ",".join(allTF)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment