Commit dbc8213e authored by Christian Arnold's avatar Christian Arnold

diffTF version 1.7, see Changelog for details

parent 4cd927af
......@@ -116,11 +116,11 @@ Details
``maxCoresPerRule``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``maxCoresPerRule`` (optional)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
Integer > 0. Default 16. Maximum number of cores to use for rules that support multithreading.
Integer > 0. Default 16. Maximum number of cores to use for rules that support multithreading. Optional parameter, if missing, set to a default of 4.
Details
This affects currently only rules involving *featureCounts* - that is, *intersectPeaksAndBAM* while for rule *intersectTFBSAndBAM*, the number of cores is hard-coded to 4. When running *Snakemake* locally, each rule will use at most this number of cores, while in a cluster setting, this value refers to the maximum number of CPUs an individual job / rule will occupy. If the node the job is executed on has fewer nodes, then the maximum number of cores on the node will be taken.
......@@ -277,7 +277,7 @@ Summary
Details
If the analysis should be restricted to a subset of TFs, list the names of the TF to include in a comma-separated manner here.
.. note:: For each TF ``{TF}``, a corresponding file ``{TF}_TFBS.bed`` needs to be present in the directory that is specified by ``dir_TFBS`` (:ref:`parameter_dir_TFBS`).
.. note:: For each TF ``{TF}``, a corresponding file ``{TF}_TFBS.bed`` needs to be present in the directory that is specified by ``dir_TFBS`` (:ref:`parameter_dir_TFBS`). The name of the TF can be anything, and from version 1.7 onwards may also contain additional underscores. See the changelog for details. If you run an older version of diffTF, please update the version.
.. warning:: We strongly recommending running *diffTF* with as many TF as possible due to our statistical model that we use that compares against a background model.
......@@ -308,6 +308,20 @@ Details
.. note::RNA-Seq integration is only included in the very last step of the pipeline, so it can also be easily integrated later.
.. _parameter_debugMode:
``debugMode`` (optional)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
Logical. true or false. Default false. Enable debug mode for R scripts? Only available and supported for diffTF v1.7 or higher (added in May 2020). So far, only R scripts are supported by the debug mode.
Details
If set to true, the debug mode for R scripts is enabled. The typical usage is as follows: If you receive errors when running one of the R scripts, set it to ``true``, restart Snakemake, and you will see a printed message that the debug mode is enabled and a corresponding R session file (``.RData``) is saved in the ``LOGS_AND_BENCHMARKS`` folder. The script then continues running and the error will appear again. Use this file to sent to us for troubleshooting if being asked for. It contains all information necessary to rerun the step on a different PC, and all input files are read in so they are available within R for others.
.. note::The debug mode should only be used for the step that fails as it may produce large session files.
SECTION ``samples``
--------------------------------------------
......@@ -426,7 +440,7 @@ Summary
String. Path to the directory where the TF-specific files for TFBS results are stored.
Details
Each TF *{TF}* has to have one *BED* file, in the format *{TF}.bed*. Each file must be a valid *BED6* file with 6 columns, as follows:
Each TF *{TF}* has to have one *BED* file, in the format *{TF}.bed*. Each file must be a valid *BED6* file with exactly 6 columns, as follows:
1. chromosome
2. start
......@@ -441,7 +455,7 @@ Details
- hg38: For a pre-compiled list of 767 human TF with in-silico predicted TFBS based on the *HOCOMOCO 11* database and *FIMO* from the MEME suite for hg38, `download this file: <https://www.embl.de/download/zaugg/diffTF/TFBS/TFBS_hg38_FIMO_HOCOMOCOv11.tar.gz>`_. For a pre-compiled list of 768 human TF with in-silico predicted TFBS based on the *HOCOMOCO 11* database and *PWMScan* for hg38, `download this file: <https://www.embl.de/download/zaugg/diffTF/TFBS/TFBS_hg38_PWMScan_HOCOMOCOv11.tar.gz>`__
- mm10: For a pre-compiled list of 422 mouse TF with in-silico predicted TFBS based on the *HOCOMOCO 10* database and *PWMScan* for mm10, `download this file: <https://www.embl.de/download/zaugg/diffTF/TFBS/TFBS_mm10_PWMScan_HOCOMOCOv10.tar.gz>`__
However, you may also manually create these files to include additional TF of your choice or to be more or less stringent with the predicted TFBS. For this, you only need PWMs for the TF of interest and then a motif prediction tool such as *FIMO* or *MOODS*.
However, you may also manually create these files to include additional TF of your choice or to be more or less stringent with the predicted TFBS. For this, you only need PWMs for the TF of interest and then a motif prediction tool such as *FIMO* or *MOODS*. *Note that postprocessing of the BED files may be required to make sure the files to use with *diffTF* have exactly 6 columns.*
.. _parameter_RNASeqCounts:
......@@ -722,8 +736,9 @@ Details
The pages are as follows:
(1) Density plots of non-normalized (page 1) and normalized (page 2) mean log counts as well their respective empirical cumulative distribution functions (ECDF, pages 3 and 4 for non−normalized and normalized mean log counts, respectively)
(2) pairwise mean-average plots (average of the log-transformed counts vs the fold-change per peak) for each of the sample pairs. This can be useful to further assess systematic differences between the samples. Note that only a maximum of 20 different pairwise plots are shown for time and efficacy reasons.
(3) mean SD plots (row standard deviations versus row means, last page)
- Page 5-6: Regular (5) and MA plot based shrunken log2 fold-changes (6) of the RNA-Seq counts based on the ``DESeq2`` analysis for the peaks (not the TF binding sites). Both show the log2 fold changes attributable to a given variable over the mean of normalized counts for all samples, while the latter removes the noise associated with log2 fold changes from low count genes without requiring arbitrary filtering thresholds. Points are colored red if the adjusted p-value is less than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down. For more information, see `here <http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#ma-plot>`__.
(2) Page 7 onward: pairwise mean-average plots (average of the log-transformed counts vs the fold-change per peak) for each of the sample pairs. This can be useful to further assess systematic differences between the samples. Note that only a maximum of 20 different pairwise plots are shown for time and efficacy reasons.
(3) Last page: mean SD plots (row standard deviations versus row means)
FILE ``{comparisonType}.DESeq.object.rds``
......
......@@ -55,9 +55,9 @@ author = 'Christian Arnold, Ivan Berest, Judith B. Zaugg'
# built documents.
#
# The short X.Y version.
version = '1.6'
version = '1.7'
# The full version, including alpha/beta/rc tags.
release = '1.6'
release = '1.7'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
......
......@@ -63,6 +63,14 @@ Open Access. DOI: `https://doi.org/10.1016/j.celrep.2019.10.106 <https://doi.org
Change log
============================
Version 1.7 (2020-05-14)
- Multiple small fixes, thanks to Guandong Shang and Jiang Kan for reporting them:
- The TF name may now contain underscores. Before, that caused an error and is fixed now. We also cleared up the documentation about this.
- TFs with zero TFBS overlapping with the peaks (and therefore, overlap files with 0 lines) do not cause an error anymore and are skipped in subsequent steps, in analogy to TFs that had between 1 and 10 TFBS.
- Fixed a bug that caused an error when running the last ``summaryFinal`` step that related to duplicated TFs in the HOCOMOCO table.
- Implemented a debug mode via the new optional parameter ``debugMode``. This mode may be used to store the R session in a file and can be used to send to us for easier troubleshooting. See the documentation for more details.
Version 1.6 (2020-01-22)
- The documentation received a major update, in particular the section output files. In addition, a few new methodological figures have been added as well as an interpretation section.
......
......@@ -69,6 +69,7 @@ assertFileExists(par.l$file_input_normFacs, access = "r")
par.l$file_input_conditionComparison = snakemake@input$condComp
assertFileExists(par.l$file_input_conditionComparison, access = "r")
## OUTPUT ##
assertList(snakemake@output, min.len = 1)
assertSubset(names(snakemake@output), c("", "outputTSV", "outputPermTSV", "outputRDS", "plot_diagnostic"))
......@@ -102,9 +103,13 @@ assertIntegerish(par.l$nPermutations, lower = 0)
par.l$conditionComparison = snakemake@config$par_general$conditionComparison
checkAndLogWarningsAndErrors(par.l$conditionComparison, checkCharacter(par.l$conditionComparison, len = 1))
par.l$debugMode = setDebugMode(snakemake@config$par_general$debugMode)
## PARAMS ##
assertList(snakemake@params, min.len = 1)
assertSubset(names(snakemake@params), c("", "doCyclicLoess", "allBAMS"))
assertSubset(names(snakemake@params), c("", "doCyclicLoess", "allBAMS", "debugFile"))
par.l$doCyclicLoess = as.logical(snakemake@params$doCyclicLoess)
assertFlag(par.l$doCyclicLoess)
......@@ -134,6 +139,26 @@ testExistanceAndCreateDirectoriesRecursively(allDirs)
startLogger(par.l$file_log, par.l$log_minlevel, removeOldLog = TRUE)
printParametersLog(par.l)
if (par.l$debugMode) {
flog.info(paste0("Debug mode active. Reading it all files that this step requires and save it to ", snakemake@params$debugFile))
# Read all files already here and then save the session so as much as possible from the script can be executed without file dependencies
sampleData.l = readRDS(par.l$file_input_metadata)
normFacs = readRDS(par.l$file_input_normFacs)
nTFBS = length(readLines(par.l$file_input_peakTFOverlaps))
if (nTFBS > 0) {
overlapsAll.df = read_tidyverse_wrapper(par.l$file_input_peakTFOverlaps, type = "tsv", col_names = TRUE, col_types = cols(), comment = "#")
}
peaks.df = read_tidyverse_wrapper(par.l$file_input_peak2, type = "tsv", col_types = cols())
peaksFiltered.df = readRDS(par.l$file_input_peaks)
conditionComparison = readRDS(par.l$file_input_conditionComparison)
save(list = ls(), file = snakemake@params$debugFile)
flog.info(paste0("File ", snakemake@params$debugFile, " has been saved. You may use it for trouble-shooting and debugging, see the Documentation for more details."))
}
#################
# READ METADATA #
#################
......@@ -189,39 +214,37 @@ outputSummary.df = tribble(~permutation, ~TF, ~Pos_l2FC, ~Mean_l2FC, ~Median_l2
# READ OVERLAP FILE #
#####################
overlapsAll.df = read_tidyverse_wrapper(par.l$file_input_peakTFOverlaps, type = "tsv",
col_names = TRUE, col_types = cols(), comment = "#")
if (nrow(overlapsAll.df) > 0) {
colnames(overlapsAll.df) = c("annotation", "chr","MSS","MES", "strand","length", colnamesNew)
overlapsAll.df = overlapsAll.df %>%
dplyr::mutate(TFBSID = paste0(chr,":", MSS, "-",MES),
mean = apply(dplyr::select(overlapsAll.df, one_of(colnamesNew)), 1, mean),
peakID = sapply(strsplit(overlapsAll.df$annotation, split = "_", fixed = TRUE),"[[", 1)) %>%
dplyr::distinct(TFBSID, .keep_all = TRUE) %>%
dplyr::select(-one_of("length"))
# Check number of lines. If file is empty, the TF has to be skipped
nTFBS = length(readLines(par.l$file_input_peakTFOverlaps))
if (nTFBS > 0) {
overlapsAll.df = read_tidyverse_wrapper(par.l$file_input_peakTFOverlaps, type = "tsv",
col_names = TRUE, col_types = cols(), comment = "#")
nTFBS = nrow(overlapsAll.df)
colnames(overlapsAll.df) = c("annotation", "chr","MSS","MES", "strand","length", colnamesNew)
overlapsAll.df = overlapsAll.df %>%
dplyr::mutate(TFBSID = paste0(chr,":", MSS, "-",MES),
mean = apply(dplyr::select(overlapsAll.df, one_of(colnamesNew)), 1, mean),
peakID = sapply(strsplit(overlapsAll.df$annotation, split = "_", fixed = TRUE),"[[", 1)) %>%
dplyr::distinct(TFBSID, .keep_all = TRUE) %>%
dplyr::select(-one_of("length"))
skipTF = FALSE
} else {
skipTF = TRUE
}
nTFBS = nrow(overlapsAll.df)
# Create formula based on user-defined design
designFormula = convertToFormula(par.l$designFormula, colnames(sampleData.df))
formulaVariables = attr(terms(designFormula), "term.labels")
# Extract the variable that defines the contrast. Always the last element in the formula
variableToPermute = formulaVariables[length(formulaVariables)]
if (nTFBS >= par.l$minNoDatapoints) {
# Create formula based on user-defined design
designFormula = convertToFormula(par.l$designFormula, colnames(sampleData.df))
formulaVariables = attr(terms(designFormula), "term.labels")
# Extract the variable that defines the contrast. Always the last element in the formula
variableToPermute = formulaVariables[length(formulaVariables)]
# Group by peak ID: To avoid biases and dependencies based on TFBS clustering within peaks, we then select the TFBS per TF per peak with the highest average read count across all samples.
coverageAll_grouped.df = overlapsAll.df %>%
dplyr::group_by(peakID) %>%
......@@ -325,6 +348,7 @@ if (skipTF) {
peaksFiltered.df = readRDS(par.l$file_input_peaks)
# READ FILE HERE
conditionComparison = readRDS(par.l$file_input_conditionComparison)
################################
......
......@@ -83,6 +83,8 @@ assertIntegerish(par.l$nBins, lower = 1, upper = 100, len = 1)
par.l$nBootstraps = as.integer(snakemake@config$par_general$nBootstraps)
assertIntegerish(par.l$nBootstraps, len = 1)
par.l$debugMode = setDebugMode(snakemake@config$par_general$debugMode)
## WILDCARDS ##
assertList(snakemake@wildcards, min.len = 1)
assertSubset(names(snakemake@wildcards), c("", "TF"))
......@@ -125,6 +127,49 @@ summaryCov.df = tribble(~permutation, ~bin1, ~bin2, ~weight1, ~weight2, ~cov
startLogger(par.l$file_log, par.l$log_minlevel, removeOldLog = TRUE)
printParametersLog(par.l)
if (par.l$debugMode) {
flog.info(paste0("Debug mode active. Reading it all files that this step requires and save it to ", snakemake@params$debugFile))
# Read all files already here and then save the session so as much as possible from the script can be executed without file dependencies
sampleData.l = readRDS(par.l$file_input_metadata)
TF.motifs.CG = read_tidyverse_wrapper(par.l$file_input_nucContentGenome, type = "tsv", col_names = TRUE,
col_types = list(
col_skip(), # "chr"
col_skip(), # "MSS"
col_skip(), # "MES"
col_character(), # "TFBSID"
col_character(), # "TF",
col_skip(), # "AT"
col_double(), # "CG"
col_skip(), # "A"
col_skip(), # "C"
col_skip(), # "G"
col_skip(), # "T"
col_skip(), # "N"
col_skip(), # "other_nucl"
col_skip() # "length"
))
TF.motifs.ori.l = list()
for (fileCur in par.l$files_input_TF_allMotives) {
TF.motifs.ori.l[[fileCur]] = read_tidyverse_wrapper(fileCur, type = "tsv", col_names = TRUE,
col_types = list(
col_character(), # "TF",
col_character(), # "TFBSID"
col_double() # "log2FoldChange", important to have double here as col_number would not parse numbers in scientific notation correctly
))
}
save(list = ls(), file = snakemake@params$debugFile)
flog.info(paste0("File ", snakemake@params$debugFile, " has been saved. You may use it for trouble-shooting and debugging, see the Documentation for more details."))
}
# Function for the bootstrap
ttest <- function(x, d, all) {
......
......@@ -26,6 +26,7 @@ par.l = list()
par.l$verbose = TRUE
par.l$log_minlevel = "INFO"
#####################
# VERIFY PARAMETERS #
#####################
......@@ -73,6 +74,11 @@ checkAndLogWarningsAndErrors(par.l$conditionComparison, checkCharacter(par.l$con
par.l$plotRNASeqClassification = as.logical(snakemake@config$par_general$RNASeqIntegration)
assertFlag(par.l$plotRNASeqClassification)
par.l$debugMode = setDebugMode(snakemake@config$par_general$debugMode)
### PARAMS ##
par.l$TFBSPattern = snakemake@params$suffixTFBS
## LOG ##
assertList(snakemake@log, min.len = 1)
par.l$file_log = snakemake@log[[1]]
......@@ -89,6 +95,61 @@ testExistanceAndCreateDirectoriesRecursively(allDirs)
startLogger(par.l$file_log, par.l$log_minlevel, removeOldLog = TRUE)
printParametersLog(par.l)
if (par.l$debugMode) {
flog.info(paste0("Debug mode active. Reading it all files that this step requires and save it to ", snakemake@params$debugFile))
# Read all files already here and then save the session so as much as possible from the script can be executed without file dependencies
sampleData.df = read_tidyverse_wrapper(par.l$file_input_sampleData, type = "tsv", col_names = TRUE, col_types = cols())
if (file_peaks != "") {
peaks.df = read_tidyverse_wrapper(snakemake@config$peaks$consensusPeaks, type = "tsv", col_names = FALSE)
}
useAllTFs = FALSE
if (length(allTFs) == 1)
if (allTFs == "all") {
useAllTFs = TRUE
}
if (useAllTFs) {
TFs = createFileList(TFBS_dir, par.l$TFBSPattern, verbose = FALSE)
if (length(TFs) == 0) {
message = paste0("No files with pattern ", par.l$TFBSPattern, " found in directory ", TFBS_dir, " as specified by the parameter \"dir_TFBS\"")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
allTFs = gsub(par.l$TFBSPattern, "", basename(TFs))
stopifnot(length(allTFs) > 0)
}
TFs.l = list()
for (TFCur in allTFs) {
TFCur = gsub(pattern = " ",replacement = "",TFCur)
flog.info(paste0(" Checking TF ", TFCur, "..."))
fileCur = paste0(TFBS_dir, "/", TFCur, par.l$TFBSPattern)
if (!file.exists(fileCur)) {
message = paste0("File ", fileCur, " does not exist even though the TF ", TFCur, " has been specified")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
TFs.l[[TFCur]] = read_tidyverse_wrapper(fileCur, type = "tsv", ncolExpected = 6, col_names = FALSE, col_types = "ciicnc", minRows = 1)
}
save(list = ls(), file = snakemake@params$debugFile)
flog.info(paste0("File ", snakemake@params$debugFile, " has been saved. You may use it for trouble-shooting and debugging, see the Documentation for more details."))
}
######################
# LOAD ALL LIBRARIES #
......@@ -315,10 +376,10 @@ if (length(allTFs) == 1)
if (useAllTFs) {
TFs = createFileList(TFBS_dir, "_TFBS.bed", verbose = FALSE)
TFs = createFileList(TFBS_dir, par.l$TFBSPattern, verbose = FALSE)
if (length(TFs) == 0) {
message = paste0("No files with pattern _TFBS.bed found in directory ", TFBS_dir, " as specified by the parameter \"dir_TFBS\"")
message = paste0("No files with pattern ", par.l$TFBSPattern, " found in directory ", TFBS_dir, " as specified by the parameter \"dir_TFBS\"")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
......@@ -332,7 +393,7 @@ for (TFCur in allTFs) {
TFCur = gsub(pattern = " ",replacement = "",TFCur)
flog.info(paste0(" Checking TF ", TFCur, "..."))
fileCur = paste0(TFBS_dir, "/", TFCur, "_TFBS.bed")
fileCur = paste0(TFBS_dir, "/", TFCur, par.l$TFBSPattern)
if (!file.exists(fileCur)) {
message = paste0("File ", fileCur, " does not exist even though the TF ", TFCur, " has been specified")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
......
......@@ -66,6 +66,9 @@ assertSubset(peakType, c("raw", "bed", "narrow", "macs", "swembl", "bayes", "pea
minOverlap = snakemake@config$peaks$minOverlap
assertIntegerish(minOverlap, lower = 0)
par.l$debugMode = setDebugMode(snakemake@config$par_general$debugMode)
## LOG ##
assertList(snakemake@log, min.len = 1)
par.l$file_log = snakemake@log[[1]]
......@@ -85,13 +88,27 @@ testExistanceAndCreateDirectoriesRecursively(allDirs)
startLogger(par.l$file_log, par.l$log_minlevel, removeOldLog = TRUE)
printParametersLog(par.l)
if (par.l$debugMode) {
flog.info(paste0("Debug mode active. Reading it all files that this step requires and save it to ", snakemake@params$debugFile))
# Read all files already here and then save the session so as much as possible from the script can be executed without file dependencies
sampleMetaData.df = read_tidyverse_wrapper(file_sampleData, type = "tsv", col_types = cols())
save(list = ls(), file = snakemake@params$debugFile)
flog.info(paste0("File ", snakemake@params$debugFile, " has been saved. You may use it for trouble-shooting and debugging, see the Documentation for more details."))
}
##########################
# Create consensus peaks #
##########################
# Provide the metadata file and parse the CSV here
sampleMetaData.df = read_tidyverse_wrapper(file_sampleData, type = "tsv",
col_types = cols())
sampleMetaData.df = read_tidyverse_wrapper(file_sampleData, type = "tsv", col_types = cols())
assertSubset(c("SampleID", "bamReads", "conditionSummary", "Peaks"), colnames(sampleMetaData.df))
assertIntegerish(minOverlap, upper = nrow(sampleMetaData.df))
......
......@@ -86,9 +86,11 @@ checkAndLogWarningsAndErrors(par.l$nPermutations, checkIntegerish(par.l$nPermuta
par.l$conditionComparison = snakemake@config$par_general$conditionComparison
checkAndLogWarningsAndErrors(par.l$conditionComparison, checkCharacter(par.l$conditionComparison, len = 1))
par.l$debugMode = setDebugMode(snakemake@config$par_general$debugMode)
## PARAMS ##
checkAndLogWarningsAndErrors(snakemake@params,checkmate::checkList(snakemake@params, min.len = 1))
checkAndLogWarningsAndErrors(names(snakemake@params), checkSubset(names(snakemake@params), c("", "doCyclicLoess")))
checkAndLogWarningsAndErrors(names(snakemake@params), checkSubset(names(snakemake@params), c("", "doCyclicLoess", "debugFile")))
par.l$doCyclicLoess = as.logical(snakemake@params$doCyclicLoess)
checkAndLogWarningsAndErrors(par.l$doCyclicLoess, checkFlag(par.l$doCyclicLoess))
......@@ -115,12 +117,27 @@ startLogger(par.l$file_log, par.l$log_minlevel, removeOldLog = TRUE)
printParametersLog(par.l)
if (par.l$debugMode) {
flog.info(paste0("Debug mode active. Reading it all files that this step requires and save it to ", snakemake@params$debugFile))
# Read all files already here and then save the session so as much as possible from the script can be executed without file dependencies
sampleData.df = read_tidyverse_wrapper(par.l$file_input_sampleData, type = "tsv", col_names = TRUE, col_types = cols())
coverageAll.df = read_tidyverse_wrapper(par.l$file_input_peakOverlaps, type = "tsv", col_names = TRUE, comment = "#", col_types = cols())
save(list = ls(), file = snakemake@params$debugFile)
flog.info(paste0("File ", snakemake@params$debugFile, " has been saved. You may use it for trouble-shooting and debugging, see the Documentation for more details."))
}
#################
# READ METADATA #
#################
sampleData.df = read_tidyverse_wrapper(par.l$file_input_sampleData, type = "tsv",
col_names = TRUE, col_types = cols())
sampleData.df = read_tidyverse_wrapper(par.l$file_input_sampleData, type = "tsv", col_names = TRUE, col_types = cols())
checkAndLogWarningsAndErrors(colnames(sampleData.df), checkSubset(c("bamReads"), colnames(sampleData.df)))
......
......@@ -80,6 +80,15 @@ startLogger <- function(logfile, level, removeOldLog = TRUE, appenderName = "con
}
stopifnot_custom <- function(testForTruth, directory) {
assert_flag(testForTruth)
if (!testForTruth) {
flog.error()
}
}
printParametersLog <- function(par.l, verbose = FALSE) {
checkAndLoadPackages(c("futile.logger"), verbose = verbose)
......@@ -92,12 +101,12 @@ printParametersLog <- function(par.l, verbose = FALSE) {
}
}
read_tidyverse_wrapper <- function(file, type = "tsv", ncolExpected = NULL, ...) {
read_tidyverse_wrapper <- function(file, type = "tsv", ncolExpected = NULL, minRows = 0, ...) {
assertSubset(type, c("csv", "csv2", "tsv", "delim"))
start = Sys.time()
flog.info(paste0(" Reading file ", file))
flog.info(paste0("Reading file ", file))
if (type == "tsv") {
......@@ -123,12 +132,19 @@ read_tidyverse_wrapper <- function(file, type = "tsv", ncolExpected = NULL, ...
}
if (!is.null(ncolExpected)) {
if (ncol(tbl) != ncolExpected) {
if (! ncol(tbl) %in% ncolExpected) {
message = paste0("The file ", file, " does not have the expected number of ", ncolExpected, " columns, but instead ", ncol(tbl), ".")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
}
if (minRows > 0) {
if (nrow(tbl) < minRows) {
message = paste0("The file ", file, " does not have the expected minimum number of rows. Expected at least ", minRows, ", but found only ",nrow(tbl), ".")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
}
.printExecutionTime(start)
tbl
......@@ -955,8 +971,14 @@ chooseTFLabelSize <- function(nTF_label) {
flog.info(paste0(" Finished sucessfully. Execution time: ", round(endTime - startTime, 1), " ", units(endTime - startTime)))
}
testFunction <- function(matrix) {
matrix2 = matrix(data = 1, nrow = 10, ncol = 10)
checkAndLogWarningsAndErrors(NULL, FALSE, isWarning = FALSE, dumpfile = paste0(getwd(), "/diffTF_error_dump.RData"))
}
checkAndLogWarningsAndErrors <- function(object, checkResult, isWarning = FALSE) {
checkAndLogWarningsAndErrors <- function(object, checkResult, isWarning = FALSE, dumpfile = paste0(getwd(), "/diffTF_error_dump.RData")) {
assert(checkCharacter(checkResult, len = 1), checkLogical(checkResult))
......@@ -982,6 +1004,10 @@ checkAndLogWarningsAndErrors <- function(object, checkResult, isWarning = FALSE)
flog.warn(messageWarning)
warning(messageWarning)
} else {
flog.info(paste0("Saving a dump file to ", dumpfile, ". You may use this file to trouble-shoot or send to others."))
save.image(file = dumpfile)
flog.error(messageError)
stop(messageError)
}
......@@ -1005,11 +1031,11 @@ filterLowlyExpressedGenes <- function(dd, comparisonMode, minMeanGroup, minMedia
assertIntegerish(minMeanGroup)
assertIntegerish(minMedianAll)
dd_counts = DESeq2::counts(dd, normalize = FALSE)
if (comparisonMode == "pairwise") {
dd_counts = DESeq2::counts(dd, normalize = FALSE)
samples_cond1 = colData(dd)$SampleID[which(colData(dd)$conditionSummary == levels(colData(dd)$conditionSummary)[1])]
samples_cond2 = colData(dd)$SampleID[which(colData(dd)$conditionSummary == levels(colData(dd)$conditionSummary)[2])]
......@@ -1210,10 +1236,10 @@ filterPeaksByRowMeans <- function(peakCounts, TF.peakMatrix = NULL, minMean = 1,
start = Sys.time()
flog.info(paste0("Filter peaks with a mean across samples of smaller than ", minMean))
index_IDColumn = which(colnames(peakCounts) == idColumn)
stopifnot(length(index_IDColumn) == 1)
index_lastColumn = which(colnames(peakCounts) == idColumn)
stopifnot(length(index_lastColumn) == 1)
rowMeans2 = rowMeans(peakCounts[,-index_IDColumn])
rowMeans2 = rowMeans(peakCounts[,-index_lastColumn])
rowsToDelete = which(rowMeans2 < minMean)
if (length(rowsToDelete) > 0) {
flog.info(paste0("Removed ", length(rowsToDelete), " peaks out of ", nrow(peakCounts),
......@@ -1239,10 +1265,12 @@ filterPeaksByRowMeans <- function(peakCounts, TF.peakMatrix = NULL, minMean = 1,
}
normalizeCounts <- function(rawCounts, method = "quantile", idColumn, removeCols = c()) {
start = Sys.time()
checkAndLoadPackages(c("tidyverse", "DESeq2", "futile.logger", "checkmate", "preprocessCore"), verbose = FALSE)
checkAndLoadPackages(c("tidyverse", "futile.logger", "checkmate", "preprocessCore"), verbose = FALSE)
flog.info(paste0("Normalize counts. Method: ", method, ", ID column: ", idColumn))
......@@ -1343,9 +1371,6 @@ intersectData <- function(countsRNA, countsATAC, idColumn_RNA = "ENSEMBL", idCol
flog.info(paste0(" Number of samples for RNA before filtering: " , ncol(countsRNA) - 1))
flog.info(paste0(" Number of samples for ATAC before filtering: ", ncol(countsATAC) - 1))
# Clean ENSEMBL IDs
countsRNA = dplyr::mutate(countsRNA, ENSEMBL = gsub("\\..+", "", ENSEMBL, perl = TRUE))
# Subset ATAC and RNA to the same set of samples
sharedColumns = intersect(colnames(countsRNA), colnames(countsATAC))
......@@ -1376,9 +1401,11 @@ intersectData <- function(countsRNA, countsATAC, idColumn_RNA = "ENSEMBL", idCol
list(RNA = countsRNA.df, ATAC = countsATAC.df)
}
filterHOCOMOCOTable <- function(HOCOMOCO_table, output.global.TFs) {
filterHOCOMOCOTable <- function(HOCOMOCO_table, TFs) {
HOCOMOCO_mapping.df.overlap <- dplyr::filter(HOCOMOCO_table, HOCOID %in% output.global.TFs$TF)
HOCOMOCO_mapping.df.overlap <- HOCOMOCO_table %>%
dplyr::filter(HOCOID %in% TFs) %>%
dplyr::distinct(ENSEMBL, HOCOID)
if (nrow(HOCOMOCO_mapping.df.overlap) == 0) {
message = paste0("Number of rows of HOCOMOCO_mapping.df.overlap is 0. Something is wrong with the mapping table or the filtering")
......@@ -1463,16 +1490,6 @@ correlateATAC_RNA <- function(countsRNA, countsATAC, HOCOMOCO_mapping, corMethod
sort.cor.m = cor.m[,names(sort(colMeans(cor.m)))]
# Change the column names from ENSEMBL ID to TF names.
# Reorder to make sure the order is the same. Due to the duplication ID issue, the number of columns may increase after the column selection
colnamesIntegrity = as.character(HOCOMOCO_mapping.exp$ENSEMBL) %in% colnames(sort.cor.m)