Commit 21fb4f25 authored by Christian Arnold's avatar Christian Arnold

Major update. Integrated old and new version into one

parent f3da2354
This diff is collapsed.
......@@ -85,7 +85,7 @@ Snakemake
Please ensure that you have at least version 4.3 installed. Principally, there are `multiple ways to install Snakemake <http://snakemake.readthedocs.io/en/stable/getting_started/installation.html>`_. We recommend installing it, along with all the other required software, via conda.
*samtools*, *bedtool*s, *Subread*
*samtools*, *bedtools*, *Subread*
----------------------------------
In addition, `samtools <http://www.htslib.org/download>`_, `bedtools <http://bedtools.readthedocs.io>`_ and `Subread <http://subread.sourceforge.net>`_ are needed to run *diffTF*. We recommend installing them, along with all the other required software, via conda.
......@@ -98,11 +98,10 @@ A working ``R`` installation is needed and a number of packages from either CRAN
.. code-block:: R
install.packages(c("checkmate", "futile.logger", "tidyverse", "reshape2", "gridExtra", "scales", "jsonlite", "RColorBrewer", "rlist", "ggrepel", "lsr", "modeest", "locfdr", "boot"))
install.packages(c("checkmate", "futile.logger", "tidyverse", "reshape2", "RColorBrewer", "ggrepel", "lsr", "modeest", "boot", "grDevices", "pheatmap", "matrixStats", "locfdr"))
source("https://bioconductor.org/biocLite.R")
biocLite(c("limma", "vsn", "csaw", "DESeq2", "EDASeq", "DiffBind", "geneplotter", "Rsamtools"))
biocLite(c("limma", "vsn", "csaw", "DESeq2", "DiffBind", "geneplotter", "Rsamtools"))
.. _docs-runOwnAnalysis:
......@@ -115,4 +114,4 @@ Running your own analysis is almost as easy as running the example analysis. Car
2. Modify the file ``config.json`` accordingly. For example, we strongly recommend running the analysis for all TF instead of just 50 as for the example analysis. For this, simply change the parameter “TFs” to “all”. See Section :ref:`configurationFile` for details about the meaning of the parameters. Do not delete or rename any parameters or sections.
3. Create a tab-separated file that defines the input data, in analogy to the file ``sampleData.tsv`` from the example analysis, and refer to that in the file ``config.json`` (parameter ``summaryFile``)
4. Adapt the file ``startAnalysis.sh`` if necessary (the exact command line call to Snakemake and the various Snakemake-related parameters)
5. Since running the pipeline might be computationally demanding, read Section :ref:`timeMemoryRequirements` and decide on which machine to run the pipeline. In most cases, we recommend running *diffTF* in a cluster environment. The pipeline is written in Snakemake, and we strongly suggest to also read Section :ref:`workingWithPipeline` to get a basic understanding of how the pipeline works.
5. Since running the pipeline is often computationally demanding, read Section :ref:`timeMemoryRequirements` and decide on which machine to run the pipeline. In most cases, we recommend running *diffTF* in a cluster environment (see Section :ref:`clusterEnvironment` for details). The pipeline is written in Snakemake, and we strongly suggest to also read Section :ref:`workingWithPipeline` to get a basic understanding of how the pipeline works.
This diff is collapsed.
......@@ -2,7 +2,10 @@
Biological motivation
============================
Transcription factor (TF) activity constitutes an important readout of cellular signalling pathways and thus for assessing regulatory differences across conditions. However, current technologies lack the ability to simultaneously assessing activity changes for multiple TFs and surprisingly little is known about whether a TF acts as repressor or activator. To this end, we introduce the widely applicable genome-wide method diffTF to assess differential TF binding activity and classifying TFs as activator or repressor by integrating any type of genome-wide chromatin with RNA-Seq data and in-silico predicted TF binding sites
Transcription factor (TF) activity constitutes an important readout of cellular signalling pathways and thus for assessing regulatory differences across conditions. However, current technologies lack the ability to simultaneously assessing activity changes for multiple TFs and surprisingly little is known about whether a TF acts as repressor or activator. To this end, we introduce the widely applicable genome-wide method diffTF to assess differential TF binding activity and classifying TFs as activator or repressor by integrating any type of genome-wide chromatin with RNA-Seq data and in-silico predicted TF binding sites.
For a graphical summary of the idea, see the section :ref:`workflow`
Help, contribute and contact
============================
......@@ -16,7 +19,7 @@ Citation
If you use this software, please cite the following reference:
Ivan Berest*, Christian Arnold*, Armando Reyes-Palomares, Kasper Dindler Rassmussen, Kristian Helin & Judith B. Zaugg. *Genome-wide quantification of differential transcription factor activity: diffTF*. 2017. submitted.
Ivan Berest*, Christian Arnold*, Armando Reyes-Palomares, Giovanni Palla, Kasper Dindler Rassmussen, Kristian Helin & Judith B. Zaugg. *Genome-wide quantification of differential transcription factor activity: diffTF*. 2018. submitted.
Change log
......
{
"par_general":
{
"outdir" : "../output",
"regionExtension" : 100,
"comparisonType" : "GMPvsMPP",
"designContrast" : "~ conditionSummary",
"designVariableTypes" : "conditionSummary:factor",
"nPermutations" : 3,
"nBootstraps" : 0,
"nCGBins" : 10,
"TFs" : "CTCF,CEBPB,SNAI2,CEBPA,UBIP1,CEBPG,CEBPD,ZFX,AP2D,PAX5.S,SNAI1,ZEB1,SP4,MBD2,IRF1,MECP2,PAX5.D,SP3,NFIA.C,SP1.A,IRF7,MYF6, NRF1,DBP,MAZ,NKX28,DLX2,GATA1,P53,ZN143,AIRE,NR2C2,HMGA1,FUBP1,TEAD3,OVOL1,HXD4,KLF1,RXRG,HNF1B,ZIC3,HNF1A,NANOG.S,GFI1,PO3F1,NR2C1,ELF5,TF65.C,NFAC3,TEAD1",
"dir_scripts" : "../../src/R",
"RNASeqIntegration" : true
},
"samples":
{
"summaryFile" : "sampleData.tsv"
"par_general": {
"outdir": "../output",
"regionExtension": 100,
"comparisonType": "GMPvsMPP.all",
"conditionComparison": "GMP,MPP",
"designContrast": "~ conditionSummary",
"designVariableTypes": "conditionSummary:factor",
"nPermutations": 5,
"nBootstraps": 0,
"nCGBins": 10,
"TFs": "CTCF,CEBPB,SNAI2,CEBPA,UBIP1,CEBPG,CEBPD,ZFX,AP2D,PAX5.S,SNAI1,ZEB1,SP4,MBD2,IRF1,MECP2,PAX5.D,SP3,NFIA.C,SP1.A,IRF7,MYF6, NRF1,DBP,MAZ,NKX28,DLX2,GATA1,P53,ZN143,AIRE,NR2C2,HMGA1,FUBP1,TEAD3,OVOL1,HXD4,KLF1,RXRG,HNF1B,ZIC3,HNF1A,NANOG.S,GFI1,PO3F1,NR2C1,ELF5,TF65.C,NFAC3,TEAD1",
"TFs_disabled": "all",
"dir_scripts": "../../src/R",
"RNASeqIntegration": true
},
"peaks":
{
"consensusPeaks" : "",
"peakType" : "narrow",
"minOverlap" : 2
},
"additionalInputFiles":
{
"refGenome_fasta" : "referenceGenome/mm10.fa",
"dir_TFBS" : "mm10/PWMScan_HOCOMOCOv10",
"RNASeqCounts" : "data/RNA-Seq/RNA.counts.tsv",
"HOCOMOCO_mapping" : "../../src/TF_Gene_TranslationTables/HOCOMOCO_v10/translationTable_mm10.csv"
"samples": {
"summaryFile": "sampleData.tsv"
},
"peaks": {
"consensusPeaks": "",
"peakType": "narrow",
"minOverlap": 2
},
"additionalInputFiles": {
"refGenome_fasta": "referenceGenome/mm10.fa",
"dir_TFBS": "mm10/PWMScan_HOCOMOCOv10",
"RNASeqCounts": "data/RNA-Seq/RNA.counts.tsv",
"HOCOMOCO_mapping": "../../src/TF_Gene_TranslationTables/HOCOMOCO_v10/translationTable_mm10.csv"
}
}
......@@ -17,4 +17,4 @@ echo "# INCREASING THE NUMBER OF CORES SPEEDS UP THE ANALYSIS #"
echo "#########################################################"
# Real run, using 2 cores
snakemake --snakefile ../../src/Snakefile --cores 2 --configfile config.json --timestamp --directory .
snakemake --snakefile ../../src/Snakefile --cores 2 --configfile config.json --timestamp
......@@ -10,5 +10,6 @@ echo "####################################################################"
echo "\nThis wrapper script calls Snakemake in dryrun mode to start diffTF. Nothing will actually be executed.\n"
# Dryrun, using 2 cores
snakemake --snakefile ../../src/Snakefile --dryrun --cores 2 --configfile config.json
snakemake --snakefile ../../src/Snakefile --dryrun --quiet --cores 2 --configfile config.json
par.l$designFormula = "~Treatment + conditionSummary"
par.l =list()
library(checkmate)
par.l$designFormula = "~Treatment + conditionSummary"
as.formula(par.l$designFormula)
formulaDesign = as.formula(par.l$designFormula)
as.formula("blaaaa")
as.formula("bla")
as.formula("~bla")
designFormula = tryCatch({
as.formula(par.l$designFormula)
}, warning = function(w) {
stop("Converting the design formula ", par.l$designFormula, " created a warning, which should be checked carefully.")
}, error = function(e) {
stop("Design formula ", par.l$designFormula, " not valid")
}
designFormula
designFormula = tryCatch({
as.formula(par.l$designFormula)
}, warning = function(w) {
stop("Converting the design formula ", par.l$designFormula, " created a warning, which should be checked carefully.")
}, error = function(e) {
stop("Design formula ", par.l$designFormula, " not valid")
}
par.l$designFormula = "~Treatment + conditionSummary"
)
designFormula = tryCatch({
as.formula(par.l$designFormula)
}, warning = function(w) {
stop("Converting the design formula ", par.l$designFormula, " created a warning, which should be checked carefully.")
}, error = function(e) {
stop("Design formula ", par.l$designFormula, " not valid")
})
designFormula
par.l$designFormula = "~Treatment design"
designFormula = tryCatch ({
as.formula(par.l$designFormula)
}, warning = function(w) {
stop("Converting the design formula ", par.l$designFormula, " created a warning, which should be checked carefully.")
}, error = function(e) {
stop("Design formula ", par.l$designFormula, " not valid")
})
designFormula = tryCatch ({
as.formula(par.l$designFormula)
}, warning = function(w) {
stop("Converting the design formula \", par.l$designFormula, "\" created a warning, which should be checked carefully.")
}, error = function(e) {
stop("Design formula \", par.l$designFormula, "\" not valid")
})
designFormula
str(designFormula)
?formula
terms(formula)
terms(designFormula)
str(terms(designFormula))
attr(terms(designFormula), "term.labels")
??removeBatchEffect
library(limma)
install.packages(limma)
install.packages("limma")
source("https://bioconductor.org/biocLite.R")
biocLite("limma")
librar(limma)
library(limma)
??removeBatchEffect
library(matrixStats)
??matrixStats
?matrixStats
......@@ -45,6 +45,8 @@ assertList(snakemake@config, min.len = 1)
file_peaks = snakemake@config$peaks$consensusPeaks
conditionComparison = strsplit(snakemake@config$par_general$conditionComparison, ",")[[1]]
assertVector(conditionComparison, len = 2)
TFBS_dir = snakemake@config$additionalInputFiles$dir_TFBS
assertDirectoryExists(dirname(TFBS_dir), access = "r")
......@@ -53,6 +55,13 @@ fastaFile = snakemake@config$additionalInputFiles$refGenome_fasta
assertFileExists(fastaFile)
assertDirectoryExists(dirname(fastaFile), access = "w")
par.l$nPermutations = snakemake@config$par_general$nPermutations
assertIntegerish(par.l$nPermutations, lower = 0, len = 1)
par.l$nBootstraps = as.integer(snakemake@config$par_general$nBootstraps)
assertIntegerish(par.l$nBootstraps, len = 1)
par.l$file_input_sampleData = snakemake@config$samples$summaryFile
checkAndLogWarningsAndErrors(par.l$file_input_sampleData, checkFileExists(par.l$file_input_sampleData, access = "r"))
......@@ -81,7 +90,7 @@ printParametersLog(par.l)
checkAndLoadPackages(c("tidyverse", "futile.logger", "DiffBind", "checkmate", "stats"), verbose = FALSE)
# Step 2
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "vsn", "csaw", "checkmate", "limma", "tools", "EDASeq", "geneplotter", "RColorBrewer", "BiocParallel", "rlist"), verbose = FALSE)
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "vsn", "csaw", "checkmate", "limma", "tools", "geneplotter", "RColorBrewer"), verbose = FALSE)
# Step 3
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "vsn", "modeest", "checkmate", "limma", "geneplotter", "RColorBrewer", "tools"), verbose = FALSE)
......@@ -90,11 +99,24 @@ checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "vsn", "modeest",
checkAndLoadPackages(c("tidyverse", "futile.logger", "modeest", "checkmate", "ggrepel"), verbose = FALSE)
# Step 5
checkAndLoadPackages(c("tidyverse", "futile.logger", "checkmate", "tools", "methods"), verbose = FALSE)
checkAndLoadPackages(c("tidyverse", "futile.logger", "checkmate", "tools", "methods", "boot"), verbose = FALSE)
# Step 6
checkAndLoadPackages(c("tidyverse", "futile.logger", "lsr", "ggrepel", "checkmate", "tools", "methods", "boot"), verbose = TRUE)
checkAndLoadPackages(c("tidyverse", "futile.logger", "lsr", "ggrepel", "checkmate", "tools", "methods", "grDevices", "pheatmap"), verbose = TRUE)
# Check the version of readr, at least 1.1.0 is required to properly write gz files
if (packageVersion("readr") < "1.1.0") {
message = paste0("Version of readr library is too old, at least version 1.1.0 is required). Execute the following in R: install.packages('readr') ")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
if (par.l$nPermutations == 0 && par.l$nBootstraps < 1000) {
flog.warn(paste0("The value for nBootstraps is < 1000. We strongly recommend using a higher value in order to reliably estimate the statistical variance."))
}
#############################
# CHECK FASTA AND BAM FILES #
......@@ -102,6 +124,18 @@ checkAndLoadPackages(c("tidyverse", "futile.logger", "lsr", "ggrepel", "checkmat
sampleData.df = read_tsv(par.l$file_input_sampleData, col_names = TRUE, col_types = cols())
# Check the sample table
nDistValues = length(unique(sampleData.df$conditionSummary))
if (nDistValues != 2) {
message = paste0("The column 'conditionSummary' must contain exactly 2 different values, but ", nDistValues, " were found.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
if (!testSubset(unique(sampleData.df$conditionSummary), conditionComparison)) {
message = paste0("The elements specified in 'conditionComparison' in the config file must be a subset of the values in the column 'conditionSummary' in the sample file")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
# Build the fasta index. Requires write access to the folder where the fasta is stored (limitation of samtools faidx)
fastaIndex = paste0(fastaFile, ".fai")
......@@ -147,6 +181,8 @@ for (bamCur in sampleData.df$bamReads) {
}
flog.info(paste0("Check peak files..."))
###################
# CHECK PEAK FILE #
###################
......@@ -219,6 +255,8 @@ if (file_peaks != "") {
}
flog.info(paste0("Check TF-specific TFBS files..."))
TFs = createFileList(TFBS_dir, "_TFBS.bed", verbose = FALSE)
if (length(TFs) == 0) {
......
......@@ -41,7 +41,7 @@ assertClass(snakemake, "Snakemake")
## INPUT ##
assertList(snakemake@input, min.len = 1)
assertSubset(names(snakemake@input), c("", "peaks", "checkFlag"))
assertSubset(names(snakemake@input), c("", "peaks", "checkFlag", "sampleFile"))
par.l$file_input_peaks = snakemake@input$peaks
......@@ -127,6 +127,7 @@ consensusPeaks.df$annotation = paste0(consensusPeaks.df$chr, ":", consensusPeaks
# Reorganize columns
consensusPeaks.df = consensusPeaks.df[,c("chr", "start", "end", "annotation", "pvalue")]
consensusPeaks.df = mutate_if(consensusPeaks.df, is.numeric, as.character)
write_tsv(consensusPeaks.df, path = par.l$file_output_consensusPeaks, col_names = FALSE)
.printExecutionTime(start.time)
......
......@@ -99,8 +99,6 @@ for (fileCur in par.l$files_input_TF_summary) {
peaks.df = read_tsv(par.l$file_input_peaks, col_types = cols())
assertSubset(colnames(peaks.df), c("permutation", "position", "D2_baseMean", "D2_l2FC", "D2_ldcSE", "D2_stat", "D2_pval", "D2_padj"))
summary.df = NULL
nTFs = length(par.l$filesTransl.l)
......@@ -149,18 +147,18 @@ if (nTFMissing == nrow(summary.df)) {
summary.df$pvalue_raw[summary.df$pvalue_raw == 0] = .Machine$double.xmin
mode_peaks = mlv(peaks.df$D2_l2FC, method = "mfv", na.rm = TRUE)
mode_peaks = mlv(peaks.df$l2FC, method = "mfv", na.rm = TRUE)
summary.df = summary.df %>%
dplyr::mutate(
adj_pvalue = p.adjust(pvalue_raw, method = "fdr"),
Diff_mean = Mean_l2FC - mean(peaks.df$D2_l2FC, na.rm = TRUE),
DiffMedian = Median_l2FC - median(peaks.df$D2_l2FC, na.rm = TRUE),
Diff_mode = Mode_l2FC - mode_peaks[[1]],
Diff_skew = skewness_l2FC - mode_peaks[[2]]) %>%
adj_pvalue = p.adjust(pvalue_raw, method = "fdr"),
Diff_mean = Mean_l2FC - mean(peaks.df$l2FC, na.rm = TRUE),
Diff_median = Median_l2FC - median(peaks.df$l2FC, na.rm = TRUE),
Diff_mode = Mode_l2FC - mode_peaks[[1]],
Diff_skew = skewness_l2FC - mode_peaks[[2]]) %>%
na.omit(summary.df)
summary.df = mutate_if(summary.df, is.numeric, as.character)
write_tsv(summary.df, par.l$file_output_table) # TODO: check the dec = "." parameter
.printExecutionTime(start.time)
......
This diff is collapsed.
## README: prepare Permutations, to make it more efficient
start.time <- Sys.time()
#########################
# LIBRARY AND FUNCTIONS #
#########################
library("checkmate")
assertClass(snakemake, "Snakemake")
assertDirectoryExists(snakemake@config$par_general$dir_scripts)
source(paste0(snakemake@config$par_general$dir_scripts, "/functions.R"))
initFunctionsScript(packagesReq = NULL, minRVersion = "3.1.0", warningsLevel = 1, disableScientificNotation = TRUE)
checkAndLoadPackages(c("tidyverse", "futile.logger", "checkmate", "tools", "methods", "BiocParallel", "rlist"), verbose = FALSE)
# methods needed here because Rscript does not loads this package automatically, see http://stackoverflow.com/questions/19468506/rscript-could-not-find-function
########################################################################
# SAVE SNAKEMAKE S4 OBJECT THAT IS PASSED ALONG FOR DEBUGGING PURPOSES #
########################################################################
# Use the following line to load the Snakemake object to manually rerun this script (e.g., for debugging purposes)
# Replace {outputFolder} correspondingly.
# snakemake = readRDS("{outputFolder}/LOGS_AND_BENCHMARKS/5.prepareBinning.R.rds")
createDebugFile(snakemake)
###################
#### PARAMETERS ###
###################
par.l = list()
par.l$verbose = TRUE
par.l$log_minlevel = "INFO"
#####################
# VERIFY PARAMETERS #
#####################
assertClass(snakemake, "Snakemake")
## INPUT ##
assertList(snakemake@input, min.len = 1)
assertSubset(names(snakemake@input), c("", "nucContent", "motifes"))
par.l$files_input_nucContentGenome = snakemake@input$nucContent
for (fileCur in par.l$files_input_nucContentGenome) {
assertFileExists(fileCur , access = "r")
}
par.l$file_input_TF_allMotives = snakemake@input$motifes
assertFileExists(par.l$file_input_TF_allMotives, access = "r")
## OUTPUT ##
assertList(snakemake@output, min.len = 1)
assertSubset(names(snakemake@output), c("", "allTFData", "allTFUniqueData"))
par.l$file_output_allTF = snakemake@output$allTFData
par.l$file_output_allTFUnique = snakemake@output$allTFUniqueData
## CONFIG ##
assertList(snakemake@config, min.len = 1)
par.l$nPermutations = snakemake@config$par_general$nPermutations
assertIntegerish(par.l$nPermutations, lower = 0)
## PARAMS ##
assertList(snakemake@params, min.len = 1)
assertSubset(names(snakemake@params), c("", "nCGBins"))
par.l$nBins = as.integer(snakemake@params$nCGBins)
assertIntegerish(par.l$nBins, lower = 1, upper = 100)
## LOG ##
assertList(snakemake@log, min.len = 1)
par.l$file_log = snakemake@log[[1]]
allDirs = c(dirname(par.l$file_output_allTF),
dirname(par.l$file_output_allTFUnique),
dirname(par.l$file_log)
)
testExistanceAndCreateDirectoriesRecursively(allDirs)
######################
# FINAL PREPARATIONS #
######################
startLogger(par.l$file_log, par.l$log_minlevel, removeOldLog = TRUE)
printParametersLog(par.l)
#########################
# READ ALL MOTIVES FILE #
#########################
# import original dataframe with motifs. Has to be read only once
TF.motifs.ori = read_tsv(par.l$file_input_TF_allMotives, col_names = TRUE, col_types = cols())
if (nrow(TF.motifs.ori) == 0) {
message = paste0("The file ", par.l$file_input_TF_allMotives, " is empty. Something went wrong before. Make sure the previous steps succeeded.")
flog.fatal(message)
stop(message)
}
# TODO: eventuell löschen: chr, MSS, MES, PSS, PES
#Filter permutations in the original files that the user does not want anymore
TF.motifs.ori = filter(TF.motifs.ori, permutation <= par.l$nPermutations)
TF.motifs.ori$CG.identifier = paste0(TF.motifs.ori$TF,":",TF.motifs.ori$TFBSID)
#########################
# READ ALL NUC CG FILES #
#########################
readFile <- function(permutationCur, par.l) {
fileCur = par.l$files_input_nucContentGenome[permutationCur + 1]
if (permutationCur > 0) {
flog.info(paste0("Reading and processing file ", fileCur, " from permutation ", permutationCur))
} else {
flog.info(paste0("Reading and processing file ", fileCur, " from original data ", permutationCur))
}
TF.motifs.CG.cur = read_tsv(fileCur, col_names = TRUE,
col_types = list(
col_skip(), # "chr"
col_skip(), # "MSS"
col_skip(), # "MES"
col_character(), # "strand"
col_character(), # "TF",
col_skip(), # "AT"
col_number(), # "CG"
col_skip(), # "A"
col_skip(), # "C"
col_skip(), # "G"
col_skip(), # "T"
col_skip(), # "N"
col_skip(), # "other_nucl"
col_skip() # "length"
)
)
if (nrow(problems(TF.motifs.CG.cur)) > 0) {
flog.fatal(paste0("Parsing errors: "), problems(TF.motifs.CG.cur), capture = TRUE)
stop("Error when parsing the file ", fileCur, ", see errors above")
}
colnames(TF.motifs.CG.cur) = c("TFBSID","TF","CG")
nRowsNA = length(which(is.na(TF.motifs.CG.cur$CG)))
if (nRowsNA > 0) {
message <- paste0("The file ", fileCur, " contains ", nRowsNA, " rows out of ", nrow(TF.motifs.CG.cur), " with a missing value for the CG content, which most likely results from an assembly discordance between the BAM files and the specified fasta file. These regions will be removed in subsequent steps. The first 10 are printed with their first 3 columns here for debugging purposes:")
message = paste0(message, paste0(unlist(TF.motifs.CG.cur[1:10,"chr"]), ":", unlist(TF.motifs.CG.cur[1:10,"MSS"]), "-", unlist(TF.motifs.CG.cur[1:10,"MES"]), collapse = ", "))
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
TF.motifs.CG.cur = TF.motifs.CG.cur[-nRowsNA,]
}
TF.motifs.CG.cur = TF.motifs.CG.cur %>%
mutate(CG.identifier = paste0(TF,":" ,TFBSID), permutation = permutationCur) %>%
select(-one_of("TFBSID", "TF"))
return(TF.motifs.CG.cur)
}
nCores = snakemake@threads
results.l = .execInParallelGen(nCores,
returnAsList = TRUE, listNames = paste0("permutation", 0:par.l$nPermutations),
iteration = 0:par.l$nPermutations,
abortIfErrorParallel = TRUE,
verbose = TRUE,
readFile, par.l)
TF.motifs.CG = do.call("rbind", results.l)
#########
# MERGE #
#########
CGBins = seq(0,1, 1/par.l$nBins)
TF.motifs.all = TF.motifs.ori %>%
full_join(TF.motifs.CG, by = c("CG.identifier", "permutation")) %>%
select(-one_of("CG.identifier")) %>%
mutate(CG.bins = cut(CG, breaks = CGBins, labels = paste0(round(CGBins[-1] * 100,0),"%"), include.lowest = TRUE))
# Not needed anymore, delete
rm(TF.motifs.CG)
rm(TF.motifs.ori)
# remove duplicated TFBS from different TFs to use in the permuations
TF.motifs.all.unique = TF.motifs.all[!duplicated(TF.motifs.all[,c("permutation", "TFBSID")]),]
# TODO: TFBSID and peakID needed?
flog.info(paste0("Found ", nrow(TF.motifs.all) - nrow(TF.motifs.all.unique), " duplicated TFBS across all TF."))
saveRDS(TF.motifs.all, file = par.l$file_output_allTF)
saveRDS(TF.motifs.all.unique, file = par.l$file_output_allTFUnique)
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
......@@ -421,101 +421,34 @@ myMAPlot <- function(M, idx, main, minMean = 0) {
}
printMultipleGraphsPerPage <- function(plots.l, nCol = 1, nRow = 1, pdfFile = NULL, height = NULL, width = NULL, verbose = FALSE) {
plotDiagnosticPlots <- function(dd, differentialResults, conditionComparison, filename = NULL, maxPairwiseComparisons = 5, alpha = 0.05) {
checkAndLoadPackages(c("grDevices", "gridExtra"), verbose = verbose)
assertList(plots.l, min.len = 1)
assertInt(nCol, lower = 1)
assertInt(nRow, lower = 1)
assert(checkNull(pdfFile), checkCharacter(pdfFile))
if (!testNull(pdfFile)) assertDirectory(dirname(pdfFile), access = "r")
assert(checkNull(height), checkNumber(height, lower = 1))
assert(checkNull(width), checkNumber(width, lower = 1))
if (testNull(height)) {
height = 7
}
if (testNull(width)) {
width = 7
}
for (i in seq_len(length(plots.l))) {
assertClass(plots.l[[i]], classes = c("ggplot", "gg"))
}
nPlotsPerPage = nCol * nRow
plotsNew.l = list()
if (!testNull(pdfFile)) {
clearOpenDevices()
pdf(pdfFile, height = height, width = width)
}
index = 0
for (indexAll in 1:length(plots.l)) {
index = index + 1
plotsNew.l[[index]] = plots.l[[indexAll]]
# Print another page
if (index %% nPlotsPerPage == 0) { ## print 8 plots on a page
suppressMessages(print(do.call(grid.arrange, c(plotsNew.l, list(ncol = nCol, nrow = nRow)))))
plotsNew.l = list() # reset plot
index = 0 # reset index
}
}
# Print the remainding plots in case they don't perfectly fit with the layout
if (length(plotsNew.l) != 0) {
suppressMessages(print(do.call(grid.arrange, c(plotsNew.l, list(ncol = nCol, nrow = nRow)))))
}
if (!testNull(pdfFile))
dev.off()
cat("Finished writing plots to file ", pdfFile, "\n")
}
computeDESeqDiagnosticPlots <- function(dd, filename = NULL, maxPairwiseComparisons = 5) {
checkAndLoadPackages(c("tidyverse", "checkmate", "geneplotter", "DESeq2", "vsn", "RColorBrewer"), verbose = FALSE)
checkAndLoadPackages(c("tidyverse", "checkmate", "geneplotter", "DESeq2", "vsn", "RColorBrewer", "limma"), verbose = FALSE)
assertClass(dd, "DESeqDataSet")
assert(testClass(differentialResults, "MArrayLM"), testClass(differentialResults, "DESeqDataSet"))
assertVector(conditionComparison, len = 2)
assert(checkNull(filename), checkDirectory(dirname(filename), access = "w"))
if (!is.null(filename)) {
pdf(filename)
}
# 1. MA plots: An MA-plot (R. Dudoit et al. 2002) provides a useful overview for the distribution of the estimated coefficients in the model, e.g. the comparisons of interest, across all genes. On the y-axis, the “M” stands for “minus” – subtraction of log values is equivalent to the log of the ratio – and on the x-axis, the “A” stands for “average”. You may hear this plot also referred to as a mean-difference plot, or a Bland-Altman plot.
# Before making the MA-plot, we use the lfcShrink function to shrink the log2 fold changes for the comparison of dex treated vs untreated samples:
# The log2 fold change for a particular comparison is plotted on the y-axis and the average of the counts normalized by size factor is shown on the x-axis. Each gene is represented with a dot. Genes with an adjusted p value below a threshold (here 0.1, the default) are shown in red.
# The DESeq2 package uses a Bayesian procedure to moderate (or “shrink”) log2 fold changes from genes with very low counts and highly variable counts, as can be seen by the narrowing of the vertical spread of points on the left side of the MA-plot. As shown above, the lfcShrink function performs this operation. For a detailed explanation of the rationale of moderated fold changes, please see the DESeq2 paper (Love, Huber, and Anders 2014).
if (testClass(differentialResults, "MArrayLM")) {
title = paste0("limma results\n", conditionComparison[1], " vs. ", conditionComparison[2])
isSign = ifelse(p.adjust(differentialResults$p.value[,ncol(differentialResults$p.value)], method = "BH") < alpha, paste0("sign. (BH, ", alpha, ")"), "not-significant")
limma::plotMA(differentialResults, main = title, status = isSign)
# TODO: dd2 <- lfcShrink(dds, contrast=c("dex","trt","untrt"), res=res)
# vary alpha from 0.01 to 0.2
for (alphaCur in c(0.001, 0.01, 0.05, 0.1, 0.2)) {
suppressWarnings(DESeq2::plotMA(dd, alpha = alphaCur, main = paste0("MA plot for alpha = ", alphaCur)))
} else {
# TODO: DeSEQ diagnostic plots
}
# Another useful diagnostic plot is the histogram of the p values (figure below).
# This plot is best formed by excluding genes with very small counts, which otherwise generate spikes in the histogram.
# TODO: hist(res$pvalue[res$baseMean > 1], breaks = 0:20/20, col = "grey50", border = "white")
# 2. Densities of counts for the different samples.
# Since most of the genes are (heavily) affected by the experimental conditions, a succesful normalization will lead to overlapping densities
......@@ -552,7 +485,7 @@ computeDESeqDiagnosticPlots <- function(dd, filename = NULL, maxPairwiseComparis
if (nrow(MA.idx) > maxPairwiseCo