Commit f31efdc3 authored by Christian Arnold's avatar Christian Arnold
Browse files

Version 1.5. See changelog for details.

parent f5fa7903
......@@ -138,7 +138,7 @@ A working ``R`` installation is needed and a number of packages from either CRAN
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("limma", "vsn", "csaw", "DESeq2", "DiffBind", "geneplotter", "Rsamtools", "preprocessCore"))
BiocManager::install(c("limma", "vsn", "csaw", "DESeq2", "DiffBind", "geneplotter", "Rsamtools", "preprocessCore", "apeglm"))
.. _docs-runOwnAnalysis:
......
......@@ -55,9 +55,9 @@ author = 'Christian Arnold, Ivan Berest, Judith B. Zaugg'
# built documents.
#
# The short X.Y version.
version = '1.4'
version = '1.5'
# The full version, including alpha/beta/rc tags.
release = '1.4'
release = '1.5'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
......
......@@ -2,12 +2,11 @@
Biological motivation
============================
Transcription factor (TF) activity constitutes an important readout of cellular signalling pathways and thus for assessing regulatory differences across conditions. However, current technologies lack the ability to simultaneously assessing activity changes for multiple TFs and surprisingly little is known about whether a TF acts as repressor or activator. To this end, we introduce the widely applicable genome-wide method diffTF to assess differential TF binding activity and classifying TFs as activator or repressor by integrating any type of genome-wide chromatin with RNA-Seq data and in-silico predicted TF binding sites.
Transcription factors (TFs) regulate many cellular processes and can therefore serve as readouts of the signaling and regulatory state. Yet for many TFs, the mode of action—repressing or activating transcription of target genes—is unclear. Here, we present diffTF (`https://git.embl.de/grp-zaugg/diffTF <https://git.embl.de/grp-zaugg/diffTF>`_) to calculate differential TF activity (basic mode) and classify TFs into putative transcriptional activators or repressors (classification mode). In basic mode, it combines genome-wide chromatin accessibility/activity with putative TF binding sites that, in classification mode, are integrated with RNA-seq. We apply diffTF to compare (1) mutated and unmutated chronic lymphocytic leukemia patients and (2) two hematopoietic progenitor cell types. In both datasets, diffTF recovers most known biology and finds many previously unreported TFs. It classifies almost 40% of TFs based on their mode of action, which we validate experimentally. Overall, we demonstrate that diffTF recovers known biology, identifies less well-characterized TFs, and classifies TFs into transcriptional activators or repressors.
For a graphical summary of the idea, see the section :ref:`workflow`
We also put the paper on *bioRxiv*, please see the section :ref:`citation` for details.
The paper is open access and available online, please see the section :ref:`citation` for details.
.. _exampleDataset:
......@@ -44,15 +43,25 @@ Citation
If you use this software, please cite the following reference:
Ivan Berest*, Christian Arnold*, Armando Reyes-Palomares, Giovanni Palla, Kasper Dindler Rassmussen, Holly Giles, Peter-Martin Bruch, Sascha Dietrich, Wolfgang Huber, Kristian Helin & Judith B. Zaugg. *Quantification of differential transcription factor activity and multiomics-based classification into activators and repressors: diffTF*. 2019. in review.
Ivan Berest*, Christian Arnold*, Armando Reyes-Palomares, Giovanni Palla, Kasper Dindler Rasmussen, Holly Giles, Peter-Martin Bruch, Wolfgang Huber, Sascha Dietrich, Kristian Helin, Judith B. Zaugg. *Quantification of Differential Transcription Factor Activity and Multiomics-Based Classification into Activators and Repressors: diffTF*. 2019. Cell Reports 29(10), P3147-3159.E12.
Open Access. DOI:https://doi.org/10.1016/j.celrep.2019.10.106
We also put the paper on *bioRxiv*, please read all methodological details here:
`Quantification of differential transcription factor activity and multiomic-based classification into activators and repressors: diffTF <https://www.biorxiv.org/content/early/2018/12/01/368498>`_.
.. We also put the paper on *bioRxiv*, please read all methodological details here:
.. `Quantification of differential transcription factor activity and multiomic-based classification into activators and repressors: diffTF <https://www.biorxiv.org/content/early/2018/12/01/368498>`_.
.. _changelog:
Change log
============================
Version 1.5 (2019-12-03)
- *diffTF* has been published in Cell Reports! See the section :ref:`citation` for details.
- raw and adjusted p-values for the permutation-based approach can now not be 0 anymore. We now use the approach described `here<https://genomicsclass.github.io/book/pages/permutation_tests.html>`_. In a nutshell, the smallest p-value is now 1/(*nPermutations* + 1), with *nPermutations* denoting the number of permutations and thereby depends on the number of permutations - having more permutations makes the minimum p-value smaller.
- for plotting TF densities, a fixed bandwidth of 0.1 was used before. We now removed this and bandwidth is determined automatically. We noticed that in some cases, the fixed bandwidth may lead to a smoothing of the density curves while without fixing it, the densities look more rugged. As we do not want to introduce visual artifacts, we decided to remove it.
- various minor code improvements, particularly related to the AR classification
- small fixes in the Snakefile
Version 1.4 (2019-09-24)
- Various small improvements for increasing user experience
......
......@@ -189,13 +189,8 @@ outputSummary.df = tribble(~permutation, ~TF, ~Pos_l2FC, ~Mean_l2FC, ~Median_l2
# READ OVERLAP FILE #
#####################
overlapsAll.df = read_tsv(par.l$file_input_peakTFOverlaps, col_names = TRUE, col_types = cols(), comment = "#")
if (nrow(problems(overlapsAll.df)) > 0) {
flog.fatal(paste0("Parsing errors: "), problems(overlapsAll.df), capture = TRUE)
stop("Error when parsing the file ", fileCur, ", see errors above")
}
overlapsAll.df = read_tidyverse_wrapper(par.l$file_input_peakTFOverlaps, type = "tsv",
col_names = TRUE, col_types = cols(), comment = "#")
if (nrow(overlapsAll.df) > 0) {
......@@ -326,11 +321,7 @@ if (skipTF) {
# Preallocate data frame so no expensive reallocation has to be done
log2fc.m = matrix(NA, nrow = nPeaks , ncol = par.l$nPermutations + 1)
peaks.df = read_tsv(par.l$file_input_peak2, col_types = cols())
if (nrow(problems(peaks.df)) > 0) {
flog.fatal(paste0("Parsing errors: "), problems(peaks.df), capture = TRUE)
stop("Error when parsing the file ", par.l$file_input_peak2, ", see errors above")
}
peaks.df = read_tidyverse_wrapper(par.l$file_input_peak2, type = "tsv", col_types = cols())
peaksFiltered.df = readRDS(par.l$file_input_peaks)
......@@ -344,6 +335,7 @@ if (skipTF) {
if (par.l$nPermutations > 0) {
# Generate normalized counts for limma analysis
countsRaw = DESeq2::counts(TF.cds.filt, norm = FALSE)
countsNorm = DESeq2::counts(TF.cds.filt, norm = TRUE)
countsNorm.transf = log2(countsNorm + par.l$pseudocountAddition)
rownames(countsNorm.transf) = rownames(TF.cds.filt)
......@@ -355,11 +347,12 @@ if (skipTF) {
for (permutationCur in 0:par.l$nPermutations) {
if (permutationCur == 0) {
flog.info(paste0("Running for real data"))
} else {
flog.info(paste0("Running for permutation ", permutationCur))
}
if (permutationCur > 0 & (permutationCur %% 10 == 0 | permutationCur == par.l$nPermutations)) {
flog.info(paste0("Running permutation ", permutationCur))
} else {
flog.info(paste0("Running for real data "))
}
sampleData.df = sampleData.l[[paste0("permutation", permutationCur)]]
##############################
......@@ -392,14 +385,14 @@ if (skipTF) {
# We already set the factors for conditionSummary explicitly. The reference level is the first level for DeSeq.
# Run the local fit first, if that throws an error try the default fit type
res_DESeq = tryCatch( {
TF.cds.filt = tryCatch( {
suppressMessages(DESeq(TF.cds.filt,fitType = 'local'))
}, error = function(e) {
message = "Could not run DESeq with local fitting, retry with default fitting type..."
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
res_DESeq = tryCatch( {
TF.cds.filt = tryCatch( {
suppressMessages(DESeq(TF.cds.filt))
}, error = function(e) {
......@@ -408,12 +401,12 @@ if (skipTF) {
}
)
res_DESeq
TF.cds.filt
}
)
if (class(res_DESeq) == "character") {
if (class(TF.cds.filt) == "character") {
skipTF = TRUE
TF_outputInclPerm.df = as.data.frame(matrix(nrow = 0, ncol = 2 + par.l$nPermutations + 1))
......@@ -425,14 +418,18 @@ if (skipTF) {
#Enforce the correct order of the comparison
if (comparisonMode == "pairwise") {
res_DESeq.df <- as.data.frame(DESeq2::results(res_DESeq, contrast = c(variableToPermute, conditionComparison[1], conditionComparison[2])))
contrast = c(variableToPermute, conditionComparison[1], conditionComparison[2])
} else {
# Same as without specifying contrast at all
res_DESeq.df <- as.data.frame(DESeq2::results(res_DESeq, contrast = list(variableToPermute)))
contrast = list(variableToPermute)
}
res_DESeq = DESeq2::results(TF.cds.filt, contrast = contrast)
res_DESeq.df <- as.data.frame(res_DESeq)
final.TF.df = tibble("TFBSID" = rownames(res_DESeq.df),
......@@ -475,9 +472,10 @@ if (skipTF) {
pdf(par.l$file_output_plot_diagnostic)
if (par.l$nPermutations == 0) {
plotDiagnosticPlots(TF.cds.filt, res_DESeq, conditionComparison, filename = NULL, maxPairwiseComparisons = 0, plotMA = FALSE)
plotDiagnosticPlots(TF.cds.filt, conditionComparison, file = NULL, maxPairwiseComparisons = 0, plotMA = FALSE)
} else {
plotDiagnosticPlots(TF.cds.filt, fit, conditionComparison, filename = NULL, maxPairwiseComparisons = 0, plotMA = FALSE)
plotDiagnosticPlots(fit, conditionComparison, file = NULL, maxPairwiseComparisons = 0, plotMA = FALSE, counts.raw = countsRaw, counts.norm = countsNorm)
}
......
......@@ -104,7 +104,7 @@ testExistanceAndCreateDirectoriesRecursively(allDirs)
TFCur = par.l$TF
# TODO: perm.l with all TF instead of TFCur
perm.l = list()
calculateVariance = par.l$nPermutations == 0
......@@ -161,31 +161,23 @@ if (par.l$nPermutations + 1 < length(sampleData.l)) {
# READ NUC CG FILE #
####################
flog.info(paste0("Reading and processing file ", fileCur))
TF.motifs.CG = read_tsv(par.l$file_input_nucContentGenome, col_names = TRUE,
col_types = list(
col_skip(), # "chr"
col_skip(), # "MSS"
col_skip(), # "MES"
col_character(), # "TFBSID"
col_character(), # "TF",
col_skip(), # "AT"
col_double(), # "CG"
col_skip(), # "A"
col_skip(), # "C"
col_skip(), # "G"
col_skip(), # "T"
col_skip(), # "N"
col_skip(), # "other_nucl"
col_skip() # "length"
)
)
if (nrow(problems(TF.motifs.CG)) > 0) {
flog.fatal(paste0("Parsing errors: "), problems(TF.motifs.CG), capture = TRUE)
stop("Error when parsing the file ", fileCur, ", see errors above")
}
TF.motifs.CG = read_tidyverse_wrapper(par.l$file_input_nucContentGenome, type = "tsv", col_names = TRUE,
col_types = list(
col_skip(), # "chr"
col_skip(), # "MSS"
col_skip(), # "MES"
col_character(), # "TFBSID"
col_character(), # "TF",
col_skip(), # "AT"
col_double(), # "CG"
col_skip(), # "A"
col_skip(), # "C"
col_skip(), # "G"
col_skip(), # "T"
col_skip(), # "N"
col_skip(), # "other_nucl"
col_skip() # "length"
))
colnames(TF.motifs.CG) = c("TFBSID","TF","CG")
......@@ -220,39 +212,23 @@ nPermutationsSkipped = 0
for (fileCur in par.l$files_input_TF_allMotives) {
# Log 2 fold-changes from the particular permutation
TF.motifs.ori = read_tsv(fileCur, col_names = TRUE,
col_types = list(
col_character(), # "TF",
col_character(), # "TFBSID"
col_double() # "log2FoldChange", important to have double here as col_number would not parse numbers in scientific notation correctly
)
)
TF.motifs.ori = read_tidyverse_wrapper(fileCur, type = "tsv", col_names = TRUE,
col_types = list(
col_character(), # "TF",
col_character(), # "TFBSID"
col_double() # "log2FoldChange", important to have double here as col_number would not parse numbers in scientific notation correctly
))
if (nrow(problems(TF.motifs.ori)) > 0) {
flog.fatal(paste0("Parsing errors: "), problems(TF.motifs.ori), capture = TRUE)
stop("Error when parsing the file ", fileCur, ", see errors above")
}
if (nrow(TF.motifs.ori) == 0) {
message = paste0("The file ", fileCur, " is empty. Something went wrong before. Make sure the previous steps succeeded.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
if (ncol(TF.motifs.ori) != 3) {
message = paste0("The file ", fileCur, " does not have 3 columns. Something is wrong with the number of permutations. We recommend restarting the pipeline from the DiffPeaks step.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
colnames(TF.motifs.ori) = c("TF", "TFBSID", "log2FoldChange")
permutationCur = as.numeric(gsub(".*perm([0-9]+).tsv.gz", '\\1', fileCur))
TF.motifs.ori$permutation = permutationCur
if (permutationCur > 0) {
if (permutationCur > 0 & (permutationCur %% 10 == 0 | permutationCur == par.l$nPermutations)) {
flog.info(paste0("Running permutation ", permutationCur))
} else {
flog.info(paste0("Running for real data ", permutationCur))
flog.info(paste0("Running for real data "))
}
......@@ -289,8 +265,6 @@ for (fileCur in par.l$files_input_TF_allMotives) {
nBinsAll = length(levels(TF.motifs.all$CG.bins))
nCol = ncol(perm.l[[TFCur]])
# TODO: TFBSID needed?
flog.info(paste0(" Found ", nrow(TF.motifs.all) - nrow(TF.motifs.all.unique), " duplicated TFBS across all TF."))
# TODO: Optimize as in dev TF.motifs.all.unique = TF.motifs.all.unique[which(TF.motifs.all.unique$TF != TFCur & is.finite(TF.motifs.all.unique$log2FoldChange)),]
......
## README: prepare Permutations, to make it more efficient
start.time <- Sys.time()
#########################
# LIBRARY AND FUNCTIONS #
#########################
library("checkmate")
assertClass(snakemake, "Snakemake")
assertDirectoryExists(snakemake@config$par_general$dir_scripts)
source(paste0(snakemake@config$par_general$dir_scripts, "/functions.R"))
initFunctionsScript(packagesReq = NULL, minRVersion = "3.1.0", warningsLevel = 1, disableScientificNotation = TRUE)
checkAndLoadPackages(c("tidyverse", "futile.logger", "lsr", "ggrepel", "checkmate", "tools", "methods", "boot"), verbose = FALSE)
# methods needed here because Rscript does not loads this package automatically, see http://stackoverflow.com/questions/19468506/rscript-could-not-find-function
########################################################################
# SAVE SNAKEMAKE S4 OBJECT THAT IS PASSED ALONG FOR DEBUGGING PURPOSES #
########################################################################
# Use the following line to load the Snakemake object to manually rerun this script (e.g., for debugging purposes)
# Replace {outputFolder} and {TF} correspondingly.
# snakemake = readRDS("{outputFolder}/LOGS_AND_BENCHMARKS/binningTF.{TF}.R.rds")
createDebugFile(snakemake)
###################
#### PARAMETERS ###
###################
par.l = list()
par.l$verbose = TRUE
par.l$log_minlevel = "INFO"
par.l$minNoDatapoints = 5
# Used for plotting
par.l$includePlots = FALSE
#####################
# VERIFY PARAMETERS #
#####################
assertClass(snakemake, "Snakemake")
## INPUT ##
assertList(snakemake@input, min.len = 1)
assertSubset(names(snakemake@input), c("", "sampleDataR", "nucContent", "motifes"))
par.l$file_input_metadata = snakemake@input$sampleDataR
assertFileExists(par.l$file_input_metadata, access = "r")
par.l$file_input_nucContentGenome = snakemake@input$nucContent
assertFileExists(par.l$file_input_nucContentGenome , access = "r")
par.l$files_input_TF_allMotives = snakemake@input$motifes
for (fileCur in par.l$files_input_TF_allMotives) {
assertFileExists(fileCur, access = "r")
}
## OUTPUT ##
assertList(snakemake@output, min.len = 1)
assertSubset(names(snakemake@output), c("", "permResults", "summary"))
par.l$file_output_permResults = snakemake@output$permResults
par.l$file_output_summary = snakemake@output$summary
## CONFIG ##
assertList(snakemake@config, min.len = 1)
par.l$nPermutations = snakemake@config$par_general$nPermutations
assertIntegerish(par.l$nPermutations, lower = 0, len = 1)
par.l$nBins = snakemake@config$par_general$nCGBins
assertIntegerish(par.l$nBins, lower = 1, upper = 100, len = 1)
par.l$nBootstraps = as.integer(snakemake@config$par_general$nBootstraps)
assertIntegerish(par.l$nBootstraps, len = 1)
## WILDCARDS ##
assertList(snakemake@wildcards, min.len = 1)
assertSubset(names(snakemake@wildcards), c("", "TF"))
par.l$TF = snakemake@wildcards$TF
assertCharacter(par.l$TF, len = 1, min.chars = 1)
## LOG ##
assertList(snakemake@log, min.len = 1)
par.l$file_log = snakemake@log[[1]]
allDirs = c(dirname(par.l$file_output_permResults),
dirname(par.l$file_output_summary),
dirname(par.l$file_log)
)
testExistanceAndCreateDirectoriesRecursively(allDirs)
TFCur = par.l$TF
# TODO: perm.l with all TF instead of TFCur
perm.l = list()
calculateVariance = par.l$nPermutations == 0
if (calculateVariance) {
boostrapResults.l = list()
boostrapResults.l[[TFCur]] = list()
}
output.global.TFs = tribble(~permutation, ~TF, ~weighted_meanDifference, ~weighted_CD, ~TFBS, ~weighted_Tstat, ~variance)
perm.l[[TFCur]] = tribble(~permutation, ~bin, ~meanDifference, ~nDataAll, ~nDataBin, ~ratio_TFBS, ~cohensD, ~variance, ~df, ~pvalue, ~Tstat)
summaryCov.df = tribble(~permutation, ~bin1, ~bin2, ~weight1, ~weight2, ~cov)
######################
# FINAL PREPARATIONS #
######################
startLogger(par.l$file_log, par.l$log_minlevel, removeOldLog = TRUE)
printParametersLog(par.l)
# Function for the bootstrap
ttest <- function(x, d, all) {
statistical.test = t.test(all, x[d])
return(statistical.test$statistic[[1]])
}
if (calculateVariance && par.l$nBootstraps < 1000) {
flog.warn(paste0("The value for nBootstraps is < 1000. We strongly recommend using a higher value in order to reliably estimate the statistical variance."))
}
sampleData.l = readRDS(par.l$file_input_metadata)
if (length(sampleData.l) == 0) {
message = "Length of sampleData.l list is 0 but is has to be at least 1. Rerun the rule DiffPeaks."
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
# Adjust the number of permutations in case less have been computed
if (par.l$nPermutations + 1 < length(sampleData.l)) {
message = paste0("In the output objects, more permutations seem to be stored. They will be ignored and the currently specified value of nPermutations will be used")
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
} else if (par.l$nPermutations + 1 > length(sampleData.l)) {
valueNew = length(sampleData.l) - 1
message = paste0("The value of the parameter nPermutations differs from what is saved in the output objects. The value of nPermutations will be adjusted to ", valueNew)
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
par.l$nPermutations = valueNew
par.l$files_input_TF_allMotives = par.l$files_input_TF_allMotives[1:(par.l$nPermutations+1)]
}
####################
# READ NUC CG FILE #
####################
flog.info(paste0("Reading and processing file ", fileCur))
TF.motifs.CG = read_tsv(par.l$file_input_nucContentGenome, col_names = TRUE,
col_types = list(
col_skip(), # "chr"
col_skip(), # "MSS"
col_skip(), # "MES"
col_character(), # "TFBSID"
col_character(), # "TF",
col_skip(), # "AT"
col_double(), # "CG"
col_skip(), # "A"
col_skip(), # "C"
col_skip(), # "G"
col_skip(), # "T"
col_skip(), # "N"
col_skip(), # "other_nucl"
col_skip() # "length"
)
)
if (nrow(problems(TF.motifs.CG)) > 0) {
flog.fatal(paste0("Parsing errors: "), problems(TF.motifs.CG), capture = TRUE)
stop("Error when parsing the file ", fileCur, ", see errors above")
}
colnames(TF.motifs.CG) = c("TFBSID","TF","CG")
nRowsNA = length(which(is.na(TF.motifs.CG$CG)))
if (nRowsNA > 0) {
message <- paste0("The file ", fileCur, " contains ", nRowsNA, " rows out of ", nrow(TF.motifs.CG), " with a missing value for the CG content, which most likely results from an assembly discordance between the BAM files and the specified fasta file. These regions will be removed in subsequent steps. The first 10 are printed with their first 3 columns here for debugging purposes:")
message = paste0(message, paste0(unlist(TF.motifs.CG[1:10,"chr"]), ":", unlist(TF.motifs.CG[1:10,"MSS"]), "-", unlist(TF.motifs.CG[1:10,"MES"]), collapse = ", "))
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
TF.motifs.CG = TF.motifs.CG[-nRowsNA,]
}
TF.motifs.CG = TF.motifs.CG %>%
mutate(CG.identifier = paste0(TF,":" ,TFBSID)) %>%
dplyr:: select(-one_of("TFBSID"))
CGBins = seq(0,1, 1/par.l$nBins)
#########################
# READ ALL MOTIVES FILE #
#########################
################
# PERMUTATIONS #
################
for (fileCur in par.l$files_input_TF_allMotives) {
# Log 2 fold-changes from the particular permutation
TF.motifs.ori = read_tsv(fileCur, col_names = TRUE,
col_types = list(
col_character(), # "TF",
col_character(), # "TFBSID"
col_double() # "log2FoldChange", important to have double here as col_number would not parse numbers in scientific notation correctly
)
)
if (nrow(problems(TF.motifs.ori)) > 0) {
flog.fatal(paste0("Parsing errors: "), problems(TF.motifs.ori), capture = TRUE)
stop("Error when parsing the file ", fileCur, ", see errors above")
}
if (nrow(TF.motifs.ori) == 0) {
message = paste0("The file ", fileCur, " is empty. Something went wrong before. Make sure the previous steps succeeded.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
if (ncol(TF.motifs.ori) != 3) {
message = paste0("The file ", fileCur, " does not have 3 columns. Something is wrong with the number of permutations. We recommend restarting the pipeline from the DiffPeaks step.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
colnames(TF.motifs.ori) = c("TF", "TFBSID", "log2FoldChange")
permutationCur = as.numeric(gsub(".*perm([0-9]+).tsv.gz", '\\1', fileCur))
TF.motifs.ori$permutation = permutationCur
if (permutationCur > 0) {
flog.info(paste0("Running permutation ", permutationCur))
} else {
flog.info(paste0("Running for real data ", permutationCur))
}
#Filter permutations in the original files that the user does not want anymore
TF.motifs.ori = TF.motifs.ori %>%
dplyr::filter(permutation <= par.l$nPermutations) %>%
mutate(CG.identifier = paste0(TF,":",TFBSID)) %>%
dplyr::select(-one_of("TF"))
#########
# MERGE #
#########
# TODO: full join necessary?
TF.motifs.all = TF.motifs.ori %>%
full_join(TF.motifs.CG, by = c("CG.identifier")) %>%
mutate(CG.bins = cut(CG, breaks = CGBins, labels = paste0(round(CGBins[-1] * 100,0),"%"), include.lowest = TRUE)) %>%
dplyr::select(-one_of("CG.identifier", "CG"))
# EDIT: RANDOMIZE TF labels column
TF.motifs.all$TF = sample(TF.motifs.all$TF, length(TF.motifs.all$TF), replace = FALSE)
# Not needed anymore, delete
rm(TF.motifs.ori)
# remove duplicated TFBS from different TFs to use in the permuations
TF.motifs.all.unique = TF.motifs.all[!duplicated(TF.motifs.all[,c("permutation", "TFBSID")]),]
uniqueBins = unique(TF.motifs.all$CG.bins)
nBins = length(uniqueBins)
nCol = ncol(perm.l[[TFCur]])
# TODO: TFBSID needed?
flog.info(paste0(" Found ", nrow(TF.motifs.all) - nrow(TF.motifs.all.unique), " duplicated TFBS across all TF."))
# TODO: Optimize as in dev TF.motifs.all.unique = TF.motifs.all.unique[which(TF.motifs.all.unique$TF != TFCur & is.finite(TF.motifs.all.unique$log2FoldChange)),]