Transcription factor (TF) activity constitutes an important readout of cellular signalling pathways and thus for assessing regulatory differences across conditions. However, current technologies lack the ability to simultaneously assessing activity changes for multiple TFs and surprisingly little is known about whether a TF acts as repressor or activator. To this end, we introduce the widely applicable genome-wide method diffTF to assess differential TF binding activity and classifying TFs as activator or repressor by integrating any type of genome-wide chromatin with RNA-Seq data and in-silico predicted TF binding sites.
Transcription factors (TFs) regulate many cellular processes and can therefore serve as readouts of the signaling and regulatory state. Yet for many TFs, the mode of action—repressing or activating transcription of target genes—is unclear. Here, we present diffTF (`https://git.embl.de/grp-zaugg/diffTF <https://git.embl.de/grp-zaugg/diffTF>`_) to calculate differential TF activity (basic mode) and classify TFs into putative transcriptional activators or repressors (classification mode). In basic mode, it combines genome-wide chromatin accessibility/activity with putative TF binding sites that, in classification mode, are integrated with RNA-seq. We apply diffTF to compare (1) mutated and unmutated chronic lymphocytic leukemia patients and (2) two hematopoietic progenitor cell types. In both datasets, diffTF recovers most known biology and finds many previously unreported TFs. It classifies almost 40% of TFs based on their mode of action, which we validate experimentally. Overall, we demonstrate that diffTF recovers known biology, identifies less well-characterized TFs, and classifies TFs into transcriptional activators or repressors.
For a graphical summary of the idea, see the section :ref:`workflow`
We also put the paper on *bioRxiv*, please see the section :ref:`citation` for details.
The paper is open access and available online, please see the section :ref:`citation` for details.
.. _exampleDataset:
...
...
@@ -44,15 +43,25 @@ Citation
If you use this software, please cite the following reference:
Ivan Berest*, Christian Arnold*, Armando Reyes-Palomares, Giovanni Palla, Kasper Dindler Rassmussen, Holly Giles, Peter-Martin Bruch, Sascha Dietrich, Wolfgang Huber, Kristian Helin & Judith B. Zaugg. *Quantification of differential transcription factor activity and multiomics-based classification into activators and repressors: diffTF*. 2019. in review.
Ivan Berest*, Christian Arnold*, Armando Reyes-Palomares, Giovanni Palla, Kasper Dindler Rasmussen, Holly Giles, Peter-Martin Bruch, Wolfgang Huber, Sascha Dietrich, Kristian Helin, Judith B. Zaugg. *Quantification of Differential Transcription Factor Activity and Multiomics-Based Classification into Activators and Repressors: diffTF*. 2019. Cell Reports 29(10), P3147-3159.E12.
Open Access. DOI:https://doi.org/10.1016/j.celrep.2019.10.106
We also put the paper on *bioRxiv*, please read all methodological details here:
`Quantification of differential transcription factor activity and multiomic-based classification into activators and repressors: diffTF <https://www.biorxiv.org/content/early/2018/12/01/368498>`_.
.. We also put the paper on *bioRxiv*, please read all methodological details here:
.. `Quantification of differential transcription factor activity and multiomic-based classification into activators and repressors: diffTF <https://www.biorxiv.org/content/early/2018/12/01/368498>`_.
.. _changelog:
Change log
============================
Version 1.5 (2019-12-03)
- *diffTF* has been published in Cell Reports! See the section :ref:`citation` for details.
- raw and adjusted p-values for the permutation-based approach can now not be 0 anymore. We now use the approach described `here<https://genomicsclass.github.io/book/pages/permutation_tests.html>`_. In a nutshell, the smallest p-value is now 1/(*nPermutations* + 1), with *nPermutations* denoting the number of permutations and thereby depends on the number of permutations - having more permutations makes the minimum p-value smaller.
- for plotting TF densities, a fixed bandwidth of 0.1 was used before. We now removed this and bandwidth is determined automatically. We noticed that in some cases, the fixed bandwidth may lead to a smoothing of the density curves while without fixing it, the densities look more rugged. As we do not want to introduce visual artifacts, we decided to remove it.
- various minor code improvements, particularly related to the AR classification
- small fixes in the Snakefile
Version 1.4 (2019-09-24)
- Various small improvements for increasing user experience
message=paste0("The file ",fileCur," does not have 3 columns. Something is wrong with the number of permutations. We recommend restarting the pipeline from the DiffPeaks step.")
flog.info(paste0("Running for real data ",permutationCur))
flog.info(paste0("Running for real data "))
}
...
...
@@ -289,8 +265,6 @@ for (fileCur in par.l$files_input_TF_allMotives) {
nBinsAll=length(levels(TF.motifs.all$CG.bins))
nCol=ncol(perm.l[[TFCur]])
# TODO: TFBSID needed?
flog.info(paste0(" Found ",nrow(TF.motifs.all)-nrow(TF.motifs.all.unique)," duplicated TFBS across all TF."))
# TODO: Optimize as in dev TF.motifs.all.unique = TF.motifs.all.unique[which(TF.motifs.all.unique$TF != TFCur & is.finite(TF.motifs.all.unique$log2FoldChange)),]
message=paste0("There is a problem with the specified design formula (parameter designContrast): The corresponding design matrix has fewer rows. This usually means that there are missing values in one of the specified variables. The problem comes from the following lines in the summary file: ",paste0(missingRows,collapse=","),".")
if fileCur.endswith(" ") or fileCur.startswith(" "):
raise IOError("File \"" + fileCur + "\" starts or ends with a white space, this should be corrected in the file " + config["samples"]["summaryFile"] + ".")
if not os.path.isfile(fileCur):
raise IOError("File \"" + fileCur + "\" (defined in " + config["samples"]["summaryFile"] + ") not found.")
# Index is not needed for featureCounts
...
...
@@ -258,7 +260,11 @@ if config["peaks"]["consensusPeaks"]:
raise IOError("File \"" + filenameCur + "\" not found.")
# Check if it contains scientific notification
# the || true in the end ensures the exit status of grep is 0, because this would raise an error otherwise
raise AssertionError("File " + config["peaks"]["consensusPeaks"] + " contains at least one line with the scientific notation (e+). This will cause errors in subsequent steps. Check the file and transform all \"e+\" coordinates.")
...
...
@@ -463,7 +469,7 @@ rule intersectPeaksAndTFBS:
TFBSinPeaksMod_bed = expand('{dir}/{compType}allTFBS.peaks.extension.bed.gz', dir = TEMP_EXTENSION_DIR, compType = compType)
message: "{ruleDisplayMessage} Obtain binding sites from peaks: Intersect all TFBS files and {input.consensusPeaks}..."