Commit bfb18ef9 authored by Christian Arnold's avatar Christian Arnold

Updated Documentation, thanks to Nicolas Descostes for feedback

parent 88946a68
......@@ -97,6 +97,7 @@ Principally, there are two ways of installing *diffTF* and the proper tools:
Read in section :ref:`docs-singularityNotes` about the ``--bind`` option and what ``/your/diffTF/path`` means here , it is actually very easy!
You can also run the example analysis with all TF instead of only 50. For this, simply modify the ``TF`` parameter and set it to the special word ``all`` that tells *diffTF* to use all recognized TFs instead of a speciifc list only (see section :ref:`parameter_TFs` for details).
4. **To run your own analysis**, modify the files ``config.json`` and ``sampleData.tsv``. See the instructions in the section `Run your own analysis`_ for more details.
5. **If your analysis finished successfully**, take a look into the ``FINAL_OUTPUT`` folder within your specified output directory, which contains the summary tables and visualization of your analysis. If you received an error, take a look in Section :ref:`docs-errors` to troubleshoot.
......@@ -161,7 +162,7 @@ Adaptations and notes when running with Singularity
2. ``--singularity-args``: You need to make all directories that contain files that are referenced in the *diffTF* configuration file available within the container also. By default, only the directory and subdirectories from which you start the analysis are automatically mounted inside the container. Since the *diffTF* source code is outside the ``input`` folder for the example analysis, however, at least the root directory of the Git repository has to be mounted. This is actually quite simple! Just use ``--singularity-args "--bind /your/diffTF/path"`` and replace ``/your/diffTF/path`` with the root path in which you cloned the *diffTF* Git repository (the one that has the subfolders ``example``, ``src`` etc.). If you reference additional files, simply add one or multiple directories to the bind path (use the comma to separate them). For example, if you reference the files ``/g/group1/user1/mm10.fa`` and ``/g/group2/user1/files/bla.txt`` in the configuration file file, you may add ``/g/group1/user1,/g/group2/user1/files`` or even just ``/g`` to the bind path (as all files you reference are within ``/g``).
3. ``--singularity-prefix /your/directory`` (optional): You do not have to, but you may want to add the ``--singularity-prefix`` argument to store all ``Singularity`` containers in a central place (here: ``/your/directory``) instead of the local ``.snakemake`` directory. If you intend to run multiple *diffTF* analyses in different folders, you can save space and time because the containers won't have to be downloaded each time and stored in multiple locations.
3. ``--singularity-prefix /your/directory`` (optional): You do not have to, but you may want to add the ``--singularity-prefix`` argument to store all ``Singularity`` containers in a central place (here: ``/your/directory``) instead of the local ``.snakemake`` directory. If you intend to run multiple *diffTF* analyses in different folders, you can save space and time because the containers won't have to be downloaded each time and stored in multiple locations.
Please read the following additional notes and warnings related to ``Singularity``:
......
......@@ -115,41 +115,45 @@ Summary
Details
The root output directory where all output is stored.
.. _parameter_regionExtension:
PARAMETER ``regionExtension``
PARAMETER ``maxCoresPerRule``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
Integer >= 0. Default 100. Target region extension in base pairs.
Integer > 0. Default 16. Maximum number of cores to use for rules that support multithreading.
Details
Specifies the number of base pairs each target region (from the peaks file) should be extended in both 5’ and 3’ direction.
This affects currently only rules involving *featureCounts* - that is, *intersectPeaksAndBAM* while for rule *intersectTFBSAndBAM*, the number of cores is hard-coded to 4. When running *Snakemake* locally, each rule will use at most this number of cores, while in a cluster setting, this value refers to the maximum number of CPUs an individual job / rule will occupy. If the node the job is executed on has fewer nodes, then the maximum number of cores on the node will be taken.
.. _parameter_maxCoresPerRule:
.. _parameter_dir_TFBS_sorted:
PARAMETER ``maxCoresPerRule``
PARAMETER ``dir_TFBS_sorted``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
Integer > 0. Default 16. Maximum number of cores to use for rules that support multithreading.
Logical. true or false. Default false. Are the files in ``dir_TFBS`` (:ref:`parameter_dir_TFBS`) already pre-sorted?
Details
This affects currently only rules involving *featureCounts* - that is, *intersectPeaksAndBAM* while for rule *intersectTFBSAndBAM*, the number of cores is hard-coded to 4. When running *Snakemake* locally, each rule will use at most this number of cores, while in a cluster setting, this value refers to the maximum number of CPUs an individual job / rule will occupy. If the node the job is executed on has fewer nodes, then the maximum number of cores on the node will be taken.
If set to true, no additional sorting will be done, saving computation time because the rule *sortTFBSParallel* is not executed. Note that sorting is assumed to be according to the chromosome (first column) and start position (second column), essentially invoking *sort -k1,1 -k2,2n*. If set to false, all files in ``dir_TFBS`` (:ref:`parameter_dir_TFBS`) will be sorted.
.. _parameter_dir_TFBS_sorted:
.. _parameter_regionExtension:
PARAMETER ``dir_TFBS_sorted``
PARAMETER ``regionExtension``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
Logical. true or false. Default false. Are the files in ``dir_TFBS`` (:ref:`parameter_dir_TFBS`) already pre-sorted?
Integer >= 0. Default 100. Target region extension in base pairs.
Details
If set to true, no additional sorting will be done, saving computation time because the rule *sortTFBSParallel* is not executed. Note that sorting is assumed to be according to the chromosome (first column) and start position (second column), essentially invoking *sort -k1,1 -k2,2n*. If set to false, all files in ``dir_TFBS`` (:ref:`parameter_dir_TFBS`) will be sorted.
Specifies the number of base pairs each target region (from the peaks file) should be extended in both 5’ and 3’ direction.
.. _parameter_maxCoresPerRule:
.. _parameter_comparisonType:
......@@ -205,6 +209,8 @@ Details
.. note:: Importantly, if the variable of interest is continuous-valued (i.e., marked as being integer or numeric), then the reported log2 fold change is per unit of change of that variable. That is, in the final circular plot, TFs displayed in the left side have a negative slope per unit of change of that variable, while TFs at the right side have a positive one.
.. _parameter_nPermutations:
......@@ -245,6 +251,20 @@ Details
.. warning:: If bootstraps are used, it is recommended to use a reasonable large number. We recommend a value 1,000 and found that higher numbers do not add much benefit but instead only increase running time unnecessarily.
.. _parameter_nGCBins:
PARAMETER ``nGCBins``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
Integer > 0. Default 10. Number of GC bins for the binning step.
Details
This parameter sets the number of GC bins that are used during the binning step. The default is to split the data into 10 bins (0-10% GC content, 11-20%, ..., 91-100%), for each of which the significance is calculated independently (see Methods). Too many bins may result in bins being skipped due to an insufficient number of TFBS for that particular bin and TF, while too few bins may introduce GC-specific biases when summarizing the signal across all TFBS.
.. _parameter_TFs:
......@@ -380,11 +400,12 @@ Details
SECTION ``additionalInputFiles``
--------------------------------------------
.. _parameter_refGenome_fasta:
PARAMETER ``refGenome_fasta``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
String. Default ‘hg19.fasta’. Path to the reference genome *fasta* file.
......@@ -435,7 +456,7 @@ Summary
String. Default “”. Path to the file with RNA-Seq counts.
Details
If no RNA-Seq data is included, set to the empty string “”. Otherwise, if ``RNASeqIntegration`` (:ref:`parameter_RNASeqIntegration`) is set to true, specify the path to a tab-separated file with normalized RNA-Seq counts. It does not matter whether the values have been variance-stabilized or not, as long as values across samples are comparable. Also, consider filtering lowly expressed genes. For guidance, you may want to read `Question 4 here <https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html>`_.
If no RNA-Seq data is included, set to the empty string “”. Otherwise, if ``RNASeqIntegration`` (:ref:`parameter_RNASeqIntegration`) is set to true, specify the path to a tab-separated file with normalized RNA-Seq counts. It does not matter whether the values have been variance-stabilized or not, as long as values across samples are comparable. Also, consider filtering lowly expressed genes. For guidance, you may want to read `Question 4 here <https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html>`_.
The first line must be used for labeling the samples, with column names being identical to the sample names as specific in the sample summary table (``summaryFile``, :ref:`parameter_summaryFile`). If you have RNA-Seq data for only a subset of the input samples, this is no problem - the classification will then naturally only be based on the subset. The first column must be named ENSEMBL and it must contain ENSEMBL IDs (e.g., *ENSG00000028277*) without dots. The IDs are then matched to the IDs from the file as specified in ``HOCOMOCO_mapping`` (:ref:`parameter_HOCOMOCO_mapping`).
......
......@@ -6,8 +6,7 @@ Transcription factor (TF) activity constitutes an important readout of cellular
For a graphical summary of the idea, see the section :ref:`workflow`
We also put the paper on *bioRxiv*, please read all methodological details here:
`Quantification of differential transcription factor activity and multiomic-based classification into activators and repressors: diffTF <https://www.biorxiv.org/content/early/2018/07/13/368498>`_.
We also put the paper on *bioRxiv*, please see the section :ref:`citation` for details.
Help, contribute and contact
......@@ -17,6 +16,8 @@ If you have questions or comments, feel free to contact us. We will be happy to
If you have questions, doubts, ideas or problems, please use the `Bitbucket Issue Tracker <https://bitbucket.org/chrarnold/diffTF>`_. We will respond in a timely manner.
.. _citation:
Citation
============================
......
......@@ -11,7 +11,6 @@
"nPermutations": 100,
"nBootstraps": 0,
"nCGBins": 10,
"TFs": "all",
"TFs": "CTCF,CEBPB,SNAI2,CEBPA,UBIP1,CEBPG,CEBPD,ZFX,AP2D,PAX5.S,SNAI1,ZEB1,SP4,MBD2,IRF1,MECP2,PAX5.D,SP3,NFIA.C,SP1.A,IRF7,MYF6,NRF1,DBP,MAZ,NKX28,DLX2,GATA1,P53,ZN143,AIRE,NR2C2,HMGA1,FUBP1,TEAD3,OVOL1,HXD4,KLF1,RXRG,HNF1B,ZIC3,HNF1A,NANOG.S,GFI1,PO3F1,NR2C1,ELF5,TF65.C,NFAC3,TEAD1",
"dir_scripts": "../../src/R",
"RNASeqIntegration": true
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment