Commit 46454588 authored by Christian Arnold's avatar Christian Arnold

Documentation update

parent 6b749988
.. _docs-quickstart:
============================================================
Try it out now!
============================================================
The following quick start briefly summarizes the necessary steps to use our pipeline:
1. Install the necessary tools (Snakemake, samtools, bedtools, and Subread).
.. note:: Note that all tools require Python 3.
We recommend installing them via conda, in which case the installation is as easy as
......@@ -20,6 +20,7 @@ The following quick start briefly summarizes the necessary steps to use our pipe
.. note:: You do not need to uninstall other Python installations or packages in order to use conda. Even if you already have a system Python, another Python installation from a source such as the macOS Homebrew package manager and globally installed packages from pip such as pandas and NumPy, you do not need to uninstall, remove, or change any of them before using conda.
If you want to install the tools manually and outside of the conda framework, see the following instructions for each of the tools: `snakemake <http://snakemake.readthedocs.io/en/stable/getting_started/installation.html>`_, `samtools <http://www.htslib.org/download>`_, `bedtools <http://bedtools.readthedocs.io/en/latest/content/installation.html>`_, `Subread <http://subread.sourceforge.net>`_.
2. Clone the Git repository:
.. code-block:: Bash
......@@ -57,26 +58,22 @@ The following quick start briefly summarizes the necessary steps to use our pipe
.. _docs-prerequisites:
============================================================
Prerequisites
============================================================
This section lists the required software and how to install them. As outlined in Section :ref:`docs-quickstart`, the easiest way is to install all of them via ``conda``. However, it is of course also possible to install the tools separately.
--------------------------
Snakemake
--------------------------
Please ensure that you have at least version 4.3 installed. Principally, there are `multiple ways to install Snakemake <http://snakemake.readthedocs.io/en/stable/getting_started/installation.html>`_. We recommend installing it, along with all the other required software, via conda.
----------------------------
samtools, bedtools, Subread
----------------------------
In addition, `samtools <http://www.htslib.org/download>`_, `bedtools <http://bedtools.readthedocs.io>`_ and `Subread <http://subread.sourceforge.net>`_ are needed to run *diffTF*. We recommend installing them, along with all the other required software, via conda.
--------------------------
R and R packages
--------------------------
......@@ -92,18 +89,13 @@ A working ``R`` installation is needed and a number of packages from either CRAN
.. _docs-runOwnAnalysis:
============================================================
Run your own analysis
============================================================
Running your own analysis is almost as easy as running the example analysis. Carefully read and follow the following steps and notes:
1. Copy the files ``config.json`` and ``startAnalysis.sh`` to a directory of your choice.
2. Modify the file ``config.json`` accordingly. For example, we strongly recommend running the analysis for all TF instead of just 50 as for the example analysis. For this, simply change the parameter “TFs” to “all”. See Section 4 (TODO) for details about the meaning of the parameters. Do not delete or rename any parameters or sections.
2. Modify the file ``config.json`` accordingly. For example, we strongly recommend running the analysis for all TF instead of just 50 as for the example analysis. For this, simply change the parameter “TFs” to “all”. See Section :ref:`configurationFile` for details about the meaning of the parameters. Do not delete or rename any parameters or sections.
3. Create a tab-separated file that defines the input data, in analogy to the file ``sampleData.tsv`` from the example analysis, and refer to that in the file ``config.json`` (parameter ``summaryFile``)
4. Adapt the file ``startAnalysis.sh`` if necessary (the exact command line call to Snakemake and the various Snakemake-related parameters)
.. note::
- For Snakemake to run properly, the R folder with the scripts has to be in the same folder as the Snakefile.
- Since running the pipeline might be computationally demanding, make sure you have enough space left on your device. As a guideline, analysis with 8 samples need around 12 GB of disk space, while a large analysis with 84 samples needs around 45 GB. Also, adjust the number of available cores accordingly. The pipeline can be invoked in a highly parallelized manner, so the more cores are available, the better!
- The pipeline is written in Snakemake, so for a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable. The same holds true for running the pipeline in a cluster setting. We recommend using a proper cluster configuration file in addition. For guidance and user convenience, we provide different cluster configuration files for a small (up to 10-15 samples) and large (>15 samples) analysis. See the folder ``src/clusterConfigurationTemplates`` for examples. Note that the sample number guidelines above are very rough estimates only. See the `Snakemake documentation <http://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#cluster-configuration>`_ for details for how to use cluster configuration files.
5. Since running the pipeline might be computationally demanding, read Section :ref:`timeMemoryRequirements` and decide on which machine to run the pipeline. In most cases, we recommend running *diffTF* in a cluster environment. The pipeline is written in Snakemake, and we strongly suggest to also read Section :ref:`workingWithPipeline` to get a basic understanding of how the pipeline works.
.. _docs-details:
.. _workflow:
Workflow
************************************************************
......@@ -43,6 +43,7 @@ In addition, the following files are need, all of which we provide already for h
- TF-specific list of TFBS (see :ref:`parameter_dir_TFBS`)
- mapping table (see :ref:`parameter_HOCOMOCO_mapping`)
.. _configurationFile:
General configuration file
==============================
......@@ -163,6 +164,9 @@ PARAMETER ``dir_scripts``
Summary
String. The path to the directory where the R scripts for running the pipeline are stored.
Details
.. warning:: The folder name must be ``R``, and it has to be located in the same folder as the ``Snakefile``.
.. _parameter_RNASeqIntegration:
......@@ -391,13 +395,14 @@ Details
FILE ``{comparisonType}.TF_vs_peak_distribution.tsv``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
TODO
This summary table contains various results regarding TFs, their log2 fold change distribution across all TFBS and differences between all TFBS and the peaks
Details
Columns are as follows:
- *TF*: name of the TF
- *permutation*: The number of the permutation.
- *Pos_l2F*C, *Mean_l2FC*, *Median_l2FC*, *sd_l2FC*, *Mode_l2FC*, *skewness_l2FC*: fraction of positive values, mean, median, standard deviation, mode value and Bickel's measure of skewness of the log2 fold change distribution across all TFBS
- *Pos_l2FC*, *Mean_l2FC*, *Median_l2FC*, *sd_l2FC*, *Mode_l2FC*, *skewness_l2FC*: fraction of positive values, mean, median, standard deviation, mode value and Bickel's measure of skewness of the log2 fold change distribution across all TFBS
- *pvalue_raw* and *pvalue_adj*: raw and adjusted (fdr) p-value of the t-test
- *T_statistic*: the value of the T statistic from the t-test
- *TFBS_num*: number of TFBS
......@@ -667,7 +672,7 @@ Stores various log and error files.
FOLDER ``TEMP``
=============================================
Stores temporary and intermediate files.
Stores temporary and intermediate files. Since they are usually not relevant for the user, they are explained only very briefly here.
Subfolder ``SortedBAM``
------------------------------
......@@ -691,16 +696,84 @@ Stores results related to the user-specified extension size (:ref:`parameter_reg
- ``{TF}_TFBS.sorted.bed`` for each TF ``{TF}``: Produced in rule ``sortPWM``. Coordinate-sorted version of the input TFBS.
- ``{comparisonType}.allTFBS.peaks.bed.gz``: Produced in rule ``intersectPeaksAndTFBS``. BED file containing all TFBS from all TF that overlap with the peaks before motif extension
.. _workingWithPipeline:
Working with the pipeline and frequently asked questions
Working with *diffTF* and FAQs
************************************************************
`diffTF` is programmed as a Snakemake pipeline, which offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and running the pipeline on different systems is easy with minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience.
General remarks
==============================
`diffTF` is programmed as a Snakemake pipeline, which offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and running the pipeline on different systems is easy with minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience. For a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable.
Simply put, the Snakemake pipeline runs various *rules*. Each *rule* can be thougt of as a single task, such as sorting a file, running an R script, etc that has, among other features, an input and an output. You can see in the ``Snakefile`` what these rules are and what they do. Importantly, each rule has a name, which is also displayed during execution. Different rules are connected through their input and output files, so that the output of one rule is the input for a subsequent rule, thereby creasting *dependencies*, whichn ultimately leads to the directed acyclic graph (*DAG*) that you have seen in Section :ref:`workflow`. In diffTF, a rule is typically executed separately for each TF.
The number of *jobs* or rules to execute in total with the pipeline can roughly be calculated as 4 * ``nTF``, where ``nTF`` stands for the number of TFs that are included in the analysis. For each TF, four rules are executed:
1. Sorting the TFBS list
2. Calulating read counts for each TFBS within the peak regions
3. Differential accessibility analysis
4. Binning step
In addition, a few other rules are executed that however do not add up much more to the overall rule count.
.. _timeMemoryRequirements:
Executing diffTF - Running times and memory requirements
===============================================================
*diffTF* is computationally demanding, and depending on the sample size and the number of peaks, a rather high amount of resources may be required to run it. In the following, we discuss avrious issues related to time and memory requirements and we provide some general guidelines that worked well for us.
.. warning:: We generally advise to run diffTF in a cluster environment. For small analysis, a local analysis on your machine might work just fine (see the example analysis in the Git repository), but running time increases substantially due to limited amount of available cores.
Analysis size
---------------
We now provide a *very rough* classification into small, medium and large with respect to the sample size and the number of peaks:
- Small: Fewer than 10-15 samples, number of peaks not exceeding 50,000-80,000, normal read depth per sample
- Large: Number of samples larger than say 20 or number of peaks clearly exceeds 100,000, or very high read depth per sample
- Medium: Anything between small and large
Memory
---------------
Some notes regarding memory:
- Disk space: Make sure you have enough space left on your device. As a guideline, analysis with 8 samples need around 12 GB of disk space, while a large analysis with 84 samples needs around 45 GB.
- Machine memory: Although most steps of the pipeline have a modest memory footprint, depending on the analysis size, some others (e.g., rule ``prepareBinning``) may need a lot of RAM during execution. We recommend having at least 16 GB available for large analysis (see above)
Number of cores
-----------------
Some notes regarding the number of available cores:
- diffTF can be invoked in a highly parallelized manner, so the more CPUs are available, the better.
- you can use the ``--cores`` option when invoking Snakemake to specify the number of cores that are available for the analysis. If you specify 4 cores, for example, up to 4 rules can be run in parallel (if each of them occupies only 1 core), or 1 rule can use up to 4 cores.
- we strongly recommend running diffTF in a cluster environment due to the massive parallization. With Snakemake, it is easy to run diffTF in a cluster setting. Simply do the following:
- write a cluster configuration file that specifies which resources each rule needs. For guidance and user convenience, we provide different cluster configuration files for a small and large analysis. See the folder ``src/clusterConfigurationTemplates`` for examples. Note that these are rough estimates only. See the `Snakemake documentation <http://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#cluster-configuration>`_ for details for how to use cluster configuration files.
- invoke Snakemake with one of the available cluster modes, which will depend on your cluster system. We used ``--cluster`` and tested the pipeline extensively with *LSF/BSUB* and *SLURM*. For more details, see the `Snakemake documentation <http://snakemake.readthedocs.io/en/latest/executable.html#cluster-execution>`_
Total running time
--------------------
Some notes regarding the total running time:
- the total running time is very difficult to estimate beforehand and depends on many parameters, most importantly the number of samples, their read depth, the number of peaks, and the number of TF included in the analysis.
- for small analysis such as the example analysis in the Git repository, running times are roughly 30 minutes with 2 cores for 50 TF and a few hours with all 640 TF.
- for large analysis, running time will be up to a day or so when executed on a cluster machine
Frequently asked questions
==============================
Here a few typical use cases, which we will extend regularly in the future if the need arises:
1. I received an error, and the pipeline did not finish.
As explained in Section 4.5, you first have to identify and fix the error. Rerunning then becomes trivially easy: just restart Snakemake, it will start off where it left off: at the step that produced that error.
As explained in Section :ref:`docs-errors`, you first have to identify and fix the error. Rerunning then becomes trivially easy: just restart Snakemake, it will start off where it left off: at the step that produced that error.
2. I want to rerun a specific part of the pipeline only.
......
......@@ -106,7 +106,7 @@
<li class="toctree-l1"><a class="reference internal" href="chapter2.html">Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#input">Input</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#output">Output</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#working-with-the-pipeline-and-frequently-asked-questions">Working with the pipeline and frequently asked questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#working-with-difftf-and-faqs">Working with <em>diffTF</em> and FAQs</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#handling-errors">Handling errors</a></li>
</ul>
<p class="caption"><span class="caption-text">Project Information</span></p>
......@@ -275,18 +275,11 @@ biocLite<span class="p">(</span><span class="kt">c</span><span class="p">(</span
<p>Running your own analysis is almost as easy as running the example analysis. Carefully read and follow the following steps and notes:</p>
<ol class="arabic simple">
<li>Copy the files <code class="docutils literal"><span class="pre">config.json</span></code> and <code class="docutils literal"><span class="pre">startAnalysis.sh</span></code> to a directory of your choice.</li>
<li>Modify the file <code class="docutils literal"><span class="pre">config.json</span></code> accordingly. For example, we strongly recommend running the analysis for all TF instead of just 50 as for the example analysis. For this, simply change the parameter “TFs” to “all”. See Section 4 (TODO) for details about the meaning of the parameters. Do not delete or rename any parameters or sections.</li>
<li>Modify the file <code class="docutils literal"><span class="pre">config.json</span></code> accordingly. For example, we strongly recommend running the analysis for all TF instead of just 50 as for the example analysis. For this, simply change the parameter “TFs” to “all”. See Section <a class="reference internal" href="chapter2.html#configurationfile"><span class="std std-ref">General configuration file</span></a> for details about the meaning of the parameters. Do not delete or rename any parameters or sections.</li>
<li>Create a tab-separated file that defines the input data, in analogy to the file <code class="docutils literal"><span class="pre">sampleData.tsv</span></code> from the example analysis, and refer to that in the file <code class="docutils literal"><span class="pre">config.json</span></code> (parameter <code class="docutils literal"><span class="pre">summaryFile</span></code>)</li>
<li>Adapt the file <code class="docutils literal"><span class="pre">startAnalysis.sh</span></code> if necessary (the exact command line call to Snakemake and the various Snakemake-related parameters)</li>
<li>Since running the pipeline might be computationally demanding, read Section <a class="reference internal" href="chapter2.html#timememoryrequirements"><span class="std std-ref">Execute diffTF - Running times and memory requirements</span></a> and decide on which machine to run the pipeline. In most cases, we recommend running <em>diffTF</em> in a cluster environment. The pipeline is written in Snakemake, and we strongly suggest to also read Section <a class="reference internal" href="chapter2.html#workingwithpipeline"><span class="std std-ref">Working with diffTF and FAQs</span></a> to get a basic understanding of how the pipeline works.</li>
</ol>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<ul class="last simple">
<li>For Snakemake to run properly, the R folder with the scripts has to be in the same folder as the Snakefile.</li>
<li>Since running the pipeline might be computationally demanding, make sure you have enough space left on your device. As a guideline, analysis with 8 samples need around 12 GB of disk space, while a large analysis with 84 samples needs around 45 GB. Also, adjust the number of available cores accordingly. The pipeline can be invoked in a highly parallelized manner, so the more cores are available, the better!</li>
<li>The pipeline is written in Snakemake, so for a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable. The same holds true for running the pipeline in a cluster setting. We recommend using a proper cluster configuration file in addition. For guidance and user convenience, we provide different cluster configuration files for a small (up to 10-15 samples) and large (&gt;15 samples) analysis. See the folder <code class="docutils literal"><span class="pre">src/clusterConfigurationTemplates</span></code> for examples. Note that the sample number guidelines above are very rough estimates only. See the <a class="reference external" href="http://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#cluster-configuration">Snakemake documentation</a> for details for how to use cluster configuration files.</li>
</ul>
</div>
</div>
......
This diff is collapsed.
......@@ -100,7 +100,7 @@
<li class="toctree-l1"><a class="reference internal" href="chapter2.html">Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#input">Input</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#output">Output</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#working-with-the-pipeline-and-frequently-asked-questions">Working with the pipeline and frequently asked questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#working-with-difftf-and-faqs">Working with <em>diffTF</em> and FAQs</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#handling-errors">Handling errors</a></li>
</ul>
<p class="caption"><span class="caption-text">Project Information</span></p>
......
......@@ -100,7 +100,7 @@
<li class="toctree-l1"><a class="reference internal" href="chapter2.html">Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#input">Input</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#output">Output</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#working-with-the-pipeline-and-frequently-asked-questions">Working with the pipeline and frequently asked questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#working-with-difftf-and-faqs">Working with <em>diffTF</em> and FAQs</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#handling-errors">Handling errors</a></li>
</ul>
<p class="caption"><span class="caption-text">Project Information</span></p>
......@@ -209,7 +209,12 @@
<li class="toctree-l2"><a class="reference internal" href="chapter2.html#folder-temp">FOLDER <code class="docutils literal"><span class="pre">TEMP</span></code></a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#working-with-the-pipeline-and-frequently-asked-questions">Working with the pipeline and frequently asked questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#working-with-difftf-and-faqs">Working with <em>diffTF</em> and FAQs</a><ul>
<li class="toctree-l2"><a class="reference internal" href="chapter2.html#general-remarks">General remarks</a></li>
<li class="toctree-l2"><a class="reference internal" href="chapter2.html#executing-difftf-running-times-and-memory-requirements">Executing diffTF - Running times and memory requirements</a></li>
<li class="toctree-l2"><a class="reference internal" href="chapter2.html#frequently-asked-questions">Frequently asked questions</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#handling-errors">Handling errors</a><ul>
<li class="toctree-l2"><a class="reference internal" href="chapter2.html#error-types">Error types</a></li>
<li class="toctree-l2"><a class="reference internal" href="chapter2.html#identify-the-cause">Identify the cause</a></li>
......
......@@ -99,7 +99,7 @@
<li class="toctree-l1"><a class="reference internal" href="chapter2.html">Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#input">Input</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#output">Output</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#working-with-the-pipeline-and-frequently-asked-questions">Working with the pipeline and frequently asked questions</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#working-with-difftf-and-faqs">Working with <em>diffTF</em> and FAQs</a></li>
<li class="toctree-l1"><a class="reference internal" href="chapter2.html#handling-errors">Handling errors</a></li>
</ul>
<p class="caption"><span class="caption-text">Project Information</span></p>
......
This diff is collapsed.
.. _docs-quickstart:
============================================================
Try it out now!
============================================================
The following quick start briefly summarizes the necessary steps to use our pipeline:
1. Install the necessary tools (Snakemake, samtools, bedtools, and Subread).
.. note:: Note that all tools require Python 3.
We recommend installing them via conda, in which case the installation is as easy as
......@@ -20,6 +20,7 @@ The following quick start briefly summarizes the necessary steps to use our pipe
.. note:: You do not need to uninstall other Python installations or packages in order to use conda. Even if you already have a system Python, another Python installation from a source such as the macOS Homebrew package manager and globally installed packages from pip such as pandas and NumPy, you do not need to uninstall, remove, or change any of them before using conda.
If you want to install the tools manually and outside of the conda framework, see the following instructions for each of the tools: `snakemake <http://snakemake.readthedocs.io/en/stable/getting_started/installation.html>`_, `samtools <http://www.htslib.org/download>`_, `bedtools <http://bedtools.readthedocs.io/en/latest/content/installation.html>`_, `Subread <http://subread.sourceforge.net>`_.
2. Clone the Git repository:
.. code-block:: Bash
......@@ -57,26 +58,22 @@ The following quick start briefly summarizes the necessary steps to use our pipe
.. _docs-prerequisites:
============================================================
Prerequisites
============================================================
This section lists the required software and how to install them. As outlined in Section :ref:`docs-quickstart`, the easiest way is to install all of them via ``conda``. However, it is of course also possible to install the tools separately.
--------------------------
Snakemake
--------------------------
Please ensure that you have at least version 4.3 installed. Principally, there are `multiple ways to install Snakemake <http://snakemake.readthedocs.io/en/stable/getting_started/installation.html>`_. We recommend installing it, along with all the other required software, via conda.
----------------------------
samtools, bedtools, Subread
----------------------------
In addition, `samtools <http://www.htslib.org/download>`_, `bedtools <http://bedtools.readthedocs.io>`_ and `Subread <http://subread.sourceforge.net>`_ are needed to run *diffTF*. We recommend installing them, along with all the other required software, via conda.
--------------------------
R and R packages
--------------------------
......@@ -92,18 +89,13 @@ A working ``R`` installation is needed and a number of packages from either CRAN
.. _docs-runOwnAnalysis:
============================================================
Run your own analysis
============================================================
Running your own analysis is almost as easy as running the example analysis. Carefully read and follow the following steps and notes:
1. Copy the files ``config.json`` and ``startAnalysis.sh`` to a directory of your choice.
2. Modify the file ``config.json`` accordingly. For example, we strongly recommend running the analysis for all TF instead of just 50 as for the example analysis. For this, simply change the parameter “TFs” to “all”. See Section 4 (TODO) for details about the meaning of the parameters. Do not delete or rename any parameters or sections.
2. Modify the file ``config.json`` accordingly. For example, we strongly recommend running the analysis for all TF instead of just 50 as for the example analysis. For this, simply change the parameter “TFs” to “all”. See Section :ref:`configurationFile` for details about the meaning of the parameters. Do not delete or rename any parameters or sections.
3. Create a tab-separated file that defines the input data, in analogy to the file ``sampleData.tsv`` from the example analysis, and refer to that in the file ``config.json`` (parameter ``summaryFile``)
4. Adapt the file ``startAnalysis.sh`` if necessary (the exact command line call to Snakemake and the various Snakemake-related parameters)
.. note::
- For Snakemake to run properly, the R folder with the scripts has to be in the same folder as the Snakefile.
- Since running the pipeline might be computationally demanding, make sure you have enough space left on your device. As a guideline, analysis with 8 samples need around 12 GB of disk space, while a large analysis with 84 samples needs around 45 GB. Also, adjust the number of available cores accordingly. The pipeline can be invoked in a highly parallelized manner, so the more cores are available, the better!
- The pipeline is written in Snakemake, so for a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable. The same holds true for running the pipeline in a cluster setting. We recommend using a proper cluster configuration file in addition. For guidance and user convenience, we provide different cluster configuration files for a small (up to 10-15 samples) and large (>15 samples) analysis. See the folder ``src/clusterConfigurationTemplates`` for examples. Note that the sample number guidelines above are very rough estimates only. See the `Snakemake documentation <http://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#cluster-configuration>`_ for details for how to use cluster configuration files.
5. Since running the pipeline might be computationally demanding, read Section :ref:`timeMemoryRequirements` and decide on which machine to run the pipeline. In most cases, we recommend running *diffTF* in a cluster environment. The pipeline is written in Snakemake, and we strongly suggest to also read Section :ref:`workingWithPipeline` to get a basic understanding of how the pipeline works.
.. _docs-details:
.. _workflow:
Workflow
************************************************************
......@@ -43,6 +43,7 @@ In addition, the following files are need, all of which we provide already for h
- TF-specific list of TFBS (see :ref:`parameter_dir_TFBS`)
- mapping table (see :ref:`parameter_HOCOMOCO_mapping`)
.. _configurationFile:
General configuration file
==============================
......@@ -163,6 +164,9 @@ PARAMETER ``dir_scripts``
Summary
String. The path to the directory where the R scripts for running the pipeline are stored.
Details
.. warning:: The folder name must be ``R``, and it has to be located in the same folder as the ``Snakefile``.
.. _parameter_RNASeqIntegration:
......@@ -391,13 +395,14 @@ Details
FILE ``{comparisonType}.TF_vs_peak_distribution.tsv``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Summary
TODO
This summary table contains various results regarding TFs, their log2 fold change distribution across all TFBS and differences between all TFBS and the peaks
Details
Columns are as follows:
- *TF*: name of the TF
- *permutation*: The number of the permutation.
- *Pos_l2F*C, *Mean_l2FC*, *Median_l2FC*, *sd_l2FC*, *Mode_l2FC*, *skewness_l2FC*: fraction of positive values, mean, median, standard deviation, mode value and Bickel's measure of skewness of the log2 fold change distribution across all TFBS
- *Pos_l2FC*, *Mean_l2FC*, *Median_l2FC*, *sd_l2FC*, *Mode_l2FC*, *skewness_l2FC*: fraction of positive values, mean, median, standard deviation, mode value and Bickel's measure of skewness of the log2 fold change distribution across all TFBS
- *pvalue_raw* and *pvalue_adj*: raw and adjusted (fdr) p-value of the t-test
- *T_statistic*: the value of the T statistic from the t-test
- *TFBS_num*: number of TFBS
......@@ -691,16 +696,84 @@ Stores results related to the user-specified extension size (:ref:`parameter_reg
- ``{TF}_TFBS.sorted.bed`` for each TF ``{TF}``: Produced in rule ``sortPWM``. Coordinate-sorted version of the input TFBS.
- ``{comparisonType}.allTFBS.peaks.bed.gz``: Produced in rule ``intersectPeaksAndTFBS``. BED file containing all TFBS from all TF that overlap with the peaks before motif extension
.. _workingWithPipeline:
Working with the pipeline and frequently asked questions
Working with *diffTF* and FAQs
************************************************************
`diffTF` is programmed as a Snakemake pipeline, which offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and running the pipeline on different systems is easy with minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience.
General remarks
==============================
`diffTF` is programmed as a Snakemake pipeline, which offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and running the pipeline on different systems is easy with minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience. For a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable.
Simply put, the Snakemake pipeline runs various *rules*. Each *rule* can be thougt of as a single task, such as sorting a file, running an R script, etc that has, among other features, an input and an output. You can see in the ``Snakefile`` what these rules are and what they do. Importantly, each rule has a name, which is also displayed during execution. Different rules are connected through their input and output files, so that the output of one rule is the input for a subsequent rule, thereby creasting *dependencies*, whichn ultimately leads to the directed acyclic graph (*DAG*) that you have seen in Section :ref:`workflow`. In diffTF, a rule is typically executed separately for each TF.
The number of *jobs* or rules to execute in total with the pipeline can roughly be calculated as 4 * ``nTF``, where ``nTF`` stands for the number of TFs that are included in the analysis. For each TF, four rules are executed:
1. Sorting the TFBS list
2. Calulating read counts for each TFBS within the peak regions
3. Differential accessibility analysis
4. Binning step
In addition, a few other rules are executed that however do not add up much more to the overall rule count.
.. _timeMemoryRequirements:
Executing diffTF - Running times and memory requirements
===============================================================
*diffTF* is computationally demanding, and depending on the sample size and the number of peaks, a rather high amount of resources may be required to run it. In the following, we discuss avrious issues related to time and memory requirements and we provide some general guidelines that worked well for us.
.. warning:: We generally advise to run diffTF in a cluster environment. For small analysis, a local analysis on your machine might work just fine (see the example analysis in the Git repository), but running time increases substantially due to limited amount of available cores.
Analysis size
---------------
We now provide a *very rough* classification into small, medium and large with respect to the sample size and the number of peaks:
- Small: Fewer than 10-15 samples, number of peaks not exceeding 50,000-80,000, normal read depth per sample
- Large: Number of samples larger than say 20 or number of peaks clearly exceeds 100,000, or very high read depth per sample
- Medium: Anything between small and large
Memory
---------------
Some notes regarding memory:
- Disk space: Make sure you have enough space left on your device. As a guideline, analysis with 8 samples need around 12 GB of disk space, while a large analysis with 84 samples needs around 45 GB.
- Machine memory: Although most steps of the pipeline have a modest memory footprint, depending on the analysis size, some others (e.g., rule ``prepareBinning``) may need a lot of RAM during execution. We recommend having at least 16 GB available for large analysis (see above)
Number of cores
-----------------
Some notes regarding the number of available cores:
- diffTF can be invoked in a highly parallelized manner, so the more CPUs are available, the better.
- you can use the ``--cores`` option when invoking Snakemake to specify the number of cores that are available for the analysis. If you specify 4 cores, for example, up to 4 rules can be run in parallel (if each of them occupies only 1 core), or 1 rule can use up to 4 cores.
- we strongly recommend running diffTF in a cluster environment due to the massive parallization. With Snakemake, it is easy to run diffTF in a cluster setting. Simply do the following:
- write a cluster configuration file that specifies which resources each rule needs. For guidance and user convenience, we provide different cluster configuration files for a small and large analysis. See the folder ``src/clusterConfigurationTemplates`` for examples. Note that these are rough estimates only. See the `Snakemake documentation <http://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#cluster-configuration>`_ for details for how to use cluster configuration files.
- invoke Snakemake with one of the available cluster modes, which will depend on your cluster system. We used ``--cluster`` and tested the pipeline extensively with *LSF/BSUB* and *SLURM*. For more details, see the `Snakemake documentation <http://snakemake.readthedocs.io/en/latest/executable.html#cluster-execution>`_
Total running time
--------------------
Some notes regarding the total running time:
- the total running time is very difficult to estimate beforehand and depends on many parameters, most importantly the number of samples, their read depth, the number of peaks, and the number of TF included in the analysis.
- for small analysis such as the example analysis in the Git repository, running times are roughly 30 minutes with 2 cores for 50 TF and a few hours with all 640 TF.
- for large analysis, running time will be up to a day or so when executed on a cluster machine
Frequently asked questions
==============================
Here a few typical use cases, which we will extend regularly in the future if the need arises:
1. I received an error, and the pipeline did not finish.
As explained in Section 4.5, you first have to identify and fix the error. Rerunning then becomes trivially easy: just restart Snakemake, it will start off where it left off: at the step that produced that error.
As explained in Section :ref:`docs-errors`, you first have to identify and fix the error. Rerunning then becomes trivially easy: just restart Snakemake, it will start off where it left off: at the step that produced that error.
2. I want to rerun a specific part of the pipeline only.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment