Commit 32ae9b3e authored by Christian Arnold's avatar Christian Arnold

Minor name changes and doc updates

parent fe5394fb
......@@ -704,16 +704,18 @@ Working with *diffTF* and FAQs
General remarks
==============================
`diffTF` is programmed as a Snakemake pipeline, which offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and running the pipeline on different systems is easy with minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience. For a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable.
`diffTF` is programmed as a Snakemake pipeline. Snakemake is a bioinformatics workflow manager that uses workflows that are described via a human readable, Python based language. It offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and workflows can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition or only minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience. For a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable.
Simply put, the Snakemake pipeline runs various *rules*. Each *rule* can be thougt of as a single task, such as sorting a file, running an R script, etc that has, among other features, an input and an output. You can see in the ``Snakefile`` what these rules are and what they do. Importantly, each rule has a name, which is also displayed during execution. Different rules are connected through their input and output files, so that the output of one rule is the input for a subsequent rule, thereby creasting *dependencies*, whichn ultimately leads to the directed acyclic graph (*DAG*) that you have seen in Section :ref:`workflow`. In diffTF, a rule is typically executed separately for each TF.
Simply put, Snakemake executes various *rules*. Each *rule* can be thougt of as a single *recipe* or task such as sorting a file, running an R script, etc. Each rule has, among other features, a name, an input, an output, and the command that is executed. You can see in the ``Snakefile`` what these rules are and what they do. During the execution, the rule name is displayed, so you know exactly at which step the pipeline is at the given moment. Different rules are connected through their input and output files, so that the output of one rule becomes the input for a subsequent rule, thereby creasting *dependencies*, which ultimately leads to the directed acyclic graph (*DAG*) that describes the whole workflow. You have seen such a graph in Section :ref:`workflow`.
The number of *jobs* or rules to execute in total with the pipeline can roughly be calculated as 4 * ``nTF``, where ``nTF`` stands for the number of TFs that are included in the analysis. For each TF, four rules are executed:
In diffTF, a rule is typically executed separately for each TF. One example for a particular rule is sorting the TFBS list for the TF CTCF.
1. Sorting the TFBS list
2. Calulating read counts for each TFBS within the peak regions
3. Differential accessibility analysis
4. Binning step
In diffTF, the total number of *jobs* or rules to execute can roughly be calculated as 4 * ``nTF``, where ``nTF`` stands for the number of TFs that are included in the analysis. For each TF, four rules are executed:
1. Sorting the TFBS list (rule ``sortTFBS``)
2. Calulating read counts for each TFBS within the peak regions (rule ``intersectTFBSAndBAM``)
3. Differential accessibility analysis (rule ``analyzeTF``)
4. Binning step (rule ``binningTF``)
In addition, a few other rules are executed that however do not add up much more to the overall rule count.
......
......@@ -942,14 +942,15 @@
<span id="workingwithpipeline"></span><h1>Working with <em>diffTF</em> and FAQs<a class="headerlink" href="#working-with-difftf-and-faqs" title="Permalink to this headline"></a></h1>
<div class="section" id="general-remarks">
<h2>General remarks<a class="headerlink" href="#general-remarks" title="Permalink to this headline"></a></h2>
<p><cite>diffTF</cite> is programmed as a Snakemake pipeline, which offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and running the pipeline on different systems is easy with minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience. For a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable.</p>
<p>Simply put, the Snakemake pipeline runs various <em>rules</em>. Each <em>rule</em> can be thougt of as a single task, such as sorting a file, running an R script, etc that has, among other features, an input and an output. You can see in the <code class="docutils literal"><span class="pre">Snakefile</span></code> what these rules are and what they do. Importantly, each rule has a name, which is also displayed during execution. Different rules are connected through their input and output files, so that the output of one rule is the input for a subsequent rule, thereby creasting <em>dependencies</em>, whichn ultimately leads to the directed acyclic graph (<em>DAG</em>) that you have seen in Section <a class="reference internal" href="#workflow"><span class="std std-ref">Workflow</span></a>. In diffTF, a rule is typically executed separately for each TF.</p>
<p>The number of <em>jobs</em> or rules to execute in total with the pipeline can roughly be calculated as 4 * <code class="docutils literal"><span class="pre">nTF</span></code>, where <code class="docutils literal"><span class="pre">nTF</span></code> stands for the number of TFs that are included in the analysis. For each TF, four rules are executed:</p>
<p><cite>diffTF</cite> is programmed as a Snakemake pipeline. Snakemake is a bioinformatics workflow manager that uses workflows that are described via a human readable, Python based language. It offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and workflows can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition or only minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience. For a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable.</p>
<p>Simply put, Snakemake executes various <em>rules</em>. Each <em>rule</em> can be thougt of as a single <em>recipe</em> or task such as sorting a file, running an R script, etc. Each rule has, among other features, a name, an input, an output, and the command that is executed. You can see in the <code class="docutils literal"><span class="pre">Snakefile</span></code> what these rules are and what they do. During the execution, the rule name is displayed, so you know exactly at which step the pipeline is at the given moment. Different rules are connected through their input and output files, so that the output of one rule becomes the input for a subsequent rule, thereby creasting <em>dependencies</em>, which ultimately leads to the directed acyclic graph (<em>DAG</em>) that describes the whole workflow. You have seen such a graph in Section <a class="reference internal" href="#workflow"><span class="std std-ref">Workflow</span></a>.</p>
<p>In diffTF, a rule is typically executed separately for each TF. One example for a particular rule is sorting the TFBS list for the TF CTCF.</p>
<p>In diffTF, the total number of <em>jobs</em> or rules to execute can roughly be calculated as 4 * <code class="docutils literal"><span class="pre">nTF</span></code>, where <code class="docutils literal"><span class="pre">nTF</span></code> stands for the number of TFs that are included in the analysis. For each TF, four rules are executed:</p>
<ol class="arabic simple">
<li>Sorting the TFBS list</li>
<li>Calulating read counts for each TFBS within the peak regions</li>
<li>Differential accessibility analysis</li>
<li>Binning step</li>
<li>Sorting the TFBS list (rule <code class="docutils literal"><span class="pre">sortTFBS</span></code>)</li>
<li>Calulating read counts for each TFBS within the peak regions (rule <code class="docutils literal"><span class="pre">intersectTFBSAndBAM</span></code>)</li>
<li>Differential accessibility analysis (rule <code class="docutils literal"><span class="pre">analyzeTF</span></code>)</li>
<li>Binning step (rule <code class="docutils literal"><span class="pre">binningTF</span></code>)</li>
</ol>
<p>In addition, a few other rules are executed that however do not add up much more to the overall rule count.</p>
</div>
......
This diff is collapsed.
......@@ -704,16 +704,18 @@ Working with *diffTF* and FAQs
General remarks
==============================
`diffTF` is programmed as a Snakemake pipeline, which offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and running the pipeline on different systems is easy with minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience. For a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable.
`diffTF` is programmed as a Snakemake pipeline. Snakemake is a bioinformatics workflow manager that uses workflows that are described via a human readable, Python based language. It offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and workflows can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition or only minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience. For a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable.
Simply put, the Snakemake pipeline runs various *rules*. Each *rule* can be thougt of as a single task, such as sorting a file, running an R script, etc that has, among other features, an input and an output. You can see in the ``Snakefile`` what these rules are and what they do. Importantly, each rule has a name, which is also displayed during execution. Different rules are connected through their input and output files, so that the output of one rule is the input for a subsequent rule, thereby creasting *dependencies*, whichn ultimately leads to the directed acyclic graph (*DAG*) that you have seen in Section :ref:`workflow`. In diffTF, a rule is typically executed separately for each TF.
Simply put, Snakemake executes various *rules*. Each *rule* can be thougt of as a single *recipe* or task such as sorting a file, running an R script, etc. Each rule has, among other features, a name, an input, an output, and the command that is executed. You can see in the ``Snakefile`` what these rules are and what they do. During the execution, the rule name is displayed, so you know exactly at which step the pipeline is at the given moment. Different rules are connected through their input and output files, so that the output of one rule becomes the input for a subsequent rule, thereby creasting *dependencies*, which ultimately leads to the directed acyclic graph (*DAG*) that describes the whole workflow. You have seen such a graph in Section :ref:`workflow`.
The number of *jobs* or rules to execute in total with the pipeline can roughly be calculated as 4 * ``nTF``, where ``nTF`` stands for the number of TFs that are included in the analysis. For each TF, four rules are executed:
In diffTF, a rule is typically executed separately for each TF. One example for a particular rule is sorting the TFBS list for the TF CTCF.
1. Sorting the TFBS list
2. Calulating read counts for each TFBS within the peak regions
3. Differential accessibility analysis
4. Binning step
In diffTF, the total number of *jobs* or rules to execute can roughly be calculated as 4 * ``nTF``, where ``nTF`` stands for the number of TFs that are included in the analysis. For each TF, four rules are executed:
1. Sorting the TFBS list (rule ``sortTFBS``)
2. Calulating read counts for each TFBS within the peak regions (rule ``intersectTFBSAndBAM``)
3. Differential accessibility analysis (rule ``analyzeTF``)
4. Binning step (rule ``binningTF``)
In addition, a few other rules are executed that however do not add up much more to the overall rule count.
......
......@@ -178,8 +178,8 @@ suffixTFBS = '_TFBS.bed'
allTF = []
if config["par_general"]["TFs"] == "all":
PWM_FILES = os.popen("ls " + config["additionalInputFiles"]["dir_TFBS"]).readlines()
for TFCur in PWM_FILES:
TFBS_FILES = os.popen("ls " + config["additionalInputFiles"]["dir_TFBS"]).readlines()
for TFCur in TFBS_FILES:
if not os.path.basename(TFCur.replace('\n', '')).endswith(suffixTFBS):
continue
TFCurBasename = os.path.basename(TFCur.replace('\n', '').replace(suffixTFBS, ''))
......@@ -297,7 +297,7 @@ rule filterSexChromosomesAndSortPeaks:
overlapPattern = "overlaps.bed.gz"
rule sortPWM:
rule sortTFBS:
input:
flag = ancient(rules.checkParameterValidity.output.flag),
bed = config["additionalInputFiles"]["dir_TFBS"] + "/{TF}" + suffixTFBS
......@@ -366,16 +366,16 @@ rule intersectPeaksAndBAM:
# TF-specific part:
rule intersectPeaksAndPWM:
rule intersectPeaksAndTFBS:
input:
consensusPeaks = rules.filterSexChromosomesAndSortPeaks.output.consensusPeaks_sorted,
#allPWMs = expand('{dir}/{compType}{TF}_TFBS.sorted.bed.gz', dir = TEMP_DIR, compType = compType, TF = allTF)
allPWMs = expand('{dir}/{compType}{TF}_TFBS.sorted.bed', dir = TEMP_DIR, compType = compType, TF = allTF)
#allTFBS = expand('{dir}/{compType}{TF}_TFBS.sorted.bed.gz', dir = TEMP_DIR, compType = compType, TF = allTF)
allTFBS = expand('{dir}/{compType}{TF}_TFBS.sorted.bed', dir = TEMP_DIR, compType = compType, TF = allTF)
output:
TFBSinPeaks_bed = expand('{dir}/{compType}allPWM.peaks.bed.gz', dir = TEMP_DIR, compType = compType),
TFBSinPeaksMod_bed = expand('{dir}/{compType}allPWM.peaks.extension.bed.gz', dir = TEMP_EXTENSION_DIR, compType = compType)
TFBSinPeaks_bed = expand('{dir}/{compType}allTFBS.peaks.bed.gz', dir = TEMP_DIR, compType = compType),
TFBSinPeaksMod_bed = expand('{dir}/{compType}allTFBS.peaks.extension.bed.gz', dir = TEMP_EXTENSION_DIR, compType = compType)
log:
message: "{ruleDisplayMessage} Obtain binding sites from peaks: Intersect files {input.allPWMs} and {input.consensusPeaks}..."
message: "{ruleDisplayMessage} Obtain binding sites from peaks: Intersect files {input.allTFBS} and {input.consensusPeaks}..."
threads: 1
params:
extension = config["par_general"]["regionExtension"],
......@@ -384,7 +384,7 @@ rule intersectPeaksAndPWM:
""" ulimit -n {params.ulimitMax} &&
bedtools intersect \
-a {input.consensusPeaks} \
-b {input.allPWMs} \
-b {input.allTFBS} \
-wa -wb \
-sorted \
-filenames \
......@@ -394,10 +394,10 @@ rule intersectPeaksAndPWM:
rule intersectTFBSAndBAM:
input:
bed = rules.intersectPeaksAndPWM.output.TFBSinPeaksMod_bed,
bed = rules.intersectPeaksAndTFBS.output.TFBSinPeaksMod_bed,
allBAMs = expand('{dir}/{allBasenamesBAM}.bam', dir = TEMP_BAM_DIR, allBasenamesBAM = allBamFilesBasename)
output:
saf = temp(expand('{dir}/{compType}{{TF}}.allPWM.peaks.extension.saf', dir = TEMP_EXTENSION_DIR, compType = compType)),
saf = temp(expand('{dir}/{compType}{{TF}}.allTFBS.peaks.extension.saf', dir = TEMP_EXTENSION_DIR, compType = compType)),
BAMOverlapRaw = temp(TF_DIR + "/{TF}/" + extDir + "/" + compType + "{TF}.allBAMs.overlaps.bed"),
BAMOverlap = TF_DIR + "/{TF}/" + extDir + "/" + compType + "{TF}.allBAMs.overlaps.bed.gz"
log:
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment