Version 1.2.5, see Changelog for details

927941be · Christian Arnold · 91d7a5b1 · 927941be · 927941be · 927941be
Commit 927941be authored 6 years ago by Christian Arnold
--- a/VERSION
+++ b/VERSION
-1.2.4
+1.2.5
--- a/docs/chapter1.rst
+++ b/docs/chapter1.rst
@@ -166,4 +166,4 @@ Adaptations and notes when running with Singularity

 Please read the following additional notes and warnings related to ``Singularity``:

- .. warning:: If you use ``Singularity`` version 3, make sure you have at least version 3.0.2 installed or the latest pull from version 3.0.1, as there was an issue with Snakemake and particular ``Singularity`` versions. For more details, see `here <https://bitbucket.org/snakemake/snakemake/issues/1017/snakemake-process-suspended-upon-execution>`_.
+- .. warning:: If you use ``Singularity`` version 3, make sure you have at least version 3.0.3 installed, as there was an issue with Snakemake and particular ``Singularity`` versions. For more details, see `here <https://bitbucket.org/snakemake/snakemake/issues/1017/snakemake-process-suspended-upon-execution>`_.
--- a/docs/chapter2.rst
+++ b/docs/chapter2.rst
@@ -127,18 +127,6 @@ Details
  This affects currently only rules involving *featureCounts* - that is, *intersectPeaksAndBAM* while for rule *intersectTFBSAndBAM*, the number of cores is hard-coded to 4. When running *Snakemake* locally, each rule will use at most this number of cores, while in a cluster setting, this value refers to the maximum number of CPUs an individual job / rule will occupy. If the node the job is executed on has fewer nodes, then the maximum number of cores on the node will be taken.


-.. _parameter_dir_TFBS_sorted:
-
-PARAMETER ``dir_TFBS_sorted``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Summary
-    Logical. true or false. Default false. Are the files in ``dir_TFBS`` (:ref:`parameter_dir_TFBS`) already pre-sorted?
-
-Details
-  If set to true, no additional sorting will be done, saving computation time because the rule *sortTFBSParallel* is not executed. Note that sorting is assumed to be according to the chromosome (first column) and start position (second column), essentially invoking *sort -k1,1 -k2,2n*.  If set to false, all files in ``dir_TFBS`` (:ref:`parameter_dir_TFBS`) will be sorted.
-
-
 .. _parameter_regionExtension:


@@ -444,7 +432,6 @@ Details

  However, you may also manually create these files to include additional TF of your choice or to be more or less stringent with the predicted TFBS. For this, you only need PWMs for the TF of interest and then a motif prediction tool such as *FIMO* or *MOODS*.

-  Also see the parameter ``dir_TFBS_sorted`` (:ref:`parameter_dir_TFBS_sorted`) to specify whether the files are already sorted or not.

 .. _parameter_RNASeqCounts:

@@ -594,7 +581,7 @@ Summary
  Only present if no consensus peak file was provided (``consensusPeaks``, :ref:`parameter_consensusPeaks`). Produced in rule ``filterSexChromosomesAndSortPeaks``. Generated consensus peaks, before filtering (see below).

 Details
-  Filtered consensus peaks (removal of peaks from one of the following chromosomes: chrX, chrY, chrM, chrUn\*, \*random*, \*hap|_gl\*
+  Filtered consensus peaks (removal of peaks from one of the following chromosomes: chrX, chrY, chrM, chrUn\*, and all contig names that do not start with "chr" such as \*random* or \*hap|_gl\*


 FILE ``{comparisonType}.allBams.peaks.overlaps.bed.gz``
@@ -803,7 +790,7 @@ Stores results related to the user-specified extension size (``regionExtension``
 - ``conditionComparison.rds``: Produced in rule ``DiffPeaks``. Stores the condition comparison as a string. Some steps in diffTF need this file as input.
 - ``{comparisonType}.motifs.coord.permutation{perm}.bed.gz`` and ``{comparisonType}.motifs.coord.nucContent.permutation{perm}.bed.gz`` for each permutation ``{perm}``: Produced in rule ``calcNucleotideContent``, and needed subsequently for the binning. Temporary and result file of *bedtools nuc*, respectively. The latter contains the GC content for all TFBS.
 - ``{comparisonType}.checkParameterValidity.done``: temporary flag file
- ``{TF}_TFBS.sorted.bed`` for each TF ``{TF}``: Produced in rule ``sortTFBSParallel``. Coordinate-sorted version of the input TFBS.
+- ``{TF}_TFBS.sorted.bed`` for each TF ``{TF}``: Produced in rule ``sortTFBSParallel``. Coordinate-sorted version of the input TFBS. Only "regular" chromosomes starting with "chr" are kept, while sex chromosomes (chrX, chrY), chrM and unassembled contigs such as chrUn are additionally removed.
 - ``{comparisonType}.allTFBS.peaks.bed.gz``: Produced in rule ``intersectPeaksAndTFBS``. *BED* file containing all TFBS from all TF that overlap with the peaks before motif extension

 .. _workingWithPipeline:

--- a/docs/projectInfo.rst
+++ b/docs/projectInfo.rst
@@ -53,6 +53,9 @@ We also put the paper on *bioRxiv*, please read all methodological details here:

 Change log
 ============================
+Version 1.2.5 (2019-03-13)
+  - Updated the TFBS_hg38_FIMO_HOCOMOCOv11 archive one more time to exclude non-assembled contigs such as HLA*. To make the pipeline more stable for such edge cases, the parameter ``dir_TFBS_sorted`` has been removed, and sorting and filtering of chromosomes is now always performed. Only chromosomes are kept in both the consensus peak files and the TFBS bed files that start with ``chr`` and are neither sex chromosomes (``chrX`` or ``chrY``) nor ``chrM``. If you want to keep sex chromosomes in your analysis (although we think this is not recommended), simply edit the Snakefile and remove the "chrX" and "chrY" occurences in the two filtering rules.
+
 Version 1.2.4 (2019-03-04)
  - Fixed an issue with ``checkParameterValidity.R`` that caused an error message when loading TFBS files with a numeric score.  Thanks to Scott Berry for pointing it out.
  - Updated the TFBS_hg38_FIMO_HOCOMOCOv11 archive. The bed files are now properly pre-sorted

--- a/src/Snakefile
+++ b/src/Snakefile
@@ -143,11 +143,14 @@ except KeyError:
  print("The parameter \"coresPerRule\" in section \"par_general\" has not been defined. Jobs/rules with multithreading support will use the default of 16 cores.")
  threadsMax = 16

-try:
-  TFBSSorted = config["par_general"]["dir_TFBS_sorted"]
-except KeyError:
-  print("The parameter \"dir_TFBS_sorted\" in section \"par_general\" has not been defined. Presorted BED files will be assumed.")
-  TFBSSorted = True
+# This has been disabled to catch reported edge cases from version 1.2.5 onwards
+# try:
+#   TFBSSorted = config["par_general"]["dir_TFBS_sorted"]
+# except KeyError:
+#   TFBSSorted = False
+
+# Changed from version 1.2.5 onwards
+TFBSSorted = False

 # Increase ulimit -n for analysis with high number of TF and/or input files. The standard value of 1024 may not be enough.
 ulimitMax = 4096
@@ -343,6 +346,8 @@ def retrieveConsensusPeakFile (par_consensusPeaks):
    else:
        return rules.checkParameterValidity.output.consPeaks

+# Sort the consensus peak file and only include "regular" chromosomes denoted with "chr"
+# that are neither sex chromosomes nor chrM
 rule filterSexChromosomesAndSortPeaks:
    input:
        consensusPeaks = retrieveConsensusPeakFile(config["peaks"]["consensusPeaks"])
@@ -352,13 +357,12 @@ rule filterSexChromosomesAndSortPeaks:
    threads: 1
    singularity: "shub://chrarnold/Singularity_images:difftf_conda"
    shell: """
-            grep -v "^chrX\|^chrY\|^chrM\|^chrUn\|random\|hap\|_gl" {input.consensusPeaks} | sort -k1,1 -k2,2n > {output.consensusPeaks_sorted}
+            grep ^chr {input.consensusPeaks} | grep -v "^chrX\|^chrY\|^chrM\|^chrUn"  | sort -k1,1 -k2,2n > {output.consensusPeaks_sorted}
           """


 overlapPattern = "overlaps.bed.gz"

-# This rule is only run when parameter TFBSSorted = False
 rule sortTFBSParallel:
    input:
        flag = ancient(rules.checkParameterValidity.output.flag),
@@ -369,7 +373,7 @@ rule sortTFBSParallel:
    threads: 1
    run:
        for fi,fo in zip(input.allBed, output.allBedSorted):
-          shell("sort -k1,1 -k2,2n {fi}  > {fo}")
+          shell("grep ^chr {fi} | grep -v \"^chrX\|^chrY\|^chrM\|^chrUn\" | sort -k1,1 -k2,2n > {fo}")


 def getBamFileFromBasename(basename):
@@ -441,6 +445,7 @@ rule intersectPeaksAndBAM:

 # TF-specific part:

+# As of now, this is currently set to false in all cases to avoid potential issues with unassembled contigs
 def retrieveInputFilesTFBS (TFBSSorted):

    if not TFBSSorted:
@@ -457,7 +462,7 @@ rule intersectPeaksAndTFBS:
        TFBSinPeaksMod_bed = expand('{dir}/{compType}allTFBS.peaks.extension.bed.gz', dir = TEMP_EXTENSION_DIR, compType = compType)
    message: "{ruleDisplayMessage} Obtain binding sites from peaks: Intersect all TFBS files and {input.consensusPeaks}..."
    threads: 1
-    singularity: "shub://chrarnold/Singularity_images:difftf_conda"
+    #singularity: "shub://chrarnold/Singularity_images:difftf_conda"
    params:
        extension = config["par_general"]["regionExtension"],
        ulimitMax = ulimitMax