Skip to content
Snippets Groups Projects
Commit 1abe6265 authored by Christian Arnold's avatar Christian Arnold
Browse files

Version 1.1.5, see Changelog

parent a43fd127
No related branches found
No related tags found
No related merge requests found
1.1.4
1.1.5
......@@ -987,7 +987,10 @@ We here provide a list of some of the errors that can happen and that users repo
Segmentation fault
...
This unfortunate message points to a problem with your R and R libraries installation and has per se nothing to do with *diffTF*. At least one of the installed libraries has an issue. We advise to reinstall *Bioconductor* in such a case, and ask someone who is experienced with this to help you. Unfortunately, this issue is so general that we cannot provide any specific solutions as this type of error is very general. To troubleshoot and identify exactly which library or function causes this, you may run the R script that failed in debug mode and go through it line by line. See the next section for more details.
.. note:: This particular message may also be related to an incompatibility of the *DiffBind* and *DESeq2* libraries. See the changelog for details, as this has been addressed in version 1.1.5.
More generally, however, such messages point to a problem with your R and R libraries installation and have per se nothing to do with *diffTF*. In such cases, we advise to reinstall the latest version of *Bioconductor* and ask someone who is experienced with this to help you. Unfortunately, this issue is so general that we cannot provide any specific solutions. To troubleshoot and identify exactly which library or function causes this, you may run the R script that failed in debug mode and go through it line by line. See the next section for more details.
Fixing the error
......
......@@ -31,14 +31,21 @@ We also put the paper on *bioRxiv*, please read all methodological details here:
Change log
============================
Version 1.1.5 (2018-08-14)
- optimized ``checkParameterValidity.R`` script, only TFBS files for TFs included in the analysis are now checked
- addressed an R library compatibility issue independent of *diffTF* that users reported. In some cases, for particular versions of R and Bioconductor, R exited with a *segfault* (memory not mapped) error in the ``checkParameterValidity.R`` that seems to be caused by the combination of *DiffBind* and *DESeq2*. Specifically, when *DiffBind* is loaded *before* *DESeq2*, R crashes with a segmentation fault upon exiting, whereas loading *DiffBind* *after* *DESeq2* causes no issue. If there are further issues, please let us know. Thanks to Gyan Prakash Mishra, who first reported this.
- fixed an issue when the number of peaks is very small so that some TFs have no overlapping TFBS at all in the peak regions. This caused the rule ``intersectTFBSAndBAM`` to exit with an error due to grep's policy of returning exit code 1 if no matches are returned (thanks to Jonas Ungerbeck, again).
- removed the ``--timestamp`` option in the helper script ``startAnalysis.sh`` because this option has been removed for Snakemake >5.2.1
- Documentation updates
Version 1.1.4 (2018-08-09)
- minor, updated the checkParameterValidity.R script and the documentation (one package was not mentioned)
- minor, updated the ``checkParameterValidity.R`` script and the documentation (one package was not mentioned)
Version 1.1.3 (2018-08-06)
- minor, fixed a small issue in the Volcano plot (legends wrong and background color in the plot was not colored properly)
Version 1.1.2 (2018-08-03)
- fixed a bug that made the ``3.analyzeTF`` script fail in case when the number of permutations has been changed throughout the analysis or when the value is higher than the actual maximum number (thanks to Jonas Ungerbeck)
- fixed a bug that made the ``3.analyzeTF.R`` script fail in case when the number of permutations has been changed throughout the analysis or when the value is higher than the actual maximum number (thanks to Jonas Ungerbeck)
Version 1.1.1 (2018-08-01)
- Documentation updates (referenced the bioRxiv paper, extended the section about errors)
......
......@@ -8,7 +8,7 @@
"conditionComparison": "GMP,MPP",
"designContrast": "~ conditionSummary",
"designVariableTypes": "conditionSummary:factor",
"nPermutations": 70,
"nPermutations": 100,
"nBootstraps": 0,
"nCGBins": 10,
"TFs": "CTCF,CEBPB,SNAI2,CEBPA,UBIP1,CEBPG,CEBPD,ZFX,AP2D,PAX5.S,SNAI1,ZEB1,SP4,MBD2,IRF1,MECP2,PAX5.D,SP3,NFIA.C,SP1.A,IRF7,MYF6, NRF1,DBP,MAZ,NKX28,DLX2,GATA1,P53,ZN143,AIRE,NR2C2,HMGA1,FUBP1,TEAD3,OVOL1,HXD4,KLF1,RXRG,HNF1B,ZIC3,HNF1A,NANOG.S,GFI1,PO3F1,NR2C1,ELF5,TF65.C,NFAC3,TEAD1",
......
......@@ -17,4 +17,4 @@ echo "# INCREASING THE NUMBER OF CORES SPEEDS UP THE ANALYSIS #"
echo "#########################################################"
# Real run, using 2 cores
snakemake --snakefile ../../src/Snakefile --cores 2 --configfile config.json --timestamp
snakemake --snakefile ../../src/Snakefile --cores 2 --configfile config.json
......@@ -181,21 +181,29 @@ if (nrow(problems(overlapsAll.df)) > 0) {
stop("Error when parsing the file ", fileCur, ", see errors above")
}
colnames(overlapsAll.df) = c("annotation", "chr","MSS","MES", "strand","length", colnamesNew)
overlapsAll.df = overlapsAll.df %>%
dplyr::mutate(TFBSID = paste0(chr,":", MSS, "-",MES),
mean = apply(dplyr::select(overlapsAll.df, one_of(colnamesNew)), 1, mean),
peakID = sapply(strsplit(overlapsAll.df$annotation, split = "_", fixed = TRUE),"[[", 1)) %>%
dplyr::distinct(TFBSID, .keep_all = TRUE) %>%
dplyr::select(-one_of("length"))
if (nrow(overlapsAll.df) > 0) {
colnames(overlapsAll.df) = c("annotation", "chr","MSS","MES", "strand","length", colnamesNew)
overlapsAll.df = overlapsAll.df %>%
dplyr::mutate(TFBSID = paste0(chr,":", MSS, "-",MES),
mean = apply(dplyr::select(overlapsAll.df, one_of(colnamesNew)), 1, mean),
peakID = sapply(strsplit(overlapsAll.df$annotation, split = "_", fixed = TRUE),"[[", 1)) %>%
dplyr::distinct(TFBSID, .keep_all = TRUE) %>%
dplyr::select(-one_of("length"))
skipTF = FALSE
} else {
skipTF = TRUE
}
nTFBS = nrow(overlapsAll.df)
skipTF = FALSE
......
......@@ -282,6 +282,9 @@ for (fileCur in par.l$files_input_TF_allMotives) {
flog.info(paste0(" Found ", nrow(TF.motifs.all) - nrow(TF.motifs.all.unique), " duplicated TFBS across all TF."))
# TODO: Optimize as in dev TF.motifs.all.unique = TF.motifs.all.unique[which(TF.motifs.all.unique$TF != TFCur & is.finite(TF.motifs.all.unique$log2FoldChange)),]
nRowsTF = nrow(TF.motifs.all[which(TF.motifs.all$TF == TFCur),])
......
start.time <- Sys.time()
#########################
# LIBRARY AND FUNCTIONS #
#########################
......@@ -55,6 +54,8 @@ fastaFile = snakemake@config$additionalInputFiles$refGenome_fasta
assertFileExists(fastaFile)
assertDirectoryExists(dirname(fastaFile), access = "w")
allTFs = strsplit(snakemake@config$par_general$TFs, ",")[[1]]
par.l$nPermutations = snakemake@config$par_general$nPermutations
assertIntegerish(par.l$nPermutations, lower = 0, len = 1)
......@@ -62,6 +63,9 @@ assertIntegerish(par.l$nPermutations, lower = 0, len = 1)
par.l$nBootstraps = as.integer(snakemake@config$par_general$nBootstraps)
assertIntegerish(par.l$nBootstraps, len = 1)
par.l$designFormula = snakemake@config$par_general$designContrast
checkAndLogWarningsAndErrors(par.l$designFormula, checkCharacter(par.l$designFormula, len = 1, min.chars = 3))
par.l$file_input_sampleData = snakemake@config$samples$summaryFile
checkAndLogWarningsAndErrors(par.l$file_input_sampleData, checkFileExists(par.l$file_input_sampleData, access = "r"))
......@@ -86,23 +90,24 @@ printParametersLog(par.l)
# LOAD ALL LIBRARIES #
######################
# createConsensusPeaks.R
checkAndLoadPackages(c("tidyverse", "futile.logger", "DiffBind", "checkmate", "stats"), verbose = FALSE)
# Step 1
# TODO: First loading DESeq2 before DiffBind seems to prevent the segfault
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "DiffBind", "checkmate", "stats"), verbose = FALSE)
# diffPeaks.R
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "csaw", "checkmate", "limma", "tools"), verbose = FALSE)
# Step 2
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "vsn", "csaw", "checkmate", "limma", "tools", "geneplotter", "RColorBrewer"), verbose = FALSE)
# analyzeTF.R
# Step 3
checkAndLoadPackages(c("tidyverse", "futile.logger", "DESeq2", "vsn", "modeest", "checkmate", "limma", "geneplotter", "RColorBrewer", "tools"), verbose = FALSE)
# summary1.R
# Step 4
checkAndLoadPackages(c("tidyverse", "futile.logger", "modeest", "checkmate", "ggrepel"), verbose = FALSE)
# binningTF.R
checkAndLoadPackages(c("tidyverse", "futile.logger", "lsr", "ggrepel", "checkmate", "tools", "methods", "boot"), verbose = FALSE)
# Step 5
checkAndLoadPackages(c("tidyverse", "futile.logger", "checkmate", "tools", "methods", "boot"), verbose = FALSE)
# summaryFinal.R
checkAndLoadPackages(c("tidyverse", "futile.logger", "lsr", "DESeq2", "matrixStats", "ggrepel", "checkmate", "tools", "grDevices", "locfdr", "pheatmap"), verbose = FALSE)
# Step 6
checkAndLoadPackages(c("tidyverse", "futile.logger", "lsr", "ggrepel", "checkmate", "tools", "methods", "grDevices", "pheatmap"), verbose = TRUE)
......@@ -127,23 +132,40 @@ sampleData.df = read_tsv(par.l$file_input_sampleData, col_names = TRUE, col_type
# Check the sample table
nDistValues = length(unique(sampleData.df$conditionSummary))
if (nDistValues != 2) {
message = paste0("The column 'conditionSummary' must contain exactly 2 different values, but ", nDistValues, " were found.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
message = paste0("The column 'conditionSummary' must contain exactly 2 different values, but ", nDistValues, " were found.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
if (!testSubset(unique(sampleData.df$conditionSummary), conditionComparison)) {
message = paste0("The elements specified in 'conditionComparison' in the config file must be a subset of the values in the column 'conditionSummary' in the sample file")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
message = paste0("The elements specified in 'conditionComparison' in the config file must be a subset of the values in the column 'conditionSummary' in the sample file")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
# Check the design formula
designFormula = convertToFormula(par.l$designFormula, colnames(sampleData.df))
designMatrix = model.matrix(designFormula, data = sampleData.df)
if (nrow(designMatrix) < nrow(sampleData.df)) {
missingRows = setdiff(1:nrow(sampleData.df), as.integer(row.names(designMatrix)))
message = paste0("There is a problem with the specified design formula (parameter designContrast): The corresponding design matrix has fewer rows. This usually means that there are missing values in one of the specified variables. The problem comes from the following lines in the summary file: ", paste0(missingRows, collapse = ","), ".")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
dummyMatrix = matrix(nrow = 100, ncol = nrow(sampleData.df), sample(1:100, 100 * nrow(sampleData.df), replace = TRUE))
TF.DESeq.obj <- DESeqDataSetFromMatrix(countData = dummyMatrix,
colData = sampleData.df,
design = designFormula)
# Build the fasta index. Requires write access to the folder where the fasta is stored (limitation of samtools faidx)
fastaIndex = paste0(fastaFile, ".fai")
if (!file.exists(fastaIndex)) {
flog.info(paste0("Running samtools faidx for fasta file to generate fasta index"))
indexFa(fastaFile)
flog.info(paste0("Running samtools faidx for fasta file to generate fasta index"))
indexFa(fastaFile)
} else {
flog.info(paste0("Fasta index already found"))
flog.info(paste0("Fasta index already found"))
}
indexes.df = as.data.frame(scanFaIndex(fastaFile))
......@@ -151,34 +173,34 @@ indexes.df$seqnames = as.character(indexes.df$seqnames)
# Check individually for each BAM file
for (bamCur in sampleData.df$bamReads) {
bamHeader = scanBamHeader(bamCur)
chrBAM.df = as.data.frame(bamHeader[[1]]$targets)
chrBAM.df$seqnames = rownames(chrBAM.df)
colnames(chrBAM.df) = c("width", "seqnames")
merged.df = full_join(chrBAM.df, indexes.df, by = "seqnames")
# Filter chromosomes and retain only those we keep in the pipeline
discardMatches = grepl("^chrX|^chrY|^chrM|^chrUn|random|hap|_gl", merged.df$seqnames, perl = TRUE)
if (length(discardMatches) > 0) {
chrNamesDiscarded = paste0(merged.df$seqnames[discardMatches], collapse = " ", sep = "\n")
flog.info(paste0("Discard the following chromosomes for compatibility comparisons because they are filtered in later steps anyway:\n ", chrNamesDiscarded))
merged.df = merged.df[!discardMatches,]
}
mismatches = merged.df$seqnames[which(merged.df$width.x != merged.df$width.y)]
if (length(mismatches) > 0) {
chrStr = paste0(mismatches, collapse = ",")
message = paste0("Chromosome lengths differ between the fasta file ", fastaFile, " and the BAM file ", bamCur, " for chromosome(s) ", chrStr)
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
} else {
flog.info(paste0("Reference fasta file ", fastaFile, " and BAM file ", bamCur, " are fully compatible."))
}
bamHeader = scanBamHeader(bamCur)
chrBAM.df = as.data.frame(bamHeader[[1]]$targets)
chrBAM.df$seqnames = rownames(chrBAM.df)
colnames(chrBAM.df) = c("width", "seqnames")
merged.df = full_join(chrBAM.df, indexes.df, by = "seqnames")
# Filter chromosomes and retain only those we keep in the pipeline
discardMatches = grepl("^chrX|^chrY|^chrM|^chrUn|random|hap|_gl", merged.df$seqnames, perl = TRUE)
if (length(discardMatches) > 0) {
chrNamesDiscarded = paste0(merged.df$seqnames[discardMatches], collapse = " ", sep = "\n")
flog.info(paste0("Discard the following chromosomes for compatibility comparisons because they are filtered in later steps anyway:\n ", chrNamesDiscarded))
merged.df = merged.df[!discardMatches,]
}
mismatches = merged.df$seqnames[which(merged.df$width.x != merged.df$width.y)]
if (length(mismatches) > 0) {
chrStr = paste0(mismatches, collapse = ",")
message = paste0("Chromosome lengths differ between the fasta file ", fastaFile, " and the BAM file ", bamCur, " for chromosome(s) ", chrStr)
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
} else {
flog.info(paste0("Reference fasta file ", fastaFile, " and BAM file ", bamCur, " are fully compatible."))
}
}
flog.info(paste0("Check peak files..."))
......@@ -199,13 +221,13 @@ if (file_peaks != "") {
flog.info(paste0("Peak file contains ", nrow(peaks.df), " peaks."))
if (nrow(peaks.df) > 100000) {
message = paste0("The number of peaks is very high, subsequent steps may be slow, particularly in the prepareBinning and binningTF steps. Make sure the preparingBinning step has enough memory available. We recommend at least 50 GB. Alternatively, consider decreasing the number of peaks for improved performance.")
message = paste0("The number of peaks is very high, subsequent steps may be slow, particularly in the prepareBinning and binningTF steps. Make sure the preparingBinning step has enough memory available. We recommend at least 50 GB. Alternatively, consider decreasing the number of peaks for improved performance.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
if (snakemake@config$par_general$nPermutations > 5) {
message = paste0("In addition to the high number of peaks, more than 5 permutations have been selected, which will further increase running times and memory footprint. Consider decreasing the number of peaks for improved performance.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
if (snakemake@config$par_general$nPermutations > 5) {
message = paste0("In addition to the high number of peaks, more than 5 permutations have been selected, which will further increase running times and memory footprint. Consider decreasing the number of peaks for improved performance.")
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
}
}
}
if (ncol(peaks.df) < 3) {
......@@ -221,21 +243,21 @@ if (file_peaks != "") {
assertIntegerish(peaks.df$X3 - peaks.df$X2, lower = 0)
if (ncol(peaks.df) == 3) {
peaks.df$X4 = "TODO"
peaks.df$X5 = 0
peaks.df$X4 = "TODO"
peaks.df$X5 = 0
} else if (ncol(peaks.df) == 4) {
peaks.df$X5 = 0
peaks.df$X5 = 0
} else if (ncol(peaks.df) == 5) {
} else {
message = paste0("More than 5 columns found. Only the first 5 are needed, the remaining ones will be dropped for subsequent analysis...")
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
peaks.df = peaks.df[,1:5]
message = paste0("More than 5 columns found. Only the first 5 are needed, the remaining ones will be dropped for subsequent analysis...")
checkAndLogWarningsAndErrors(NULL, message, isWarning = TRUE)
peaks.df = peaks.df[,1:5]
}
# Overwrite original annotation, we use our own from here on
peaks.df$X4 = paste0(peaks.df$X1, ":", peaks.df$X2, "-", peaks.df$X3)
colnamesFinal = c("chr", "start", "end", "annotation", "score")
colnames(peaks.df) = colnamesFinal
......@@ -243,7 +265,7 @@ if (file_peaks != "") {
rowsBefore = nrow(peaks.df)
peaks.df = dplyr::distinct(peaks.df, annotation, .keep_all = TRUE)
nRowsFiltered = rowsBefore - nrow(peaks.df)
if (par.l$verbose & nRowsFiltered > 0) flog.info(paste0("Filtered ", nRowsFiltered, " non-unique positions out of ", rowsBefore, " from peaks."))
......@@ -257,16 +279,38 @@ if (file_peaks != "") {
flog.info(paste0("Check TF-specific TFBS files..."))
TFs = createFileList(TFBS_dir, "_TFBS.bed", verbose = FALSE)
useAllTFs = FALSE
if (length(allTFs) == 1)
if (allTFs == "all") {
useAllTFs = TRUE
}
if (length(TFs) == 0) {
message = paste0("No files with pattern _TFBS.bed found in directory ", TFBS_dir, " as specified by the parameter \"dir_TFBS\"")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
if (useAllTFs) {
TFs = createFileList(TFBS_dir, "_TFBS.bed", verbose = FALSE)
if (length(TFs) == 0) {
message = paste0("No files with pattern _TFBS.bed found in directory ", TFBS_dir, " as specified by the parameter \"dir_TFBS\"")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
allTFs = sapply(strsplit(basename(TFs), "_"), "[[", 1)
stopifnot(length(allTFs > 0))
}
for (TFCur in TFs) {
for (TFCur in allTFs) {
tableCur.df = read_tsv(TFCur, col_names = FALSE, col_types = cols())
TFCur = gsub(pattern = " ",replacement = "",TFCur)
flog.info(paste0(" Checking TF ", TFCur, "..."))
fileCur = paste0(TFBS_dir, "/", TFCur, "_TFBS.bed")
if (!file.exists(fileCur)) {
message = paste0("File ", fileCur, " does not exist even though the TF ", TFCur, " has been specified")
checkAndLogWarningsAndErrors(NULL, message, isWarning = FALSE)
}
tableCur.df = read_tsv(fileCur, col_names = FALSE, col_types = cols())
if (nrow(problems(tableCur.df)) > 0) {
flog.fatal(paste0("Parsing errors: "), problems(tableCur.df), capture = TRUE)
stop("Parsing errors for file ", TFCur, ". See the log file for more information")
......@@ -280,8 +324,11 @@ for (TFCur in TFs) {
}
flog.info("Found ", length(TFs), " TF PWM bed files.")
flog.info(paste0("Done checking."))
# Write to dummy file
cat("DONE", file = par.l$output_flag)
.printExecutionTime(start.time)
flog.info("Session info: ", sessionInfo(), capture = TRUE)
\ No newline at end of file
......@@ -471,7 +471,6 @@ rule intersectPeaksAndTFBS:
"""
rule intersectTFBSAndBAM:
input:
bed = rules.intersectPeaksAndTFBS.output.TFBSinPeaksMod_bed,
......@@ -482,7 +481,7 @@ rule intersectTFBSAndBAM:
BAMOverlap = TF_DIR + "/{TF}/" + extDir + "/" + compType + "{TF}.allBAMs.overlaps.bed.gz",
saf = temp(expand('{dir}/{compType}{{TF}}.allTFBS.peaks.extension.saf', dir = TEMP_EXTENSION_DIR, compType = compType))
log:
message: "{ruleDisplayMessage} Intersect file {input.bed} against all BAM files..."
message: "{ruleDisplayMessage} Intersect file {input.bed} against all BAM files for TF {wildcards.TF}..."
threads: 4
params:
pairedEnd = pairedEndOptions,
......@@ -491,20 +490,27 @@ rule intersectTFBSAndBAM:
ulimitMax = ulimitMax
shell:
""" ulimit -n {params.ulimitMax} &&
zgrep "{wildcards.TF}_TFBS\." {input.bed} | awk 'BEGIN {{ OFS = "\\t" }} {{print $4"_"$2"-"$3,$1,$2,$3,$6}}' | sort -u -k1,1 >{output.saf} &&
featureCounts \
-F SAF \
-T {threads} \
{params.readFiltering} \
{params.pairedEnd} \
-a {output.saf} \
-s 0 \
{params.multiOverlap} \
-o {output.BAMOverlapRaw} \
{input.allBAMs} &&
zgrep "{wildcards.TF}_TFBS\." {input.bed} | awk 'BEGIN {{ OFS = "\\t" }} {{print $4"_"$2"-"$3,$1,$2,$3,$6}}' | sort -u -k1,1 >{output.saf} || true &&
if [[ $(wc -l <{output.saf}) -eq "0" ]]; then
touch {output.BAMOverlapRaw}
echo "No TFBS found, skip featureCounts..."
else
featureCounts \
-F SAF \
-T {threads} \
{params.readFiltering} \
{params.pairedEnd} \
-a {output.saf} \
-s 0 \
{params.multiOverlap} \
-o {output.BAMOverlapRaw} \
{input.allBAMs}
fi &&
gzip -f < {output.BAMOverlapRaw} > {output.BAMOverlap}
"""
name_plots = PEAKS_DIR + "/" + compType + "diagnosticPlots.peaks.pdf"
rule DiffPeaks:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment