Workflow¶
We put the paper on bioRxiv, please read all methodological details here: Quantification of differential transcription factor activity and multiomic-based classification into activators and repressors: diffTF.
The workflow and conceptual idea behind diffTF is illustrated by the following three Figures. First, we give a high-level conceptual overview and a biological motivation:
We now show which rules are executed by Snakemake for a specific example (see the caption of the image):
diffTF is implemented as a Snakemake pipeline. For a gentle introduction about Snakemake, see Section Working with diffTF and FAQs. As you can see, the workflow consists of the following steps or rules:
checkParameterValidity
: R script that checks whether the specified peak file has the correct format, whether the provided fasta file and the BAM files are compatible, and other checksproduceConsensusPeaks
: R script that generates the consensus peaks if none are providedfilterSexChromosomesAndSortPeaks
: Filters various chromosomes 8sex, unassembled ones, contigs, etc) from the peak file.sortTFBSParallel
: Sort the TFBS lists by positionresortBAM
: Sort the BAM file for optimized processing (only run if data are paired-end)intersectPeaksAndBAM
: Count all reads for peak regions across all input filesintersectPeaksAndTFBS
: Intersect all TFBS with peak regions to retain only TFBS in peak regionsintersectTFBSAndBAM
: Count all reads from all TFBS across all input files in a TF-specific mannerDiffPeaks
: R script that performs a differential accessibility analysis for the peak regions as well as sample permutationsanalyzeTF
: R script that performs a TF-specific differential accessibility analysissummary1
: R script that summarizes the previous script for all TFsconcatenateMotifs
andconcatenateMotifsPerm
: Concatenates previous results from either real or permuted data (TFBS motives)calcNucleotideContent
: Calculates the GC content for all TFBSbinningTF
: R script that performs the binning approach in a TF-specific mannersummaryFinal
: R script that summarizes the analysis and calculates final statisticscleanUpLogFiles
: Cleans up theLOGS_AND_BENCHMARKS
directory (mostly relevant if run in cluster mode)
Input¶
Summary¶
As input for diffTF for your own analysis, the following data are needed:
- BAM file with aligned reads for each sample (see PARAMETER summaryFile)
- genome reference fasta that has been used to produce the BAM files (see PARAMETER refGenome_fasta)
- Optionally: corresponding RNA-Seq data (see PARAMETER RNASeqCounts)
In addition, the following files are need, all of which we provide already for human hg19, hg38 and mouse mm10:
- TF-specific list of TFBS (see PARAMETER dir_TFBS)
- mapping table (see PARAMETER HOCOMOCO_mapping)
General configuration file¶
To run the pipeline, a configuration file that defines various parameters of the pipeline is required.
Note
Please note the following important points:
- the name of this file is irrelevant, but it must be in the right format (JSON) and it must be referenced correctly when calling Snakemake (via the
--configfile
parameter). We recommend naming itconfig.json
- neither section nor parameter names must be changed.
- For parameters that specify a path, both absolute and relative paths are possible. We recommend specifying an absolute path. Relative paths must be specified relative to the Snakemake working directory.
- For parameters that specify a directory, there should be no trailing slash.
In the following, we explain all parameters in detail, organized by section names.
SECTION par_general
¶
PARAMETER outdir
¶
- Summary
- String. Default “output”. Root output directory.
- Details
- The root output directory where all output is stored.
PARAMETER regionExtension
¶
- Summary
- Integer >= 0. Default 100. Target region extension in base pairs.
- Details
- Specifies the number of base pairs each target region (from the peaks file) should be extended in both 5’ and 3’ direction.
PARAMETER maxCoresPerRule
¶
- Summary
- Integer > 0. Default 16. Maximum number of cores to use for rules that support multithreading.
- Details
- This affects currently only rules involving featureCounts - that is, intersectPeaksAndBAM while for rule intersectTFBSAndBAM, the number of cores is hard-coded to 4. When running Snakemake locally, each rule will use at most this number of cores, while in a cluster setting, this value refers to the maximum number of CPUs an individual job / rule will occupy. If the node the job is executed on has fewer nodes, then the maximum number of cores on the node will be taken.
PARAMETER dir_TFBS_sorted
¶
- Summary
- Logical. true or false. Default false. Are the files in
dir_TFBS
(PARAMETER dir_TFBS) already pre-sorted? - Details
- If set to true, no additional sorting will be done, saving computation time because the rule sortTFBSParallel is not executed. Note that sorting is assumed to be according to the chromosome (first column) and start position (second column), essentially invoking sort -k1,1 -k2,2n. If set to false, all files in
dir_TFBS
(PARAMETER dir_TFBS) will be sorted.
PARAMETER comparisonType
¶
- Summary
- String. Default “”.
- Details
- This parameter helps to organize complex analysis for which multiple different types of comparisons should be done. Set it to a short but descriptive name that summarizes the type of comparison you are making or the types of cells you compare. The value of this parameter appears as prefix in most output files created by the pipeline. It may also be empty.
PARAMETER conditionComparison
¶
- Summary
- String. Default “”. Specifies the two conditions you want to compare.
- Details
This parameter specifies the contrast you are making in diffTF. Two conditions have to be specified, separated by a comma. For example, if you want to compare GMP and MPP samples, the parameter should be “GMP,MPP”. Both conditions have to be present in the column “conditionSummary” in the sample file table (see parameter
summaryFile
(PARAMETER summaryFile)).Note
The order of the two conditions matters. The condition specified first is the reference condition. For the “GMP,MPP” example, all log2 fold-changes will be the log2fc of MPP as compared to GMP. That means that a positive log2 fold-change means it is higher in MPP as compared to GMP. This is particularly relevant for the allMotifs output file.
PARAMETER designContrast
¶
- Summary
- String. Default conditionSummary. Design formula for the differential accessibility analysis.
- Details
- This important parameter defines the actual contrast that is done in the differential analysis. That is, which groups of samples are being compared? Examples include mutant vs wild type, mutated vs. unmutated, etc. The last element in the formula must always be conditionSummary, which defines the two groups that are being compared. This name is currently hard-coded and required by the pipeline. Our pipeline allows including additional variables to model potential confounding variables, like gender, batches etc. For each additional variable that is part of the formula, a corresponding and identically named column in the sample summary file must be specified. For example, for an analysis that also includes the batch number of the samples, you may specify this as “~ Treatment + conditionSummary”.
PARAMETER designVariableTypes
¶
- Summary
- String. Default conditionSummary:factor. The data types of all elements listed in
designContrast
(PARAMETER designContrast). - Details
Names must be separated by commas, spaces are allowed and will be eliminated automatically. The data type must be specified with a “:”, followed by either “numeric”, “integer”, “logical”, or “factor”. For example, if
designContrast
(PARAMETER designContrast) is specified as “~ Treatment + conditionSummary”, the corresponding types might be “Treatment:factor, conditionSummary:factor”. If a data type is specified as either “logical” or “factor”, the variable will be treated as a discrete variable with a finite number of distinct possibilities (something like batch, for example). conditionSummary is usually specified as factor because you want to make a pairwise comparison of exactly two conditions. If conditionSummary is specified as “integer” or “numeric”, however, the variable is treated as continuously-scaled, which changes the interpretation of the results, see the note below.Note
Importantly, if the variable of interest is continuous-valued (i.e., marked as being integer or numeric), then the reported log2 fold change is per unit of change of that variable. That is, in the final circular plot, TFs displayed in the left side have a negative slope per unit of change of that variable, while TFs at the right side have a positive one.
PARAMETER nPermutations
¶
- Summary
- Integer >= 0. Default 50. The number of random sample permutations.
- Details
If set to a value > 0, in addition to the real data, the sample conditions as specified in the sample table will be randomly permuted nPermutations times. This is the recommended way of computing statistical significances for each TF. Note that the maximum number of possible permutations is limited by the number of samples and can be computed with the binomial coefficient n over k. For example, if you have n = 8 samples in total and they split up in the two conditions/groups as k = 5 / k = 3, the total number of permutations is 8 over 5 or 8 over 3 (they are both identical). We generally recommend setting this value to high values such as 1,000. If the value is set to a number higher than the number of possible permutations, it will be adjusted automatically to the maximum number of permutations as determined by the binomial coefficient.
If set to 0, an alternative way of computing significances that is not based on permutations is performed. First, in the CG normalization step, a Welch Two Sample t-test is performed for each bin and the overall significance by treating the T-statistics as z-scores is calculated, which allows to summarize them across the bins and convert them to one p-value per TF. For this conversion of z-scores per bin to p-value an estimate of the variance of the T-scores is approximated (see the publication for details). This procedure reduces the dependency of the p-value on the sample size (since the number of TFBS can range between a few dozen and multiple tens of thousands depending on the TF).
Note
If set to a value > 0, the parameter
nBootstraps
(PARAMETER nBootstraps) is ignored.Note
While using permutations is the recommended approach for assessing statistical significance, in some cases it might be useful to use the alternative approach: If the number of samples is small or the groups show a very uneven distributions, the number of possible permutations is very small and therefore also the permutation-based approach might not accurately assess significance.
Note
The running time of the pipeline increases with the number of permutations.
Warning
Do not change the value of this parameter after (parts of) the pipeline have been run, some steps may fail due to this change. If you really need to change the value, rerun the pipeline from the diffPeaks step onwards.
PARAMETER nBootstraps
¶
- Summary
- Integer >= 0. Default 1,000. The number of bootstrap for estimating the variance of the TF-specific T scores in the CG binning step.
- Details
To properly estimate the variance of the T scores for each TF in the CG binning step, we employ a bootstrap approach using the boot library in R with a user-adjustable number of bootstrap replicates (default 1,000), with resampling the bin-specific data and then performing the t-test against the full sample as described above. We then calculate the variance of the bootstrapped T scores for each bin. For more details, see the methods of the publication.
Note
Only relevant if the parameter
nPermutations
(PARAMETER nPermutations) is set to 0. If both are set to 0, an error is thrown.Warning
If bootstraps are used, it is highly recommended to use a large number of bootstraps. We recommend at least a value of 1,000.
PARAMETER TFs
¶
- Summary
- String. Default “all”. Either “all” or a comma-separated list of TF names of TFs to include. If set to “all”, all TFs that are found in the directory as specified in
dir_TFBS
(PARAMETER dir_TFBS) will be used. - Details
If the analysis should be restricted to a subset of TFs, list the names of the TF to include in a comma-separated manner here.
Note
For each TF
{TF}
, a corresponding file{TF}_TFBS.bed
needs to be present in the directory that is specified bydir_TFBS
(PARAMETER dir_TFBS).Warning
We strongly recommending running diffTF with as many TF as possible due to our statistical model that we use that compares against a background model.
PARAMETER dir_scripts
¶
- Summary
- String. The path to the directory where the R scripts for running the pipeline are stored.
- Details
Warning
The folder name must be
R
, and it has to be located in the same folder as theSnakefile
.
PARAMETER RNASeqIntegration
¶
- Summary
- Logical. true or false. Default false. Should RNA-Seq data be integrated into the pipeline?
- Details
- If set to true, RNA-Seq counts as specified in
RNASeqCounts
(PARAMETER RNASeqCounts) will be used to classify each TF into either “activator”, “repressor”, “unknown”, or “not-expressed” for the final circular visualization and the summary table.
SECTION samples
¶
PARAMETER summaryFile
¶
- Summary
- String. Default “samples.tsv”. Path to the sample metadata file.
- Details
- Path to a tab-separated file that summarizes the input data. See the section Input metadata and the example file for how this file should look like.
PARAMETER pairedEnd
¶
- Summary
- Logical. true or false. Default true. Is the data paired-end? If single-end, set to false.
- Details
- Both paired-end and single-end data can be run with diffTF.
SECTION peaks
¶
PARAMETER consensusPeaks
¶
- Summary
- String. Default “” (empty). Path to the consensus peak file.
- Details
If set to the empty string “”, the pipeline will generate a consensus peaks out of the peak files from each individual sample. For this, you need to provide the following two things:
- a peak file for each sample in the metadata file in the column peaks, see the section Input metadata for details.
- The format of the peak files, as specified in
peakType
(PARAMETER peakType)
If a file is provided, it must be a valid BED file with at least 3 columns:
- tab-separated columns
- no column names in the first row
- Columns 1 to 3:
- Chromosome
- Start position
- End position
- Optional:
- Identifier (will be made unique for each if this is not the case already)
- Score
- Strand
PARAMETER peakType
¶
- Summary
- String. Default
narrow
. File format of the individual, sample-specific peak files. Only relevant if no consensus peak file has been provided (i.e., the PARAMETER consensusPeaks is empty). - Details
Only needed if no consensus peak set has been provided. All individual peak files must be in the same format. We recommend the
narrow
format (files ending in.narrowPeak
) that is a direct output from MACS2, but other formats are supported. See the help for DiffBind dba for a full list of supported formats, the most common ones include:bed
: .bed file; peak score is in fifth columnnarrow
: narrowPeaks file (from MACS2)
PARAMETER minOverlap
¶
- Summary
- Integer >= 0 or Float between 0 and 1. Default 2. Minimum overlap for peak files for a peak to be considered into the consensus peak set. Corresponds to the
minOverlap
argument in the dba function of DiffBind. Only relevant if no consensus peak file has been provided (i.e.,consensusPeaks
, PARAMETER consensusPeaks, is empty). - Details
- Only include peaks in at least this many peak sets in the main binding matrix. If set to a value between zero and one, peak will be included from at least this proportion of peak sets. For more information, see the
minOverlap
argument in the dba function of DiffBind (see here).
SECTION additionalInputFiles
¶
PARAMETER refGenome_fasta
¶
- Summary
- String. Default ‘hg19.fasta’. Path to the reference genome fasta file.
Details
Warning
You need write access to the directory in which the fasta file is stored, make sure this is the case or copy the fasta file to a different directory. The reason is that the pipeline produces a fasta index file, which is put in the same directory as the corresponding fasta file. This is a limitation of samtools faidx and not our pipeline.
Note
This file has to be in concordance with the input data; that is, the exact same genome assembly version must be used. In the first step of the pipeline, this is checked explicitly, and any mismatches will result in an error.
PARAMETER dir_TFBS
¶
- Summary
- String. Path to the directory where the TF-specific files for TFBS results are stored.
- Details
Each TF {TF} has to have one BED file, in the format {TF}.bed. Each file must be a valid BED6 file with 6 columns, as follows:
- chromosome
- start
- end
- ID (or sequence)
- score or any other numeric column
- strand
For user convenience, we provide such sorted files as described in the publication as a separate download:
- hg19: For a pre-compiled list of 620 human TF with in-silico predicted TFBS based on the HOCOMOCO 10 database and PWMScan for hg19, download this file:
- hg38: For a pre-compiled list of 771 human TF with in-silico predicted TFBS based on the HOCOMOCO 11 database and FIMO from the MEME suite1 for hg38, download this file:
- mm10: For a pre-compiled list of 423 mouse TF with in-silico predicted TFBS based on the HOCOMOCO 10 database and PWMScan for mm10, download this file:
However, you may also manually create these files to include additional TF of your choice or to be more or less stringent with the predicted TFBS. For this, you only need PWMs for the TF of interest and then a motif prediction tool such as FIMO or MOODS.
Also see the parameter
dir_TFBS_sorted
(PARAMETER dir_TFBS_sorted) to specify whether the files are already sorted or not.
PARAMETER RNASeqCounts
¶
- Summary
- String. Default “”. Path to the file with RNA-Seq counts.
- Details
If no RNA-Seq data is included, set to the empty string “”. Otherwise, if
RNASeqIntegration
(PARAMETER RNASeqIntegration) is set to true, specify the path to a tab-separated file with normalized RNA-Seq counts. It does not matter whether the values have been variance-stabilized or not, as long as values across samples are comparable. Also, consider filtering lowly expressed genes. For guidance, you may want to read Question 4 here.The first line must be used for labeling the samples, with column names being identical to the sample names as specific in the sample summary table (
summaryFile
, PARAMETER summaryFile). If you have RNA-Seq data for only a subset of the input samples, this is no problem - the classification will then naturally only be based on the subset. The first column must be named ENSEMBL and it must contain ENSEMBL IDs (e.g., ENSG00000028277) without dots. The IDs are then matched to the IDs from the file as specified inHOCOMOCO_mapping
(PARAMETER HOCOMOCO_mapping).
PARAMETER HOCOMOCO_mapping
¶
- Summary
- String. Path to the TF-Gene translation table.
- Details
If RNA-Seq integration shall be used, a translation table to associate TFs and ENSEMBL genes is needed. For convenience, we provide such a translation table compatible with the pre-provided TFBS lists. Specifically, for each of the currently three TFBS lists, we provide corresponding translation tables for:
- hg19 with HOCOMOCO 10
- hg38 with HOCOMOCO 11
- mm10 with HOCOMOCO 10
If you want to create your own version, check the example translation tables and construct one with an identical structure.
Input metadata¶
This file summarizes the data and corresponding available metadata that should be used for the analysis. The format is flexible and may contain additional columns that are ignored by the pipeline, so it can be used to capture all available information in a single place. Importantly, the file must be saved as tab-separated, the exact name does not matter as long as it is correctly specified in the configuration file.
Warning
Make sure that the line endings are correct. Different operating systems use different characters to mark the end of line, and the line ending character must be compatible with the operationg system in which you run diffTF. For example, if you created the file in MAC, but you run it in a Linux environment (e.g., a cluster system), you may have to convert line endings to make them compatible with Linux. For more information, see here .
It must contain at least contain the following columns (the exact names do matter):
sampleID
: The ID of the samplebamReads
: path to the BAM file corresponding to the sample.Warning
All BAM files must meet SAM format specifications. You may use the program ValidateSamFile from the Picard tools to check and identify problems with your file. Chromosome names must have a “chr” as prefix, otherwise diffTF may crash.
peaks
: absolute path to the sample-specific peak file, in the format as given bypeakType
(PARAMETER peakType). Only needed if no consensus peak file is provided.conditionSummary
: String with an arbitrary condition name that defines which condition the sample belongs to. There must be only exactly two different conditions across all samples (e.g., mutated and unmutated, day0 and day10, …). In addition, the two conditions must match the ones specified in the parameterconditionComparison
(PARAMETER conditionComparison).if applicable, all additional variables from the design formula except
conditionSummary
must also be present as a separate column.
Warning
Do not change the samples data after you started an analysis. You may introduce inconsistencies that will result in error messages. If you need to alter the sample data, we strongly advise to recalculate all steps in the pipeline.
Output¶
The pipeline produces quite a large number of output files, only some of which are however relevant for the regular user.
Note
In the following, the directory structure and the files are briefly outlined. As some directory or file names depend on specific parameters in the configuration file, curly brackets will be used to denote that the filename depends on a particular parameter or name. For example, {comparisonType}
and {regionExtension}
refer to comparisonType
(PARAMETER comparisonType) and regionExtension
( PARAMETER regionExtension) as specified in the configuration file.
Most files have one of the following file formats:
- .bed.gz (gzipped bed file)
- .tsv.gz (tab-separated value, text file with tab as column separators, gzipped)
- .rds (binary R format, read into with the function
readRDS
) - .pdf (PDF format)
- .log (text format)
FOLDER FINAL_OUTPUT
¶
In this folder, the final output files are stored. Most users want to examine the files in here for further analysis.
Sub-folder extension{regionExtension}
¶
Stores results related to the user-specified extension size (regionExtension
, PARAMETER regionExtension)
Note
In all output files, in the column permutation
, 0 always refers to the non-permuted, real data, while permutations > 0 reflect real permutations.
FILE {comparisonType}.allMotifs.tsv.gz
¶
- Summary
- Summary table for each TFBS
- Details
Columns are as follows:
- permutation: The number of the permutation. This will always be 0, so it can ignored essentially in this file.
- TF: name of the TF
- chr, MSS, MES, strand, TFBSID: Genomic location and identifier of the (extended) TFBS
- peakID: Genomic location and annotation of the overlapping peak region
- l2FC, pval, pval_adj: Results from the limma or DESeq2 analysis, see the respective documentation for details (see below for links and further explanation). These column names are shared between limma and DESeq2. l2FC are interpreted as described in the parameter
conditionComparison
( PARAMETER conditionComparison) - DESeq_baseMean, DESeq_ldcSE, DESeq_stat: Results from the DESeq2 analysis, see the DESeq2 documentation for details (e.g., ?DESeq2::results). If DESeq2 was not run for calculating log2 fold-changes (i.e., if the value for the parameter
nPermutations
( PARAMETER regionExtension) is >0), these columns are set to NA. - limma_avgExpr, limma_B, limma_t_stat: Results from the limma analysis, see the limma documentation for details (e.g., ??topTable). If limma was not run (i.e., if the value for the parameter
nPermutations
( PARAMETER regionExtension) is 0), these columns are set to NA.
FILE {comparisonType}.TF_vs_peak_distribution.tsv.gz
¶
- Summary
- This summary table contains various results regarding TFs, their log2 fold change distribution across all TFBS and differences between all TFBS and the peaks
- Details
- See the description of the file
{TF}.{comparisonType}.summary.rds
. This file aggregates the data for all TF and adds the following additional columns: - pvalue_adj: adjusted (fdr aka BH) p-value (based on pvalue_raw) - Diff_mean, Diff_median, Diff_mode, Diff_skew: Difference of the mean, median, mode, and skewness between the log2 fold-change distribution across all TFBS and the peaks, respectively
FILE {comparisonType}.summary.tsv.gz
¶
- Summary
- The final summary table that is also used for the final circular visualization.
- Details
The columns are as follows:
- TF: name of the TF
- weighted_meanDifference: the weighted mean difference of the real and background distribution across all CG bins. This value is the basis for the final calculation of the x-axis position for the circular plot.
- TFBS: The number of TF binding sites for the particular TF that overlap with the peaks
- fdr: the local FDR value that is derived from comparing the observed values against the permuted ones
- classification: RNA-Seq classification (either activator, undetermined, repressor or not-expressed)
FILE {comparisonType}.summary.circular.pdf
¶
- Summary
- The final (circular) visualization of the diffTF results.
- Details
The PDF contains multiple pages and iterates over either one or two different parameters: - 1. Significance threshold based on adjusted p-values (0.001, 0.01, 0.05, 0.1 and 0.2.) - 2. If RNA classification is integrated, different combinations of categories are shown for each individual significance threshold (1: activator-undetermined-repressor-not-expressed, 2: activator-undetermined-repressor, 3: activator-repressor).
For each variant, the values on the x-axis denote the effect size (weighted mean difference). TFs that are similarly active between the two conditions are close to 0, while higher values indicate a higher differential TF activity between the two conditions and are therefore away from 0. The y-axis (radial position) denotes the statistical significance (adjusted p-values). The significance threshold is indicated as red circular line. TFs that pass the significance threshold are labeled and, if RNA-Seq data is integrated, colored according to their predicted role (see above).
FILE {comparisonType}.diagnosticPlots.pdf
¶
- Summary
- Various diagnostic plots for the final TF activity values. TODO UPDATE
- Details
- If the number of permutations is larger than 0, the first three pages show various versions of the permuted weighted_meanDifference values and how they relate to the real ones. Permutation 0, as used everywhere throughout the pipeline, contains the real values, while any permutation > 0 refers to an actual permutation. Page 1 shows real and permuted values, page 2 only permuted ones, page 3 a density plot of the real values with the permutation thresholds as dashed lines, inside of which TFs are not labeled as they fall within the permutation and therefore noise area. The next page shows various diagnostic plots from the locfdr package to estimate the distribution median, while the remaining plots show histograms of all relevant columns in the final output table for different sets of TFs depending on a specific FDR threshold.
FOLDER PEAKS
¶
Stores peak-associated files.
FILES {comparisonType}.consensusPeaks.filtered.sorted.bed
¶
- Summary
- Only present if no consensus peak file was provided (
consensusPeaks
, PARAMETER consensusPeaks). Produced in rulefilterSexChromosomesAndSortPeaks
. Generated consensus peaks, before filtering (see below). - Details
- Filtered consensus peaks (removal of peaks from one of the following chromosomes: chrX, chrY, chrM, chrUn*, *random*, *hap|_gl*
FILE {comparisonType}.allBams.peaks.overlaps.bed.gz
¶
- Summary
- Produced in rule
intersectPeaksAndBAM
. Counts for each consensus peak with each of the input BAM files. - Details
- No details provided yet.
FILE {comparisonType}.sampleMetadata.rds
¶
- Summary
- Produced in rule
DiffPeaks
. Stores data for the input data (similar to the input sample table), for both the real data and the permutations. - Details
- No details provided yet.
FILE {comparisonType}.peaks.rds
¶
- Summary
- Produced in rule
DiffPeaks
. Stores all peaks that will be used in the analysis. - Details
- No details provided yet.
FILE {comparisonType}.peaks.tsv.gz
¶
- Summary
- Produced in rule
DiffPeaks
. Stores the results of the differential accessibility analysis for the peaks. - Details
- No details provided yet.
FILE {comparisonType}.normFacs.rds
¶
- Summary
- Produced in rule
DiffPeaks
. Gene-specific normalization factors for each sample and peak. - Details
- This file is produces after the differential accessibility analysis for the peaks. The normalization factors are used for the TF-specific differential accessibility analysis.
FILES {comparisonType}.diagnosticPlots.peaks.pdf
¶
- Summary
- Produced in rule
DiffPeaks
. Various diagnostic plots for the differential accessibility peak analysis for the real data - Details
The pages are as follows:
- MA plots
- density plots of normalized and non-normalized counts
- mean-average plots (average of the log-transformed counts vs the fold-change per peak) for each of the sample pairs
- mean SD plots (row standard deviations versus row means)
FILE {comparisonType}.DESeq.object.rds
¶
- Summary
- Produced in rule
DiffPeaks
. The DESeq2 object from the differential accessibility peak analysis. - Details
- If the number of permutations (parameter
nPermutations
(PARAMETER nPermutations) is set to 0, DESeq is fully run, otherwise the objects does only contain the counts and metadata but no results slot.
FOLDER TF-SPECIFIC
¶
Stores TF-specific files. For each TF {TF}
, a separate sub-folder {TF}
is created by the pipeline. Within this folder, the following structure is created:
Sub-folder extension{regionExtension}
¶
FILES {TF}.{comparisonType}.allBAMs.overlaps.bed.gz
and {TF}.{comparisonType}.allBAMs.overlaps.bed.summary
¶
- Summary
- Overlap and featureCounts summary file of read counts across all TFBS for all input BAM files.
- Details
- For more details, see the documentation of featureCounts.
FILE {TF}.{comparisonType}.output.tsv.gz
¶
- Summary
- Produced in rule
analyzeTF
. A summary table for the limma analysis. - Details
- See the file
{comparisonType}.allMotifs.tsv.gz
in theFINAL_OUTPUT
folder for a column description.
FILE {TF}.{comparisonType}.outputPerm.tsv.gz
¶
- Summary
- Produced in rule
analyzeTF
. A subset of the file{TF}.{comparisonType}.output.tsv.gz
that stores only the necessary permutation-specific results for subsequent steps. - Details
- This file has the following columns (see the description for the file
{TF}.{comparisonType}.output.tsv.gz
for details): - TF - TFBSID - log2fc_perm columns, which store the permutation-specific log2 fold-changes of the particular TFBS. Permutation 0 refers to the real data
FILE {TF}.{comparisonType}.summary.rds
¶
- Summary
- Produced in rule
analyzeTF
. A summary table for the log2 fold-changes across all TFBS limma results. - Details
- This file summarizes the TF-specific results for the differential analysis and has the following columns: - TF: name of the TF - permutation: The number of the permutation. - Pos_l2FC, Mean_l2FC, Median_l2FC, sd_l2FC, Mode_l2FC, skewness_l2FC: fraction of positive values, mean, median, standard deviation, mode value and Bickel’s measure of skewness of the log2 fold change distribution across all TFBS - pvalue_raw and pvalue_adj: raw and adjusted (fdr aka BH) p-value of the t-test - T_statistic: the value of the T statistic from the t-test - TFBS_num: number of TFBS
FILES {TF}.{comparisonType}.diagnosticPlots.pdf
¶
- Summary
- Produced in rule
analyzeTF
. Various diagnostic plots for the differential accessibility TFBS analysis for the real data. - Details
- See the description of the file
{comparisonType}.diagnosticPlots.peaks.pdf
in thePEAKS
folder, which has an identical structure. Here, the second last page shows a density plot of the log2 fold-changes for the specific pairwise condition that the user selected, separately for the peaks only and across all TFBS from the specific TF. The last page shows the same but in a cumulative representation.
FILE {TF}.{comparisonType}.permutationResults.rds
¶
- Summary
- Produced in rule
binningTF
. Contains a data frame that stores the results of bin-specific results. - Details
- No details provided yet.
FILE {TF}.{comparisonType}.permutationSummary.tsv.gz
¶
- Summary
- Produced in rule
binningTF
. A final summary table that summarizes the results across bins by calculating weighted means. - Details
- The data of this table are used for the final visualization.
FILE {TF}.{comparisonType}.covarianceResults.rds
¶
- Summary
- Produced in rule
binningTF
. Contains a data frame that stores the results of the pairwise bin covariances and the bin-specific weights. - Details
Note
Covariances are only computed for the real data but not the permuted ones.
FOLDER LOGS_AND_BENCHMARKS
¶
Stores various log and error files.
*.log
files from R scripts: Each log file is produced by the corresponding R script and contains debugging information as well as warnings and errors:1.produceConsensusPeaks.R.log
2.DiffPeaks.R.log
3.analyzeTF.{TF}.R.log
for each TF{TF}
4.summary1.R.log
5.binningTF.{TF}.log
for each TF{TF}
6.summaryFinal.R.log
*.log
summary files: Summary logs for user convenience, produced at very end of the pipeline only. They should contain all errors and warnings from the pipeline run.all.errors.log
all.warnings.log
FOLDER TEMP
¶
Stores temporary and intermediate files. Since they are usually not relevant for the user, they are explained only very briefly here.
Sub-folder SortedBAM
¶
Stores sorted versions of the original BAMs that are optimized for fast count retrieval using featureCounts. Only present if data are paired-end.
{basenameBAM}.bam
for each input BAM file: Produced in ruleresortBAM
. Resorted BAM file
Sub-folder extension{regionExtension}
¶
Stores results related to the user-specified extension size (regionExtension
, PARAMETER regionExtension)
{comparisonType}.allTFBS.peaks.bed.gz
: Produced in ruleintersectPeaksAndTFBS
. BED file containing all TFBS from all TF that overlap with the peaks after motif extensionconditionComparison.rds
: Produced in ruleDiffPeaks
. Stores the condition comparison as a string. Some steps in diffTF need this file as input.{comparisonType}.motifs.coord.permutation{perm}.bed.gz
and{comparisonType}.motifs.coord.nucContent.permutation{perm}.bed.gz
for each permutation{perm}
: Produced in rulecalcNucleotideContent
, and needed subsequently for the binning. Temporary and result file of bedtools nuc, respectively. The latter contains the GC content for all TFBS.{comparisonType}.checkParameterValidity.done
: temporary flag file{TF}_TFBS.sorted.bed
for each TF{TF}
: Produced in rulesortTFBSParallel
. Coordinate-sorted version of the input TFBS.{comparisonType}.allTFBS.peaks.bed.gz
: Produced in ruleintersectPeaksAndTFBS
. BED file containing all TFBS from all TF that overlap with the peaks before motif extension
Working with diffTF and FAQs¶
General remarks¶
diffTF is programmed as a Snakemake pipeline. Snakemake is a bioinformatics workflow manager that uses workflows that are described via a human readable, Python based language. It offers many advantages to the user because each step can easily be modified, parts of the pipeline can be rerun, and workflows can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition or only minimal modifications. However, with great flexibility comes a price: the learning curve to work with the pipeline might be a bit higher, especially if you have no Snakemake experience. For a deeper understanding and troubleshooting errors, some knowledge of Snakemake is invaluable.
Simply put, Snakemake executes various rules. Each rule can be thought of as a single recipe or task such as sorting a file, running an R script, etc. Each rule has, among other features, a name, an input, an output, and the command that is executed. You can see in the Snakefile
what these rules are and what they do. During the execution, the rule name is displayed, so you know exactly at which step the pipeline is at the given moment. Different rules are connected through their input and output files, so that the output of one rule becomes the input for a subsequent rule, thereby creating dependencies, which ultimately leads to the directed acyclic graph (DAG) that describes the whole workflow. You have seen such a graph in Section Workflow.
In diffTF, a rule is typically executed separately for each TF. One example for a particular rule is sorting the TFBS list for the TF CTCF.
In diffTF, the total number of jobs or rules to execute can roughly be approximated as 3 * nTF
, where nTF
stands for the number of TFs that are included in the analysis. For each TF, four rules are executed:
- Calculating read counts for each TFBS within the peak regions (rule
intersectTFBSAndBAM
) - Differential accessibility analysis (rule
analyzeTF
) - Binning step (rule
binningTF
)
In addition, one rule per permuation is executed, so an additional nPermutations
rules are performed. Lastly, a few other rules are executed that however do not add up much more to the overall rule count.
Executing diffTF - Running times and memory requirements¶
diffTF can be computationally demanding depending on the sample size and the number of peaks. In the following, we discuss various issues related to time and memory requirements and we provide some general guidelines that worked well for us.
Warning
We generally advise to run diffTF in a cluster environment. For small analysis, a local analysis on your machine might work just fine (see the example analysis in the Git repository), but running time increases substantially due to limited amount of available cores.
Analysis size¶
We now provide a very rough classification into small, medium and large with respect to the sample size and the number of peaks:
- Small: Fewer than 10-15 samples, number of peaks not exceeding 50,000-80,000, normal read depth per sample
- Large: Number of samples larger than say 20 or number of peaks clearly exceeds 100,000, or very high read depth per sample
- Medium: Anything between small and large
Memory¶
Some notes regarding memory:
- Disk space: Make sure you have enough space left on your device. As a guideline, analysis with 8 samples need around 12 GB of disk space, while a large analysis with 84 samples needs around 45 GB. The number of permutations also has an influence on the (temporary) required storage and a high number of permutations (> 500) may substantially increase the memory footprint. Note that most space is occupied in the TEMP folder, which can be deleted after an analysis has been run successfully. We note, however, that rerunning (parts of) the analysis will require regenerating files from the TEMP folder, so only delete the folder or files if you are sure that you do not need them anymore.
- Machine memory: Although most steps of the pipeline have a modest memory footprint of less than 4 GB or so, depending on the analysis size, some may need 10+ GB of RAM during execution. We therefore recommend having at least 10 GB available for large analysis (see above).
Number of cores¶
Some notes regarding the number of available cores:
- diffTF can be invoked in a highly parallelized manner, so the more CPUs are available, the better.
- you can use the
--cores
option when invoking Snakemake to specify the number of cores that are available for the analysis. If you specify 4 cores, for example, up to 4 rules can be run in parallel (if each of them occupies only 1 core), or 1 rule can use up to 4 cores. - we strongly recommend running diffTF in a cluster environment due to the massive parallelization. With Snakemake, it is easy to run diffTF in a cluster setting. Simply do the following:
- write a cluster configuration file that specifies which resources each rule needs. For guidance and user convenience, we provide different cluster configuration files for a small and large analysis. See the folder
src/clusterConfigurationTemplates
for examples. Note that these are rough estimates only. See the *Snakemake* documentation for details for how to use cluster configuration files. - invoke Snakemake with one of the available cluster modes, which will depend on your cluster system. We used
--cluster
and tested the pipeline extensively with LSF/BSUB and SLURM. For more details, see the *Snakemake* documentation
- write a cluster configuration file that specifies which resources each rule needs. For guidance and user convenience, we provide different cluster configuration files for a small and large analysis. See the folder
Total running time¶
Some notes regarding the total running time:
- the total running time is very difficult to estimate beforehand and depends on many parameters, most importantly the number of samples, their read depth, the number of peaks, and the number of TF included in the analysis.
- for small analysis such as the example analysis in the Git repository, running times are roughly 30 minutes with 2 cores for 50 TF and a few hours with all 640 TF.
- for large analysis, running time will be up to a day or so when executed on a cluster machine
Running diffTF in a cluster environment¶
If diffTF should be run in a cluster environment, the changes are minimal due to the flexibility of Snakemake. You only need to change the following:
- create a cluster configuration file in JSON format. See the files in the clusterConfigurationTemplates folder for examples. In a nutshell, this file specifies the computational requirements and job details for each job that is run via Snakemake.
- invoke Snakemake with a cluster parameter. As an example, we use the following for our SLURM cluster:
snakemake -s path/to/Snakefile --reason --configfile path/to/configfile --latency-wait 30 --notemp --printshellcmds --rerun-incomplete --timestamp --cores 16 --keep-going --jobs 400 --cluster-config path/to/clusterconfigfile --cluster " sbatch -p {cluster.queueSLURM} -J {cluster.name} -A {cluster.group} -C {cluster.nodes} --cpus-per-task {cluster.nCPUs} --mem {cluster.memory} --time {cluster.maxTime} -o "{cluster.output}" -e "{cluster.error}" --mail-type=None --parsable " --local-cores 1
Note that the –cluster string is the only part that has to be adjusted for your cluster system.
Frequently asked questions¶
Here a few typical use cases, which we will extend regularly in the future if the need arises:
- I received an error, and the pipeline did not finish.
As explained in Section Handling errors, you first have to identify and fix the error. Rerunning then becomes trivially easy: just restart Snakemake, it will start off where it left off: at the step that produced that error.
- I want to rerun a specific part of the pipeline only.
This common scenario is also easy to solve: Just invoke Snakemake with--forcerun {rulename}
, where{rulename}
is the name of the rule as defined in the Snakefile. Snakemake will then rerun the specified run and all parts downstream of the rule. If you want to avoid rerunning downstream parts (think carefully about it, as there might be changes from the rerunning that might have consequences for downstream parts also), you can combine--forcerun
with--until
and specify the same rule name for both.
- I want to modify the workflow.
Simply add or modify rules to the Snakefile, it is as easy as that.
- I want to change the value of a parameter.
TODO
If you feel that a particular use case is missing, let us know and we will add it here!
Handling errors¶
Error types¶
Errors occur during the Snakemake run can principally be divided into:
- Temporary errors (often when running in a cluster setting)
- might occur due to temporary problems such as bad nodes, file system issues or latencies
- rerunning usually fixes the problem already. Consider using the option
--restart-times
in Snakemake.
- Permanent errors
- indicates a real error related to the specific command that is executed
- rerunning does not fix the problem as they are systematic (such as a missing tool, a library problem in R)
Identify the cause¶
To troubleshoot errors, you have to first locate the exact error. Depending on how you run Snakemake (i.e., in a cluster setting or not), check the following places:
- in locale mode: the Snakemake output appears on the console. Check the output before the line “Error in rule”, and try to identify what went wrong. Errors from R script should in addition be written to the corresponding R log files in the in the
LOGS_AND_BENCHMARKS
directory. - in cluster mode: either error, output or log file of the corresponding rule that threw the error in the
LOGS_AND_BENCHMARKS
directory. If you are unsure in which file to look, identify the rule name that caused the error and search for files that contain the rule name in it.
In both cases, you can check the log file that is located in .snakemake/log
. Identify the latest log file (check the date), and then either open the file or use something along the lines of:
grep -C 5 "Error in rule" .snakemake/log/2018-07-25T095519.371892.snakemake.log
This is particularly helpful if the Snakemake output is long and you have troubles identifying the exact step in which an error occurred.
Common errors¶
We here provide a list of some of the errors that can happen and that users reported to us. This list will be extended whenever a new problem has been reported.
- R related problems
Many errors are R related. R and Bioconductor use a quite complex system of libraries and dependencies, and you may receive errors that are related to R, Bioconductor, or specific libraries.
*** caught segfault *** ... Segmentation fault ...This unfortunate message points to a problem with your R and R libraries installation and has per se nothing to do with diffTF. At least one of the installed libraries has an issue. We advise to reinstall Bioconductor in such a case, and ask someone who is experienced with this to help you. Unfortunately, this issue is so general that we cannot provide any specific solutions as this type of error is very general. To troubleshoot and identify exactly which library or function causes this, you may run the R script that failed in debug mode and go through it line by line. See the next section for more details.
Fixing the error¶
General guidelines¶
After locating the error, fix it accordingly. We here provide some guidelines of different error types that may help you fixing the errors you receive:
- Errors related to erroneous input: These errors are easy to fix, and the error message should be indicative. If not, please let us know, and we improve the error message in the pipeline.
- Errors of technical nature: Errors related to memory, missing programs, R libraries etc can be fixed easily by making sure the necessary tools are installed and by executing the pipeline in an environment that provides the required technical requirements. For example, if you receive a memory-related error, try to increase the available memory. In a cluster setting, adjust the cluster configuration file accordingly by either increasing the default memory or (preferably) or by overriding the default values for the specific rule.
- Errors related to Snakemake: In rare cases, the error can be due to Snakemake (corrupt metadata, missing files, etc). If you suspect this to be the case, you may delete the hidden
.snakemake
directory in the folder from which you started the analysis. Snakemake will regenerate it the next time you invoke it then. - Errors related to the input data: Error messages that indicate the problem might be located in the data are more difficult to fix, and we cannot provide guidelines here. Feel free to contact us.
Debugging R scripts to identify the cause of an error¶
If an R script fails with a technical error such as caught segfault
(a segmentation fault), you may want to identify the library or function call that causes the message in order to figure out which library to reinstall. To do so, open the R script that fails in RStudio, and execute the script line by line until you identify the line that causes the issue. Importantly, read the instructions in the section at the beginning of the script that is called SAVE SNAKEMAKE S4 OBJECT THAT IS PASSED ALONG FOR DEBUGGING PURPOSES
. Briefly, you simply have to make the snakemake object available in your R workspace, which contains all necessary information to execute the R script properly. Normally, Snakemake automatically loads that when executing a script. To do so, simply execute the line that is pasted there in R, it is something like this:
snakemake = readRDS("{outputFolder}/LOGS_AND_BENCHMARKS/0.checkParameters.R.rds")
Replace {outputFolder}
by the folder you used for the analysis, and adjust the 0.checkParameters
part also accordingly. Essentially, you just have to provide the path to the corresponding file that is located in the LOGS_AND_BENCHMARKS
subdirectly within the specified output directory.
Rerunning Snakemake¶
After fixing the error, rerun Snakemake. Snakemake will continue at the point at which the error message occurred, without rerunning already successfully computed previous steps (unless specified otherwise).