Update README.md

parent e7c7bb11
# oral-fecal-transmission-public
# Analysis Code: Extensive transmission of microbes along the gastrointestinal tract
This repository contains data processing and analysis code underlying the study, "Extensive transmission of microbes along the gastrointestinal tract", [eLife 2019](https://elifesciences.org/articles/42693).
Below, the general workflow of the analyses is outlined, putting individual scripts in the repo into context.
## Data
The study used salivary and fecal metagenomes from several public and newly generated datasets. See the [methods section](https://elifesciences.org/articles/42693#s3) in the original study for an overview of the various cohorts that were used. All raw data is available via the European Nucleotide Archive:
* `FJ-CTR` (FijiCOMP): https://www.ebi.ac.uk/ena/browser/view/PRJNA217052
* `CN-RA`: https://www.ebi.ac.uk/ena/browser/view/PRJEB6997
* `LU-T1D`: https://www.ebi.ac.uk/ena/browser/view/PRJNA289586 (newly generated data added under the same project accession)
* `FR-CRC` & `DE-CTR`: https://www.ebi.ac.uk/ena/browser/view/PRJEB6070 https://www.ebi.ac.uk/ena/browser/view/PRJEB8347 & https://www.ebi.ac.uk/ena/browser/view/PRJEB28422 (newly generated data)
A full list of samples and the metadata used in the study are available as [table S1](https://doi.org/10.7554/eLife.42693.015) in the published study.
Taxon phenotypic and other metadata were obtained from the `PATRIC` db, and further amended by manual curation. It is available as [table S2](https://doi.org/10.7554/eLife.42693.016) in the published study.
## Data Processing
The analysis workflow is described in detail in the published study. Here, the various processing steps are only outlined and linked to the respective code snippets.
### Preprocessing: Read Filtering & Mapping
Raw reads for all samples were filtered, truncated by quality and then mapped to representative genomes for species-level clusters of the [`proGenomes`](http://progenomes1.embl.de) database. The `NGless` workflow script that was used is available as [sh/Rep9alignment.ngl](https://git.embl.de/tschmidt/oral-fecal-transmission-public/-/blob/master/sh/Rep9alignment.ngl).
This resulted in one `bam` file per biological sample, sorted and filtered to uniquely mapping reads (mapping to only one location in the reference) across all reference genomes.
### SNV Calling & Filtering
The tool [`metaSNV`](http://metasnv.embl.de) (v1) was used to call Single Nucleotide Variants across metagenomic samples for each reference microbial genome. As an input, `metaSNV` expects a list of `bam` files to consider for processing, as well as indexed `bed` files for the underlying genome reference set.
`metaSNV` was run with the following highly permissive parameters:
* `-c 10`: minimum coverage of 10 reads at genomic position, across all samples
* `-t 2`: minimum number of non-reference reads to call SNV
* `-p 0.001`: minimum frequency to call "population-level" SNV
* `-d 2500`: maximum depth for pileup (passed as parameter `d` to `samtools mpileup`)
The resulting raw SNV tables were split into per-taxon (per-genome) subtables and filtered using the script [R/job.prepare.data.SNV.R](https://git.embl.de/tschmidt/oral-fecal-transmission-public/-/blob/master/R/job.prepare.data.SNV.R) to only contain taxa satisfying the following criteria *in ≥10 samples*:
* horizontal genome coverage (breadth) ≥ 0.05
* average vertical genome coverage (depth) ≥ 0.25
* mOTU abundance ≥ 10^-6
'Informative' SNVs for transmission detection were defined as follows:
* coverage ≥1 at focal position (putative SNV) in ≥10 salivary and ≥10 fecal samples
* SNV coverage ≥1 at focal position in ≥1 salivary and ≥1 fecal sample
Information across per-taxon runs is then integrated using the script [consolidate.prepare.data.SNV.R](https://git.embl.de/tschmidt/oral-fecal-transmission-public/-/blob/master/R/consolidate.prepare.data.SNV.R).
From the resulting per-taxon tables, allele incidences and frequencies across samples were calculated. SNV tables were further filtered differentially for the various downstream analyses.
### Quantification of Oral-Fecal Transmission
The rationale and formulas used to compute oral-fecal "transmission scores" per tested subject are described in detail in the published study. The following scripts are required to run these computations:
* [R/job.prepare.detect_transmission.alleles.R](https://git.embl.de/tschmidt/oral-fecal-transmission-public/-/blob/master/R/job.prepare.detect_transmission.alleles.R) performs potentially memory intensive pre-processing steps (preparation of allele tables per taxon, pre-computation of per-cohort allele frequency tables, etc.)
* [R/job.detect_transmission.alleles.R](https://git.embl.de/tschmidt/oral-fecal-transmission-public/-/blob/master/R/job.detect_transmission.alleles.R) run per taxon using pre-processed allele (SNV) tables; this script does the "heavy lifting" and applies the transmission scoring algorithm described in the published study
* [R/consolidate.detect_transmission.SNV.R](https://git.embl.de/tschmidt/oral-fecal-transmission-public/-/blob/master/R/consolidate.detect_transmission.SNV.R) integrates outputs of the above scripts across all tested taxa
While the above scripts serve to *detect* transmission signatures (in the sense of a test for significance of SNV overlap), the quantification of the relative contributions of oral and fecal strain populations in a subject was performed separately, using a different rationale (see paper):
* [R/job.quantify_transmission.R](https://git.embl.de/tschmidt/oral-fecal-transmission-public/-/blob/master/R/job.quantify_transmission.R) to quantify the fraction of fecal strain populations attributable to oral strains, and vice versa (run per taxon)
* [R/consolidate.quantify_transmission.SNV.R](https://git.embl.de/tschmidt/oral-fecal-transmission-public/-/blob/master/R/consolidate.quantify_transmission.SNV.R) to consolidate these results across all taa
### Longitudinal Strain Stability and Directionality of Transmission
For a subset of data, longitudinal replicates were available. These were used to
* establish the longitudinal stability of strains per species in the mouth and gut (scripts `job.establish_temporal_stability.R` and `consolidate.establish_temporal_stability.R`)
* establish the directionality of transmission (mouth -> gut vs gut -> mouth, in the sense of a significance call) (scripts `job.establish_directionality.R` and `consolidate.establish_directionality.R`)
* establish transmission rates (mouth -> gut vs gut -> mouth, quantifying the transmission per time) (scripts `job.establish_transmission_rates.R` and `consolidate.establish_transmission_rates.R`)
## Analyses & Generation of Manuscript Figures
Using the data generated by running the above scripts, data analysis and generation of manuscript figures (sub-panels) was performed using the code in folder `Rmd`.
Analysis code for a study on oral-fecal microbial transmission.
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment