6.77 KB
Newer Older
Thomas Sebastian Schmidt's avatar
Thomas Sebastian Schmidt committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
# Analysis Code: Extensive transmission of microbes along the gastrointestinal tract

This repository contains data processing and analysis code underlying the study, "Extensive transmission of microbes along the gastrointestinal tract", [eLife 2019](

Below, the general workflow of the analyses is outlined, putting individual scripts in the repo into context.

## Data

The study used salivary and fecal metagenomes from several public and newly generated datasets. See the [methods section](  in the original study for an overview of the various cohorts that were used. All raw data is available via the European Nucleotide Archive:

* `FJ-CTR` (FijiCOMP):
* `CN-RA`:
* `LU-T1D`: (newly generated data added under the same project accession)
* `FR-CRC` & `DE-CTR`: & (newly generated data)

A full list of samples and the metadata used in the study are available as [table S1]( in the published study.

Taxon phenotypic and other metadata were obtained from the `PATRIC` db, and further amended by manual curation. It is available as [table S2]( in the published study.

## Data Processing

The analysis workflow is described in detail in the published study. Here, the various processing steps are only outlined and linked to the respective code snippets.

### Preprocessing: Read Filtering & Mapping

Raw reads for all samples were filtered, truncated by quality and then mapped to representative genomes for species-level clusters of the [`proGenomes`]( database. The `NGless` workflow script that was used is available as [sh/Rep9alignment.ngl](

This resulted in one `bam` file per biological sample, sorted and filtered to uniquely mapping reads (mapping to only one location in the reference) across all reference genomes.

### SNV Calling & Filtering

The tool [`metaSNV`]( (v1) was used to call Single Nucleotide Variants across metagenomic samples for each reference microbial genome. As an input, `metaSNV` expects a list of `bam` files to consider for processing, as well as indexed `bed` files for the underlying genome reference set.

`metaSNV` was run with the following highly permissive parameters:

* `-c 10`: minimum coverage of 10 reads at genomic position, across all samples
* `-t 2`: minimum number of non-reference reads to call SNV
* `-p 0.001`: minimum frequency to call "population-level" SNV
* `-d 2500`: maximum depth for pileup (passed as parameter `d` to `samtools mpileup`)

The resulting raw SNV tables were split into per-taxon (per-genome) subtables and filtered using the script [R/]( to only contain taxa satisfying the following criteria *in ≥10 samples*:

* horizontal genome coverage (breadth) ≥ 0.05
* average vertical genome coverage (depth) ≥ 0.25
* mOTU abundance ≥ 10^-6

'Informative' SNVs for transmission detection were defined as follows:

* coverage ≥1 at focal position (putative SNV) in ≥10 salivary and ≥10 fecal samples
* SNV coverage ≥1 at focal position in ≥1 salivary and ≥1 fecal sample

Information across per-taxon runs is then integrated using the script [](

From the resulting per-taxon tables, allele incidences and frequencies across samples were calculated. SNV tables were further filtered differentially for the various downstream analyses.

### Quantification of Oral-Fecal Transmission

The rationale and formulas used to compute oral-fecal "transmission scores" per tested subject are described in detail in the published study. The following scripts are required to run these computations:

* [R/job.prepare.detect_transmission.alleles.R]( performs potentially memory intensive pre-processing steps (preparation of allele tables per taxon, pre-computation of per-cohort allele frequency tables, etc.)
* [R/job.detect_transmission.alleles.R]( run per taxon using pre-processed allele (SNV) tables; this script does the "heavy lifting" and applies the transmission scoring algorithm described in the published study
* [R/consolidate.detect_transmission.SNV.R]( integrates outputs of the above scripts across all tested taxa

While the above scripts serve to *detect* transmission signatures (in the sense of a test for significance of SNV overlap), the quantification of the relative contributions of oral and fecal strain populations in a subject was performed separately, using a different rationale (see paper):

* [R/job.quantify_transmission.R]( to quantify the fraction of fecal strain populations attributable to oral strains, and vice versa (run per taxon)
* [R/consolidate.quantify_transmission.SNV.R]( to consolidate these results across all taa

### Longitudinal Strain Stability and Directionality of Transmission

For a subset of data, longitudinal replicates were available. These were used to

* establish the longitudinal stability of strains per species in the mouth and gut (scripts `job.establish_temporal_stability.R` and `consolidate.establish_temporal_stability.R`)
* establish the directionality of transmission (mouth -> gut vs gut -> mouth, in the sense of a significance call) (scripts `job.establish_directionality.R` and `consolidate.establish_directionality.R`)
* establish transmission rates (mouth -> gut vs gut -> mouth, quantifying the transmission per time) (scripts `job.establish_transmission_rates.R` and `consolidate.establish_transmission_rates.R`)

## Analyses & Generation of Manuscript Figures

Using the data generated by running the above scripts, data analysis and generation of manuscript figures (sub-panels) was performed using the code in folder `Rmd`.

Thomas Sebastian Schmidt's avatar
Thomas Sebastian Schmidt committed