Commit bd4bcd4e authored by Alessio Milanese's avatar Alessio Milanese
Browse files

add practical from previous year

parent 265911cf
In this practical we are going to:
# Practical 1: Quality control and read trimming
Objectives of the practical:
- Familiarise with fastq files, how to look at them and what they represent;
- Learn how to understand if the reads in your files are of good quality;
- Learn how to remove and trim low quality reads.
Required tools:
- [trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) (install with [conda](https://anaconda.org/bioconda/trimmomatic))
- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) (install with [conda](https://anaconda.org/bioconda/fastqc))
Required file:
- [https://www.embl.de/download/zeller/metaG_course/fastq_files/raw_reads_1.fastq](https://www.embl.de/download/zeller/metaG_course/fastq_files/raw_reads_1.fastq) (~22 Mb)
- [https://www.embl.de/download/zeller/metaG_course/fastq_files/raw_reads_2.fastq](https://www.embl.de/download/zeller/metaG_course/fastq_files/raw_reads_2.fastq) (~22 Mb)
## Exercises
1. Open the terminal, create a directory in the `~/Desktop` and download the two required files with `wget`.
<details><summary>SOLUTION</summary>
<p>
```
cd ~/Desktop
mkdir practical_2
cd practical_2
wget https://www.embl.de/download/zeller/metaG_course/fastq_files/raw_reads_1.fastq
wget https://www.embl.de/download/zeller/metaG_course/fastq_files/raw_reads_2.fastq
```
</p>
2. Look at the fastq files, how are they structured?
<details><summary>SOLUTION</summary>
<p>
When we work with metagenomic data we usually have two fastq files produced by
the Illumina sequencer:
- a file containing the forward reads
- a file containing the reverse reads
Usually the prefix of the file name is the same, and we have `_1_` for the file
with forward reads and `_2_` for the file with reverse reads, example:
```
HYG3LBGXC_261_1_19s00-sample119s004346_1_sequence.txt.gz
HYG3LBGXC_261_1_19s00-sample119s004346_2_sequence.txt.gz
```
A fastq file contains 4 lines for each read, with the following information:
| Line | Description |
| ------ | ------ |
| 1 | A line starting with `@` and the read id |
| 2 | The DNA sequence |
| 3 | A line starting with `+` and sometimes the same information as in line 1 |
| 4 | A string of characters that represents the quality score (same number of characters as in line 2) |
We can have a look at the first read (4 lines) with `head -n 4 raw_reads_1.fastq`:
```
@read1
CTCTAGCAGATACTCTCCCTATATGAACTCATGGGGGCGGGGATGCCCGTCCTGTGTAACAATAAAAAATAACCTTGATGAGGGCGGATAGATCCTACCT
+
BBBFFFF<FBFFFFFFIIIIBFBFFBFIIIIFFFIIIIIFFBFFFFFB77BBFFBBBBBBBBBBFFFB7<0BBBB<BBBBFBBBFFF<<7BBBFFFBBBB
```
Each character in the fourth line can be converted to a quality score ([Phred-33](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm)) from 1 to 40:
```
Character: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
| | | | |
Quality score: 0........10........20........30........40
```
And, for each quality score there is an associated probability for correctly calling a base:
| Quality Score | Probability of incorrect base call | Base call accuracy |
| ------ | ------ | ------ |
| 10 | 1 in 10 | 90% |
| 20 | 1 in 100 | 99% |
| 30 | 1 in 1000 | 99.9% |
| 40 | 1 in 10,000 | 99.99% |
</p>
3. Use `fastqc` ([link](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) to check the quality of the reads in the two files.
<details><summary>SOLUTION</summary>
<p>
Run on the terminal:
```
fastqc raw_reads_1.fastq
fastqc raw_reads_2.fastq
```
We can open the [html file for the reverse reads](https://www.embl.de/download/zeller/metaG_course/pics/raw_reads_2_fastqc.html). There a bunch of graphs, but probably the most interesting is:
![alt text](https://www.embl.de/download/zeller/metaG_course/pics/average_quality_reverse_read_raw.png)
This view shows an overview of the range of quality values across all bases at each position in the FastQ file.
For each position a BoxWhisker type plot is drawn with:
- median as a red line,
- the inter-quartile range (25-75%) as a yellow box,
- upper and lower whiskers represent the 10% and 90% points,
- the blue line represents the mean quality.
The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read.
</p>
4. Have a look at the first four reads, which reads would you remove?
<details><summary>SOLUTION</summary>
<p>
If you have a look at the first four reads with `head -n 16 raw_reads_1.fastq` and `head -n 16 raw_reads_2.fastq`, you can see that:
- `@read1` in the forward file looks fine, while in the reverse file only 22 nucleotides have been sequenced:
```
@read1
GCTCGATGGAGAGGGAAGGTT
+
BBBFFFFBFFFFFBBFFFIFF
```
- `@read54` in the forward file has only 4 sequenced nucleotides and all other are N's:
```
@read54
NATTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
#AAA################################################################################################
```
- `@read315` in the forward file has only 2 low quality nucleotides at the end:
```
@read315
TTGTAGTGCTTGTCGAATTCGACGTAGTATTTGCCTACGAGGTGGTCGCCCTTCATGCCCGACGTGGCGGGCGTTTCGCCGTTGCCGTAGAGTTTCCANN
+
BBBFFFFFFFFFFIIIIIIIIIIIFFIFIIIIIIIIIIIIIIFIIIIIIIIIIIFFFFFFFBBBBFFFFFFB<0<7<B<B<<BB<B<7<<0000B<<<##
```
- `@read337` is of low quality in both files.
We would like to remove `@read1` reverse and `@read54` forward because they are of low quality. On the other hand, `@read315` forward looks fine, except for the last 2 nucleotides.
Hence, we would like to trim only the low quality nucleotides at the end (or beginning) of the read.
For `@read337`, we would like to remove both forward and reverse.
</p>
5. Use `trimmomatic` ([trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)) to remove short reads (<40 bp), low quality reads and remove low quality bases at the end/beginning of the reads. What files are produced by trimmomatic?
<details><summary>SOLUTION</summary>
<p>
Use the following command:
```bash
trimmomatic PE raw_reads_1.fastq raw_reads_2.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40 -baseout filtered
```
Which produces 4 files:
- `filtered_1P`, containing the forward reads that pass the filter and have a mate (in `filtered_2P`);
- `filtered_1U`, containing the forward reads that pass the filter and do not have a mate (the paired reverse read didn't pass the filter)
- `filtered_2P`, containing the reverse reads that pass the filter and have a mate (in `filtered_1P`);
- `filtered_2U`, containing the reverse reads that pass the filter and do not have a mate (the paired forward read didn't pass the filter)
</p>
6. Check the filtered reverse reads with `fastqc`, is the read quality improved?
<details><summary>SOLUTION</summary>
<p>
If we run again `fastQC` on the reverse paired end reads after filtering (`filtered_2P`), with:
```bash
fastqc filtered_2P
```
we obtain the following [html file](https://www.embl.de/download/zeller/metaG_course/pics/filtered_2P_fastqc.html). Now, we have better quality reads:
![alt text](https://www.embl.de/download/zeller/metaG_course/pics/compared_fastqc.png)
</p>
## References and additional information
1. [Intro to FastQC with explanation of a FASTA file](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/02_assessing_quality.html)
2. [Per base sequence quality fastQC plot](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)
3. [Trimmomatic manual](http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf)
# Practical 2: Demonstration of mOTUs for taxonomic profiling
Objectives of the practical:
- Learn how to create a taxonomic profile for a (quality filtered) metagenomic sample using mOTUs2;
- Familiarise with the parameters and the output of a taxonomic profiling tool;
- Learn how to merge many taxonomic profile files (all the profiles from your study) into one file.
Required tools:
- [mOTUs2](https://github.com/motu-tool/mOTUs_v2) (install with [conda](https://anaconda.org/bioconda/motus))
Required file:
- filtered fastq files created in the previous practical
- [https://www.embl.de/download/zeller/metaG_course/motus_profiles/tax_profiles.tar.gz](https://www.embl.de/download/zeller/metaG_course/motus_profiles/tax_profiles.tar.gz) (~1 Mb)
## Taxonomic profiling
The majority of microbiome studies rely on an accurate identification of the microbes and quantification their abundance in the sample under study, a process called taxonomic profiling.
![alt text](https://www.embl.de/download/zeller/metaG_course/pics/taxonomic_profiling.png)
We would like to save the profile in a file like:
```
Bacteroides_vulgatus 0.34
Prevotella_copri 0.16
Eubacterium_rectale 0.10
...
```
And, if we pull together many samples:
```
sample_1 sample_2
Bacteroides_vulgatus 0.34 0.01
Prevotella_copri 0.16 0.42
Eubacterium_rectale 0.10 0.00
```
We use [mOTUs2](https://github.com/motu-tool/mOTUs_v2) to create taxonomic profiles of metagenomic samples.
## Exercises
1. Use `motus` (manual: [link](https://github.com/motu-tool/mOTUs_v2#simple-examples)) to create a profile from the files created by trimmomatic.
<details><summary>SOLUTION</summary>
<p>
You can download the 4 files that you created with `trimmomatic` in the previous practical with (in case you don't have them):
```
wget https://www.embl.de/download/zeller/metaG_course/fastq_files/filtered_1P
wget https://www.embl.de/download/zeller/metaG_course/fastq_files/filtered_1U
wget https://www.embl.de/download/zeller/metaG_course/fastq_files/filtered_2P
wget https://www.embl.de/download/zeller/metaG_course/fastq_files/filtered_2U
```
You can profile the sample with:
```
motus profile -f filtered_1P -r filtered_2P -s filtered_1U,filtered_2U -o test_sample.motus
```
Which produces (`head test_sample.motus`):
```
# git tag version 2.5.1 | motus version 2.5.1 | map_tax 2.5.1 | gene database: nr2.5.1 | calc_mgc 2.5.1 -y insert.scaled_counts -l 75 | calc_motu 2.5.1 -k mOTU -C no_CAMI -g 3 | taxonomy: ref_mOTU_2.5.1 meta_mOTU_2.5.1
# call: python motus profile -f filtered_1P -r filtered_2P -s filtered_1U,filtered_2U -o test_sample.motus
#consensus_taxonomy unnamed sample
Leptospira alexanderi [ref_mOTU_v25_00001] 0.0000000000
Leptospira weilii [ref_mOTU_v25_00002] 0.0000000000
Chryseobacterium sp. [ref_mOTU_v25_00004] 0.0000000000
Chryseobacterium gallinarum [ref_mOTU_v25_00005] 0.0000000000
Chryseobacterium indologenes [ref_mOTU_v25_00006] 0.0000000000
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v25_00007] 0.0000000000
Chryseobacterium jejuense [ref_mOTU_v25_00008] 0.0000000000
```
In the profile, `ref_mOTU` represent species with a reference genome sequence in NCBI. While `meta_mOTU` represent species that have not been sequenced yet (they do not have a representative genome in NCBI).
the `-1` at the end of the file (check with `tail test_sample.motus`), represents species that we know to be present, but are not able to profile. The `-1` is useful to have a correct evaluation of the relative abundance of all species (more info [here](https://github.com/motu-tool/mOTUs_v2/wiki/Explain-the-resulting-profile#-1)).
</p>
2. How many species are detected? How many are reference species and how many are unknown species?
<details><summary>SOLUTION</summary>
<p>
We can have a look at the identified species that are different from zero with `cat test_sample.motus | grep -v "0.0000000000"`:
```
# git tag version 2.5.1 | motus version 2.5.1 | map_tax 2.5.1 | gene database: nr2.5.1 | calc_mgc 2.5.1 -y insert.scaled_counts -l 75 | calc_motu 2.5.1 -k mOTU -C no_CAMI -g 3 | taxonomy: ref_mOTU_2.5.1 meta_mOTU_2.5.1
# call: python /Users/milanese/Dropbox/PhD/mOTUsv2_tool/2.5.1/mOTUs_v2/motus profile -f filtered_1P -r filtered_2P -s filtered_1U,filtered_2U -o test_sample.motus -t 2
#consensus_taxonomy unnamed sample
Proteobacteria sp. [ref_mOTU_v25_00095] 0.0001743716
Eggerthella lenta [ref_mOTU_v25_00719] 0.0001584466
Ruminococcus bromii [ref_mOTU_v25_00853] 0.0040007632
```
With `cat test_sample.motus | grep -v "0.0000000000" | grep -v "#" | wc -l`, we can see that we have 139 detected species.
To count the number of ref-mOTUs you can use:
```
cat test_sample.motus | grep -v "0.0000000000" | grep -c "ref_mOTU_"
```
and for meta-mOTUs:
```
cat test_sample.motus | grep -v "0.0000000000" | grep -c "meta_mOTU_"
```
</p>
3. Can you change some parameters in `motus` to profile more or less species?
<details><summary>SOLUTION</summary>
<p>
In the mOTUs wiki page there are more information: [link](https://github.com/motu-tool/mOTUs_v2/wiki/Increase-precision-or-recall).
Basically, you can use `-g` and `-l` to increase(/decrease) the specificity of the detected species. Lower values allows you to profile more species (lax thresholds); while higher values corresponds to stringent cutoffs (and hence less profiled species).
If we run:
```
motus profile -f filtered_1P -r filtered_2P -s filtered_1U,filtered_2U -o test_sample_relax.motus -t 2 -g 1 -l 40
cat test_sample_relax.motus | grep -v "0.0000000000" | grep -v "#" | wc -l
```
We obtain 218 species (79 more species than the default parameters). While, with:
```
motus profile -f filtered_1P -r filtered_2P -s filtered_1U,filtered_2U -o test_sample_stringent.motus -t 2 -g 6 -l 90
cat test_sample_stringent.motus | grep -v "0.0000000000" | grep -v "#" | wc -l
```
we profile only 86 species (53 less than using the standard parameters).
If you don't manage to run them (~4 mins each), you can download the profiles with:
```
wget https://www.embl.de/download/zeller/metaG_course/motus_profiles/test_sample/test_sample.motus
wget https://www.embl.de/download/zeller/metaG_course/motus_profiles/test_sample/test_sample_relax.motus
wget https://www.embl.de/download/zeller/metaG_course/motus_profiles/test_sample/test_sample_stringent.motus
```
</p>
4. How can you merge different motus profiles into one file?
Download the following profiles and merge them into one file.
```
wget https://www.embl.de/download/zeller/metaG_course/motus_profiles/tax_profiles.tar.gz
tar -zxvf tax_profiles.tar.gz
```
How the merged file differs from the example above?
<details><summary>SOLUTION</summary>
<p>
You can use `motus merge` (more info [here](https://github.com/motu-tool/mOTUs_v2/wiki/Merge-profiles-into-a-table)).
Note that the name of the samples is speciefied by `-n` when using `motus profile`.
You can run:
```
motus merge -d tax_profiles -o all_samples.motus
```
which looks like (`head all_samples.motus`):
```
# motus version 2.5.1 | merge 2.5.1 | info merged profiles: # git tag version 2.5.1 | motus version 2.5.1 | map_tax 2.5.1 | gene database: nr2.5.1 | calc_mgc 2.5.1 -y insert.scaled_counts -l 75 | calc_motu 2.5.1 -k mOTU -C no_CAMI -g 3 | taxonomy: ref_mOTU_2.5.1 meta_mOTU_2.5.1
# call: python /Users/milanese/Dropbox/PhD/mOTUsv2_tool/2.5.1/mOTUs_v2/motus merge -d tax_profiles -o all_samples.motus
#consensus_taxonomy VF_264 VF_47 VF_77 VF_90
Leptospira alexanderi [ref_mOTU_v25_00001] 0.0000000000 0.0000000000 0.0000000000 0.0000000000
Leptospira weilii [ref_mOTU_v25_00002] 0.0000000000 0.0000000000 0.0000000000 0.0000000000
Chryseobacterium sp. [ref_mOTU_v25_00004] 0.0000000000 0.0000000000 0.0000000000 0.0000000000
Chryseobacterium gallinarum [ref_mOTU_v25_00005] 0.0000000000 0.0000000000 0.0000000000 0.0000000000
Chryseobacterium indologenes [ref_mOTU_v25_00006] 0.0000000000 0.0000000000 0.0000000000 0.0000000000
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v25_00007] 0.0000000000 0.0000000000 0.0000000000 0.0000000000
Chryseobacterium jejuense [ref_mOTU_v25_00008] 0.0000000000 0.0000000000 0.0000000000 0.0000000000
```
Note that that if you want to load this table into another tool (for example R), you would like to skip the first 2 rows.
</p>
5. (bonus) Profile the unfiltered fastq files from the test sample and compare them to the filtered one.
<details><summary>SOLUTION</summary>
<p>
We can profile the unfiltered reads with:
```
motus profile -f raw_reads_1.fastq -r raw_reads_2.fastq -t 2 -n unfiltered -o unfiltered.motus
```
Let's have a look at the species that are profiled by both:
```
motus merge -i unfiltered.motus,test_sample.motus | grep -v "0.0000000000" | grep -v "#"
```
Which results in:
```
Proteobacteria sp. [ref_mOTU_v25_00095] 0.0002007051 0.0001743716
Eggerthella lenta [ref_mOTU_v25_00719] 0.0001496362 0.0001584466
Ruminococcus bromii [ref_mOTU_v25_00853] 0.0037746076 0.0040007632
Bacteroides rodentium/uniformis [ref_mOTU_v25_00855] 0.0212321086 0.0214878111
Anaerostipes hadrus [ref_mOTU_v25_00857] 0.0001321414 0.0001399217
Roseburia faecis [ref_mOTU_v25_00859] 0.0003954637 0.0004187479
Roseburia inulinivorans [ref_mOTU_v25_00860] 0.0001544263 0.0001639940
```
The two profiles are really similar. mOTUs is filtering the reads internally based on how good they map to the marker gene sequences; hence trimming and filtering the reads before will not affect much the profiles. For other analysis (like building metagenome-assembled genomes) trimming the reads improve the result.
</p>
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment