Each character in the fourth line can be converted to a quality score ([Phred-33](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm)) from 1 to 40:
And, for each quality score there is an associated probability for correctly calling a base:
| Quality Score | Probability of incorrect base call | Base call accuracy |
| ------ | ------ | ------ |
| 10 | 1 in 10 | 90% |
| 20 | 1 in 100 | 99% |
| 30 | 1 in 1000 | 99.9% |
| 40 | 1 in 10,000 | 99.99% |
</p>
3. Use `fastqc` ([link](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) to check the quality of the reads in the two files.
<details><summary>SOLUTION</summary>
<p>
Run on the terminal:
```
fastqc raw_reads_1.fastq
fastqc raw_reads_2.fastq
```
We can open the [html file for the reverse reads](https://www.embl.de/download/zeller/metaG_course/pics/raw_reads_2_fastqc.html). There a bunch of graphs, but probably the most interesting is:
This view shows an overview of the range of quality values across all bases at each position in the FastQ file.
For each position a BoxWhisker type plot is drawn with:
- median as a red line,
- the inter-quartile range (25-75%) as a yellow box,
- upper and lower whiskers represent the 10% and 90% points,
- the blue line represents the mean quality.
The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read.
</p>
4. Have a look at the first four reads, which reads would you remove?
<details><summary>SOLUTION</summary>
<p>
If you have a look at the first four reads with `head -n 16 raw_reads_1.fastq` and `head -n 16 raw_reads_2.fastq`, you can see that:
-`@read1` in the forward file looks fine, while in the reverse file only 22 nucleotides have been sequenced:
```
@read1
GCTCGATGGAGAGGGAAGGTT
+
BBBFFFFBFFFFFBBFFFIFF
```
-`@read54` in the forward file has only 4 sequenced nucleotides and all other are N's:
We would like to remove `@read1` reverse and `@read54` forward because they are of low quality. On the other hand, `@read315` forward looks fine, except for the last 2 nucleotides.
Hence, we would like to trim only the low quality nucleotides at the end (or beginning) of the read.
For `@read337`, we would like to remove both forward and reverse.
</p>
5. Use `trimmomatic` ([trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)) to remove short reads (<40 bp), low quality reads and remove low quality bases at the end/beginning of the reads. What files are produced by trimmomatic?
<details><summary>SOLUTION</summary>
<p>
Use the following command:
```bash
trimmomatic PE raw_reads_1.fastq raw_reads_2.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40 -baseout filtered
```
Which produces 4 files:
-`filtered_1P`, containing the forward reads that pass the filter and have a mate (in `filtered_2P`);
-`filtered_1U`, containing the forward reads that pass the filter and do not have a mate (the paired reverse read didn't pass the filter)
-`filtered_2P`, containing the reverse reads that pass the filter and have a mate (in `filtered_1P`);
-`filtered_2U`, containing the reverse reads that pass the filter and do not have a mate (the paired forward read didn't pass the filter)
</p>
6. Check the filtered reverse reads with `fastqc`, is the read quality improved?
<details><summary>SOLUTION</summary>
<p>
If we run again `fastQC` on the reverse paired end reads after filtering (`filtered_2P`), with:
```bash
fastqc filtered_2P
```
we obtain the following [html file](https://www.embl.de/download/zeller/metaG_course/pics/filtered_2P_fastqc.html). Now, we have better quality reads:
The majority of microbiome studies rely on an accurate identification of the microbes and quantification their abundance in the sample under study, a process called taxonomic profiling.
In the profile, `ref_mOTU` represent species with a reference genome sequence in NCBI. While `meta_mOTU` represent species that have not been sequenced yet (they do not have a representative genome in NCBI).
the `-1` at the end of the file (check with `tail test_sample.motus`), represents species that we know to be present, but are not able to profile. The `-1` is useful to have a correct evaluation of the relative abundance of all species (more info [here](https://github.com/motu-tool/mOTUs_v2/wiki/Explain-the-resulting-profile#-1)).
</p>
2. How many species are detected? How many are reference species and how many are unknown species?
<details><summary>SOLUTION</summary>
<p>
We can have a look at the identified species that are different from zero with `cat test_sample.motus | grep -v "0.0000000000"`:
```
# git tag version 2.5.1 | motus version 2.5.1 | map_tax 2.5.1 | gene database: nr2.5.1 | calc_mgc 2.5.1 -y insert.scaled_counts -l 75 | calc_motu 2.5.1 -k mOTU -C no_CAMI -g 3 | taxonomy: ref_mOTU_2.5.1 meta_mOTU_2.5.1
3. Can you change some parameters in `motus` to profile more or less species?
<details><summary>SOLUTION</summary>
<p>
In the mOTUs wiki page there are more information: [link](https://github.com/motu-tool/mOTUs_v2/wiki/Increase-precision-or-recall).
Basically, you can use `-g` and `-l` to increase(/decrease) the specificity of the detected species. Lower values allows you to profile more species (lax thresholds); while higher values corresponds to stringent cutoffs (and hence less profiled species).
The two profiles are really similar. mOTUs is filtering the reads internally based on how good they map to the marker gene sequences; hence trimming and filtering the reads before will not affect much the profiles. For other analysis (like building metagenome-assembled genomes) trimming the reads improve the result.