Commit 247d4d56 authored by Hugo Carlos's avatar Hugo Carlos
Browse files

Upload New File

parent ea9303a7
# Finding similar sequences
Finding similar protein and nucleotide sequences has been one of the
greatest challenges in the age of bioinformatics. In many situations where we
have piece of DNA or a protein, the very first step knowing more about it
involves finding similar pieces of DNA or protein based on sequence similarity.
The basic premise of all DNA sequence similarity searches is quite straight
forward:
- Given a particular nucleotide sequence, find all genes/transcripts with similar sequences in a particular database.
And for Protein sequences, is quite similar:
- Given a particular amino acid sequence, find all peptides or proteins with a similar sequences in a particular database.
The main tool which has been used over the past 20 years for this purpose
has been **BLAST**. In this tutorial we are going to look at standard BLAST for
nucleotide and protein sequences, as well as a BLAST variant called PSI-BLAST.
Lastly we'll take a look at some newer tools that have emerged based on hidden
markov models: HMMER and HHPred.
## BLAST
**BLAST** stands for "basic local alignment search tool", and is a basic tool for
searching for local alignments (as you might have guessed from the acronym).
Its standing as one of the most important tools in bioinformatics history
becomes pretty clear when we take a look at the list of the 100 highest cited
publications:
- **12:** "Basic local alignment search tool", Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. *J. Mol.* Biol. 215, 403–410 (1990). -- 38,380 citations
- **14:** "Gapped BLAST and PSI-BLAST: A new generation of protein database search programs", Altschul, S. F. et al. *Nucleic Acids Res.* 25, 3389–3402 (1997). -- 36,410 citations
(source: [The top 100 papers: Nature News & Comment][nature_top_100])
BLAST can be used to find similar nucleotide as well as protein sequences and
rank them from most to least similar based on a similarity score.
As the name implies, BLAST is designed for "local" alignments, i.e.: finding
segments of a sequence that are similar to another segment of another sequence.
Most proteins are modular, and often contain one or more functional domains.
Finding proteins (or genes) that have similar domains gives a good indication
that these proteins share some functional similarity. Proteins (or genes) that
are "locally" similar for their entire sequences are likely to be homologous,
and functionally equivalent.
In 'sequence similarity search' lingo, we typically speak of two things:
- **query:** the sequence we have we want to know more about.
- **subject** or **database:** the set of sequences we want to retrieve similar sequences from.
**BLAST** refers to a suite of programs to generate local alignments and
calculate significance scores.
This includes the two main versions of BLAST we will discuss today:
- **BLASTP** proteins
- **BLASTN** nucleotides
There are many other BLAST searches: megaBLAST, BLASTX, TBLASTN, PSI-BLAST,
RPSBLAST & DELTA-BLAST. (of which we'll look at PSI-BLAST later as well.)
### BLAST method, in brief
It is worthwhile to have a basic understanding of how a simple BLAST works:
The **query** sequence is split into small fragments, known as **words**:
- protein sequences are split into tri-mers
- nucleotide sequences are split into 11-mers
BLAST then scans the **database** (or subject) to find all identical words
between query and subject sequences.
- For protein sequences: must execeed a score based on substituion matrix.
- for nucleotide: must be excat
When a match is found: BLAST extends forwards and backwards to produce and
alignment. This extension continues until the score falls below a given
threshold (dropoff). The sequence similarity score is calculated from a Score
matrix, which defines the similarity between different amino acids (or
nucleotides).
For proteins, the similarity score is determined by a sequence similarity matrix, such as the BLOSUM62 matrix shown below:
![Search box](images/BLOSUM62.png)
(source: [https://commons.wikimedia.org/wiki/File:BLOSUM62.png])
Additionally a GAP penalty is also defined, as the costs of opening and extending a GAP in the sequence alignments.
### Exercise 1: Protein BLAST-ing BLD10
BLD10 is a protein found in *C. reinhardtii* (a small green algae) involved in creating the Centrosome/Basal body complex. It has some interesting sequence features which make it a nice example for sequence similarity searches, and we'll be using it throughout the rest of this tutorial.
Note: Much of this material has also been adapted from the [NCBI
guide][ncbi_blast_guide], which is an excellent reference for getting started
with BLAST.
#### BLAST input
The BLAST input screen contains the most important components you need to create a BLAST web search.
The top part contains information pertaining to our input sequence, or **query**:
- **Step 1** Navigate to the NCBI BLAST homepage: [https://www.ncbi.nlm.nih.gov/blast/], and click on the "Protein BLAST" button to start a protein BLAST query.
![Search box](images/searchbox.jpg)
- On the top bar, we see which BLAST program we are in: blastn, blastp, blastx, tblastn, tblastx.
- In the text box, we have the **query** input section where you can enter a nucleotide/protein sequence, accession number of FASTA formatted entry.
- The "from" and "to" fields allow you to limit the search to a part of the sequence
- Its also possible to upload a file instead of copy-pasting entries into the searchbox using the 'upload file' button.
- Optionally, you can give your job a name (to find it back later more quickly)
- **Step 2** In the Search text box, enter the accession for BLD10: **EDP06416.1**
![Search set](images/searchset.jpg)
In the "search set" section, we choose which **subject** or **database** we
want to search in. There are many options to choose from.
For proteins, the most useful/common options are:
- **nr**: A non-redundant database of protein from Genbank, RefSeq, SwissProt, PDB and more.
- **refseq_protein**: Protein sequences from the NCBI Reference Sequence dataset.
- **swissprot**: The Uniprot/SWISS-PROT database.
In the "search set" we can also choose to limit our search to a single species (for example, "homo sapiens"), or a taxonomic group (e.g. "fungi").
- **Step 3** Make sure we have the **nr** database selected as our **subject**.
![Program selection](images/programselection.jpg)
In the "program selection" box, we select which type of BLAST we wish to run
(the one shown above is for protein sequences.) To run a (standard) BLASTP leave "blastp" selected, and hit the "BLAST BUTTON".
- **Step 4** Make sure we have the **blastp** selected as our program.
#### BLAST output
The top panel contains useful information about your search:
![Results Top](images/toptop.png)
The most important of these is the **RID** (result ID), which you can use to
retrieve your results (and is typically valid for a few days).
The next panel contains a graphical summary of the result:
![Graphical Summary](images/graphicsummary.png)
The top part of the "graphical summary" contains any domains identified by
sequence, such as Pfam and InterPro domains). You can click on these for more
information. The sequences retrieved are typically called **hits**.
The bottom part of the "graphical summary" part has the more interesting
results regarding your search: A graphical representation of the top scoring hits.
Mouse over the hit for a description, or click on one to scroll down to
the hit details.
- **Question 1** Were there any homologues detected in Humans?
- **Question 2** What is the "evolutionay extent" of this protein in other species? (hint: the button "taxonomic reports" gives a nice overview)
The next part has information on all of the "hits" found:
![Descriptions](images/descriptions.png)
The descriptions section shows the hits resulting from the search, as well as
some of the scores. Of the scores, the most important one is the e-value:
- **e-value** The number of different alignments with scores equivalent to or better than the query that are expected to occur in a database search by chance. The smaller the e-value, the more likely the detected protein is a homologue. A rule of thumb value of *0.0001* is often used to infer homology.
- **Question 3** Do the detected proteins appear to be true homologues to BLD10?
![Alignments](images/alignments.png)
The last part of the results page contains the individual alignments detected
using the BLAST search.
### Exercise: C. elegans protein BLAST
This exercise was adapted from an [NCBI Blast course][ncbi_blast_course].
The following is the product of a gene in *C. elegans* which resulted from some (hypothetical) experiment we performed.
```fasta
>C.elegans protein
MFHPGMTSQPSTSNQMYYDPLYGAEQIVQCNPMDYHQANILCGMQYFNNSHNRYPLLPQMPPQFTNDHPYDFPNVPTISTLDEASSFNGFLIPSQPSSYNNNNISCVFTPTPCTSSQASSQPPPTPTVNPTPIPPNAGAVLTTAMDSCQQISHVLQCYQQGGEDSDFVRKAIESLVKKLKDKRIELDALITAVTSNGKQPTGCVTIQRSLDGRLQVAGRKGVPHVVYARIWRWPKVSKNELVKLVQCQTSSDHPDNICINPYHYERVVSNRITSADQSLHVENSPMKSEYLGDAGVIDSCSDWPNTPPDNNFNGGFAPDQPQLVTPIISDIPIDLNQIYVPTPPQLLDNWCSIIYYELDTPIGETFKVSARDHGKVIVDGGMDPHGENEGRLCLGALSNVHRTEASEKARIHIGRGVELTAHADGNISITSNCKIFVRSGYLDYTHGSEYSSKAHRFTPNESSFTVFDIRWAYMQMLRRSRSSNEAVRAQAAAVAGYAPMSVMPAIMPDSGVDRMRRDFCTIAISFVKAWGDVYQRKTIKETPCWIEVTLHRPLQILDQLLKNSSQFGSS
```
#### Questions
Use BLAST to answer the following questions:
- **Question 1** Which *C. elegans* protein is this one most likely to be?
- **Question 2** What is it's closest homologue in another species?
- **Quesion 3** What is its homologue in *D. melanogaster*? (Hint: You can create another BLAST query for this).
### Exercise 3: Nucleotide BLAST
BLAST can also be used to perform searches on DNA (or transcripts).
- **Step 1** Navigate to the NCBI BLAST homepage: [https://www.ncbi.nlm.nih.gov/blast/], and click on the "Nucleotid BLAST" button to start a protein BLAST query.
#### BLAST input
BLAST input for nucleotide sequences is very similar to to the Protein BLAST shown previously, with a few minor differences:
The most common options for the "search set" are:
- **nr**: A non-redundant set of all GenBank + EMBL + DDBJ + PDB sequences.
- **refseq_genomic**: Nucleotide genomic sequences from the NCBI Reference Sequence Project.
- **Human G+T**: The genomic sequences plus curated and predicted RNAs from the current build of the human genome.
There are also a handfull of different programs for nucleotide BLAST, the most useful of which are:
- **blastn** is general purpose search to align tRNA, mRNA, rRNA or genomic DNA sequences with coding and non-coding regious
- **megablast** is a search for highly similar sequences (and is much faster)
#### Questions
Let's assume that the above DNA sequence belongs to an RNAi sequence we have
found in our lab, but we have no idea which transicript it is supposed to
target. However, we know that its from a Human cell line.
```
>LostRNAi
GGTATTTGTTTTGATGCAAACCTCAATCCCTCCCCTTCTTTGAATGGTGTGCCCCACCCCGCGGGTCGCCTGCAACCTAGGCGGACGCTACCATGGCGTGAGACAGGGAGGGAAAGAAGTGTGCAGAAGGCAAGCCCGGAGGTATTTTCAAGAATGAGTATATCTCATCTTCCCGGAGGAAAAAAAAAAAGAATGGGTACGTCTGAGAATCAAATTTTGAAAGAGTGCAATGATGGGTCGTTTGATAATTTGTCGGAAAAACAATCTACCTGTTATCTAGCTTTGGGCTAGGCCATTCCAGTTCCAGACGCAGGCTGAACGTCGTGAAGCGGAAGGGGCGGGCCCGCAGGCGTCCGTGTGGTCCTCCGTGCAGCCCTCCGGCCCGAGCCGGTTCTTCCTGGTAGGAGGCGGAACTCGAATTCATTTCTCCCGCTGCCCCATCTCTTAGCTCGCGGTTGTTTCATTCCGCAGTTTCTTCCCATGCACCTGCCGCGTACCGGCCACTTTGTGCCGTACTTACGTCATCTTTTTCCTAAATCGAGGTGGCATTTACACACAGCGCCAGTGCACACAGCAAGTGCACAGGAAGATGAGTTTTGGCCCCTAACCGCTCCGTGATGCCTACCAAGTCACAGACCCTTTTCATCGTCCCAGAAACGTTTCATCACGTCTCTTCCCAGTCGATTCCCGACCCCACCTTTATTTTGATCTCCATAACCATTTTGCCTGTTGGAGAACTTCATATAGAATGGAATCAGGCTGGGCGCTGTGGCTCACGCCTGCACTTTGGGAGGCCGAGGCGGGCGGATTACTTGAGGATAGGAGTTCCAGACCAGCGTGGCCAACGTGGTGAATCCCCGTCTCTACTAAAAAATACAAAAATTAGCTGGGCGTGGTGGGTGCCTGTAATCCCAGCTATTCGGGAGGGTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGTGCCACTACACTCCAGCCTGGGCGACAAGAACGAAACTCCGTCTCAAAAAAAAGGGGGGAATCATACATTATGTGCTCATTTTTGTCGGGCTTCTGTCCTTCAATGTACTGTCTGACATTCGTTCATGTTGTATATATCAGTATTTTGCTC
```
- **Question 1** Use BLASTN to find which transcript this sequence most likely
targets. (Hint: Use **megablast** for highly similar sequence search, and
search the Human genome + predicted RNA's)
## PSI-BLAST
Position-Specific Iterative BLAST (PSI BLAST) BLAST is an extension of BLAST,
which uses an itteration to create a Position Specific Score Matrix (PSSM) for each query.
![PSI BLAST Top](images/psiblast.png)
Because it uses information about sequence diversity and conservation, its much better at detecting remote homolgues.
PSI BLAST can be ran from the same page as a normal BLAST by selecting "PSI-BLAST" in the "Program selection" section.
## HMMER
HMMER is a sequence similarity search tools that uses Hidden Markov Models
(HMMs) to search for similar nucleotide or proteins sequences. In practice it
is very similar to BLAST, although it is much more powerful, especially when we
use a multiple sequence alignment as input instead of just a single sequence.
Its main strength is in detecting remote homologues.
### Exersize 3: HMMER BLD10
As we saw earlier, stardard BLAST is not able to detect a human homologue of the *C. reinhardtii* protein BLD10. In thi short exercise we'll use HMMER to see if this more sensitive tool can detect it.
#### HMMER input
Navigate to the HMMER search page: [http://www.ebi.ac.uk/Tools/hmmer/]
HMMER takes as input a single nucleotide or protein sequence (accession number or identifiers are not accpected).
- **Step 1** Retrieve the amino sequence corresponding to the *C. reinhardtii* protein with accession EDP06416.1 from the NCBI database, and copy-paste it into the HMMER searchbox.
- **Step 2** Make sure the **Reference proteomes** is selected as the **sequence database** to search in. (It should already be selected).
- **Step 3** Hit **Search** to start searching.
#### HMMER output
The HMMER output page contains all of the information on HMMERs detected hits across the reference sequence database.
- **Question 1** What is the evolutionay range of the significants hits detected by HMMER? Is this a plant specific protein, eukaryotic, or does it exist in all major branches of life?
- **Question 2** Does BLD10 have a homologue in humans, and if so, what is it called?
## HHPred
**HHPred** is part of the **HH-suite** software package, and also uses Hidden
Markov Models to perform a more sensitive search than standard BLAST. HHPred
represents both the input **query** and **subject** sequences as HMM's (thus
the "HH" in HHPred).
Like HMMER, HHPred is one of the fastest and more sensitive tools that
currently exist.
### Exersize 4: HHPred.
Find a protein sequence of your interest, or alternatively use one from this tutorial. Let's use HHPred to find similar sequences in across the set of known species.
- **Step 1** Navigate to the HHPred search homepage: [https://toolkit.tuebingen.mpg.de/hhpred]
- **Question 1** Does the search for similar sequences using HHPred differ than using a BLAST alone?
## Take home messages
There are many different ways to search for similar protein and nucleotide sequences. Although tools like HMMER and HHPred are incredibly good for finding homologues genes across large evolutionary distances, this does not necessarily mean they are the best at everything. BLAST (and all of its variants) are both simple and easily usable tools which may simply be better for the job at hand.
[ncbi_blast_guide]: https://www.ncbi.nlm.nih.gov/blast/BLAST_guide.pdf
[nature_top_100]: http://www.nature.com/news/the-top-100-papers-1.16224
[ncbi_blast_course]: https://www.ncbi.nlm.nih.gov/Class/BLAST/blast_course.short.html
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment