Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • mainar/protein-bioinformatics-nov-2016
  • makumar/protein-bioinformatics-nov-2016
  • lang/protein-bioinformatics-nov-2016
  • sharan/protein-bioinformatics-embl-hd
4 results
Show changes
Showing
with 2084 additions and 0 deletions
TeachingMaterials/2016/sequence_similarity/images/psiblast.png

42.4 KiB

TeachingMaterials/2016/sequence_similarity/images/searchbox.jpg

32.2 KiB

TeachingMaterials/2016/sequence_similarity/images/searchset.jpg

45.3 KiB

TeachingMaterials/2016/sequence_similarity/images/toptop.png

50.2 KiB

# Finding similar sequences
Finding similar protein and nucleotide sequences has been one of the
greatest challenges in the age of bioinformatics. In many situations where we
have piece of DNA or a protein, the very first step knowing more about it
involves finding similar pieces of DNA or protein based on sequence similarity.
The basic premise of all DNA sequence similarity searches is quite straight
forward:
- Given a particular nucleotide sequence, find all genes/transcripts with similar sequences in a particular database.
And for Protein sequences, is quite similar:
- Given a particular amino acid sequence, find all peptides or proteins with a similar sequences in a particular database.
The main tool which has been used over the past 20 years for this purpose
has been **BLAST**. In this tutorial we are going to look at standard BLAST for
nucleotide and protein sequences, as well as a BLAST variant called PSI-BLAST.
Lastly we'll take a look at some newer tools that have emerged based on hidden
markov models: HMMER and HHPred.
## BLAST
**BLAST** stands for "basic local alignment search tool", and is a basic tool for
searching for local alignments (as you might have guessed from the acronym).
Its standing as one of the most important tools in bioinformatics history
becomes pretty clear when we take a look at the list of the 100 highest cited
publications:
- **12:** "Basic local alignment search tool", Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. *J. Mol.* Biol. 215, 403–410 (1990). -- 38,380 citations
- **14:** "Gapped BLAST and PSI-BLAST: A new generation of protein database search programs", Altschul, S. F. et al. *Nucleic Acids Res.* 25, 3389–3402 (1997). -- 36,410 citations
(source: [The top 100 papers: Nature News & Comment][nature_top_100])
BLAST can be used to find similar nucleotide as well as protein sequences and
rank them from most to least similar based on a similarity score.
As the name implies, BLAST is designed for "local" alignments, i.e.: finding
segments of a sequence that are similar to another segment of another sequence.
Most proteins are modular, and often contain one or more functional domains.
Finding proteins (or genes) that have similar domains gives a good indication
that these proteins share some functional similarity. Proteins (or genes) that
are "locally" similar for their entire sequences are likely to be homologous,
and functionally equivalent.
In 'sequence similarity search' lingo, we typically speak of two things:
- **query:** the sequence we have we want to know more about.
- **subject** or **database:** the set of sequences we want to retrieve similar sequences from.
**BLAST** refers to a suite of programs to generate local alignments and
calculate significance scores.
This includes the two main versions of BLAST we will discuss today:
- **BLASTP** proteins
- **BLASTN** nucleotides
There are many other BLAST searches: megaBLAST, BLASTX, TBLASTN, PSI-BLAST,
RPSBLAST & DELTA-BLAST. (of which we'll look at PSI-BLAST later as well.)
### BLAST method, in brief
It is worthwhile to have a basic understanding of how a simple BLAST works:
The **query** sequence is split into small fragments, known as **words**:
- protein sequences are split into tri-mers
- nucleotide sequences are split into 11-mers
BLAST then scans the **database** (or subject) to find all identical words
between query and subject sequences.
- For protein sequences: must execeed a score based on substituion matrix.
- for nucleotide: must be excat
When a match is found: BLAST extends forwards and backwards to produce and
alignment. This extension continues until the score falls below a given
threshold (dropoff). The sequence similarity score is calculated from a Score
matrix, which defines the similarity between different amino acids (or
nucleotides).
For proteins, the similarity score is determined by a sequence similarity matrix, such as the BLOSUM62 matrix shown below:
![Search box](images/BLOSUM62.png)
(source: [https://commons.wikimedia.org/wiki/File:BLOSUM62.png])
Additionally a GAP penalty is also defined, as the costs of opening and extending a GAP in the sequence alignments.
### Exercise 1: Protein BLAST-ing BLD10
BLD10 is a protein found in *C. reinhardtii* (a small green algae) involved in creating the Centrosome/Basal body complex. It has some interesting sequence features which make it a nice example for sequence similarity searches, and we'll be using it throughout the rest of this tutorial.
Note: Much of this material has also been adapted from the [NCBI
guide][ncbi_blast_guide], which is an excellent reference for getting started
with BLAST.
#### BLAST input
The BLAST input screen contains the most important components you need to create a BLAST web search.
The top part contains information pertaining to our input sequence, or **query**:
- **Step 1** Navigate to the NCBI BLAST homepage: [https://www.ncbi.nlm.nih.gov/blast/], and click on the "Protein BLAST" button to start a protein BLAST query.
![Search box](images/searchbox.jpg)
- On the top bar, we see which BLAST program we are in: blastn, blastp, blastx, tblastn, tblastx.
- In the text box, we have the **query** input section where you can enter a nucleotide/protein sequence, accession number of FASTA formatted entry.
- The "from" and "to" fields allow you to limit the search to a part of the sequence
- Its also possible to upload a file instead of copy-pasting entries into the searchbox using the 'upload file' button.
- Optionally, you can give your job a name (to find it back later more quickly)
- **Step 2** In the Search text box, enter the accession for BLD10: **EDP06416.1**
![Search set](images/searchset.jpg)
In the "search set" section, we choose which **subject** or **database** we
want to search in. There are many options to choose from.
For proteins, the most useful/common options are:
- **nr**: A non-redundant database of protein from Genbank, RefSeq, SwissProt, PDB and more.
- **refseq_protein**: Protein sequences from the NCBI Reference Sequence dataset.
- **swissprot**: The Uniprot/SWISS-PROT database.
In the "search set" we can also choose to limit our search to a single species (for example, "homo sapiens"), or a taxonomic group (e.g. "fungi").
- **Step 3** Make sure we have the **nr** database selected as our **subject**.
![Program selection](images/programselection.jpg)
In the "program selection" box, we select which type of BLAST we wish to run
(the one shown above is for protein sequences.) To run a (standard) BLASTP leave "blastp" selected, and hit the "BLAST BUTTON".
- **Step 4** Make sure we have the **blastp** selected as our program.
#### BLAST output
The top panel contains useful information about your search:
![Results Top](images/toptop.png)
The most important of these is the **RID** (result ID), which you can use to
retrieve your results (and is typically valid for a few days).
The next panel contains a graphical summary of the result:
![Graphical Summary](images/graphicsummary.png)
The top part of the "graphical summary" contains any domains identified by
sequence, such as Pfam and InterPro domains). You can click on these for more
information. The sequences retrieved are typically called **hits**.
The bottom part of the "graphical summary" part has the more interesting
results regarding your search: A graphical representation of the top scoring hits.
Mouse over the hit for a description, or click on one to scroll down to
the hit details.
- **Question 1** Were there any homologues detected in Humans?
- **Question 2** What is the "evolutionay extent" of this protein in other species? (hint: the button "taxonomic reports" gives a nice overview)
The next part has information on all of the "hits" found:
![Descriptions](images/descriptions.png)
The descriptions section shows the hits resulting from the search, as well as
some of the scores. Of the scores, the most important one is the e-value:
- **e-value** The number of different alignments with scores equivalent to or better than the query that are expected to occur in a database search by chance. The smaller the e-value, the more likely the detected protein is a homologue. A rule of thumb value of *0.0001* is often used to infer homology.
- **Question 3** Do the detected proteins appear to be true homologues to BLD10?
![Alignments](images/alignments.png)
The last part of the results page contains the individual alignments detected
using the BLAST search.
### Exercise: C. elegans protein BLAST
This exercise was adapted from an [NCBI Blast course][ncbi_blast_course].
The following is the product of a gene in *C. elegans* which resulted from some (hypothetical) experiment we performed.
```fasta
>C.elegans protein
MFHPGMTSQPSTSNQMYYDPLYGAEQIVQCNPMDYHQANILCGMQYFNNSHNRYPLLPQMPPQFTNDHPYDFPNVPTISTLDEASSFNGFLIPSQPSSYNNNNISCVFTPTPCTSSQASSQPPPTPTVNPTPIPPNAGAVLTTAMDSCQQISHVLQCYQQGGEDSDFVRKAIESLVKKLKDKRIELDALITAVTSNGKQPTGCVTIQRSLDGRLQVAGRKGVPHVVYARIWRWPKVSKNELVKLVQCQTSSDHPDNICINPYHYERVVSNRITSADQSLHVENSPMKSEYLGDAGVIDSCSDWPNTPPDNNFNGGFAPDQPQLVTPIISDIPIDLNQIYVPTPPQLLDNWCSIIYYELDTPIGETFKVSARDHGKVIVDGGMDPHGENEGRLCLGALSNVHRTEASEKARIHIGRGVELTAHADGNISITSNCKIFVRSGYLDYTHGSEYSSKAHRFTPNESSFTVFDIRWAYMQMLRRSRSSNEAVRAQAAAVAGYAPMSVMPAIMPDSGVDRMRRDFCTIAISFVKAWGDVYQRKTIKETPCWIEVTLHRPLQILDQLLKNSSQFGSS
```
#### Questions
Use BLAST to answer the following questions:
- **Question 1** Which *C. elegans* protein is this one most likely to be?
- **Question 2** What is it's closest homologue in another species?
- **Quesion 3** What is its homologue in *D. melanogaster*? (Hint: You can create another BLAST query for this).
### Exercise 3: Nucleotide BLAST
BLAST can also be used to perform searches on DNA (or transcripts).
- **Step 1** Navigate to the NCBI BLAST homepage: [https://www.ncbi.nlm.nih.gov/blast/], and click on the "Nucleotid BLAST" button to start a protein BLAST query.
#### BLAST input
BLAST input for nucleotide sequences is very similar to to the Protein BLAST shown previously, with a few minor differences:
The most common options for the "search set" are:
- **nr**: A non-redundant set of all GenBank + EMBL + DDBJ + PDB sequences.
- **refseq_genomic**: Nucleotide genomic sequences from the NCBI Reference Sequence Project.
- **Human G+T**: The genomic sequences plus curated and predicted RNAs from the current build of the human genome.
There are also a handfull of different programs for nucleotide BLAST, the most useful of which are:
- **blastn** is general purpose search to align tRNA, mRNA, rRNA or genomic DNA sequences with coding and non-coding regious
- **megablast** is a search for highly similar sequences (and is much faster)
#### Questions
Let's assume that the above DNA sequence belongs to an RNAi sequence we have
found in our lab, but we have no idea which transicript it is supposed to
target. However, we know that its from a Human cell line.
```
>LostRNAi
GGTATTTGTTTTGATGCAAACCTCAATCCCTCCCCTTCTTTGAATGGTGTGCCCCACCCCGCGGGTCGCCTGCAACCTAGGCGGACGCTACCATGGCGTGAGACAGGGAGGGAAAGAAGTGTGCAGAAGGCAAGCCCGGAGGTATTTTCAAGAATGAGTATATCTCATCTTCCCGGAGGAAAAAAAAAAAGAATGGGTACGTCTGAGAATCAAATTTTGAAAGAGTGCAATGATGGGTCGTTTGATAATTTGTCGGAAAAACAATCTACCTGTTATCTAGCTTTGGGCTAGGCCATTCCAGTTCCAGACGCAGGCTGAACGTCGTGAAGCGGAAGGGGCGGGCCCGCAGGCGTCCGTGTGGTCCTCCGTGCAGCCCTCCGGCCCGAGCCGGTTCTTCCTGGTAGGAGGCGGAACTCGAATTCATTTCTCCCGCTGCCCCATCTCTTAGCTCGCGGTTGTTTCATTCCGCAGTTTCTTCCCATGCACCTGCCGCGTACCGGCCACTTTGTGCCGTACTTACGTCATCTTTTTCCTAAATCGAGGTGGCATTTACACACAGCGCCAGTGCACACAGCAAGTGCACAGGAAGATGAGTTTTGGCCCCTAACCGCTCCGTGATGCCTACCAAGTCACAGACCCTTTTCATCGTCCCAGAAACGTTTCATCACGTCTCTTCCCAGTCGATTCCCGACCCCACCTTTATTTTGATCTCCATAACCATTTTGCCTGTTGGAGAACTTCATATAGAATGGAATCAGGCTGGGCGCTGTGGCTCACGCCTGCACTTTGGGAGGCCGAGGCGGGCGGATTACTTGAGGATAGGAGTTCCAGACCAGCGTGGCCAACGTGGTGAATCCCCGTCTCTACTAAAAAATACAAAAATTAGCTGGGCGTGGTGGGTGCCTGTAATCCCAGCTATTCGGGAGGGTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGTGCCACTACACTCCAGCCTGGGCGACAAGAACGAAACTCCGTCTCAAAAAAAAGGGGGGAATCATACATTATGTGCTCATTTTTGTCGGGCTTCTGTCCTTCAATGTACTGTCTGACATTCGTTCATGTTGTATATATCAGTATTTTGCTC
```
- **Question 1** Use BLASTN to find which transcript this sequence most likely
targets. (Hint: Use **megablast** for highly similar sequence search, and
search the Human genome + predicted RNA's)
## PSI-BLAST
Position-Specific Iterative BLAST (PSI BLAST) BLAST is an extension of BLAST,
which uses an itteration to create a Position Specific Score Matrix (PSSM) for each query.
![PSI BLAST Top](images/psiblast.png)
Because it uses information about sequence diversity and conservation, its much better at detecting remote homolgues.
PSI BLAST can be ran from the same page as a normal BLAST by selecting "PSI-BLAST" in the "Program selection" section.
## HMMER
HMMER is a sequence similarity search tools that uses Hidden Markov Models
(HMMs) to search for similar nucleotide or proteins sequences. In practice it
is very similar to BLAST, although it is much more powerful, especially when we
use a multiple sequence alignment as input instead of just a single sequence.
Its main strength is in detecting remote homologues.
### Exersize 3: HMMER BLD10
As we saw earlier, stardard BLAST is not able to detect a human homologue of the *C. reinhardtii* protein BLD10. In thi short exercise we'll use HMMER to see if this more sensitive tool can detect it.
#### HMMER input
Navigate to the HMMER search page: [http://www.ebi.ac.uk/Tools/hmmer/]
HMMER takes as input a single nucleotide or protein sequence (accession number or identifiers are not accpected).
- **Step 1** Retrieve the amino sequence corresponding to the *C. reinhardtii* protein with accession EDP06416.1 from the NCBI database, and copy-paste it into the HMMER searchbox.
- **Step 2** Make sure the **Reference proteomes** is selected as the **sequence database** to search in. (It should already be selected).
- **Step 3** Hit **Search** to start searching.
#### HMMER output
The HMMER output page contains all of the information on HMMERs detected hits across the reference sequence database.
- **Question 1** What is the evolutionay range of the significants hits detected by HMMER? Is this a plant specific protein, eukaryotic, or does it exist in all major branches of life?
- **Question 2** Does BLD10 have a homologue in humans, and if so, what is it called?
## HHPred
**HHPred** is part of the **HH-suite** software package, and also uses Hidden
Markov Models to perform a more sensitive search than standard BLAST. HHPred
represents both the input **query** and **subject** sequences as HMM's (thus
the "HH" in HHPred).
Like HMMER, HHPred is one of the fastest and more sensitive tools that
currently exist.
### Exersize 4: HHPred.
Find a protein sequence of your interest, or alternatively use one from this tutorial. Let's use HHPred to find similar sequences in across the set of known species.
- **Step 1** Navigate to the HHPred search homepage: [https://toolkit.tuebingen.mpg.de/hhpred]
- **Question 1** Does the search for similar sequences using HHPred differ than using a BLAST alone?
## Take home messages
There are many different ways to search for similar protein and nucleotide sequences. Although tools like HMMER and HHPred are incredibly good for finding homologues genes across large evolutionary distances, this does not necessarily mean they are the best at everything. BLAST (and all of its variants) are both simple and easily usable tools which may simply be better for the job at hand.
[ncbi_blast_guide]: https://www.ncbi.nlm.nih.gov/blast/BLAST_guide.pdf
[nature_top_100]: http://www.nature.com/news/the-top-100-papers-1.16224
[ncbi_blast_course]: https://www.ncbi.nlm.nih.gov/Class/BLAST/blast_course.short.html
## Computational structure prediction of proteins
### Secondary and tertiary structure prediction
**Tool used in this course**: [Phyre2](http://www.sbg.bio.ic.ac.uk/phyre2/index.cgi)
- Phyre2 is a collection of of tools to predict and analyze protein structure, function and mutations.
- It uses advanced remote homology detection methods to build 3D models, predict ligand binding sites and analyse the effect of amino acid variants or a user's protein sequence.
![How does phyre work?](http://www.nature.com/nprot/journal/v10/n6/images_article/nprot.2015.053-F1.jpg)
1. A database of all the know structure is created
1. For each sequence in this database, PSI-BLAST is carried out to identify homologs
1. A HMM is created using all the homologs
1. The above steps are followed for all the sequences of the known structures, which creates a structure database
1. Query protein is subjected to PSI-BLAST and HMM is creates
1. The HMM of the query is compared to the HMM in the database
1. The fragments of matching strcutures are assembled into a predicted structures
- It analyses query sequence and generates a result page with comprehensive analysis result called Phyre investigator.
* We will look at the different fields of the result using working examples listed below.
**Caution**
- Any structure prediction takes long time, so do not re-analyze any of the examples given below (use the provided results to save time)
- You can submit your proteins as queries and you should have prediction result by the end of the day (or tomorrow)
- You should always save the results (because who would want to wait for another day to analyze the same proteins)
#### [Exercises taken from Phyre2 server](http://www.sbg.bio.ic.ac.uk/phyre2/workshops/2016/EBI/worked_examples.html)
##### Working with example to understand Phyre Investigator
For help viewing your PDB file off-line, please see the FAQ here:
http://www.sbg.bio.ic.ac.uk/phyre2/html/help.cgi?id=help/faq
###### Exercise-1
**Example using Human Globin**
````
>Globin_example
SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTLFADNQETIGYFKRLGNVSQGMANDKLRGHSITLMYALQNFIDQLDNPDDLVCVVEKFAVNHITRKISAAEFGKINGPIKKVLASKNFGDKYANAWAKLVAVVQAAL
````
[Phyre investigator](http://www.sbg.bio.ic.ac.uk/phyre2/phyre2_output/ca8acc6688d7f918/summary.html)
**Walkthrough/interpretation**
Here is a simple example for Phyre investigator to get you used to the interface.
1. Scroll down to the Detailed template information section. You can see the investigator buttons on the right hand side except for rank 3 which says "view investigator results". This is because that analysis is already provided. Please don't press the other investigator buttons on the tutorial examples as multiple people running the analysis is likely to cause a mess.
2. Click on the View investigator results for the rank 3 hit c3pt8B_. This will take you to the Investigator interface. The screen is divided into 3 main horizontal sections: the Info box, The 3D structure and Analyses section, and at the bottom, the Sequence view
3. In the Analyses section, click the Quality tab and below that click the 'ProQ2 quality assessment' button. The structure will be coloured mainly orange and yellow. Look at the key to the left. This indicates most of the structure is towards the 'Good' end of the spectrum. Look at the text box near the top of the page. It gives a brief summary of what this analysis (ProQ2) does.
4. Move your mouse down towards the Sequence view area. Note how as you hover over residues in the sequence view, the corresponding residue in the 3D structure is highlighted. Clicking on a residue causes that position to 'spacefill'. You can clear that by clicking the 'clear selection' button just above the sequence view.
5. Also, hovering over a position in the sequence view displays two bar graphs on the right portion of the middle section. These graphs display the preference of a residue type in the sequence profile ('Sequence Profile' graph) and the likelihood a mutation to one of the 20 amino acids will have a phenotypic effect ('Mutations' graph).
6. In the 'Analyses' section are 3 tabs: Quality, Function, and CDD. Under the 'Quality' tab you can investigate a number of features. Try clicking the 'Ramachandran Analysis' button. A few residues will be colored green and red in the 3D structure. Also a new row will appear for the sequence analysis section. Corresponding residues will appear to those highlighted in the structure. The 'Bad' and 'Allowed' residues only appear in the loop regions. So probably not much to worry about
7. Clicking the 'Disorder' button shows similarly that loop regions and the termini are the only regions with any significant disorder
8. Let's look at the CDD tab. This tab only appears if information from the Conserved Domain Database is available for your sequence. In this case it has detected a Heme-binding site. First click 'clear selection'. Now for each residue colored red in the sequence view, click to spacefill. You should have about 11 residues in spacefill mode, coloured red. As you click on each residue, have a look at the 'Mutations' graph. In almost all cases you can see that mutating the residue to anything other than that in the query sequence is likely to have a phenotypic effect
9. Go to the 'Function' tab. Click through 'conservation', 'pocket detection' and 'mutational sensitivity', reading the text in the Info box for each analysis. Notice how the heme-binding site residues correlate well with these features.
10. Finally, click the protindb interface button to see those residues known to form an interface in the template structure.
###### Exercise-2: [A bad example](http://www.sbg.bio.ic.ac.uk/phyre2/phyre2_output/d4c0b42a3223e636/summary.html)
````
>CD630_32760_[_protein=PTS_system,_mannose/fructose/sorbose_IID_component]_[protein_id=CAJ70173.1]
MTLNKLTKKELRSMFWRSFALQGAFNYERMQNLGYCYSMLPAIKKLYNKKENQAKAIERHLEIFNTTTVVVPAILGITAAMEEENANNPEFDESSISAVKTALMGPLAGIGDSLFWGTFRIIAAGVGVSLAKEGNIFGPLLFLLLYNIPAFALRIFGLKYGYQVGVNSLERIQREGLMEKIMSMATTVGLFVVGGMVATMLSITTPLKFNLNGAEVILQDILDKIIPNMLPLTFAFVIYYMLKRKVSVTKLTIGTIVTGIALHAIGLL
````
1. First note the low confidence (41%) and low coverage (16%). Immediately you know you aren't going to learn too much from this.
Note the PhyreAlarm icon. This pops up in such cases of low confidence and coverage.
2. Go down to the Sequence Analysis section. Click the button for PSI-Blast Pseudo-multiple sequence alignment. That opens a new window. One can see plenty of homologous sequences, which is good. It means the secondary structure prediction will be pretty accurate and the hidden Markov model for the sequence should be quite powerful. But the lack of any confident hits suggest maybe this is a new fold or just really remote from anything we have a structure for.
3. Look at the Secondary structure and disorder prediction. Click Show'. No significant disorder, confidently all alpha helices (SS confidence mainly red). Notice the gold helices? That indicates transmembrane helices. Click 'Hide' to close the secondary structure prediction panel.
4. Scroll down to Domain analysis and click Show. Only short blue and green matches, all well below any useful confidence threshold.
5. Click Hide to hide the domain analysis. Scroll down to the Detailed template information. One can see that the rank 3 and 4 hits have red boxes highlighting the 40% identity between the query protein and the template. But then look how short they are. One can often get high sequence identities purely by chance from short alignments.
6. Scroll to the very bottom of the web page. You can choose to Hide the Detailed template information if you like to make this easier. You'll see the Transmembrane helix prediction section.
7. It Looks like all we can get from this run is possibly a useful TM topology prediction. The image indicates the extracellular and cytoplasmic sides of the helices and their start and stop positions. This is probably a good candidate for PhyreAlarm. Maybe a new structure will come out in the weeks ahead that we can build a model on.
###### Exercise-3: investigate few more examples
Example using Human [Prion](http://www.uniprot.org/uniprot/P04156)
````
>sp|P04156|PRIO_HUMAN Major prion protein OS=Homo sapiens GN=PRNP PE=1 SV=1
MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG
````
Full html results of all homologues, models, secondary structure etc. available at:
http://www.sbg.bio.ic.ac.uk/phyre2/phyre2_output/ecba3197e3730bcd/summary.html
Example using [Human Toll like receptor1](http://www.uniprot.org/uniprot/Q15399)
````
>sp|Q15399|TLR1_HUMAN Toll-like receptor 1 OS=Homo sapiens GN=TLR1 PE=1 SV=3
MTSIFHFAIIFMLILQIRIQLSEESEFLVDRSKNGLIHVPKDLSQKTTILNISQNYISELWTSDILSLSKLRILIISHNRIQYLDISVFKFNQELEYLDLSHNKLVKISCHPTVNLKHLDLSFNAFDALPICKEFGNMSQLKFLGLSTTHLEKSSVLPIAHLNISKVLLVLGETYGEKEDPEGLQDFNTESLHIVFPTNKEFHFILDVSVKTVANLELSNIKCVLEDNKCSYFLSILAKLQTNPKLSNLTLNNIETTWNSFIRILQLVWHTTVWYFSISNVKLQGQLDFRDFDYSGTSLKALSIHQVVSDVFGFPQSYIYEIFSNMNIKNFTVSGTRMVHMLCPSKISPFLHLDFSNNLLTDTVFENCGHLTELETLILQMNQLKELSKIAEMTTQMKSLQQLDISQNSVSYDEKKGDCSWTKSLLSLNMSSNILTDTIFRCLPPRIKVLDLHSNKIKSIPKQVVKLEALQELNVAFNSLTDLPGCGSFSSLSVLIIDHNSVSHPSADFFQSCQKMRSIKAGDNPFQCTCELGEFVKNIDQVSSEVLEGWPDSYKCDYPESYRGTLLKDFHMSELSCNITLLIVTIVATMLVLAVTVTSLCSYLDLPWYLRMVCQWTQTRRRARNIPLEELQRNLQFHAFISYSGHDSFWVKNELLPNLEKEGMQICLHERNFVPGKSIVENIITCIEKSYKSIFVLSPNFVQSEWCHYELYFAHHNLFHEGSNSLILILLEPIPQYSIPSSYHKLKSLMARRTYLEWPKEKSKRGLFWANLRAAINIKLTEQAKK
````
Full html results of all homologues, models, secondary structure etc. available at:
http://www.sbg.bio.ic.ac.uk/phyre2/phyre2_output/6f4f568f92839199/summary.html
###### Example of intensive analysis: http://www.sbg.bio.ic.ac.uk/phyre2/phyre2_output/d4a1f7b1ec99b495/summary.html
1. Click the Interactive 3D view in JSmol link. Maybe that N-terminal blue alpha-helix (which was built ab initio) probably shouldn't be where it is. It should probably pack better - but ab initio is tricky! Also there appears to be some tangling in the red C-terminus. This is usually caused by disagreements between the input templates in that region.
2. In the summary section click on the link called 'Details' below the confidence key. This takes you to the bottom of the page of results to the Multi-template and ab initio information table. This table shows you which templates were used, what regions of your sequence they covered, and their confidence.
3. In particular, note that template d1svma_ (bottom of the list) covers a significant extra region of the query protein at the N-terminus, but is missing a sizeable segment at the C-terminus. Luckily the other templates cover this region well already. This is where using multiple-templates as a 'patchwork' can improve model coverage.
In this case, use of intensive has managed to model an extra 60+ residues. Its also had a fair go at the missing first 20 residues that have no template. The secondary structure prediction says this should be helix and intensive (or rather the Poing system) has attempted to build a helix for these residues and pack them against the rest of the structure. However, whenever ab initio modelling is concerned, please take results with a large pinch of salt.
We are working very hard on methods to avoid 'spaghettification', tangling from inconsistent templates, and better methods of template selection, including user-defined selections. These new approaches should be incorporated into Phyre2 by the end of 2016.
Intensive mode often creates excellent full length models that cannot be achieved by normal mode. The examples presented here are designed to illustrate how the process can occasionally go wrong, how to detect the problem and diagnose the cause.
**Other tools**:
Example: Human Toll like receptor 4
````
>sp|O00206|TLR4_HUMAN Toll-like receptor 4 OS=Homo sapiens GN=TLR4 PE=1 SV=2
MMSASRLAGTLIPAMAFLSCVRPESWEPCVEVVPNITYQCMELNFYKIPDNLPFSTKNLD
LSFNPLRHLGSYSFFSFPELQVLDLSRCEIQTIEDGAYQSLSHLSTLILTGNPIQSLALG
AFSGLSSLQKLVAVETNLASLENFPIGHLKTLKELNVAHNLIQSFKLPEYFSNLTNLEHL
DLSSNKIQSIYCTDLRVLHQMPLLNLSLDLSLNPMNFIQPGAFKEIRLHKLTLRNNFDSL
NVMKTCIQGLAGLEVHRLVLGEFRNEGNLEKFDKSALEGLCNLTIEEFRLAYLDYYLDDI
IDLFNCLTNVSSFSLVSVTIERVKDFSYNFGWQHLELVNCKFGQFPTLKLKSLKRLTFTS
NKGGNAFSEVDLPSLEFLDLSRNGLSFKGCCSQSDFGTTSLKYLDLSFNGVITMSSNFLG
LEQLEHLDFQHSNLKQMSEFSVFLSLRNLIYLDISHTHTRVAFNGIFNGLSSLEVLKMAG
NSFQENFLPDIFTELRNLTFLDLSQCQLEQLSPTAFNSLSSLQVLNMSHNNFFSLDTFPY
KCLNSLQVLDYSLNHIMTSKKQELQHFPSSLAFLNLTQNDFACTCEHQSFLQWIKDQRQL
LVEVERMECATPSDKQGMPVLSLNITCQMNKTIIGVSVLSVLVVSVVAVLVYKFYFHLML
LAGCIKYGRGENIYDAFVIYSSQDEDWVRNELVKNLEEGVPPFQLCLHYRDFIPGVAIAA
NIIHEGFHKSRKVIVVVSQHFIQSRWCIFEYEIAQTWQFLSSRAGIIFIVLQKVEKTLLR
QQVELYRLLSRNTYLEWEDSVLGRHIFWRRLRKALLDGKSWNPEGTVGTGCNWQEATSI
````
### Secondary structure prediction
[JPred](http://www.compbio.dundee.ac.uk/jpred/)
JPred is a protein secondary structure prediction tool. It also makes predictions on Solvent Accessibility and Coiled-coil regions. It first searches the query sequence in PDB to identify homologous structure, if not available it predicts structure using Jnet algorithm, which uses neural network secondary structure prediction algorithm with different types of multiple sequence alignment profiles derived from the same sequences.
###### Example using their default example protein
- There is a limit of sequence length to 800 aa, however the sequence can be split and used at batch mode
- Run Jpred (click the `Make Prediction`)
- It immediately opens a list of PDB tht matches the query
- One can either explore those proteins individually using the PDB linkout
- Or, submit the job to identify a more accurate secondary structure assignment.
- the result page shows the predicted components of secondary structures
Default query using default parameters:
````
MQVWPIEGIKKFETLSYLPPLTVEDLLKQIEYLLRSKWVPCLEFSKVGFVYRENHRSPGYYDGRYWTMWKLPMFGCTDATQVLKELEEAKKAYPDAFVRIIGFDNVRQVQLISFIAYKPPGC
````
[result page](http://www.compbio.dundee.ac.uk/jpred4/results/jp_1Xqh2hJ/jp_1Xqh2hJ.results.html)
###### Long sequence of Toll-like receptor 4 split into multiple fragments (advanced search):
**Result:**
An archive with all the results can be downloaded from the following link:
http://www.compbio.dundee.ac.uk/jpred4/results/jp_batch_1478672019__YtqOUQg/jp_batch_1478672019__ALL_JOBS_ARCHIVE.tar.gz
**Results for individual queries are available from the links below:**
- TLR4_HUMAN1 [Link to results](http://www.compbio.dundee.ac.uk/jpred4/results/jp_batch_1478672019__6tFSD1_)
- TLR4_HUMAN2 [Link to results](http://www.compbio.dundee.ac.uk/jpred4/results/jp_batch_1478672019__YtqOUQg)
- TLR4_HUMAN3 [Link to results](http://www.compbio.dundee.ac.uk/jpred4/results/jp_batch_1478672019__mUJGlod)
### Tertiary structure prediction
1. [I-TASSER](http://zhanglab.ccmb.med.umich.edu/I-TASSER/)
Requires registration and submission permission
Check an already analysed result page for [Toll like receptor 4](http://www.uniprot.org/uniprot/O00206) using I-TASSER
Results available at: http://zhanglab.ccmb.med.umich.edu/I-TASSER/output/S298631/
2. (PS)2-v2: [Protein Structure Prediction Server](http://ps2.life.nctu.edu.tw/docs.php)
- A much faster tool compared the Phyre2, unfortunately broken at the moment.
- combines both sequence and secondary structure information for the detection of homologous proteins with remote similarity and the target-template alignment.
- Check an already analysed result page for [Toll like receptor 4](http://www.uniprot.org/uniprot/O00206) using (PS)2-v2
Results available at: http://140.113.239.111/~ps2v2/display_multi.php?folder=408053421
File added
# Antibodypedia
*Antibodypedia is an open-access, curated, searchable database containing annotated and scored affinity reagents to aid users in selecting antibodies tailored to specific biological and biomedical assays.*
(Same providers as the Human Protein Atlas)
https://www.antibodypedia.com
### Question
Why does Santa Cruz torture goats?
### Examples
1. FGF13
* Click through to the NBP2-45642 and see the validation images
2. Beta-Catenin
* Click on the buttons – What do you get?
* Click on the image – What do you get?
* Do all the antibodies give similar ICC images?
* Do most antibodies work for multiple methods?
3. Look up antibodies for your favourite proteins
* Is an antibody you use in the list?
* Do you think you have used the best antibody available for your purposes?
# [EMBOSS tools for sequence analysis](http://www.ebi.ac.uk/Tools/emboss/)
## [EMBOSS explorer](http://emboss.bioinformatics.nl/cgi-bin/emboss/)
###### Official [EMBOSS tutorial](http://emboss.sourceforge.net/docs/emboss_tutorial/emboss_tutorial.html) written by [Gary Williams](http://emboss.sourceforge.net/docs/emboss_tutorial/node8.html)
**Why EMBOSS?**
- Open source
- Wide range of tools for sequence analysis
- Ideal for building workflow (commandline tools)
- Accesses remote databases conveniently
The official EMBOSS suite comprises of over 150 programs that are available as commandline tools and only few of those are offered as web based applications.
Wageningen Bioinformatics Webportal, Netherlands offers [a graphical user interface to the EMBOSS suite](http://emboss.bioinformatics.nl/cgi-bin/emboss/), which we will use today for the hands-on session (more like demo!).
## Quick Demo on EMBOSS tools
...but before that, re-use/do the Clustal Omega analysis on your set of 10 P53 sequences. (or, go down this document to use my set of sequences ;) !)
- [extractalign](http://emboss.bioinformatics.nl/cgi-bin/emboss/extractalign)
- Swich to [Mview](http://www.ebi.ac.uk/Tools/msa/mview/) to visualize consensus
- Also check [alnviz](https://toolkit.tuebingen.mpg.de/alnviz), but don't dive into it today. We will cover such visualizations tomorrow.
- Create consensus with [cons](http://emboss.bioinformatics.nl/cgi-bin/emboss/cons)
- Also check [consambig](http://emboss.bioinformatics.nl/cgi-bin/emboss/consambig): cons calculates a consensus sequence from a multiple sequence alignment. To obtain the consensus, the amino acid residue or nucleotide at each position is compared to the possible ambiguity codes using consambig. The consensus sequence uses the minimum ambiguity code match. The ambiguity characters were designed to encode positional variations found among families of related genes. Useful for DNA sequences.
- use [Merger](Merge two overlapping sequences) to merge two overlapping sequences. It uses a global alignment algorithm (Needleman & Wunsch) to optimally align the sequences. A merged sequence is generated from the alignment and writen to the output file. Also useful in case of DNA.
- [Dotmatcher](http://emboss.bioinformatics.nl/cgi-bin/emboss/dotmatcher) generates a dotplot from two input sequences. The dotplot is an intuitive graphical representation of the regions of similarity between two sequences. All positions from the first input sequence are compared with all positions from the second input sequence using a specified substitution matrix.
- [plotcon](http://emboss.bioinformatics.nl/cgi-bin/emboss/plotcon)
- [prettyplot](http://emboss.bioinformatics.nl/cgi-bin/emboss/prettyplot): claims to present alignment with pretty formatting (?)
## Example proteins
For pairwise alignment tools, we can use human p53 and zebrafish dp53:
- Human p53: [P04637](http://www.uniprot.org/uniprot/P04637.fasta)
- Zebrafish tp53: [P79734](http://www.uniprot.org/uniprot/P79734.fasta)
````
>P53_HUMAN|P04637| Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 PE=1 SV=4
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG
GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
>P53_DANRE|P79734| Cellular tumor antigen p53 OS=Danio rerio GN=tp53 PE=1 SV=1
MAQNDSQEFAELWEKNLIIQPPGGGSCWDIINDEEYLPGSFDPNFFENVLEEQPQPSTLP
PTSTVPETSDYPGDHGFRLRFPQSGTAKSVTCTYSPDLNKLFCQLAKTCPVQMVVDVAPP
QGSVVRATAIYKKSEHVAEVVRRCPHHERTPDGDNLAPAGHLIRVEGNQRANYREDNITL
RHSVFVPYEAPQLGAEWTTVLLNYMCNSSCMGGMNRRPILTIITLETQEGQLLGRRSFEV
RVCACPGRDRKTEESNFKKDQETKTMAKTTTGTKRSLVKESSSATLRPEGSKKAKGSSSD
EEIFTLQVRGRERYEILKKLNDSLELSDVVPASDAEKYRQKFMTKNKKENRESSEPKQGK
KLMVKDEGRSDSD
````
For dotmatcher we can use these sequences:
- ZKSC7_HUMAN: [Q9P0L1](http://www.uniprot.org/uniprot/Q9P0L1.fasta)
- MPDZ_HUMAN: [O75970](http://www.uniprot.org/uniprot/O75970.fasta)
````
>ZKSC7_HUMAN|Q9P0L1| Zinc finger protein with KRAB and SCAN domains 7 OS=Homo sapiens GN=ZKSCAN7 PE=1 SV=2
MTTAGRGNLGLIPRSTAFQKQEGRLTVKQEPANQTWGQGSSLQKNYPPVCEIFRLHFRQL
CYHEMSGPQEALSRLRELCRWWLMPEVHTKEQILELLVLEQFLSILPGELRTWVQLHHPE
SGEEAVAVVEDFQRHLSGSEEVSAPAQKQEMHFEETTALGTTKESPPTSPLSGGSAPGAH
LEPPYDPGTHHLPSGDFAQCTSPVPTLPQVGNSGDQAGATVLRMVRPQDTVAYEDLSVDY
TQKKWKSLTLSQRALQWNMMPENHHSMASLAGENMMKGSELTPKQEFFKGSESSNRTSGG
LFGVVPGAAETGDVCEDTFKELEGQTSDEEGSRLENDFLEITDEDKKKSTKDRYDKYKEV
GEHPPLSSSPVEHEGVLKGQKSYRCDECGKAFNRSSHLIGHQRIHTGEKPYECNECGKTF
RQTSQLIVHLRTHTGEKPYECSECGKAYRHSSHLIQHQRLHNGEKPYKCNECAKAFTQSS
RLTDHQRTHTGEKPYECNECGEAFIRSKSLARHQVLHTGKKPYKCNECGRAFCSNRNLID
HQRIHTGEKPYECSECGKAFSRSKCLIRHQSLHTGEKPYKCSECGKAFNQNSQLIEHERI
HTGEKPFECSECGKAFGLSKCLIRHQRLHTGEKPYKCNECGKSFNQNSHLIIHQRIHTGE
KPYECNECGKVFSYSSSLMVHQRTHTGEKPYKCNDCGKAFSDSSQLIVHQRVHTGEKPYE
CSECGKAFSQRSTFNHHQRTHTGEKSSGLAWSVS
>MPDZ_HUMAN|O75970| Multiple PDZ domain protein OS=Homo sapiens GN=MPDZ PE=1 SV=2
MLEAIDKNRALHAAERLQTKLRERGDVANEDKLSLLKSVLQSPLFSQILSLQTSVQQLKD
QVNIATSATSNIEYAHVPHLSPAVIPTLQNESFLLSPNNGNLEALTGPGIPHINGKPACD
EFDQLIKNMAQGRHVEVFELLKPPSGGLGFSVVGLRSENRGELGIFVQEIQEGSVAHRDG
RLKETDQILAINGQALDQTITHQQAISILQKAKDTVQLVIARGSLPQLVSPIVSRSPSAA
STISAHSNPVHWQHMETIELVNDGSGLGFGIIGGKATGVIVKTILPGGVADQHGRLCSGD
HILKIGDTDLAGMSSEQVAQVLRQCGNRVKLMIARGAIEERTAPTALGITLSSSPTSTPE
LRVDASTQKGEESETFDVELTKNVQGLGITIAGYIGDKKLEPSGIFVKSITKSSAVEHDG
RIQIGDQIIAVDGTNLQGFTNQQAVEVLRHTGQTVLLTLMRRGMKQEAELMSREDVTKDA
DLSPVNASIIKENYEKDEDFLSSTRNTNILPTEEEGYPLLSAEIEEIEDAQKQEAALLTK
WQRIMGINYEIVVAHVSKFSENSGLGISLEATVGHHFIRSVLPEGPVGHSGKLFSGDELL
EVNGITLLGENHQDVVNILKELPIEVTMVCCRRTVPPTTQSELDSLDLCDIELTEKPHVD
LGEFIGSSETEDPVLAMTDAGQSTEEVQAPLAMWEAGIQHIELEKGSKGLGFSILDYQDP
IDPASTVIIIRSLVPGGIAEKDGRLLPGDRLMFVNDVNLENSSLEEAVEALKGAPSGTVR
IGVAKPLPLSPEEGYVSAKEDSFLYPPHSCEEAGLADKPLFRADLALVGTNDADLVDEST
FESPYSPENDSIYSTQASILSLHGSSCGDGLNYGSSLPSSPPKDVIENSCDPVLDLHMSL
EELYTQNLLQRQDENTPSVDISMGPASGFTINDYTPANAIEQQYECENTIVWTESHLPSE
VISSAELPSVLPDSAGKGSEYLLEQSSLACNAECVMLQNVSKESFERTINIAKGNSSLGM
TVSANKDGLGMIVRSIIHGGAISRDGRIAIGDCILSINEESTISVTNAQARAMLRRHSLI
GPDIKITYVPAEHLEEFKISLGQQSGRVMALDIFSSYTGRDIPELPEREEGEGEESELQN
TAYSNWNQPRRVELWREPSKSLGISIVGGRGMGSRLSNGEVMRGIFIKHVLEDSPAGKNG
TLKPGDRIVEVDGMDLRDASHEQAVEAIRKAGNPVVFMVQSIINRPRKSPLPSLLHNLYP
KYNFSSTNPFADSLQINADKAPSQSESEPEKAPLCSVPPPPPSAFAEMGSDHTQSSASKI
SQDVDKEDEFGYSWKNIRERYGTLTGELHMIELEKGHSGLGLSLAGNKDRSRMSVFIVGI
DPNGAAGKDGRLQIADELLEINGQILYGRSHQNASSIIKCAPSKVKIIFIRNKDAVNQMA
VCPGNAVEPLPSNSENLQNKETEPTVTTSDAAVDLSSFKNVQHLELPKDQGGLGIAISEE
DTLSGVIIKSLTEHGVAATDGRLKVGDQILAVDDEIVVGYPIEKFISLLKTAKMTVKLTI
HAENPDSQAVPSAAGAASGEKKNSSQSLMVPQSGSPEPESIRNTSRSSTPAIFASDPATC
PIIPGCETTIEISKGRTGLGLSIVGGSDTLLGAIIIHEVYEEGAACKDGRLWAGDQILEV
NGIDLRKATHDEAINVLRQTPQRVRLTLYRDEAPYKEEEVCDTLTIELQKKPGKGLGLSI
VGKRNDTGVFVSDIVKGGIADADGRLMQGDQILMVNGEDVRNATQEAVAALLKCSLGTVT
LEVGRIKAGPFHSERRPSQSSQVSEGSLSSFTFPLSGSSTSESLESSSKKNALASEIQGL
RTVEMKKGPTDSLGISIAGGVGSPLGDVPIFIAMMHPTGVAAQTQKLRVGDRIVTICGTS
TEGMTHTQAVNLLKNASGSIEMQVVAGGDVSVVTGHQQEPASSSLSFTGLTSSSIFQDDL
GPPQCKSITLERGPDGLGFSIVGGYGSPHGDLPIYVKTVFAKGAASEDGRLKRGDQIIAV
NGQSLEGVTHEEAVAILKRTKGTVTLMVLS
````
### Set of P53 proteins:
**Raw sequences**
```
>Mus musculus
MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEALRVSGAPAAQDPVTETPGPV
APAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFFQLAKTCPVQLWVSATPPAGSRVRAMAIY
KKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCM
GGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPPQKKKPL
DGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSLQPRAFQALIKEESPNC
>Rattus norvegicus
MEDSQSDMSIELPLSQETFSCLWKLLPPDDILPTTATGSPNSMEDLFLPQDVAELLEGPEEALQVSAPAAQEPGTEAPAP
VAPASATPWPLSSSVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSISLNKLFCQLAKTCPVQLWVTSTPPPGTRVRAMAI
YKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNPYAEYLDDRQTFRHSVVVPYEPPEVGSDYTTIHYKYMCNSSC
MGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEEHCPELPPGSAKRALPTSTSSSPQQKKKP
LDGEYFTLKIRGRERFEMFRELNEALELKDARAAEESGDSRAHSSLQPRTFQALIKKESPNC
>Mastomys natalensis
LPLSQETFQRLWKLLPPEAVLSEASPNSMDNMFLSPDVVNLLEGPEEALQVSAAPAAQDPVTETPAPAAPAPATPWPLSS
FVPSQKTYQGSYGFHLGFLQSGTAKSVMCTYSPSLNKLFCQLAKTCPVQLWVSDTPPAGSRVRAMAIYKKSQHMTEVVRR
CPHHERCTDGDGLAPPQHLIRVEGNLNAEYLDDKQTFRHSVVVPYEPPEVGSDYTTIHYKYMCNSSCMGGMNRRPILTII
TLEDSSGNLLGRDSFEVRICACPGRDRRTEEENFRKKEEPCPELPLGSAKRALPTGTSASPQQKKKRLDGEYFTLKIRGR
ERFEMFRELNEALELKDARAAEELGDSRAHSSYLKTKRGQSSSHHKKPMVKKVGPDSD
>Microtus ochrogaster
MEEPQSDLSIEPPLSQETFSDLWNLLPPNNVLSTSLSVDAMEDLFLSQDVANWLEEPNEGPQMSAAASTAEDPVTEAPAP
VTPAPVTSWPLSSSVPSQKTYQGEYGFRLGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVSSTPPPGTRVRAMAI
YKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLRAEYLDDRQTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSC
MGGMNRRPILTIITLEDPSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGEPRPELPVGSTKRVLPTNTSSPQPKKKPL
DGEYFTLKIRGRERFKMFSELNEALELKDAQDANGSGDSRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
>Nannospalax galili
MEEQQSDLSIEPPLSQETFSDLWKLLPQNNVLSTPLSPNSMEDLLLSPEDVANWLDDPDEALQVPAAAITGDPVTETSAP
VAPPPATPWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPPLNKLFCQLAKTCPVQLWVDSTPPPGTRVRAMAI
YKKSQHMTEVVKRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSC
MGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGELCPELPPGSTKRALPTGTSSSPQPKKKP
LDGEYFTLKIRGRERFEMFRELNEALELKDTQAEKDSGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
>Eospalaxbaileyi
MEEPQSDLSIEPPLSQETFSDLWKLLPQNNVLSTSLSPNSMEDLLLSAEDVANWLDDPDDALRMPAAPVTEDPATEASAP
VAPPPATPWPLSSSVPSQKTYQGNYGFRLGFLHSGTAKSVTCTYSPCLNKLFCQLAKTCPVQLWVDSTPPPGTRVRAMAI
YKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSVIVPYEPPEVGSDCTTIHYNYMCNSSC
MGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGESCPELPPGSTKRALPTDTSSSPQPKKKP
LLDGEYFTLKIRGRERFEMFRELNEALELKDAQAEKESGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
>Eospalaxcansus
MEEPQSDLSIEPPLSQETFSDLWKLLPQNNVLSTSLSPNSMEDLLLSAEDVANWLDDPDDALRMPAAPVTEDPTTEASAP
VAPPPATPWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVACTYSPCLNKLFCQLAKTCPVQLWVDSTPPPGTRVRAMAI
YKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSC
MGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGESCPELPPGSTKRALPTGTSSSPQPKKKP
LLDGEYFTLKIRGRERFEMFRELNEALELKDAQAEKESGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
>Cricetulus griseus
MEEPQSDLSIELPLSQETFSDLWKLLPPNNVLSTLPSSDSIEELFLSENVTGWLEDSGGALQGVAAAAASTAEDPVTETP
APVASAPATPWPLSSSVPSYKTFQGDYGFRLGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVNSTPPPGTRVRAM
AIYKKLQYMTEVVRRCPHHERSSEGDSLAPPQHLIRVEGNLHAEYLDDKQTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDPSGNLLGRNSFEVRICACPGRDRRTEEKNFQKKGEPCPELPPKSAKRALPTNTSSSPPPKK
KTLDGEYFTLKIRGHERFKMFQELNEALELKDAQASKGSEDNGAHSSYLKSKKGQSASRLKKLMIKREGPDSD
>Oryctolagus cuniculus
MSATAQAGPGGSQEASDPAAAMEESQSDLSLEPPLSQETFSDLWKLLPENNLLTTSLNPPVDDLLSAEDVANWLNEDPEE
GLRVPAAPAPEAPAPAAPALAAPAPATSWPLSSSVPSQKTYHGNYGFRLGFLHSGTAKSVTCTYSPCLNKLFCQLAKTCP
VQLWVDSTPPPGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDRNTFRHSVVVPYEP
PEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGEPCPELPPG
SSKRALPTTTTDSSPQTKKKPLDGEYFILKIRGRERFEMFRELNEALELKDAQAEKEPGGSRAHSSYLKAKKGQSTSRHK
KPMFKREGPDSD
>Carlito syrichta
MEEPQSDLSIEPLSQETFSDLWKLLPENNVLSPSLSPPVDDLILSTEDIANWFSEGPDEALRTAPAPVAPTPAASTQAAP
APGTPWPLSSSVPSQKTYHGNYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQ
SQYMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDKTTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGG
MNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGEPCSELPPGSTKRALPTSTSSPSQPKKKPLDG
EYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHTSHLKSKKGQSTSRHKKLMFKREGPDSD
```
**Aligned by Clustal Omega**
```
CLUSTAL O(1.2.3) multiple sequence alignment
Cricetulus ---------------------MEEPQSDLSIELPLSQETFSDLWKLLPPNNVLSTL--PS
Carlito ---------------------MEEPQSDLSIE-PLSQETFSDLWKLLPENNVLSPS--LS
Microtus ---------------------MEEPQSDLSIEPPLSQETFSDLWNLLPPNNVLSTS--LS
Oryctolagus MSATAQAGPGGSQEASDPAAAMEESQSDLSLEPPLSQETFSDLWKLLPENNLLTTS--LN
Nannospalax ---------------------MEEQQSDLSIEPPLSQETFSDLWKLLPQNNVLSTP--LS
Eospalaxbaileyi ---------------------MEEPQSDLSIEPPLSQETFSDLWKLLPQNNVLSTS--LS
Eospalaxcansus ---------------------MEEPQSDLSIEPPLSQETFSDLWKLLPQNNVLSTS--LS
Mastomys --------------------------------LPLSQETFQRLWKLLPPEAVLSE---AS
Mus ------------------MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPS-----
Rattus ---------------------MEDSQSDMSIELPLSQETFSCLWKLLPPDDILPTTATGS
*******. **:*** : :*
Cricetulus SDSIEELFL-SENVTGWLEDSGGALQGVAAAAASTAEDPVTETPAPVASAPATPWPLSSS
Carlito PP-VDDLILSTEDIANWFSEGPDE--ALRTAPAPV--APTPAASTQAAPAPGTPWPLSSS
Microtus VDAMEDLFL-SQDVANWLEEPNEG--PQMSAAASTAEDPVTEAPAPVTPAPVTSWPLSSS
Oryctolagus PP--VDDLLSAEDVANWLNEDPEE--GLRVPAAPAPEAPAPAAPALAAPAPATSWPLSSS
Nannospalax PNSMEDLLLSPEDVANWLD-DPDE--ALQVPAAAITGDPVTETSAPVAPPPATPWPLSSS
Eospalaxbaileyi PNSMEDLLLSAEDVANWLD-DPDD--ALRMPAAPVTEDPATEASAPVAPPPATPWPLSSS
Eospalaxcansus PNSMEDLLLSAEDVANWLD-DPDD--ALRMPAAPVTEDPTTEASAPVAPPPATPWPLSSS
Mastomys PNSMDNMFL-SPDVVNLLEGPEE---ALQVSAAPAAQDPVTETPAPAAPAPATPWPLSSF
Mus PHCMDDLLL-PQDVEEFFEGPSE---ALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSF
Rattus PNSMEDLFL-PQDVAELLEGPEE---ALQVS-APAAQEPGTEAPAPVAPASATPWPLSSS
: :* :: :. * * : .: * *****
Cricetulus VPSYKTFQGDYGFRLGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVNSTPPPGTR
Carlito VPSQKTYHGNYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTR
Microtus VPSQKTYQGEYGFRLGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVSSTPPPGTR
Oryctolagus VPSQKTYHGNYGFRLGFLHSGTAKSVTCTYSPCLNKLFCQLAKTCPVQLWVDSTPPPGSR
Nannospalax VPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPPLNKLFCQLAKTCPVQLWVDSTPPPGTR
Eospalaxbaileyi VPSQKTYQGNYGFRLGFLHSGTAKSVTCTYSPCLNKLFCQLAKTCPVQLWVDSTPPPGTR
Eospalaxcansus VPSQKTYQGSYGFRLGFLHSGTAKSVACTYSPCLNKLFCQLAKTCPVQLWVDSTPPPGTR
Mastomys VPSQKTYQGSYGFHLGFLQSGTAKSVMCTYSPSLNKLFCQLAKTCPVQLWVSDTPPAGSR
Mus VPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFFQLAKTCPVQLWVSATPPAGSR
Rattus VPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSISLNKLFCQLAKTCPVQLWVTSTPPPGTR
*** **::*.***:****:******* **** ***:* ************ *** *:*
Cricetulus VRAMAIYKKLQYMTEVVRRCPHHERSSEGDSLAPPQHLIRVEGNLHAEYLDDKQTFRHSV
Carlito VRAMAIYKQSQYMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDKTTFRHSV
Microtus VRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLRAEYLDDRQTFRHSV
Oryctolagus VRAMAIYKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDRNTFRHSV
Nannospalax VRAMAIYKKSQHMTEVVKRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSV
Eospalaxbaileyi VRAMAIYKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSV
Eospalaxcansus VRAMAIYKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSV
Mastomys VRAMAIYKKSQHMTEVVRRCPHHERCTDGDGLAPPQHLIRVEGNLNAEYLDDKQTFRHSV
Mus VRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSV
Rattus VRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNPYAEYLDDRQTFRHSV
********: *:*****:*******.::.*.************* ***:*: ******
Cricetulus VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDPSGNLLGRNSFEVRICA
Carlito VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCA
Microtus VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDPSGNLLGRNSFEVRVCA
Oryctolagus VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCA
Nannospalax VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCA
Eospalaxbaileyi IVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCA
Eospalaxcansus VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCA
Mastomys VVPYEPPEVGSDYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRICA
Mus VVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCA
Rattus VVPYEPPEVGSDYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCA
:*******.**: *****:************************ *******:*****:**
Cricetulus CPGRDRRTEEKNFQKKGEPCPELPPKSAKRALPTNTSSS-PPPKKKTLDGEYFTLKIRGH
Carlito CPGRDRRTEEENFRKKGEPCSELPPGSTKRALPTSTSS-PSQPKKKPLDGEYFTLQIRGR
Microtus CPGRDRRTEEENFRKKGEPRPELPVGSTKRVLPTNTS--SPQPKKKPLDGEYFTLKIRGR
Oryctolagus CPGRDRRTEEENFRKKGEPCPELPPGSSKRALPTTTTDSSPQTKKKPLDGEYFILKIRGR
Nannospalax CPGRDRRTEEENFRKKGELCPELPPGSTKRALPTGTSSSPQPKKKP-LDGEYFTLKIRGR
Eospalaxbaileyi CPGRDRRTEEENFRKKGESCPELPPGSTKRALPTDTSSSPQPKKKPLLDGEYFTLKIRGR
Eospalaxcansus CPGRDRRTEEENFRKKGESCPELPPGSTKRALPTGTSSSPQPKKKPLLDGEYFTLKIRGR
Mastomys CPGRDRRTEEENFRKKEEPCPELPLGSAKRALPTGTSAS-PQQKKKRLDGEYFTLKIRGR
Mus CPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSAS-PPQKKKPLDGEYFTLKIRGR
Rattus CPGRDRRTEEENFRKKEEHCPELPPGSAKRALPTSTSSS-PQQKKKPLDGEYFTLKIRGR
**********:**:** *** *:**.*** *: ** ****** *:***:
Cricetulus ERFKMFQELNEALELKDAQASKGSEDNGAHSSYLKSKKGQSASRLKKLMIKREGPDSD
Carlito ERFEMFRELNEALELKDAQAGKEPGGSRAHTSHLKSKKGQSTSRHKKLMFKREGPDSD
Microtus ERFKMFSELNEALELKDAQDANGSGDSRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
Oryctolagus ERFEMFRELNEALELKDAQAEKEPGGSRAHSSYLKAKKGQSTSRHKKPMFKREGPDSD
Nannospalax ERFEMFRELNEALELKDTQAEKDSGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
Eospalaxbaileyi ERFEMFRELNEALELKDAQAEKESGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
Eospalaxcansus ERFEMFRELNEALELKDAQAEKESGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
Mastomys ERFEMFRELNEALELKDARAAEELGDSRAHSSYLKTKRGQSSSHHKKPMVKKVGPDSD
Mus KRFEMFRELNEALELKDAHATEESGDSRAHSSLQPRAFQ--------ALIKEESPNC-
Rattus ERFEMFRELNEALELKDARAAEESGDSRAHSSLQPRTFQ--------ALIKKESPNC-
:**:** **********:: : . **:* :.*. .*:.
```
# Exploring the Human Protein Atlas
http://www.proteinatlas.org
Antibodies raised against most human proteins.
Rigorous purification protocol.
Used to Stain human tissues and cells.
Key fact: Provides independent validation of cellular location and tissue distribution using commercial (or home produced) antibodies.
## Example proteins to explore the Atlas
1. A-kinase anchoring proteins are scaffolds for the PKA kinase.
- AKAP1
- AKAP4
- AKAP8
- AKAP12
###Questions
- Are these AKAPs all found in the same cell compartments and subcellular locations?
- What happens when you toggle the channels?
- Are these AKAPs found in all tissues?
- Are they highly expressed in cancer cells?
- Is there a PKA kinase cascade?
1. c-Myc transcription factor
### Questions
- Do all antibodies perform similarly? (Click for primary data for a summary).
- In the cerebral cortex are all cell types stained?
- Would we expect that for Myc?
- Which cells are stained in the placenta?
- Which cell compartments have Myc?
3. CTNNB1 – the key Wnt signalling transcription factor beta catenin
### Questions
- Which tissues don’t express any beta catenin?
- Which cancers don’t express any beta catenin?
- Is most of the beta catenin staining in the nucleus?
4. FGF13 - fibroblast growth factor 13
### Questions
- Is FGF13 strongly expressed in most cancer cells?
- Are there any tissues that don’t stain for FGF13?
- Growth factors are secreted and their receptors are on the cell surface: Which cellular compartments contain FGF13?
- Which cellular compartments in bronchial tissue contain FGF13?
5. Try your own favourite proteins!
Let us know if the images make sense to you…
## Other course materials
### The links below will direct you to the external files
1. *JalView*, *Chimera* and otrher exercises by Toby Gibson, Marc Gouw and Malvika Sharan:
https://docs.google.com/document/d/1ceyNSXCpytsG0Bih-sRIOKnf2ZRNxJTOp9jCzKn6lOY/edit?usp=sharing
2. *Cytoscape* lesson and exercises by Matt Rogon:
https://git.embl.de/rogon/introduction_to_cytoscape
3. *Unix* Course by Marc Gouw and Malvika Sharan:
https://github.com/malvikasharan/SWC_reference_material/blob/master/Unix_Shell/Unix_Shell.md
4. *String* and *Stitch* Exercises by michael Kuhn:
https://www.dropbox.com/sh/zfad20wfa8t485j/AACQl76GoP_9x0DP55tOKh5ka?dl=0
\ No newline at end of file
File added
File added
# Finding similar sequences
Finding similar protein and nucleotide sequences has been one of the
greatest challenges in the age of bioinformatics. In many situations where we
have piece of DNA or a protein, the very first step knowing more about it
involves finding similar pieces of DNA or protein based on sequence similarity.
The basic premise of all DNA sequence similarity searches is quite straight
forward:
- Given a particular nucleotide sequence, find all genes/transcripts with similar sequences in a particular database.
And for Protein sequences, is quite similar:
- Given a particular amino acid sequence, find all peptides or proteins with a similar sequences in a particular database.
The main tool which has been used over the past 20 years for this purpose
has been **BLAST**. In this tutorial we are going to look at standard BLAST for
nucleotide and protein sequences, as well as a BLAST variant called PSI-BLAST.
Lastly we'll take a look at some newer tools that have emerged based on hidden
markov models: HMMER and HHPred.
## BLAST
**BLAST** stands for "basic local alignment search tool", and is a basic tool for
searching for local alignments (as you might have guessed from the acronym).
Its standing as one of the most important tools in bioinformatics history
becomes pretty clear when we take a look at the list of the 100 highest cited
publications:
- **12:** "Basic local alignment search tool", Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. *J. Mol.* Biol. 215, 403–410 (1990). -- 38,380 citations
- **14:** "Gapped BLAST and PSI-BLAST: A new generation of protein database search programs", Altschul, S. F. et al. *Nucleic Acids Res.* 25, 3389–3402 (1997). -- 36,410 citations
(source: [The top 100 papers: Nature News & Comment][nature_top_100])
BLAST can be used to find similar nucleotide as well as protein sequences and
rank them from most to least similar based on a similarity score.
As the name implies, BLAST is designed for "local" alignments, i.e.: finding
segments of a sequence that are similar to another segment of another sequence.
Most proteins are modular, and often contain one or more functional domains.
Finding proteins (or genes) that have similar domains gives a good indication
that these proteins share some functional similarity. Proteins (or genes) that
are "locally" similar for their entire sequences are likely to be homologous,
and functionally equivalent.
In 'sequence similarity search' lingo, we typically speak of two things:
- **query:** the sequence we have we want to know more about.
- **subject** or **database:** the set of sequences we want to retrieve similar sequences from.
**BLAST** refers to a suite of programs to generate local alignments and
calculate significance scores.
This includes the two main versions of BLAST we will discuss today:
- **BLASTP** proteins
- **BLASTN** nucleotides
There are many other BLAST searches: megaBLAST, BLASTX, TBLASTN, PSI-BLAST,
RPSBLAST & DELTA-BLAST. (of which we'll look at PSI-BLAST later as well.)
### BLAST method, in brief
It is worthwhile to have a basic understanding of how a simple BLAST works:
The **query** sequence is split into small fragments, known as **words**:
- protein sequences are split into tri-mers
- nucleotide sequences are split into 11-mers
BLAST then scans the **database** (or subject) to find all identical words
between query and subject sequences.
- For protein sequences: must execeed a score based on substituion matrix.
- for nucleotide: must be excat
When a match is found: BLAST extends forwards and backwards to produce and
alignment. This extension continues until the score falls below a given
threshold (dropoff). The sequence similarity score is calculated from a Score
matrix, which defines the similarity between different amino acids (or
nucleotides).
For proteins, the similarity score is determined by a sequence similarity matrix, such as the BLOSUM62 matrix shown below:
![Search box](images/BLOSUM62.png)
(source: [https://commons.wikimedia.org/wiki/File:BLOSUM62.png])
Additionally a GAP penalty is also defined, as the costs of opening and extending a GAP in the sequence alignments.
### Exercise 1: Protein BLAST-ing BLD10
BLD10 is a protein found in *C. reinhardtii* (a small green algae) involved in creating the Centrosome/Basal body complex. It has some interesting sequence features which make it a nice example for sequence similarity searches, and we'll be using it throughout the rest of this tutorial.
Note: Much of this material has also been adapted from the [NCBI
guide][ncbi_blast_guide], which is an excellent reference for getting started
with BLAST.
#### BLAST input
The BLAST input screen contains the most important components you need to create a BLAST web search.
The top part contains information pertaining to our input sequence, or **query**:
- **Step 1** Navigate to the NCBI BLAST homepage: [https://www.ncbi.nlm.nih.gov/blast/], and click on the "Protein BLAST" button to start a protein BLAST query.
![Search box](images/searchbox.jpg)
- On the top bar, we see which BLAST program we are in: blastn, blastp, blastx, tblastn, tblastx.
- In the text box, we have the **query** input section where you can enter a nucleotide/protein sequence, accession number of FASTA formatted entry.
- The "from" and "to" fields allow you to limit the search to a part of the sequence
- Its also possible to upload a file instead of copy-pasting entries into the searchbox using the 'upload file' button.
- Optionally, you can give your job a name (to find it back later more quickly)
- **Step 2** In the Search text box, enter the accession for BLD10: **EDP06416.1**
![Search set](images/searchset.jpg)
In the "search set" section, we choose which **subject** or **database** we
want to search in. There are many options to choose from.
For proteins, the most useful/common options are:
- **nr**: A non-redundant database of protein from Genbank, RefSeq, SwissProt, PDB and more.
- **refseq_protein**: Protein sequences from the NCBI Reference Sequence dataset.
- **swissprot**: The Uniprot/SWISS-PROT database.
In the "search set" we can also choose to limit our search to a single species (for example, "homo sapiens"), or a taxonomic group (e.g. "fungi").
- **Step 3** Make sure we have the **nr** database selected as our **subject**.
![Program selection](images/programselection.jpg)
In the "program selection" box, we select which type of BLAST we wish to run
(the one shown above is for protein sequences.) To run a (standard) BLASTP leave "blastp" selected, and hit the "BLAST BUTTON".
- **Step 4** Make sure we have the **blastp** selected as our program.
#### BLAST output
The top panel contains useful information about your search:
![Results Top](images/toptop.png)
The most important of these is the **RID** (result ID), which you can use to
retrieve your results (and is typically valid for a few days).
The next panel contains a graphical summary of the result:
![Graphical Summary](images/graphicsummary.png)
The top part of the "graphical summary" contains any domains identified by
sequence, such as Pfam and InterPro domains). You can click on these for more
information. The sequences retrieved are typically called **hits**.
The bottom part of the "graphical summary" part has the more interesting
results regarding your search: A graphical representation of the top scoring hits.
Mouse over the hit for a description, or click on one to scroll down to
the hit details.
- **Question 1** Were there any homologues detected in Humans?
- **Question 2** What is the "evolutionay extent" of this protein in other species? (hint: the button "taxonomic reports" gives a nice overview)
The next part has information on all of the "hits" found:
![Descriptions](images/descriptions.png)
The descriptions section shows the hits resulting from the search, as well as
some of the scores. Of the scores, the most important one is the e-value:
- **e-value** The number of different alignments with scores equivalent to or better than the query that are expected to occur in a database search by chance. The smaller the e-value, the more likely the detected protein is a homologue. A rule of thumb value of *0.0001* is often used to infer homology.
- **Question 3** Do the detected proteins appear to be true homologues to BLD10?
![Alignments](images/alignments.png)
The last part of the results page contains the individual alignments detected
using the BLAST search.
### Exercise: C. elegans protein BLAST
This exercise was adapted from an [NCBI Blast course][ncbi_blast_course].
The following is the product of a gene in *C. elegans* which resulted from some (hypothetical) experiment we performed.
```fasta
>C.elegans protein
MFHPGMTSQPSTSNQMYYDPLYGAEQIVQCNPMDYHQANILCGMQYFNNSHNRYPLLPQMPPQFTNDHPYDFPNVPTISTLDEASSFNGFLIPSQPSSYNNNNISCVFTPTPCTSSQASSQPPPTPTVNPTPIPPNAGAVLTTAMDSCQQISHVLQCYQQGGEDSDFVRKAIESLVKKLKDKRIELDALITAVTSNGKQPTGCVTIQRSLDGRLQVAGRKGVPHVVYARIWRWPKVSKNELVKLVQCQTSSDHPDNICINPYHYERVVSNRITSADQSLHVENSPMKSEYLGDAGVIDSCSDWPNTPPDNNFNGGFAPDQPQLVTPIISDIPIDLNQIYVPTPPQLLDNWCSIIYYELDTPIGETFKVSARDHGKVIVDGGMDPHGENEGRLCLGALSNVHRTEASEKARIHIGRGVELTAHADGNISITSNCKIFVRSGYLDYTHGSEYSSKAHRFTPNESSFTVFDIRWAYMQMLRRSRSSNEAVRAQAAAVAGYAPMSVMPAIMPDSGVDRMRRDFCTIAISFVKAWGDVYQRKTIKETPCWIEVTLHRPLQILDQLLKNSSQFGSS
```
#### Questions
Use BLAST to answer the following questions:
- **Question 1** Which *C. elegans* protein is this one most likely to be?
- **Question 2** What is it's closest homologue in another species?
- **Quesion 3** What is its homologue in *D. melanogaster*? (Hint: You can create another BLAST query for this).
### Exercise 3: Nucleotide BLAST
BLAST can also be used to perform searches on DNA (or transcripts).
- **Step 1** Navigate to the NCBI BLAST homepage: [https://www.ncbi.nlm.nih.gov/blast/], and click on the "Nucleotid BLAST" button to start a protein BLAST query.
#### BLAST input
BLAST input for nucleotide sequences is very similar to to the Protein BLAST shown previously, with a few minor differences:
The most common options for the "search set" are:
- **nr**: A non-redundant set of all GenBank + EMBL + DDBJ + PDB sequences.
- **refseq_genomic**: Nucleotide genomic sequences from the NCBI Reference Sequence Project.
- **Human G+T**: The genomic sequences plus curated and predicted RNAs from the current build of the human genome.
There are also a handfull of different programs for nucleotide BLAST, the most useful of which are:
- **blastn** is general purpose search to align tRNA, mRNA, rRNA or genomic DNA sequences with coding and non-coding regious
- **megablast** is a search for highly similar sequences (and is much faster)
#### Questions
Let's assume that the above DNA sequence belongs to an RNAi sequence we have
found in our lab, but we have no idea which transicript it is supposed to
target. However, we know that its from a Human cell line.
```
>LostRNAi
GGTATTTGTTTTGATGCAAACCTCAATCCCTCCCCTTCTTTGAATGGTGTGCCCCACCCCGCGGGTCGCCTGCAACCTAGGCGGACGCTACCATGGCGTGAGACAGGGAGGGAAAGAAGTGTGCAGAAGGCAAGCCCGGAGGTATTTTCAAGAATGAGTATATCTCATCTTCCCGGAGGAAAAAAAAAAAGAATGGGTACGTCTGAGAATCAAATTTTGAAAGAGTGCAATGATGGGTCGTTTGATAATTTGTCGGAAAAACAATCTACCTGTTATCTAGCTTTGGGCTAGGCCATTCCAGTTCCAGACGCAGGCTGAACGTCGTGAAGCGGAAGGGGCGGGCCCGCAGGCGTCCGTGTGGTCCTCCGTGCAGCCCTCCGGCCCGAGCCGGTTCTTCCTGGTAGGAGGCGGAACTCGAATTCATTTCTCCCGCTGCCCCATCTCTTAGCTCGCGGTTGTTTCATTCCGCAGTTTCTTCCCATGCACCTGCCGCGTACCGGCCACTTTGTGCCGTACTTACGTCATCTTTTTCCTAAATCGAGGTGGCATTTACACACAGCGCCAGTGCACACAGCAAGTGCACAGGAAGATGAGTTTTGGCCCCTAACCGCTCCGTGATGCCTACCAAGTCACAGACCCTTTTCATCGTCCCAGAAACGTTTCATCACGTCTCTTCCCAGTCGATTCCCGACCCCACCTTTATTTTGATCTCCATAACCATTTTGCCTGTTGGAGAACTTCATATAGAATGGAATCAGGCTGGGCGCTGTGGCTCACGCCTGCACTTTGGGAGGCCGAGGCGGGCGGATTACTTGAGGATAGGAGTTCCAGACCAGCGTGGCCAACGTGGTGAATCCCCGTCTCTACTAAAAAATACAAAAATTAGCTGGGCGTGGTGGGTGCCTGTAATCCCAGCTATTCGGGAGGGTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGTGCCACTACACTCCAGCCTGGGCGACAAGAACGAAACTCCGTCTCAAAAAAAAGGGGGGAATCATACATTATGTGCTCATTTTTGTCGGGCTTCTGTCCTTCAATGTACTGTCTGACATTCGTTCATGTTGTATATATCAGTATTTTGCTC
```
- **Question 1** Use BLASTN to find which transcript this sequence most likely
targets. (Hint: Use **megablast** for highly similar sequence search, and
search the Human genome + predicted RNA's)
## PSI-BLAST
Position-Specific Iterative BLAST (PSI BLAST) BLAST is an extension of BLAST,
which uses an itteration to create a Position Specific Score Matrix (PSSM) for each query.
![PSI BLAST Top](images/psiblast.png)
Because it uses information about sequence diversity and conservation, its much better at detecting remote homolgues.
PSI BLAST can be ran from the same page as a normal BLAST by selecting "PSI-BLAST" in the "Program selection" section.
## HMMER
HMMER is a sequence similarity search tools that uses Hidden Markov Models
(HMMs) to search for similar nucleotide or proteins sequences. In practice it
is very similar to BLAST, although it is much more powerful, especially when we
use a multiple sequence alignment as input instead of just a single sequence.
Its main strength is in detecting remote homologues.
### Exersize 3: HMMER BLD10
As we saw earlier, stardard BLAST is not able to detect a human homologue of the *C. reinhardtii* protein BLD10. In thi short exercise we'll use HMMER to see if this more sensitive tool can detect it.
#### HMMER input
Navigate to the HMMER search page: [http://www.ebi.ac.uk/Tools/hmmer/]
HMMER takes as input a single nucleotide or protein sequence (accession number or identifiers are not accpected).
- **Step 1** Retrieve the amino sequence corresponding to the *C. reinhardtii* protein with accession EDP06416.1 from the NCBI database, and copy-paste it into the HMMER searchbox.
- **Step 2** Make sure the **Reference proteomes** is selected as the **sequence database** to search in. (It should already be selected).
- **Step 3** Hit **Search** to start searching.
#### HMMER output
The HMMER output page contains all of the information on HMMERs detected hits across the reference sequence database.
- **Question 1** What is the evolutionay range of the significants hits detected by HMMER? Is this a plant specific protein, eukaryotic, or does it exist in all major branches of life?
- **Question 2** Does BLD10 have a homologue in humans, and if so, what is it called?
## HHPred
**HHPred** is part of the **HH-suite** software package, and also uses Hidden
Markov Models to perform a more sensitive search than standard BLAST. HHPred
represents both the input **query** and **subject** sequences as HMM's (thus
the "HH" in HHPred).
Like HMMER, HHPred is one of the fastest and more sensitive tools that
currently exist.
### Exersize 4: HHPred.
Find a protein sequence of your interest, or alternatively use one from this tutorial. Let's use HHPred to find similar sequences in across the set of known species.
- **Step 1** Navigate to the HHPred search homepage: [https://toolkit.tuebingen.mpg.de/hhpred]
- **Question 1** Does the search for similar sequences using HHPred differ than using a BLAST alone?
## Take home messages
There are many different ways to search for similar protein and nucleotide sequences. Although tools like HMMER and HHPred are incredibly good for finding homologues genes across large evolutionary distances, this does not necessarily mean they are the best at everything. BLAST (and all of its variants) are both simple and easily usable tools which may simply be better for the job at hand.
[ncbi_blast_guide]: https://www.ncbi.nlm.nih.gov/blast/BLAST_guide.pdf
[nature_top_100]: http://www.nature.com/news/the-top-100-papers-1.16224
[ncbi_blast_course]: https://www.ncbi.nlm.nih.gov/Class/BLAST/blast_course.short.html
## UniProt
1. Introduction
2. Swiss-Prot (curated) vs. TrEMBL (automated)
3. Cross-references and link-outs
* OMIM, Domains, GO, ...
### Introduction
[**UniProt**](http://www.uniprot.org) is a protein database. Its focus is on proteins that have been observed experimentally, particularly by mass spectrometry. It has a dedicated team of curators working on high-quality annotation of these proteins and makes a great **central hub** for all protein-related information. It's my first stop for any protein or gene name.
Its protein focus sets it apart from genomic databases like [**Ensembl**](http://www.ensembl.org) and the [**UCSC Genome Browser**](http://genome.ucsc.edu), which focus on gene loci, transcripts, splicing, and predicting protein sequences from nucleic acids. Ensembl is more inclusive when it comes to splice variants, but its sequences are less rigorously validated than UniProt's and it contains almost no annotation for genes and proteins.
### Swiss-Prot (curated) vs. TrEMBL (automated)
UniProtKB consists of two subsets: Swiss-Prot (the hand-curated part) and TrEMBL, which contains automatically annotated sequences yet to be looked at by the curation team. There are currently around 550,000 annotated proteins in Swiss-Prot and around 70,000,000 in TrEMBL.
Historically, there was only Swiss-Prot, but as more sequencing data flooded in, a fast way of providing at least some annotation (e.g. by transferring annotation over from homologs) was needed, and so TrEMBL was added.
For the model organisms, especially human, mouse and yeast, the manual Swiss-Prot annotation is excellent and frequently updated by the curators as well as through automated pipelines. In these organisms, all proteins are now covered.
### Cross-references and link-outs
UniProt incorporates and links out to a huge number of protein-related databases. It can be considered an authoritative resource that covers nearly all major information sources. The information includes:
- [**Function:**](http://www.uniprot.org/uniprot/P04637#function)
- A curated paragraph or two describing the biology of the protein. Individual statements are backed up by literature references. Also includes Gene Ontology (GO) annotation.
- [**Names & Taxonomy:**](http://www.uniprot.org/uniprot/P04637#names_and_taxonomy)
- Alternative gene and protein names that might be in use, as well as information on the organism.
- [**Subcellular location:**](http://www.uniprot.org/uniprot/P04637#subcellular_location)
- Nuclear/cytoplasmic etc., also broken down into isoforms since they might differ in their localisation.
- [**Pathology & Biotech:**](http://www.uniprot.org/uniprot/P04637#pathology_and_biotech)
- Disease-associated variants and somatic mutations from [OMIM](http://omim.org) etc.
- [**PTM / Processing:**](http://www.uniprot.org/uniprot/P04637#ptm_processing)
- Phosphorylation sites etc. largely from mass spectrometry, as well as annotation on signal peptides and other segments that are cleaved off during protein maturation.
- [**Expression:**](http://www.uniprot.org/uniprot/P04637#expression)
- Organismal tissues and cell types where the protein is expressed. Links out to the [Human Protein Atlas](http://www.proteinatlas.org) (HPA) which assays this using antibodies and transcriptomics, as well as other resources.
- [**Interaction:**](http://www.uniprot.org/uniprot/P04637#interaction)
- Known protein-protein or protein-DNA/RNA interactions, sometimes including information on the protein regions involved.
- [**Structure:**](http://www.uniprot.org/uniprot/P04637#structure)
- Information on 3D structures from [PDB](http://www.rcsb.org/pdb/) obtained by X-ray crystallography, NMR or cryo-electron microscopy, as well as the regions they cover.
- [**Family & Domains:**](http://www.uniprot.org/uniprot/P04637#family_and_domains)
- Protein domains that might be catalytic or mediate things like protein-protein interactions from Pfam, InterPro, SMART etc.
- [**Sequences:**](http://www.uniprot.org/uniprot/P04637#sequences)
- The FASTA sequence for the protein's canonical isoform as well as a selection of variants from alternative splicing, alternative promoter usage etc. Note that Ensembl is often more comprehensive when it comes to isoforms, but it doesn't cover proteolytic processing.
- [**Cross-references:**](http://www.uniprot.org/uniprot/P04637#cross_references)
- A comprehensive list of all major external databases that provide additional information aspects on the protein. Also repeats all resources that have been referenced in the previous sections.
- [**Entry information:**](http://www.uniprot.org/uniprot/P04637#entry_information)
- Gives a brief history of the protein entry within UniProt: when it was created, last updated etc.
- [**Miscellaneous:**](http://www.uniprot.org/uniprot/P04637#miscellaneous)
- Some additional terms such as which "[proteome](http://www.uniprot.org/help/reference_proteome)" release(s) this protein is part of. These are based on mass spectrometry and are available for more than just the curated model organisms. They can provide additional confidence e.g. for TrEMBL proteins predicted from nucleic acid sequences only.
- [**Similar proteins:**](http://www.uniprot.org/uniprot/P04637#similar_proteins)
- Links to UniProt's "UniRef" clusters at 100%, 90% and 50% sequence identity. These clusters are useful mainly to reduce bias for protein groups with many members (paralogs) in bioinformatics studies by collapsing them. If you are looking for homologs, a better place is either the phylogenomic databases under "Family & Domains" such as eggNOG or, better yet, look up the gene on [Ensembl](http://www.ensembl.org), e.g. [here](http://www.ensembl.org/Homo_sapiens/Gene/Compara_Tree?db=core;g=ENSG00000141510;r=17:7661779-7687550) for p53.
What's also fantastic is that most pieces of information UniProt displays link directly to the original PubMed article. Alternatively, the source information might show that something was transferred over from another organism (e.g. mouse to human).
### Nice to know
- The "feature viewer" is a little bit hidden: it's at the top of the sidebar on the left. It provides a clear overview of all features along a protein's sequence at a glance (and allows you to expand feature categories that interest you).
- Each protein has a readable "ID" (e.g. P53_HUMAN) and a more cryptic, but stable "accession" (e.g. P04637). The ID can change as more becomes known about a protein, similar to a gene name, while the accession is always kept the same. Therefore, if you are making e.g. a table for a paper, be sure to include the accessions.
- In Swiss-Prot, there is a [star rating](http://www.uniprot.org/help/annotation_score) for each protein which gives an indication of how much evidence there is that it exists in the form described.
- Important features that UniProt doesn't include yet:
- Intrinsically disordered regions (can be obtained from [D2P2](http://d2p2.pro))
- Many short linear motifs (can be obtained from [ELM](http://elm.eu.org))
For natural variation and disease-causing variants:
- Check out the 2-minute variation video on the UniProt YouTube channel (above)
- Another fantastic new resource is [ExAC](http://exac.broadinstitute.org), the Exome Aggregation Consortium. They combined exome (transcript) sequences from 60,000 humans. It's by far the biggest resource on human sequence variants to date.
- There are some really nice short tutorial videos here: [UniProt YouTube channel](https://www.youtube.com/channel/UCkCR5RJZCZZoVTQzTYY92aw)
- If you are ever in doubt about a particular term or feature (like "accession"), the [Help section](http://www.uniprot.org/help/) is really concise and excellent.
### Individual examples
- A protein you're working on?
- Annotation transferred from one species to another, "By similarity":
- KDM3B's function from [human](http://www.uniprot.org/uniprot/Q7LBC6#function) to [mouse](http://www.uniprot.org/uniprot/Q6ZPY7#function) (note the yellow annotation source tags)
- p53 in the feature viewer:
- The normal UniProt view is one long page: [p53 (normal view)](http://www.uniprot.org/uniprot/P04637).
- Alternatively, you can use the "feature viewer" to get a quick overview of what's going on in the protein: [p53 (feature viewer)](http://www.uniprot.org/uniprot/P04637#showFeaturesViewer).
- Where does [P53_HUMAN](http://www.uniprot.org/uniprot/P04637#showFeaturesViewer) get post-translationally modified?
- Where do its disease mutations happen?
- Two proteins from one precursor protein:
- See if you can find the ghrelin and obestatin peptides within the [precursor protein](http://www.uniprot.org/uniprot/Q9UBU3#showFeaturesViewer)!
- Hint hint: Click "Molecule Processing", or [here](http://www.uniprot.org/uniprot/Q9UBU3#ptm_processing)!
- Isoforms:
- Lamin A:
- Has a progeria-causing [pathogenic isoform](http://www.uniprot.org/uniprot/P02545#sequences) (number 6, see note), which is produced by unusual splicing if a disease-associated missense SNP is present.
- See also the "natural" variant at residue 608. Under references, it says: "[Recurrent de novo point mutations in lamin A cause Hutchinson-Gilford progeria syndrome.](https://www.ncbi.nlm.nih.gov/pubmed/12714972)"
- Interleukin-33:
- Has a constitutively active isoform (number 3): [IL33_HUMAN](http://www.uniprot.org/uniprot/O95760#sequences)
- Ankyrin-1:
- Has a muscle-specific isoform (Mu17): [ANK1_HUMAN](http://www.uniprot.org/uniprot/P16157#function)
- A protein with too many names:
- [FOLH1_HUMAN](http://www.uniprot.org/uniprot/Q04609#names_and_taxonomy) (Glutamate carboxypeptidase 2, or N-acetylated-alpha-linked acidic dipeptidase I, or Prostate-specific membrane antigen, or Folate hydrolase 1, or Cell growth-inhibiting gene 27 protein). Good that they're all in here, no?
- Nice examples of comprehensive subcellular localisation annotation:
- [AKP8L_HUMAN](http://www.uniprot.org/uniprot/Q9ULX6#subcellular_location) lists four papers that describe its localisation, colocalisation with other proteins, and potential shuttling in and out of the nucleus.
- [AKA7A_HUMAN](http://www.uniprot.org/uniprot/O43687#subcellular_location) has two isoforms with different localisations.
- Trypsin-2 expression:
- [TRY2_HUMAN](http://www.uniprot.org/uniprot/P07478#expression) is very tissue-specific.
- Also check out its [entry](http://www.proteinatlas.org/ENSG00000275896-PRSS3P2/tissue) in the Human Protein Atlas (linked in UniProt) for some microscopy images as well as RNA sequencing data.
- Protein-protein interactions:
- [HAND1_MOUSE](http://www.uniprot.org/uniprot/Q64279#interaction) is a transcription factor that needs to form a homodimer to work.
- Protein domains:
- [KDM5C_HUMAN](http://www.uniprot.org/uniprot/P41229#family_and_domains) has at least 3 domains (JmjN, ARID and then JmjC).
- We know that it's a [histone demethylase](http://www.uniprot.org/uniprot/P41229#function) acting on H3K4me2/3.
- To find out what the domains do, let's follow the link out to Pfam:
- Scroll down a bit to "Family and domain databases".
- Click "[graphical view](http://pfam.xfam.org/protein/P41229)" in the Pfam section.
- From there, we can check out the 3 domains UniProt mentioned:
- [JmjN](http://pfam.xfam.org/family/JmjN): Nothing much seems to be known about this one except that it occurs N-terminally of JmjC.
- [ARID](http://pfam.xfam.org/family/ARID): A DNA-binding domain.
- [JmjC](http://pfam.xfam.org/family/JmjC): Looks like it might be a catalytic domain.
- Pfam lists a few more domains than UniProt did, actually:
- [zf-C5HC2](http://pfam.xfam.org/family/zf-C5HC2): A small "zinc-finger" domain that is thought to bind DNA as well.
- [PLU-1](http://pfam.xfam.org/family/PLU-1): A larger domain that may also play a role in DNA binding, but not much is known about it.
- [PHD](http://pfam.xfam.org/family/PHD): This one is incredibly well annotated on Pfam compared to the others. It is a very important epigenetic "reader" domain that is thought to specifically bind trimethylated lysines in many cases. Pfam mentions it occurs in over 100 human proteins, and that it might play a role in epigenetic cross-talk with H3K9 trimethylation.
- We know that the [catalytic residues](http://www.uniprot.org/uniprot/P41229#function) are 514 (H), 517 (D) and 602 (H). These are all negatively charged and apparently they chelate an iron ion (Fe2+).
- Looking at the [feature viewer](http://www.uniprot.org/uniprot/P41229#showFeaturesViewer), we can clearly see that these are indeed in the JmjC domain, making it the catalytic lysine demethylase domain.
Just let me know if you have any questions, ideas or comments, I'm Ben Lang from the Gibson Team ([lang@embl.de](lang@embl.de))! :)
\ No newline at end of file
This diff is collapsed.
File added
File added