-
Malvika Sharan authoredMalvika Sharan authored
Multiple sequence alignment
A multiple sequence alignment (MSA) is a method for the comparison of three or more biological sequences (protein, DNA, or RNA) by aligning them against each other. In practice, these query sequences would share an evolutionary relationship (common ancestor). With MSA the distances and similarities between the sequences can be inferred, which facilitates the analysis of phylogenetic association such as evolutionary origins.
A MSA allows to visualize the conserved locations in the sequences that hold the functional relevance across species as well as mutation events (that appear as hyphens in one or more of the sequences in the alignment) such as insertion, deletion mutations or sunstitutions to allow calculation the rate of evolution.
MSA is used to define a protein family by assessing sequence conservation of protein domains, tertiary and secondary structures.
external slide with comprehensive details on algorithm
Clustal Omega for multiple sequence alignment
Hands-on session onClustal omega is the current version of the MSA tools from clustal series. It uses progressive alignment heuristic to build a final MSA, beginning with the most similar pair and progressing to the most distantly related.
The progressive alignment combines all the pairwise alignments in two stages: a first stage in which the relationships between the sequences are represented as a tree (clustering), called a guide tree, and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree.
Availability:
- Clustal Omega can be used via the web interface available at http://www.ebi.ac.uk/Tools/msa/clustalo/.
Input:
- It requires protein accession IDs or protein seqences in FASTA format.
What substitution matrix/default parameters are used by Clustal Omega? Clustal Omega uses the HHalign algorithm and its default settings as its core alignment engine. The algorithm is described in Söding, J. (2005) 'Protein homology detection by HMM–HMM comparison'. Bioinformatics 21, 951-960. The default transition matrix is Gonnet, gap opening penalty is 6 bits, gap extension is 1 bit.
HHalign: HHalign compares two alignments with each other by pairwise alignment of HMMs. It shows the optimal alignment and all significant non-overlapping suboptimal alignments. It also generates a dotplot for which the profile-profile column score is averaged over a window of variable size. If only one alignment is entered, this is compared to itself. Used in this way, HHalign is a very sensitive repeat-identification tool.
Examples:
To extract examples, we will review our first session of NCBI using following instructions:
- Search for P53 proteins in NCBI
- Select P53 protein from Mus muscuslus
- Run BLAST on this sequence to identify its homologs
- Randomly select 10 hits (avoid multiple sequences from same species)
- View GenPept report, and view the summary (top left) as FASTA (text)
These sequences will be the set of queries for your MSA
Using Clustal Omega
- Select all the query sequences (Optionally: you can edit the FASTA header by keeping only species name)
- Go to Clustal Omega web form, ad paste your query sequences
- Choose output format as 'clustal w/ numbers'
- Submit you query
- Browse your output result
- Show colors
- Phylogenetic tree
- Summary: Percent Identity Matrix
Optional exercise: COBALT (NCBI)
COBALT in a tool for multiple sequence alignment, integrated in the NCBI resource for sequence analysis. It alignes sequences by conserved proteins domains and local similarities of the sequences.
- Go back to your NCBI page of P53 BLAST result
- Click on multiple alignment
- Browse the result: phylogenetic tree
- Randomly select few sequences, go to the GenPept page
- In the 'Analyse these sequences', select the option 'Align sequences with COBALT'
- Browse your output result: Phylogenetic tree