Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • mainar/protein-bioinformatics-nov-2016
  • makumar/protein-bioinformatics-nov-2016
  • lang/protein-bioinformatics-nov-2016
  • sharan/protein-bioinformatics-embl-hd
4 results
Show changes
Commits on Source (146)
Showing
with 1321 additions and 13 deletions
......@@ -12,18 +12,15 @@ Why does Santa Cruz torture goats?
### Examples
1. FGF13
* Click through to the NBP2-45642 and see the validation images
* Click through to the NBP2-45642 and see the validation images
2. Beta-Catenin
* Click on the buttons – What do you get?
* Click on the buttons – What do you get?
* Click on the image – What do you get?
* Do all the antibodies give similar ICC images?
* Do most antibodies work for multiple methods?
3. Look up antibodies for your favourite proteins
* Is an antibody you use in the list?
* Do you think you have used the best antibody available for your purposes?
......
# [EMBOSS tools for sequence analysis](http://www.ebi.ac.uk/Tools/emboss/)
## [EMBOSS explorer](http://emboss.bioinformatics.nl/cgi-bin/emboss/)
###### Official [EMBOSS tutorial](http://emboss.sourceforge.net/docs/emboss_tutorial/emboss_tutorial.html) written by [Gary Williams](http://emboss.sourceforge.net/docs/emboss_tutorial/node8.html)
**Why EMBOSS?**
- Open source
- Wide range of tools for sequence analysis
- Ideal for building workflow (commandline tools)
- Accesses remote databases conveniently
The official EMBOSS suite comprises of over 150 programs that are available as commandline tools and only few of those are offered as web based applications.
Wageningen Bioinformatics Webportal, Netherlands offers [a graphical user interface to the EMBOSS suite](http://emboss.bioinformatics.nl/cgi-bin/emboss/), which we will use today for the hands-on session (more like demo!).
## Quick Demo on EMBOSS tools
...but before that, re-use/do the Clustal Omega analysis on your set of 10 P53 sequences. (or, go down this document to use my set of sequences ;) !)
- [extractalign](http://emboss.bioinformatics.nl/cgi-bin/emboss/extractalign)
- Swich to [Mview](http://www.ebi.ac.uk/Tools/msa/mview/) to visualize consensus
- Also check [alnviz](https://toolkit.tuebingen.mpg.de/alnviz), but don't dive into it today. We will cover such visualizations tomorrow.
- Create consensus with [cons](http://emboss.bioinformatics.nl/cgi-bin/emboss/cons)
- Also check [consambig](http://emboss.bioinformatics.nl/cgi-bin/emboss/consambig): cons calculates a consensus sequence from a multiple sequence alignment. To obtain the consensus, the amino acid residue or nucleotide at each position is compared to the possible ambiguity codes using consambig. The consensus sequence uses the minimum ambiguity code match. The ambiguity characters were designed to encode positional variations found among families of related genes. Useful for DNA sequences.
- use [Merger](Merge two overlapping sequences) to merge two overlapping sequences. It uses a global alignment algorithm (Needleman & Wunsch) to optimally align the sequences. A merged sequence is generated from the alignment and writen to the output file. Also useful in case of DNA.
- [Dotmatcher](http://emboss.bioinformatics.nl/cgi-bin/emboss/dotmatcher) generates a dotplot from two input sequences. The dotplot is an intuitive graphical representation of the regions of similarity between two sequences. All positions from the first input sequence are compared with all positions from the second input sequence using a specified substitution matrix.
- [plotcon](http://emboss.bioinformatics.nl/cgi-bin/emboss/plotcon)
- [prettyplot](http://emboss.bioinformatics.nl/cgi-bin/emboss/prettyplot): claims to present alignment with pretty formatting (?)
## Example proteins
For pairwise alignment tools, we can use human p53 and zebrafish dp53:
- Human p53: [P04637](http://www.uniprot.org/uniprot/P04637.fasta)
- Zebrafish tp53: [P79734](http://www.uniprot.org/uniprot/P79734.fasta)
````
>P53_HUMAN|P04637| Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 PE=1 SV=4
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG
GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
>P53_DANRE|P79734| Cellular tumor antigen p53 OS=Danio rerio GN=tp53 PE=1 SV=1
MAQNDSQEFAELWEKNLIIQPPGGGSCWDIINDEEYLPGSFDPNFFENVLEEQPQPSTLP
PTSTVPETSDYPGDHGFRLRFPQSGTAKSVTCTYSPDLNKLFCQLAKTCPVQMVVDVAPP
QGSVVRATAIYKKSEHVAEVVRRCPHHERTPDGDNLAPAGHLIRVEGNQRANYREDNITL
RHSVFVPYEAPQLGAEWTTVLLNYMCNSSCMGGMNRRPILTIITLETQEGQLLGRRSFEV
RVCACPGRDRKTEESNFKKDQETKTMAKTTTGTKRSLVKESSSATLRPEGSKKAKGSSSD
EEIFTLQVRGRERYEILKKLNDSLELSDVVPASDAEKYRQKFMTKNKKENRESSEPKQGK
KLMVKDEGRSDSD
````
For dotmatcher we can use these sequences:
- ZKSC7_HUMAN: [Q9P0L1](http://www.uniprot.org/uniprot/Q9P0L1.fasta)
- MPDZ_HUMAN: [O75970](http://www.uniprot.org/uniprot/O75970.fasta)
````
>ZKSC7_HUMAN|Q9P0L1| Zinc finger protein with KRAB and SCAN domains 7 OS=Homo sapiens GN=ZKSCAN7 PE=1 SV=2
MTTAGRGNLGLIPRSTAFQKQEGRLTVKQEPANQTWGQGSSLQKNYPPVCEIFRLHFRQL
CYHEMSGPQEALSRLRELCRWWLMPEVHTKEQILELLVLEQFLSILPGELRTWVQLHHPE
SGEEAVAVVEDFQRHLSGSEEVSAPAQKQEMHFEETTALGTTKESPPTSPLSGGSAPGAH
LEPPYDPGTHHLPSGDFAQCTSPVPTLPQVGNSGDQAGATVLRMVRPQDTVAYEDLSVDY
TQKKWKSLTLSQRALQWNMMPENHHSMASLAGENMMKGSELTPKQEFFKGSESSNRTSGG
LFGVVPGAAETGDVCEDTFKELEGQTSDEEGSRLENDFLEITDEDKKKSTKDRYDKYKEV
GEHPPLSSSPVEHEGVLKGQKSYRCDECGKAFNRSSHLIGHQRIHTGEKPYECNECGKTF
RQTSQLIVHLRTHTGEKPYECSECGKAYRHSSHLIQHQRLHNGEKPYKCNECAKAFTQSS
RLTDHQRTHTGEKPYECNECGEAFIRSKSLARHQVLHTGKKPYKCNECGRAFCSNRNLID
HQRIHTGEKPYECSECGKAFSRSKCLIRHQSLHTGEKPYKCSECGKAFNQNSQLIEHERI
HTGEKPFECSECGKAFGLSKCLIRHQRLHTGEKPYKCNECGKSFNQNSHLIIHQRIHTGE
KPYECNECGKVFSYSSSLMVHQRTHTGEKPYKCNDCGKAFSDSSQLIVHQRVHTGEKPYE
CSECGKAFSQRSTFNHHQRTHTGEKSSGLAWSVS
>MPDZ_HUMAN|O75970| Multiple PDZ domain protein OS=Homo sapiens GN=MPDZ PE=1 SV=2
MLEAIDKNRALHAAERLQTKLRERGDVANEDKLSLLKSVLQSPLFSQILSLQTSVQQLKD
QVNIATSATSNIEYAHVPHLSPAVIPTLQNESFLLSPNNGNLEALTGPGIPHINGKPACD
EFDQLIKNMAQGRHVEVFELLKPPSGGLGFSVVGLRSENRGELGIFVQEIQEGSVAHRDG
RLKETDQILAINGQALDQTITHQQAISILQKAKDTVQLVIARGSLPQLVSPIVSRSPSAA
STISAHSNPVHWQHMETIELVNDGSGLGFGIIGGKATGVIVKTILPGGVADQHGRLCSGD
HILKIGDTDLAGMSSEQVAQVLRQCGNRVKLMIARGAIEERTAPTALGITLSSSPTSTPE
LRVDASTQKGEESETFDVELTKNVQGLGITIAGYIGDKKLEPSGIFVKSITKSSAVEHDG
RIQIGDQIIAVDGTNLQGFTNQQAVEVLRHTGQTVLLTLMRRGMKQEAELMSREDVTKDA
DLSPVNASIIKENYEKDEDFLSSTRNTNILPTEEEGYPLLSAEIEEIEDAQKQEAALLTK
WQRIMGINYEIVVAHVSKFSENSGLGISLEATVGHHFIRSVLPEGPVGHSGKLFSGDELL
EVNGITLLGENHQDVVNILKELPIEVTMVCCRRTVPPTTQSELDSLDLCDIELTEKPHVD
LGEFIGSSETEDPVLAMTDAGQSTEEVQAPLAMWEAGIQHIELEKGSKGLGFSILDYQDP
IDPASTVIIIRSLVPGGIAEKDGRLLPGDRLMFVNDVNLENSSLEEAVEALKGAPSGTVR
IGVAKPLPLSPEEGYVSAKEDSFLYPPHSCEEAGLADKPLFRADLALVGTNDADLVDEST
FESPYSPENDSIYSTQASILSLHGSSCGDGLNYGSSLPSSPPKDVIENSCDPVLDLHMSL
EELYTQNLLQRQDENTPSVDISMGPASGFTINDYTPANAIEQQYECENTIVWTESHLPSE
VISSAELPSVLPDSAGKGSEYLLEQSSLACNAECVMLQNVSKESFERTINIAKGNSSLGM
TVSANKDGLGMIVRSIIHGGAISRDGRIAIGDCILSINEESTISVTNAQARAMLRRHSLI
GPDIKITYVPAEHLEEFKISLGQQSGRVMALDIFSSYTGRDIPELPEREEGEGEESELQN
TAYSNWNQPRRVELWREPSKSLGISIVGGRGMGSRLSNGEVMRGIFIKHVLEDSPAGKNG
TLKPGDRIVEVDGMDLRDASHEQAVEAIRKAGNPVVFMVQSIINRPRKSPLPSLLHNLYP
KYNFSSTNPFADSLQINADKAPSQSESEPEKAPLCSVPPPPPSAFAEMGSDHTQSSASKI
SQDVDKEDEFGYSWKNIRERYGTLTGELHMIELEKGHSGLGLSLAGNKDRSRMSVFIVGI
DPNGAAGKDGRLQIADELLEINGQILYGRSHQNASSIIKCAPSKVKIIFIRNKDAVNQMA
VCPGNAVEPLPSNSENLQNKETEPTVTTSDAAVDLSSFKNVQHLELPKDQGGLGIAISEE
DTLSGVIIKSLTEHGVAATDGRLKVGDQILAVDDEIVVGYPIEKFISLLKTAKMTVKLTI
HAENPDSQAVPSAAGAASGEKKNSSQSLMVPQSGSPEPESIRNTSRSSTPAIFASDPATC
PIIPGCETTIEISKGRTGLGLSIVGGSDTLLGAIIIHEVYEEGAACKDGRLWAGDQILEV
NGIDLRKATHDEAINVLRQTPQRVRLTLYRDEAPYKEEEVCDTLTIELQKKPGKGLGLSI
VGKRNDTGVFVSDIVKGGIADADGRLMQGDQILMVNGEDVRNATQEAVAALLKCSLGTVT
LEVGRIKAGPFHSERRPSQSSQVSEGSLSSFTFPLSGSSTSESLESSSKKNALASEIQGL
RTVEMKKGPTDSLGISIAGGVGSPLGDVPIFIAMMHPTGVAAQTQKLRVGDRIVTICGTS
TEGMTHTQAVNLLKNASGSIEMQVVAGGDVSVVTGHQQEPASSSLSFTGLTSSSIFQDDL
GPPQCKSITLERGPDGLGFSIVGGYGSPHGDLPIYVKTVFAKGAASEDGRLKRGDQIIAV
NGQSLEGVTHEEAVAILKRTKGTVTLMVLS
````
### Set of P53 proteins:
**Raw sequences**
```
>Mus musculus
MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEALRVSGAPAAQDPVTETPGPV
APAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFFQLAKTCPVQLWVSATPPAGSRVRAMAIY
KKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCM
GGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPPQKKKPL
DGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSLQPRAFQALIKEESPNC
>Rattus norvegicus
MEDSQSDMSIELPLSQETFSCLWKLLPPDDILPTTATGSPNSMEDLFLPQDVAELLEGPEEALQVSAPAAQEPGTEAPAP
VAPASATPWPLSSSVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSISLNKLFCQLAKTCPVQLWVTSTPPPGTRVRAMAI
YKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNPYAEYLDDRQTFRHSVVVPYEPPEVGSDYTTIHYKYMCNSSC
MGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEEHCPELPPGSAKRALPTSTSSSPQQKKKP
LDGEYFTLKIRGRERFEMFRELNEALELKDARAAEESGDSRAHSSLQPRTFQALIKKESPNC
>Mastomys natalensis
LPLSQETFQRLWKLLPPEAVLSEASPNSMDNMFLSPDVVNLLEGPEEALQVSAAPAAQDPVTETPAPAAPAPATPWPLSS
FVPSQKTYQGSYGFHLGFLQSGTAKSVMCTYSPSLNKLFCQLAKTCPVQLWVSDTPPAGSRVRAMAIYKKSQHMTEVVRR
CPHHERCTDGDGLAPPQHLIRVEGNLNAEYLDDKQTFRHSVVVPYEPPEVGSDYTTIHYKYMCNSSCMGGMNRRPILTII
TLEDSSGNLLGRDSFEVRICACPGRDRRTEEENFRKKEEPCPELPLGSAKRALPTGTSASPQQKKKRLDGEYFTLKIRGR
ERFEMFRELNEALELKDARAAEELGDSRAHSSYLKTKRGQSSSHHKKPMVKKVGPDSD
>Microtus ochrogaster
MEEPQSDLSIEPPLSQETFSDLWNLLPPNNVLSTSLSVDAMEDLFLSQDVANWLEEPNEGPQMSAAASTAEDPVTEAPAP
VTPAPVTSWPLSSSVPSQKTYQGEYGFRLGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVSSTPPPGTRVRAMAI
YKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLRAEYLDDRQTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSC
MGGMNRRPILTIITLEDPSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGEPRPELPVGSTKRVLPTNTSSPQPKKKPL
DGEYFTLKIRGRERFKMFSELNEALELKDAQDANGSGDSRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
>Nannospalax galili
MEEQQSDLSIEPPLSQETFSDLWKLLPQNNVLSTPLSPNSMEDLLLSPEDVANWLDDPDEALQVPAAAITGDPVTETSAP
VAPPPATPWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPPLNKLFCQLAKTCPVQLWVDSTPPPGTRVRAMAI
YKKSQHMTEVVKRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSC
MGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGELCPELPPGSTKRALPTGTSSSPQPKKKP
LDGEYFTLKIRGRERFEMFRELNEALELKDTQAEKDSGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
>Eospalaxbaileyi
MEEPQSDLSIEPPLSQETFSDLWKLLPQNNVLSTSLSPNSMEDLLLSAEDVANWLDDPDDALRMPAAPVTEDPATEASAP
VAPPPATPWPLSSSVPSQKTYQGNYGFRLGFLHSGTAKSVTCTYSPCLNKLFCQLAKTCPVQLWVDSTPPPGTRVRAMAI
YKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSVIVPYEPPEVGSDCTTIHYNYMCNSSC
MGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGESCPELPPGSTKRALPTDTSSSPQPKKKP
LLDGEYFTLKIRGRERFEMFRELNEALELKDAQAEKESGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
>Eospalaxcansus
MEEPQSDLSIEPPLSQETFSDLWKLLPQNNVLSTSLSPNSMEDLLLSAEDVANWLDDPDDALRMPAAPVTEDPTTEASAP
VAPPPATPWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVACTYSPCLNKLFCQLAKTCPVQLWVDSTPPPGTRVRAMAI
YKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSC
MGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGESCPELPPGSTKRALPTGTSSSPQPKKKP
LLDGEYFTLKIRGRERFEMFRELNEALELKDAQAEKESGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
>Cricetulus griseus
MEEPQSDLSIELPLSQETFSDLWKLLPPNNVLSTLPSSDSIEELFLSENVTGWLEDSGGALQGVAAAAASTAEDPVTETP
APVASAPATPWPLSSSVPSYKTFQGDYGFRLGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVNSTPPPGTRVRAM
AIYKKLQYMTEVVRRCPHHERSSEGDSLAPPQHLIRVEGNLHAEYLDDKQTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDPSGNLLGRNSFEVRICACPGRDRRTEEKNFQKKGEPCPELPPKSAKRALPTNTSSSPPPKK
KTLDGEYFTLKIRGHERFKMFQELNEALELKDAQASKGSEDNGAHSSYLKSKKGQSASRLKKLMIKREGPDSD
>Oryctolagus cuniculus
MSATAQAGPGGSQEASDPAAAMEESQSDLSLEPPLSQETFSDLWKLLPENNLLTTSLNPPVDDLLSAEDVANWLNEDPEE
GLRVPAAPAPEAPAPAAPALAAPAPATSWPLSSSVPSQKTYHGNYGFRLGFLHSGTAKSVTCTYSPCLNKLFCQLAKTCP
VQLWVDSTPPPGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDRNTFRHSVVVPYEP
PEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGEPCPELPPG
SSKRALPTTTTDSSPQTKKKPLDGEYFILKIRGRERFEMFRELNEALELKDAQAEKEPGGSRAHSSYLKAKKGQSTSRHK
KPMFKREGPDSD
>Carlito syrichta
MEEPQSDLSIEPLSQETFSDLWKLLPENNVLSPSLSPPVDDLILSTEDIANWFSEGPDEALRTAPAPVAPTPAASTQAAP
APGTPWPLSSSVPSQKTYHGNYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQ
SQYMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDKTTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGG
MNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENFRKKGEPCSELPPGSTKRALPTSTSSPSQPKKKPLDG
EYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHTSHLKSKKGQSTSRHKKLMFKREGPDSD
```
**Aligned by Clustal Omega**
```
CLUSTAL O(1.2.3) multiple sequence alignment
Cricetulus ---------------------MEEPQSDLSIELPLSQETFSDLWKLLPPNNVLSTL--PS
Carlito ---------------------MEEPQSDLSIE-PLSQETFSDLWKLLPENNVLSPS--LS
Microtus ---------------------MEEPQSDLSIEPPLSQETFSDLWNLLPPNNVLSTS--LS
Oryctolagus MSATAQAGPGGSQEASDPAAAMEESQSDLSLEPPLSQETFSDLWKLLPENNLLTTS--LN
Nannospalax ---------------------MEEQQSDLSIEPPLSQETFSDLWKLLPQNNVLSTP--LS
Eospalaxbaileyi ---------------------MEEPQSDLSIEPPLSQETFSDLWKLLPQNNVLSTS--LS
Eospalaxcansus ---------------------MEEPQSDLSIEPPLSQETFSDLWKLLPQNNVLSTS--LS
Mastomys --------------------------------LPLSQETFQRLWKLLPPEAVLSE---AS
Mus ------------------MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPS-----
Rattus ---------------------MEDSQSDMSIELPLSQETFSCLWKLLPPDDILPTTATGS
*******. **:*** : :*
Cricetulus SDSIEELFL-SENVTGWLEDSGGALQGVAAAAASTAEDPVTETPAPVASAPATPWPLSSS
Carlito PP-VDDLILSTEDIANWFSEGPDE--ALRTAPAPV--APTPAASTQAAPAPGTPWPLSSS
Microtus VDAMEDLFL-SQDVANWLEEPNEG--PQMSAAASTAEDPVTEAPAPVTPAPVTSWPLSSS
Oryctolagus PP--VDDLLSAEDVANWLNEDPEE--GLRVPAAPAPEAPAPAAPALAAPAPATSWPLSSS
Nannospalax PNSMEDLLLSPEDVANWLD-DPDE--ALQVPAAAITGDPVTETSAPVAPPPATPWPLSSS
Eospalaxbaileyi PNSMEDLLLSAEDVANWLD-DPDD--ALRMPAAPVTEDPATEASAPVAPPPATPWPLSSS
Eospalaxcansus PNSMEDLLLSAEDVANWLD-DPDD--ALRMPAAPVTEDPTTEASAPVAPPPATPWPLSSS
Mastomys PNSMDNMFL-SPDVVNLLEGPEE---ALQVSAAPAAQDPVTETPAPAAPAPATPWPLSSF
Mus PHCMDDLLL-PQDVEEFFEGPSE---ALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSF
Rattus PNSMEDLFL-PQDVAELLEGPEE---ALQVS-APAAQEPGTEAPAPVAPASATPWPLSSS
: :* :: :. * * : .: * *****
Cricetulus VPSYKTFQGDYGFRLGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVNSTPPPGTR
Carlito VPSQKTYHGNYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTR
Microtus VPSQKTYQGEYGFRLGFLHSGTAKSVTCTYSPSLNKLFCQLAKTCPVQLWVSSTPPPGTR
Oryctolagus VPSQKTYHGNYGFRLGFLHSGTAKSVTCTYSPCLNKLFCQLAKTCPVQLWVDSTPPPGSR
Nannospalax VPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPPLNKLFCQLAKTCPVQLWVDSTPPPGTR
Eospalaxbaileyi VPSQKTYQGNYGFRLGFLHSGTAKSVTCTYSPCLNKLFCQLAKTCPVQLWVDSTPPPGTR
Eospalaxcansus VPSQKTYQGSYGFRLGFLHSGTAKSVACTYSPCLNKLFCQLAKTCPVQLWVDSTPPPGTR
Mastomys VPSQKTYQGSYGFHLGFLQSGTAKSVMCTYSPSLNKLFCQLAKTCPVQLWVSDTPPAGSR
Mus VPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFFQLAKTCPVQLWVSATPPAGSR
Rattus VPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSISLNKLFCQLAKTCPVQLWVTSTPPPGTR
*** **::*.***:****:******* **** ***:* ************ *** *:*
Cricetulus VRAMAIYKKLQYMTEVVRRCPHHERSSEGDSLAPPQHLIRVEGNLHAEYLDDKQTFRHSV
Carlito VRAMAIYKQSQYMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDKTTFRHSV
Microtus VRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLRAEYLDDRQTFRHSV
Oryctolagus VRAMAIYKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDRNTFRHSV
Nannospalax VRAMAIYKKSQHMTEVVKRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSV
Eospalaxbaileyi VRAMAIYKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSV
Eospalaxcansus VRAMAIYKKSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRAEYLDDKHTFRHSV
Mastomys VRAMAIYKKSQHMTEVVRRCPHHERCTDGDGLAPPQHLIRVEGNLNAEYLDDKQTFRHSV
Mus VRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSV
Rattus VRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNPYAEYLDDRQTFRHSV
********: *:*****:*******.::.*.************* ***:*: ******
Cricetulus VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDPSGNLLGRNSFEVRICA
Carlito VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCA
Microtus VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDPSGNLLGRNSFEVRVCA
Oryctolagus VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCA
Nannospalax VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCA
Eospalaxbaileyi IVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCA
Eospalaxcansus VVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCA
Mastomys VVPYEPPEVGSDYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRICA
Mus VVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCA
Rattus VVPYEPPEVGSDYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCA
:*******.**: *****:************************ *******:*****:**
Cricetulus CPGRDRRTEEKNFQKKGEPCPELPPKSAKRALPTNTSSS-PPPKKKTLDGEYFTLKIRGH
Carlito CPGRDRRTEEENFRKKGEPCSELPPGSTKRALPTSTSS-PSQPKKKPLDGEYFTLQIRGR
Microtus CPGRDRRTEEENFRKKGEPRPELPVGSTKRVLPTNTS--SPQPKKKPLDGEYFTLKIRGR
Oryctolagus CPGRDRRTEEENFRKKGEPCPELPPGSSKRALPTTTTDSSPQTKKKPLDGEYFILKIRGR
Nannospalax CPGRDRRTEEENFRKKGELCPELPPGSTKRALPTGTSSSPQPKKKP-LDGEYFTLKIRGR
Eospalaxbaileyi CPGRDRRTEEENFRKKGESCPELPPGSTKRALPTDTSSSPQPKKKPLLDGEYFTLKIRGR
Eospalaxcansus CPGRDRRTEEENFRKKGESCPELPPGSTKRALPTGTSSSPQPKKKPLLDGEYFTLKIRGR
Mastomys CPGRDRRTEEENFRKKEEPCPELPLGSAKRALPTGTSAS-PQQKKKRLDGEYFTLKIRGR
Mus CPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSAS-PPQKKKPLDGEYFTLKIRGR
Rattus CPGRDRRTEEENFRKKEEHCPELPPGSAKRALPTSTSSS-PQQKKKPLDGEYFTLKIRGR
**********:**:** *** *:**.*** *: ** ****** *:***:
Cricetulus ERFKMFQELNEALELKDAQASKGSEDNGAHSSYLKSKKGQSASRLKKLMIKREGPDSD
Carlito ERFEMFRELNEALELKDAQAGKEPGGSRAHTSHLKSKKGQSTSRHKKLMFKREGPDSD
Microtus ERFKMFSELNEALELKDAQDANGSGDSRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
Oryctolagus ERFEMFRELNEALELKDAQAEKEPGGSRAHSSYLKAKKGQSTSRHKKPMFKREGPDSD
Nannospalax ERFEMFRELNEALELKDTQAEKDSGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
Eospalaxbaileyi ERFEMFRELNEALELKDAQAEKESGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
Eospalaxcansus ERFEMFRELNEALELKDAQAEKESGESRAHSSYLKSKKGQSTSRHKKLMIKREGPDSD
Mastomys ERFEMFRELNEALELKDARAAEELGDSRAHSSYLKTKRGQSSSHHKKPMVKKVGPDSD
Mus KRFEMFRELNEALELKDAHATEESGDSRAHSSLQPRAFQ--------ALIKEESPNC-
Rattus ERFEMFRELNEALELKDARAAEESGDSRAHSSLQPRTFQ--------ALIKKESPNC-
:**:** **********:: : . **:* :.*. .*:.
```
# Exploring the Human Protein Atlas
http://www.proteinatlas.org
Antibodies raised against most human proteins.
Rigorous purification protocol.
Used to Stain human tissues and cells.
Key fact: Provides independent validation of cellular location and tissue distribution using commercial (or home produced) antibodies.
## Example proteins to explore the Atlas
1. A-kinase anchoring proteins are scaffolds for the PKA kinase.
- AKAP1
- AKAP4
- AKAP8
- AKAP12
###Questions
- Are these AKAPs all found in the same cell compartments and subcellular locations?
- What happens when you toggle the channels?
- Are these AKAPs found in all tissues?
- Are they highly expressed in cancer cells?
- Is there a PKA kinase cascade?
1. c-Myc transcription factor
### Questions
- Do all antibodies perform similarly? (Click for primary data for a summary).
- In the cerebral cortex are all cell types stained?
- Would we expect that for Myc?
- Which cells are stained in the placenta?
- Which cell compartments have Myc?
3. CTNNB1 – the key Wnt signalling transcription factor beta catenin
### Questions
- Which tissues don’t express any beta catenin?
- Which cancers don’t express any beta catenin?
- Is most of the beta catenin staining in the nucleus?
4. FGF13 - fibroblast growth factor 13
### Questions
- Is FGF13 strongly expressed in most cancer cells?
- Are there any tissues that don’t stain for FGF13?
- Growth factors are secreted and their receptors are on the cell surface: Which cellular compartments contain FGF13?
- Which cellular compartments in bronchial tissue contain FGF13?
5. Try your own favourite proteins!
Let us know if the images make sense to you…
File added
File added
## UniProt
1. Introduction
2. Swiss-Prot (curated) vs. TrEMBL (automated)
3. Cross-references and link-outs
* OMIM, Domains, GO, ...
### Introduction
[**UniProt**](http://www.uniprot.org) is a protein database. Its focus is on proteins that have been observed experimentally, particularly by mass spectrometry. It has a dedicated team of curators working on high-quality annotation of these proteins and makes a great **central hub** for all protein-related information. It's my first stop for any protein or gene name.
Its protein focus sets it apart from genomic databases like [**Ensembl**](http://www.ensembl.org) and the [**UCSC Genome Browser**](http://genome.ucsc.edu), which focus on gene loci, transcripts, splicing, and predicting protein sequences from nucleic acids. Ensembl is more inclusive when it comes to splice variants, but its sequences are less rigorously validated than UniProt's and it contains almost no annotation for genes and proteins.
### Swiss-Prot (curated) vs. TrEMBL (automated)
UniProtKB consists of two subsets: Swiss-Prot (the hand-curated part) and TrEMBL, which contains automatically annotated sequences yet to be looked at by the curation team. There are currently around 550,000 annotated proteins in Swiss-Prot and around 70,000,000 in TrEMBL.
Historically, there was only Swiss-Prot, but as more sequencing data flooded in, a fast way of providing at least some annotation (e.g. by transferring annotation over from homologs) was needed, and so TrEMBL was added.
For the model organisms, especially human, mouse and yeast, the manual Swiss-Prot annotation is excellent and frequently updated by the curators as well as through automated pipelines. In these organisms, all proteins are now covered.
### Cross-references and link-outs
UniProt incorporates and links out to a huge number of protein-related databases. It can be considered an authoritative resource that covers nearly all major information sources. The information includes:
- [**Function:**](http://www.uniprot.org/uniprot/P04637#function)
- A curated paragraph or two describing the biology of the protein. Individual statements are backed up by literature references. Also includes Gene Ontology (GO) annotation.
- [**Names & Taxonomy:**](http://www.uniprot.org/uniprot/P04637#names_and_taxonomy)
- Alternative gene and protein names that might be in use, as well as information on the organism.
- [**Subcellular location:**](http://www.uniprot.org/uniprot/P04637#subcellular_location)
- Nuclear/cytoplasmic etc., also broken down into isoforms since they might differ in their localisation.
- [**Pathology & Biotech:**](http://www.uniprot.org/uniprot/P04637#pathology_and_biotech)
- Disease-associated variants and somatic mutations from [OMIM](http://omim.org) etc.
- [**PTM / Processing:**](http://www.uniprot.org/uniprot/P04637#ptm_processing)
- Phosphorylation sites etc. largely from mass spectrometry, as well as annotation on signal peptides and other segments that are cleaved off during protein maturation.
- [**Expression:**](http://www.uniprot.org/uniprot/P04637#expression)
- Organismal tissues and cell types where the protein is expressed. Links out to the [Human Protein Atlas](http://www.proteinatlas.org) (HPA) which assays this using antibodies and transcriptomics, as well as other resources.
- [**Interaction:**](http://www.uniprot.org/uniprot/P04637#interaction)
- Known protein-protein or protein-DNA/RNA interactions, sometimes including information on the protein regions involved.
- [**Structure:**](http://www.uniprot.org/uniprot/P04637#structure)
- Information on 3D structures from [PDB](http://www.rcsb.org/pdb/) obtained by X-ray crystallography, NMR or cryo-electron microscopy, as well as the regions they cover.
- [**Family & Domains:**](http://www.uniprot.org/uniprot/P04637#family_and_domains)
- Protein domains that might be catalytic or mediate things like protein-protein interactions from Pfam, InterPro, SMART etc.
- [**Sequences:**](http://www.uniprot.org/uniprot/P04637#sequences)
- The FASTA sequence for the protein's canonical isoform as well as a selection of variants from alternative splicing, alternative promoter usage etc. Note that Ensembl is often more comprehensive when it comes to isoforms, but it doesn't cover proteolytic processing.
- [**Cross-references:**](http://www.uniprot.org/uniprot/P04637#cross_references)
- A comprehensive list of all major external databases that provide additional information aspects on the protein. Also repeats all resources that have been referenced in the previous sections.
- [**Entry information:**](http://www.uniprot.org/uniprot/P04637#entry_information)
- Gives a brief history of the protein entry within UniProt: when it was created, last updated etc.
- [**Miscellaneous:**](http://www.uniprot.org/uniprot/P04637#miscellaneous)
- Some additional terms such as which "[proteome](http://www.uniprot.org/help/reference_proteome)" release(s) this protein is part of. These are based on mass spectrometry and are available for more than just the curated model organisms. They can provide additional confidence e.g. for TrEMBL proteins predicted from nucleic acid sequences only.
- [**Similar proteins:**](http://www.uniprot.org/uniprot/P04637#similar_proteins)
- Links to UniProt's "UniRef" clusters at 100%, 90% and 50% sequence identity. These clusters are useful mainly to reduce bias for protein groups with many members (paralogs) in bioinformatics studies by collapsing them. If you are looking for homologs, a better place is either the phylogenomic databases under "Family & Domains" such as eggNOG or, better yet, look up the gene on [Ensembl](http://www.ensembl.org), e.g. [here](http://www.ensembl.org/Homo_sapiens/Gene/Compara_Tree?db=core;g=ENSG00000141510;r=17:7661779-7687550) for p53.
What's also fantastic is that most pieces of information UniProt displays link directly to the original PubMed article. Alternatively, the source information might show that something was transferred over from another organism (e.g. mouse to human).
### Nice to know
- The "feature viewer" is a little bit hidden: it's at the top of the sidebar on the left. It provides a clear overview of all features along a protein's sequence at a glance (and allows you to expand feature categories that interest you).
- Each protein has a readable "ID" (e.g. P53_HUMAN) and a more cryptic, but stable "accession" (e.g. P04637). The ID can change as more becomes known about a protein, similar to a gene name, while the accession is always kept the same. Therefore, if you are making e.g. a table for a paper, be sure to include the accessions.
- In Swiss-Prot, there is a [star rating](http://www.uniprot.org/help/annotation_score) for each protein which gives an indication of how much evidence there is that it exists in the form described.
- Important features that UniProt doesn't include yet:
- Intrinsically disordered regions (can be obtained from [D2P2](http://d2p2.pro))
- Many short linear motifs (can be obtained from [ELM](http://elm.eu.org))
For natural variation and disease-causing variants:
- Check out the 2-minute variation video on the UniProt YouTube channel (above)
- Another fantastic new resource is [ExAC](http://exac.broadinstitute.org), the Exome Aggregation Consortium. They combined exome (transcript) sequences from 60,000 humans. It's by far the biggest resource on human sequence variants to date.
- There are some really nice short tutorial videos here: [UniProt YouTube channel](https://www.youtube.com/channel/UCkCR5RJZCZZoVTQzTYY92aw)
- If you are ever in doubt about a particular term or feature (like "accession"), the [Help section](http://www.uniprot.org/help/) is really concise and excellent.
### Individual examples
- A protein you're working on?
- Annotation transferred from one species to another, "By similarity":
- KDM3B's function from [human](http://www.uniprot.org/uniprot/Q7LBC6#function) to [mouse](http://www.uniprot.org/uniprot/Q6ZPY7#function) (note the yellow annotation source tags)
- p53 in the feature viewer:
- The normal UniProt view is one long page: [p53 (normal view)](http://www.uniprot.org/uniprot/P04637).
- Alternatively, you can use the "feature viewer" to get a quick overview of what's going on in the protein: [p53 (feature viewer)](http://www.uniprot.org/uniprot/P04637#showFeaturesViewer).
- Where does [P53_HUMAN](http://www.uniprot.org/uniprot/P04637#showFeaturesViewer) get post-translationally modified?
- Where do its disease mutations happen?
- Two proteins from one precursor protein:
- See if you can find the ghrelin and obestatin peptides within the [precursor protein](http://www.uniprot.org/uniprot/Q9UBU3#showFeaturesViewer)!
- Hint hint: Click "Molecule Processing", or [here](http://www.uniprot.org/uniprot/Q9UBU3#ptm_processing)!
- Isoforms:
- Lamin A:
- Has a progeria-causing [pathogenic isoform](http://www.uniprot.org/uniprot/P02545#sequences) (number 6, see note), which is produced by unusual splicing if a disease-associated missense SNP is present.
- See also the "natural" variant at residue 608. Under references, it says: "[Recurrent de novo point mutations in lamin A cause Hutchinson-Gilford progeria syndrome.](https://www.ncbi.nlm.nih.gov/pubmed/12714972)"
- Interleukin-33:
- Has a constitutively active isoform (number 3): [IL33_HUMAN](http://www.uniprot.org/uniprot/O95760#sequences)
- Ankyrin-1:
- Has a muscle-specific isoform (Mu17): [ANK1_HUMAN](http://www.uniprot.org/uniprot/P16157#function)
- A protein with too many names:
- [FOLH1_HUMAN](http://www.uniprot.org/uniprot/Q04609#names_and_taxonomy) (Glutamate carboxypeptidase 2, or N-acetylated-alpha-linked acidic dipeptidase I, or Prostate-specific membrane antigen, or Folate hydrolase 1, or Cell growth-inhibiting gene 27 protein). Good that they're all in here, no?
- Nice examples of comprehensive subcellular localisation annotation:
- [AKP8L_HUMAN](http://www.uniprot.org/uniprot/Q9ULX6#subcellular_location) lists four papers that describe its localisation, colocalisation with other proteins, and potential shuttling in and out of the nucleus.
- [AKA7A_HUMAN](http://www.uniprot.org/uniprot/O43687#subcellular_location) has two isoforms with different localisations.
- Trypsin-2 expression:
- [TRY2_HUMAN](http://www.uniprot.org/uniprot/P07478#expression) is very tissue-specific.
- Also check out its [entry](http://www.proteinatlas.org/ENSG00000275896-PRSS3P2/tissue) in the Human Protein Atlas (linked in UniProt) for some microscopy images as well as RNA sequencing data.
- Protein-protein interactions:
- [HAND1_MOUSE](http://www.uniprot.org/uniprot/Q64279#interaction) is a transcription factor that needs to form a homodimer to work.
- Protein domains:
- [KDM5C_HUMAN](http://www.uniprot.org/uniprot/P41229#family_and_domains) has at least 3 domains (JmjN, ARID and then JmjC).
- We know that it's a [histone demethylase](http://www.uniprot.org/uniprot/P41229#function) acting on H3K4me2/3.
- To find out what the domains do, let's follow the link out to Pfam:
- Scroll down a bit to "Family and domain databases".
- Click "[graphical view](http://pfam.xfam.org/protein/P41229)" in the Pfam section.
- From there, we can check out the 3 domains UniProt mentioned:
- [JmjN](http://pfam.xfam.org/family/JmjN): Nothing much seems to be known about this one except that it occurs N-terminally of JmjC.
- [ARID](http://pfam.xfam.org/family/ARID): A DNA-binding domain.
- [JmjC](http://pfam.xfam.org/family/JmjC): Looks like it might be a catalytic domain.
- Pfam lists a few more domains than UniProt did, actually:
- [zf-C5HC2](http://pfam.xfam.org/family/zf-C5HC2): A small "zinc-finger" domain that is thought to bind DNA as well.
- [PLU-1](http://pfam.xfam.org/family/PLU-1): A larger domain that may also play a role in DNA binding, but not much is known about it.
- [PHD](http://pfam.xfam.org/family/PHD): This one is incredibly well annotated on Pfam compared to the others. It is a very important epigenetic "reader" domain that is thought to specifically bind trimethylated lysines in many cases. Pfam mentions it occurs in over 100 human proteins, and that it might play a role in epigenetic cross-talk with H3K9 trimethylation.
- We know that the [catalytic residues](http://www.uniprot.org/uniprot/P41229#function) are 514 (H), 517 (D) and 602 (H). These are all negatively charged and apparently they chelate an iron ion (Fe2+).
- Looking at the [feature viewer](http://www.uniprot.org/uniprot/P41229#showFeaturesViewer), we can clearly see that these are indeed in the JmjC domain, making it the catalytic lysine demethylase domain.
Just let me know if you have any questions, ideas or comments, I'm Ben Lang from the Gibson Team ([lang@embl.de](lang@embl.de))! :)
\ No newline at end of file
|**Workshop**|**Protein bioinformatics for beginners**|
|----------|:-------------:|------:|
|**Dates**|8 - 9 November|
|**Time**|09:30 - 17:00 hrs|
|**Venue**|ATC Computer lab, EMBL Heidelberg|
|**Trainers**|Toby Gibson, Marc Gouw, Michael Kuhn, Manjeet Kumar, Benjamin Lang, Malvika Sharan|
## List of resources that will be covered in this workshop
**Part-1 Protein databases and sequence analysis**
1. Protein databases:
- [Introduction to protein databases](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/ProteinBioinfo-MalvikaSharan.pdf): Malvika
- [Quick overview of NCBI](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/protein_database.md): Malvika
- [UniProt](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/UniProt.md): Ben
- Swissprot and Trembl
- Cross-refrences and link-outs
- OMIM, Domains, GO, ...
2. [Study of similar sequences](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/sequence_similarity/tutorial_text.md): Marc
- BLAST
- BLASTP, BLASTN & PSI-BLAST
- HMMER
- HHPred
3. [Multiple sequence alignments](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/multiple_sequence_alignment.md): Malvika
- Clustal omega (EMBL-EBI)
- COBALT (NCBI)
4. Other resources
- [Human Protein Atlas](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/HPRexercise.md): Toby
- [Antibodypedia](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/Antibodypedia.md): Toby
- [EMBOSS toolkits](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/EMBOSS_EBI.md): Malvika
- [EMBOSS explorer](http://emboss.bioinformatics.nl/cgi-bin/emboss/)
**Part-2 Protein structure analysis**
*Lecture (Toby):* Secondary vs tertiary structure vs protein complexes
1. Protein Structures - Toby
- Structure database: PDB at [RCSB](http://www.rcsb.org/pdb/home/home.do)
- [Structure visualization](https://docs.google.com/document/d/19gtIv5fqqkEP1sJyIaCzJzMrIPKTSnT0owmk093w2C8/pub)
- Chimera
2. Structure prediction - Malvika
- [Secondary and Tertiary structure prediction](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/tertiary_structure_pred.md)
3. Protein-protein interaction - Michael
- STRING and STITCH
- Intact
- MINT
4. [Domain databases](https://docs.google.com/document/d/1v7JM9i7yANHasTdpLFZIx_oKIx5K-t0S2o5GnDbXKD0/edit): Manjeet
- [SMART](http://smart.embl-heidelberg.de/)
- [Pfam](http://pfam.xfam.org/)
5. [Prediction of transmembrane helices in proteins](https://docs.google.com/document/d/1v7JM9i7yANHasTdpLFZIx_oKIx5K-t0S2o5GnDbXKD0/edit): Manjeet
- [TMHMM](http://www.cbs.dtu.dk/services/TMHMM/)
- [IUPRED](http://iupred.enzim.hu/) and [Anchor](http://anchor.enzim.hu/)
6. Intrinsically disordered region: Marc
- [ELM](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/elm.md)
- [DisProt](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/disprot.md)
7. Motif visualization:
- [Weblogo and MEME](https://git.embl.de/sharan/protein-bioinformatics-embl-hd/blob/master/TeachingMaterials/2016/motif_visualization.md) - Malvika
- [Jalview Alignment Viewer](https://docs.google.com/document/d/1Rd7KiqndSW3xqbW_GJc6gfU1dRkjoxR00gCi97F9VMU/pub) - Toby
### Important references
- http://www.sciencedirect.com/science/book/9788131222973
- http://molbiol-tools.ca/Protein_Chemistry.htm
- http://www.ebi.ac.uk/Tools/pfa/
- https://toolkit.tuebingen.mpg.de/
- http://emboss.sourceforge.net/
### [Post workshop survey](https://www.surveymonkey.de/r/2GCN32Q)
......@@ -9,24 +9,26 @@
**Part-1 Protein databases and sequence analysis**
1. Protein databases: Malvika and Ben
- NCBI (quick overview)
- UniProt
1. Protein databases:
- Introduction to protein databases
- Quick overview of NCBI
- UniProt - Ben
- Swissprot and Trembl
- Cross-refrences and link-outs
- OMIM, Domains, GO, ...
2. Study of similar sequences: Marc
2. Study of similar sequences - Marc
- BLAST
- BLASTp, BLASTn, PSI-BLAST, ...
- Diamond
- HMMER
- HHPred
3. Multiple sequence alignments: Malvika
- Muscle, Clustal omega, etc.
4. Other resources: Toby & Malvika
3. Multiple sequence alignments
- Clustal omega (EMBL-EBI)
- COBALT (NCBI)
4. Other resources
- Human Protein Atlas
- Antibodypedia
- EMBOSS toolkits (EBI)
- EMBOSS toolkits
- EMBOSS dot-plot
- dotmatcher
- Pepinfo
......
# DisProt
DisProt is a collection of manually curated disordered protein regions, and
contains over 800 entries. The DisProt homepage can be found here:
http://www.disprot.org/
## Exercise 1: Browsing DisProt
Navigate to DisProt hompage, and subsequently to the "Browse" section to browse
the database content.
- **Question 1:** How many Rabbit proteins are annotated in DisProt?
Find the DisProt entry for (human) **DNA topoisomerase 1**.
- **Question 2:** How many disorered regious exist in this protein?
- **Question 3:** Which method was used to determine that the region between
"175 - 214" is disordered?
# Short Linear Motifs
This text was largely adapted from a [tutorial written by Holger
Dinkel][elm_tutorial] for the [EMBO Practical Course on computational analysis
of protein-protein interactions][embo_course_ppi]
[elm_tutorial]: http://aidanbudd.github.io/course_EMBO_at_TGAC_PPI_Sep2015/trainingMaterial/holgerDinkel/linear_motifs/
[embo_course_ppi]: http://aidanbudd.github.io/course_EMBO_at_TGAC_PPI_Sep2015//
## Eukaryotic Linear Motifs
Eukaryotic Linear Motifs (or ELMS) sometimes also known as short Linear Motifs
(SLiMs) are short sequences typically found in disordered regions that have
important roles in the function of a protein.
## The ELM database
The [ELM database][elm] is a project who's ultimate goal it is to all occurences
of ELMs and their function in all known proteins(!).
It consists of manually annotated entries carefully curated by experts in a
particular field, working in a certain protein, or a particular motif. These
annotators are responsible for contributing ELM **classes**, which represent
linear motifs with a known function, and experimentally verified **instances**
of this motif.
- **types** There are 6 types of motifs: LIG: ligand binding, MOD:
modification, TRG: targeting, DOC: docking, DEG: degradation, CLV: cleavage.
- **class** is a sequence of amino acids with a given function, based on
binding partner, modifying enzyme, acting peptidase and targeted subcellular
localisation. Each **class** is defined by a **regular expression**
- **instance** an manually annotated occurrence of a **class** in a protein,
verified by a literature citable experiment.
[elm]: http://elm.eu.org
## Browsing content
## Exercise 1: Browsing content
There are two main ways in which the ELM database content can be browsed.
Click on "ELM DB" -> "ELM Classes", or follow the link to the ELM classes page:
http://elm.eu.org/elms to browse the ELM **classes** that have been annotated.
Use the search (or side filters) to find the ELM motif: **DOC_CYCLIN_1**
- **Question 1:** What does this motif do?
- **Question 2:** How many instances are annotated in the database?
- **Question 3:** Which Gene Ontology terms is this motif associated with?
This motif was identified in P53 in the sequence: **KKLMF**
- **Question 4:** What is the starting and finishing position of this sequence
in P53?
- **Question 5:** Which experimental protocols were used to infer the existence
of this instance?
- **Question 6:** How certain are we about this annotation?
- **Question 7:** What activates P53 in the pathway to induce apoptosis?
## Exercise 2: The ELM Prediction tool
Navigate to the "ELM predictions" page.
Search protein **SRC_HUMAN** (accession P12931) for ELMs using the following parameters:
- Cell Compartment: Not specified
- Motif Probability Cutoff: 100
- Context information: (leave blank)
Some questions:
- **Question 1:** How many instances do you find?
- **Question 2:** What can you say about the globularity of the protein? Does
it have globular and/or disordered regions?
Redo the above search, this time using the following parameters:
- Cell Compartment: cytosol
- Motif Probability Cutoff: 0.01
- Context information: Homo sapiens
Some questions:
- **Question 3:** How many instances do you find now?
- **Question 4:** How many of the instances are manually annotated?
- **Question 5:** Do the structural predictors/filters (SMART, GlobPlot,
IUPRED, Secondary Structure) agree in terms of which regions are
structured/disordered?
- **Question 6:** Compare the location of the annotated instances with
structural information at hand (IUPRED, Secondary Structure).
- **Question 7:** How many deteced instances were removed by the
SMART/Structure filter?
- **Question 8:** For the annotated instances, which of the ELM classes require
a phosphorylation at a certain residue of the motif? (Hint: This information
can be found in the description of the ELM class)
- **Question 9:** Which residue in SRC_HUMAN corresponds to this and can you
find evidence for a phosphorylation of this residue (using Phospho.ELM)?
## Exercise 3: The ELM Prediction tool
Search ELM using the protein name **MDM4_HUMAN** and look for the ‘USP binding motif’ **DOC_USP7_MATH_1**
- **Question 1:** How many such motif instances are found in this protein sequence?
- **Question 2:** How many of these have been exprimentally validated (i.e., are manually annotated?), and what are the "FP" annotations?
## Exercise 4: Switches
Use the ELM "global search box" (on the top right) to search for the class
**LIG_SH3_2**. (Just start typing, and wait for the autocomplete to finish).
Click on "LIG_SH3_2" to visit the class page.
- **Question 1:** How many switches are annotated for this class?
- **Question 2:** What is the mechanism that results in the switching event in **SYNJ2_RAT**?
This diff is collapsed.
# Multiple sequence alignment
A multiple sequence alignment (MSA) is a method for the comparison of three or more biological sequences (protein, DNA, or RNA) by aligning them against each other. In practice, these query sequences would share an evolutionary relationship (common ancestor). With MSA the distances and similarities between the sequences can be inferred, which facilitates the analysis of phylogenetic association such as evolutionary origins.
A MSA allows to visualize the conserved locations in the sequences that hold the functional relevance across species as well as mutation events (that appear as hyphens in one or more of the sequences in the alignment) such as insertion, deletion mutations or sunstitutions to allow calculation the rate of evolution.
MSA is used to define a protein family by assessing sequence conservation of protein domains, tertiary and secondary structures.
[PDF slides](https://git.embl.de/sharan/protein-bioinformatics-nov-2016/blob/master/TeachingMaterials/Multiple_Sequence_Alignment_slides.pdf)
[external slide with comprehensive details on algorithm](http://player.slideplayer.com/17/5286187/#)
## Hands-on session on [Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/) for multiple sequence alignment
Clustal omega is the current version of the MSA tools from clustal series. It uses progressive alignment heuristic to build a final MSA, beginning with the most similar pair and progressing to the most distantly related.
The progressive alignment combines all the pairwise alignments in two stages: a first stage in which the relationships between the sequences are represented as a tree (clustering), called a guide tree, and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree.
**Availability:**
- Clustal Omega can be used via the web interface available at http://www.ebi.ac.uk/Tools/msa/clustalo/.
**Input:**
- It requires protein accession IDs or protein seqences in FASTA format.
[Frequently asked questions](http://www.ebi.ac.uk/Tools/msa/clustalo/help/faq.html#1)
`What substitution matrix/default parameters are used by Clustal Omega?
Clustal Omega uses the HHalign algorithm and its default settings as its core alignment engine. The algorithm is described in Söding, J. (2005) 'Protein homology detection by HMM–HMM comparison'. Bioinformatics 21, 951-960.
The default transition matrix is Gonnet, gap opening penalty is 6 bits, gap extension is 1 bit.`
HHalign:
HHalign compares two alignments with each other by pairwise alignment of HMMs. It shows the optimal alignment and all significant non-overlapping suboptimal alignments. It also generates a dotplot for which the profile-profile column score is averaged over a window of variable size. If only one alignment is entered, this is compared to itself. Used in this way, HHalign is a very sensitive repeat-identification tool.
### Examples:
To extract examples, we will review our first session of NCBI using following instructions:
1. Search for P53 proteins in NCBI
2. Select P53 protein from *Mus muscuslus*
3. Run BLAST on this sequence to identify its homologs
4. Randomly select 10 hits (avoid multiple sequences from same species)
5. View GenPept report, and view the summary (top left) as FASTA (text)
These sequences will be the set of queries for your MSA
### Using Clustal Omega
1. Select all the query sequences (Optionally: you can edit the FASTA header by keeping only species name)
2. Go to Clustal Omega web form, ad paste your query sequences
3. Choose output format as 'clustal w/ numbers'
4. Submit you query
5. Browse your output result
* Show colors
* Phylogenetic tree
* Summary: Percent Identity Matrix
## Optional exercise: COBALT (NCBI)
COBALT in a tool for multiple sequence alignment, integrated in the NCBI resource for sequence analysis. It alignes sequences by conserved proteins domains and local similarities of the sequences.
1. Go back to your NCBI page of P53 BLAST result
* Click on multiple alignment
* Browse the result: phylogenetic tree
2. Randomly select few sequences, go to the GenPept page
* In the 'Analyse these sequences', select the option 'Align sequences with COBALT'
* Browse your output result: Phylogenetic tree
## List of few other tools for MSA
1. [T-Coffee](http://www.tcoffee.org/)
2. [UGENE](http://ugene.net/)
3. [Phylo: interactive video game](http://phylo.cs.mcgill.ca/)
4. [MUSCLE](http://www.drive5.com/muscle/)
5. [MAFFT](http://mafft.cbrc.jp/alignment/software/)
6. [MAVID](http://baboon.math.berkeley.edu/mavid/)
## MSA and MSA related tools on EBI-EMBL
Link: http://www.ebi.ac.uk/Tools/msa/
# Proteins
## Introduction
Proteins are macromolecules, constituted of long chains of amino acid residues of varying lengths inferred from the corresponding nucleotide sequences of their genes. Proteins are the building block of our body and they are involved in a wide range of biological functions within organisms, that include DNA replication, catalysis of metabolic reactions, response to stimuli, interaction with other biomolecules for pathway regulation, stability, transport, localization or degradation.
## Protein databases
A biological database is an organized collection of a particular type of datasets compiled from a large number of scientifc publications and discoveries, for example, biological sequences or different -omics (transcriptomics, proteomics, metagenomics) data, specific type of annotations, structural data, chemical compounds, biological pathways etc.
The Protein databases contain entries for each protein sequence from all the known proteome sets. There are few well known protein databases like the National Center for Biotechnology Information Reference Sequence project, UniProtKB/SWISS-Prot and the DNA Databank of Japan Amino Acid Sequence Database.
Protein records are available mainly in text formats that include sequence entries as FASTA and their corresponding annotations in XML formats. The protein entries are generally linked to external resources, allowing users to find relevant data such as literature (Pubmed), genes (NCBI, GenBank database), biological pathways (KEGG database), structures (PDB database), corresponding DNA/RNA sequences, sequence homologs, and expression and variation data.
## Hands-on sessions on protein databases
#### 1. [National Center for Biotechnology Information - NCBI](https://www.ncbi.nlm.nih.gov/)
The NCBI interface provides aceess to several journals and bioinfomatics resources.
In this course, we will use several protein related resources of NCBI.
###### Example proteins:
* **Tumor protein P53**: a tumor suppressor protein in human, the absence of which allows many cancers to proliferate.
###### Search method:
* Text/term search in [All fields] (simply type in your query)
* Limiting the search using [filters]
- Organism [ORGN]
- Source database
- Genetic component
- Bio-chemical/physical properties etc.
* Combining multiple search criteria by boolean AND, OR, NOT
* Browsing by taxonomy (right side of the screen)
###### Select one record of your choice
* Browse the GenPept entry
- Identical proteins
- FASTA entry
- Graphical representation of the features
- Other linked data
- Articles
- Pathways
- Reference sequences
- Homologs
- Related information
- Link-outs
- Analysis options (we will explore these later)
- BLAST
- Domains
- Sequence features
- Regular expression
- Tertiary structure
- Multiple alignment by COBALT
#### 2. [UniProt Knowledgebase](https://www.ebi.ac.uk/uniprot)
- Swissprot and Trembl
- Cross-reference
- Other resources for proteins
TeachingMaterials/2016/sequence_similarity/images/BLOSUM62.png

44.8 KiB

TeachingMaterials/2016/sequence_similarity/images/alignments.png

180 KiB

TeachingMaterials/2016/sequence_similarity/images/blosum.gif

28.2 KiB

TeachingMaterials/2016/sequence_similarity/images/descriptions.png

126 KiB

TeachingMaterials/2016/sequence_similarity/images/graphic.png

104 KiB

TeachingMaterials/2016/sequence_similarity/images/graphicsummary.png

34.1 KiB

TeachingMaterials/2016/sequence_similarity/images/programselection.jpg

62.1 KiB