Commit f2158aa6 authored by Hugo Carlos's avatar Hugo Carlos
Browse files

Upload New File

parent 628077a8
## UniProt
1. Introduction
2. Swiss-Prot (curated) vs. TrEMBL (automated)
3. Cross-references and link-outs
* OMIM, Domains, GO, ...
### Introduction
[**UniProt**](http://www.uniprot.org) is a protein database. Its focus is on proteins that have been observed experimentally, particularly by mass spectrometry. It has a dedicated team of curators working on high-quality annotation of these proteins and makes a great **central hub** for all protein-related information. It's my first stop for any protein or gene name.
Its protein focus sets it apart from genomic databases like [**Ensembl**](http://www.ensembl.org) and the [**UCSC Genome Browser**](http://genome.ucsc.edu), which focus on gene loci, transcripts, splicing, and predicting protein sequences from nucleic acids. Ensembl is more inclusive when it comes to splice variants, but its sequences are less rigorously validated than UniProt's and it contains almost no annotation for genes and proteins.
### Swiss-Prot (curated) vs. TrEMBL (automated)
UniProtKB consists of two subsets: Swiss-Prot (the hand-curated part) and TrEMBL, which contains automatically annotated sequences yet to be looked at by the curation team. There are currently around 550,000 annotated proteins in Swiss-Prot and around 70,000,000 in TrEMBL.
Historically, there was only Swiss-Prot, but as more sequencing data flooded in, a fast way of providing at least some annotation (e.g. by transferring annotation over from homologs) was needed, and so TrEMBL was added.
For the model organisms, especially human, mouse and yeast, the manual Swiss-Prot annotation is excellent and frequently updated by the curators as well as through automated pipelines. In these organisms, all proteins are now covered.
### Cross-references and link-outs
UniProt incorporates and links out to a huge number of protein-related databases. It can be considered an authoritative resource that covers nearly all major information sources. The information includes:
- [**Function:**](http://www.uniprot.org/uniprot/P04637#function)
- A curated paragraph or two describing the biology of the protein. Individual statements are backed up by literature references. Also includes Gene Ontology (GO) annotation.
- [**Names & Taxonomy:**](http://www.uniprot.org/uniprot/P04637#names_and_taxonomy)
- Alternative gene and protein names that might be in use, as well as information on the organism.
- [**Subcellular location:**](http://www.uniprot.org/uniprot/P04637#subcellular_location)
- Nuclear/cytoplasmic etc., also broken down into isoforms since they might differ in their localisation.
- [**Pathology & Biotech:**](http://www.uniprot.org/uniprot/P04637#pathology_and_biotech)
- Disease-associated variants and somatic mutations from [OMIM](http://omim.org) etc.
- [**PTM / Processing:**](http://www.uniprot.org/uniprot/P04637#ptm_processing)
- Phosphorylation sites etc. largely from mass spectrometry, as well as annotation on signal peptides and other segments that are cleaved off during protein maturation.
- [**Expression:**](http://www.uniprot.org/uniprot/P04637#expression)
- Organismal tissues and cell types where the protein is expressed. Links out to the [Human Protein Atlas](http://www.proteinatlas.org) (HPA) which assays this using antibodies and transcriptomics, as well as other resources.
- [**Interaction:**](http://www.uniprot.org/uniprot/P04637#interaction)
- Known protein-protein or protein-DNA/RNA interactions, sometimes including information on the protein regions involved.
- [**Structure:**](http://www.uniprot.org/uniprot/P04637#structure)
- Information on 3D structures from [PDB](http://www.rcsb.org/pdb/) obtained by X-ray crystallography, NMR or cryo-electron microscopy, as well as the regions they cover.
- [**Family & Domains:**](http://www.uniprot.org/uniprot/P04637#family_and_domains)
- Protein domains that might be catalytic or mediate things like protein-protein interactions from Pfam, InterPro, SMART etc.
- [**Sequences:**](http://www.uniprot.org/uniprot/P04637#sequences)
- The FASTA sequence for the protein's canonical isoform as well as a selection of variants from alternative splicing, alternative promoter usage etc. Note that Ensembl is often more comprehensive when it comes to isoforms, but it doesn't cover proteolytic processing.
- [**Cross-references:**](http://www.uniprot.org/uniprot/P04637#cross_references)
- A comprehensive list of all major external databases that provide additional information aspects on the protein. Also repeats all resources that have been referenced in the previous sections.
- [**Entry information:**](http://www.uniprot.org/uniprot/P04637#entry_information)
- Gives a brief history of the protein entry within UniProt: when it was created, last updated etc.
- [**Miscellaneous:**](http://www.uniprot.org/uniprot/P04637#miscellaneous)
- Some additional terms such as which "[proteome](http://www.uniprot.org/help/reference_proteome)" release(s) this protein is part of. These are based on mass spectrometry and are available for more than just the curated model organisms. They can provide additional confidence e.g. for TrEMBL proteins predicted from nucleic acid sequences only.
- [**Similar proteins:**](http://www.uniprot.org/uniprot/P04637#similar_proteins)
- Links to UniProt's "UniRef" clusters at 100%, 90% and 50% sequence identity. These clusters are useful mainly to reduce bias for protein groups with many members (paralogs) in bioinformatics studies by collapsing them. If you are looking for homologs, a better place is either the phylogenomic databases under "Family & Domains" such as eggNOG or, better yet, look up the gene on [Ensembl](http://www.ensembl.org), e.g. [here](http://www.ensembl.org/Homo_sapiens/Gene/Compara_Tree?db=core;g=ENSG00000141510;r=17:7661779-7687550) for p53.
What's also fantastic is that most pieces of information UniProt displays link directly to the original PubMed article. Alternatively, the source information might show that something was transferred over from another organism (e.g. mouse to human).
### Nice to know
- The "feature viewer" is a little bit hidden: it's at the top of the sidebar on the left. It provides a clear overview of all features along a protein's sequence at a glance (and allows you to expand feature categories that interest you).
- Each protein has a readable "ID" (e.g. P53_HUMAN) and a more cryptic, but stable "accession" (e.g. P04637). The ID can change as more becomes known about a protein, similar to a gene name, while the accession is always kept the same. Therefore, if you are making e.g. a table for a paper, be sure to include the accessions.
- In Swiss-Prot, there is a [star rating](http://www.uniprot.org/help/annotation_score) for each protein which gives an indication of how much evidence there is that it exists in the form described.
- Important features that UniProt doesn't include yet:
- Intrinsically disordered regions (can be obtained from [D2P2](http://d2p2.pro))
- Many short linear motifs (can be obtained from [ELM](http://elm.eu.org))
For natural variation and disease-causing variants:
- Check out the 2-minute variation video on the UniProt YouTube channel (above)
- Another fantastic new resource is [ExAC](http://exac.broadinstitute.org), the Exome Aggregation Consortium. They combined exome (transcript) sequences from 60,000 humans. It's by far the biggest resource on human sequence variants to date.
- There are some really nice short tutorial videos here: [UniProt YouTube channel](https://www.youtube.com/channel/UCkCR5RJZCZZoVTQzTYY92aw)
- If you are ever in doubt about a particular term or feature (like "accession"), the [Help section](http://www.uniprot.org/help/) is really concise and excellent.
### Individual examples
- A protein you're working on?
- Annotation transferred from one species to another, "By similarity":
- KDM3B's function from [human](http://www.uniprot.org/uniprot/Q7LBC6#function) to [mouse](http://www.uniprot.org/uniprot/Q6ZPY7#function) (note the yellow annotation source tags)
- p53 in the feature viewer:
- The normal UniProt view is one long page: [p53 (normal view)](http://www.uniprot.org/uniprot/P04637).
- Alternatively, you can use the "feature viewer" to get a quick overview of what's going on in the protein: [p53 (feature viewer)](http://www.uniprot.org/uniprot/P04637#showFeaturesViewer).
- Where does [P53_HUMAN](http://www.uniprot.org/uniprot/P04637#showFeaturesViewer) get post-translationally modified?
- Where do its disease mutations happen?
- Two proteins from one precursor protein:
- See if you can find the ghrelin and obestatin peptides within the [precursor protein](http://www.uniprot.org/uniprot/Q9UBU3#showFeaturesViewer)!
- Hint hint: Click "Molecule Processing", or [here](http://www.uniprot.org/uniprot/Q9UBU3#ptm_processing)!
- Isoforms:
- Lamin A:
- Has a progeria-causing [pathogenic isoform](http://www.uniprot.org/uniprot/P02545#sequences) (number 6, see note), which is produced by unusual splicing if a disease-associated missense SNP is present.
- See also the "natural" variant at residue 608. Under references, it says: "[Recurrent de novo point mutations in lamin A cause Hutchinson-Gilford progeria syndrome.](https://www.ncbi.nlm.nih.gov/pubmed/12714972)"
- Interleukin-33:
- Has a constitutively active isoform (number 3): [IL33_HUMAN](http://www.uniprot.org/uniprot/O95760#sequences)
- Ankyrin-1:
- Has a muscle-specific isoform (Mu17): [ANK1_HUMAN](http://www.uniprot.org/uniprot/P16157#function)
- A protein with too many names:
- [FOLH1_HUMAN](http://www.uniprot.org/uniprot/Q04609#names_and_taxonomy) (Glutamate carboxypeptidase 2, or N-acetylated-alpha-linked acidic dipeptidase I, or Prostate-specific membrane antigen, or Folate hydrolase 1, or Cell growth-inhibiting gene 27 protein). Good that they're all in here, no?
- Nice examples of comprehensive subcellular localisation annotation:
- [AKP8L_HUMAN](http://www.uniprot.org/uniprot/Q9ULX6#subcellular_location) lists four papers that describe its localisation, colocalisation with other proteins, and potential shuttling in and out of the nucleus.
- [AKA7A_HUMAN](http://www.uniprot.org/uniprot/O43687#subcellular_location) has two isoforms with different localisations.
- Trypsin-2 expression:
- [TRY2_HUMAN](http://www.uniprot.org/uniprot/P07478#expression) is very tissue-specific.
- Also check out its [entry](http://www.proteinatlas.org/ENSG00000275896-PRSS3P2/tissue) in the Human Protein Atlas (linked in UniProt) for some microscopy images as well as RNA sequencing data.
- Protein-protein interactions:
- [HAND1_MOUSE](http://www.uniprot.org/uniprot/Q64279#interaction) is a transcription factor that needs to form a homodimer to work.
- Protein domains:
- [KDM5C_HUMAN](http://www.uniprot.org/uniprot/P41229#family_and_domains) has at least 3 domains (JmjN, ARID and then JmjC).
- We know that it's a [histone demethylase](http://www.uniprot.org/uniprot/P41229#function) acting on H3K4me2/3.
- To find out what the domains do, let's follow the link out to Pfam:
- Scroll down a bit to "Family and domain databases".
- Click "[graphical view](http://pfam.xfam.org/protein/P41229)" in the Pfam section.
- From there, we can check out the 3 domains UniProt mentioned:
- [JmjN](http://pfam.xfam.org/family/JmjN): Nothing much seems to be known about this one except that it occurs N-terminally of JmjC.
- [ARID](http://pfam.xfam.org/family/ARID): A DNA-binding domain.
- [JmjC](http://pfam.xfam.org/family/JmjC): Looks like it might be a catalytic domain.
- Pfam lists a few more domains than UniProt did, actually:
- [zf-C5HC2](http://pfam.xfam.org/family/zf-C5HC2): A small "zinc-finger" domain that is thought to bind DNA as well.
- [PLU-1](http://pfam.xfam.org/family/PLU-1): A larger domain that may also play a role in DNA binding, but not much is known about it.
- [PHD](http://pfam.xfam.org/family/PHD): This one is incredibly well annotated on Pfam compared to the others. It is a very important epigenetic "reader" domain that is thought to specifically bind trimethylated lysines in many cases. Pfam mentions it occurs in over 100 human proteins, and that it might play a role in epigenetic cross-talk with H3K9 trimethylation.
- We know that the [catalytic residues](http://www.uniprot.org/uniprot/P41229#function) are 514 (H), 517 (D) and 602 (H). These are all negatively charged and apparently they chelate an iron ion (Fe2+).
- Looking at the [feature viewer](http://www.uniprot.org/uniprot/P41229#showFeaturesViewer), we can clearly see that these are indeed in the JmjC domain, making it the catalytic lysine demethylase domain.
Just let me know if you have any questions, ideas or comments, I'm Ben Lang from the Gibson Team ([lang@embl.de](lang@embl.de))! :)
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment