README.md 3.47 KB
Newer Older
1
## OpenTargets data retrieval
Sudeep Sahadevan's avatar
Sudeep Sahadevan committed
2

3 4 5 6 7 8 9 10 11 12 13 14 15
The following steps were used to extract mendelian variant associations from [OpenTargets](https://www.targetvalidation.org/) data

### Environment set up

Download and install a recent version of [anaconda](https://docs.anaconda.com/anaconda/install/) environment manager.  
This folder includes a yaml config file `OpenTargets.yaml` which includes all the necessary dependencies for the script.  
Create a new environment as

```bash
conda env create -f OpenTargets.yaml
conda activate OpenTargets.yaml
```  

Sudeep Sahadevan's avatar
Sudeep Sahadevan committed
16 17
Preferred Os: Unix/Linux

18 19 20 21 22 23 24 25 26 27 28
### Usage

#### General
The `scripts` directory contains a python script called `ot_parser.py` which was used to parse OpenTargets data dump for mendelian disease association data.  

use
```shell
python scripts/ot_parser.py -h
```
for a description of available commands and required input files and parameters.

29 30

### Mendelian disease associations
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

Use the subcommand `variant` to parse Mendelian disease associations
```shell
python scripts/ot_parser.py variant -h
```

**Additional information**

For current analysis, OpenTargets evidence dump from February 2020 was uses (available [here](https://storage.googleapis.com/open-targets-data-releases/20.02/output/20.02_evidence_data.json.gz))

Gene annotations from gencode, available [here](ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/gencode.v27.annotation.gff3.gz)

**Parse somatic mutation variants**

```bash
python scripts/ot_parser.py variant --in /path/to/20.02_evidence_data.json.gz --associationScore 0.2 --type somatic_muation --mendelian --email mail@mail.com --assembly GRCh38 --gff /path/to/gencode.v27.annotation.gff3.gz --out mendelian_somatic_mutations.bed 
```

**Parse genetic association variants**

```bash
python scripts/ot_parser.py variant --in /path/to/20.02_evidence_data.json.gz --associationScore 0.2 --type genetic_association --mendelian --email mail@mail.com --assembly GRCh38 --gff /path/to/gencode.v27.annotation.gff3.gz --out mendelian_genetic_association.bed 
```

**combine**

```bash
cat mendelian_somatic_mutations.bed mendelian_genetic_association.bed | sort -k 1,1 -k2,2n -k3,3n > mendelian_somatic_mutations_genetic_associations_combined.bed
```
Sudeep Sahadevan's avatar
Sudeep Sahadevan committed
60

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
### Parse Experimental factor ontology (EFO) id to name mappings

Use subcommand `obo_id_name` to parse EFO id to name mappings from EFO .obo file

```shell
python scripts/ot_parser.py obo_id_name -h
```

**Additional information**  
Experimental factor ontology used in this analysis is version: data-version: 3.17.1  
The ontology in .owl format is available from the source [here](http://www.ebi.ac.uk/efo/releases/v3.17.1/efo.owl) and the .obo formatted file is available from [EBISPOT](https://github.com/EBISPOT) github repo [here](https://github.com/EBISPOT/efo/releases/download/v3.17.1/efo.obo). OpenTargets Therapeutic area annotations are missing in the ontology, but available in disease id to name mappings, which can be downloaded from [here](https://storage.googleapis.com/open-targets-data-releases/20.02/output/20.02_disease_list.csv.gz)

**Generate id - name mapping file**
```shell
python scripts/ot_parser.py obo_id_name --efo /path/to/efo.obo --out efo_id_name_map.csv
# download disease id to name file
wget -O - https://storage.googleapis.com/open-targets-data-releases/20.02/output/20.02_disease_list.csv.gz | gunzip -c| perl -pe 's/\"//g'| awk 'BEGIN {FS=",";OFS="\t"}{print $1,$2}' > disease_id_name_map.csv
cat disease_id_name_map.csv efo_id_name_map.csv| sort -u > OpenTargets_id_name_map.csv
```