UniprotのID mapping APIを利用した遺伝子IDの変換
TL;DR
bioinformaticsをやっていると、遺伝子のIDが星の数ほどあります。
このID同士の変換は結構めんどくさいですが、有名なものであれば、uniprotのid mappingを利用することで変換できます。 例えば、Uniprot IDからEnsembl IDやNCBIのentrez IDへの変換などができます。 uniprotのid mappingはAPIにも対応しているので、Python等から実行することも可能です。
API Documentもあるので、これを読めば簡単に実行できます。
変換できるIDの確認
以下のコマンドで確認できます。
curl https://rest.uniprot.org/configure/idmapping/fields
基本的にはそれぞれのgroupがあって、その中にどういうデータベースが存在して、名前が何なのか、というのが記述されています。listを表示するとだいたい以下のような感じです。
curl https://rest.uniprot.org/configure/idmapping/fields | jq '.groups[].groupName'
"UniProt"
"Sequence databases"
"3D structure databases"
"Protein-protein interaction databases"
"Chemistry"
"Protein family/group databases"
"PTM databases"
"Genetic variation databases"
"Proteomic databases"
"Protocols and materials databases"
"Genome annotation databases"
"Organism-specific databases"
"Phylogenomic databases"
"Enzyme and pathway databases"
"Miscellaneous"
"Gene expression databases"
"Family and domain databases"
Genome Annotation Databaseにどういうものがあるのかを見てます。Ensemblなんかはここです。
curl https://rest.uniprot.org/configure/idmapping/fields | \
jq '.groups[] | select(.groupName == "Genome annotation databases")'
{
"groupName": "Genome annotation databases",
"items": [
{
"displayName": "Ensembl",
"name": "Ensembl",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "https://www.ensembl.org/id/%id"
},
{
"displayName": "Ensembl Genomes",
"name": "Ensembl_Genomes",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "http://www.ensemblgenomes.org/id/%id"
},
{
"displayName": "Ensembl Genomes Protein",
"name": "Ensembl_Genomes_Protein",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "http://www.ensemblgenomes.org/id/%id"
},
{
"displayName": "Ensembl Genomes Transcript",
"name": "Ensembl_Genomes_Transcript",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "http://www.ensemblgenomes.org/id/%id"
},
{
"displayName": "Ensembl Protein",
"name": "Ensembl_Protein",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "https://www.ensembl.org/id/%id"
},
{
"displayName": "Ensembl Transcript",
"name": "Ensembl_Transcript",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "https://www.ensembl.org/id/%id"
},
{
"displayName": "GeneID",
"name": "GeneID",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "https://www.ncbi.nlm.nih.gov/gene/%id"
},
{
"displayName": "KEGG",
"name": "KEGG",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "https://www.genome.jp/dbget-bin/www_bget?%id"
},
{
"displayName": "PATRIC",
"name": "PATRIC",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "https://www.patricbrc.org/view/Feature/%id"
},
{
"displayName": "UCSC",
"name": "UCSC",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "https://genome.ucsc.edu/cgi-bin/hgLinkIn?resource=uniprot&id=%primaryAccession"
},
{
"displayName": "WBParaSite",
"name": "WBParaSite",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "https://parasite.wormbase.org/id/%id"
},
{
"displayName": "WBParaSite Transcript/Protein",
"name": "WBParaSite_Transcript-Protein",
"from": true,
"to": true,
"ruleId": 7,
"uriLink": "https://parasite.wormbase.org/id/%id"
}
]
}
流れ
APIをいくつか経由する必要があります。以下の3ステップです。
- Submitting a job
- Polling the status of a job
- Fetching the result of a job
bashでやってみる
公式の例にそって逐次処理を試します。
逐次処理
- ジョブを投げます。UniprotKBからUniRefへの変換です。
curl --request POST 'https://rest.uniprot.org/idmapping/run' --form 'ids="P21802,P12345"' --form 'from="UniProtKB_AC-ID"' --form 'to="UniRef90"'
json形式でjob idがもらえます。
{ "jobId": "6387ece4eb3305097b61ffb3601c18f4b7d92242" }
- jobの結果を見ます。redirectでジョブの結果が見れるようになっているので、HeaderのLocationを確認します。
curl -i 'https://rest.uniprot.org/idmapping/status/6387ece4eb3305097b61ffb3601c18f4b7d92242'
HTTP/2 303
vary: accept,accept-encoding,x-uniprot-release,x-api-deployment-date
vary: User-Agent
cache-control: no-cache
content-type: application/json
access-control-allow-credentials: true
access-control-expose-headers: Link, X-Total-Results, X-UniProt-Release, X-UniProt-Release-Date, X-API-Deployment-Date
x-api-deployment-date: 24-July-2024
strict-transport-security: max-age=31536000; includeSubDomains
date: Sat, 28 Sep 2024 09:52:30 GMT
access-control-max-age: 1728000
x-uniprot-release: 2024_04
location: https://rest.uniprot.org/idmapping/uniref/results/6387ece4eb3305097b61ffb3601c18f4b7d92242
access-control-allow-origin: *
access-control-allow-methods: GET, PUT, POST, DELETE, PATCH, OPTIONS
access-control-allow-headers: DNT,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization
x-uniprot-release-date: 24-July-2024
{"jobStatus":"FINISHED"}
- 結果を確認する
location: https://rest.uniprot.org/idmapping/uniref/results/6387ece4eb3305097b61ffb3601c18f4b7d92242
みたいなのが出てきます。これを叩くと結果が確認できます。
結果は長いので畳んでいます。
結果
{
"results": [
{
"from": "P21802",
"to": {
"id": "UniRef90_P21802",
"name": "Cluster: Fibroblast growth factor receptor 2",
"updated": "2024-07-24",
"entryType": "UniRef90",
"commonTaxon": { "scientificName": "Amniota", "taxonId": 32524 },
"memberCount": 132,
"organismCount": 77,
"representativeMember": {
"memberIdType": "UniProtKB ID",
"memberId": "FGFR2_HUMAN",
"organismName": "Homo sapiens (Human)",
"organismTaxId": 9606,
"sequenceLength": 821,
"proteinName": "Fibroblast growth factor receptor 2",
"accessions": [
"P21802",
"B4DFC2",
"E7EVR6",
"E9PCR0",
"P18443",
"Q01742",
"Q12922",
"Q14300",
"Q14301",
"Q14302",
"Q14303",
"Q14304",
"Q14305",
"Q14672",
"Q14718",
"Q14719",
"Q1KHY5",
"Q86YI4",
"Q8IXC7",
"Q96KL9",
"Q96KM0",
"Q96KM1",
"Q96KM2",
"Q9NZU2",
"Q9NZU3",
"Q9UD01",
"Q9UD02",
"Q9UIH3",
"Q9UIH4",
"Q9UIH5",
"Q9UIH6",
"Q9UIH7",
"Q9UIH8",
"Q9UM87",
"Q9UMC6",
"Q9UNS7",
"Q9UQH7",
"Q9UQH8",
"Q9UQH9",
"Q9UQI0"
],
"uniref50Id": "UniRef50_P21802",
"uniref100Id": "UniRef100_P21802",
"uniparcId": "UPI000012A72A",
"sequence": {
"value": "MVSWGRFICLVVVTMATLSLARPSFSLVEDTTLEPEEPPTKYQISQPEVYVAAPGESLEVRCLLKDAAVISWTKDGVHLGPNNRTVLIGEYLQIKGATPRDSGLYACTASRTVDSETWYFMVNVTDAISSGDDEDDTDGAEDFVSENSNNKRAPYWTNTEKMEKRLHAVPAANTVKFRCPAGGNPMPTMRWLKNGKEFKQEHRIGGYKVRNQHWSLIMESVVPSDKGNYTCVVENEYGSINHTYHLDVVERSPHRPILQAGLPANASTVVGGDVEFVCKVYSDAQPHIQWIKHVEKNGSKYGPDGLPYLKVLKAAGVNTTDKEIEVLYIRNVTFEDAGEYTCLAGNSIGISFHSAWLTVLPAPGREKEITASPDYLEIAIYCIGVFLIACMVVTVILCRMKNTTKKPDFSSQPAVHKLTKRIPLRRQVTVSAESSSSMNSNTPLVRITTRLSSTADTPMLAGVSEYELPEDPKWEFPRDKLTLGKPLGEGCFGQVVMAEAVGIDKDKPKEAVTVAVKMLKDDATEKDLSDLVSEMEMMKMIGKHKNIINLLGACTQDGPLYVIVEYASKGNLREYLRARRPPGMEYSYDINRVPEEQMTFKDLVSCTYQLARGMEYLASQKCIHRDLAARNVLVTENNVMKIADFGLARDINNIDYYKKTTNGRLPVKWMAPEALFDRVYTHQSDVWSFGVLMWEIFTLGGSPYPGIPVEELFKLLKEGHRMDKPANCTNELYMMMRDCWHAVPSQRPTFKQLVEDLDRILTLTTNEEYLDLSQPLEQYSPSYPDTRSSCSSGDDSVFSPDPMPYEPCLPQYPHINGSVKT",
"length": 821,
"molWeight": 92025,
"crc64": "6CD5001C960ED82F",
"md5": "8278583234A3EDA2A192D8BB50E1FAB8"
}
},
"seedId": "A0A6J2D1E3",
"memberIdTypes": [
"UniProtKB Unreviewed (TrEMBL)",
"UniProtKB Reviewed (Swiss-Prot)",
"UniParc"
],
"members": [
"P21802",
"P21803",
"A0A6J2D1E3",
"A0A8C7ANU4",
"A0A8U0SV68",
"A0A673SQ31",
"A0A2J8PBY1",
"A1YYN9",
"A0A8C0AIU6",
"A0A8C6GPM7"
],
"organisms": [
{
"scientificName": "Homo sapiens",
"commonName": "Human",
"taxonId": 9606
},
{
"scientificName": "Mus musculus",
"commonName": "Mouse",
"taxonId": 10090
},
{
"scientificName": "Zalophus californianus",
"commonName": "California sealion",
"taxonId": 9704
},
{
"scientificName": "Neovison vison",
"commonName": "American mink",
"taxonId": 452646
},
{
"scientificName": "Mustela putorius furo",
"commonName": "European domestic ferret",
"taxonId": 9669
},
{
"scientificName": "Suricata suricatta",
"commonName": "Meerkat",
"taxonId": 37032
},
{
"scientificName": "Pan troglodytes",
"commonName": "Chimpanzee",
"taxonId": 9598
},
{
"scientificName": "Bos mutus grunniens",
"commonName": "Wild yak",
"taxonId": 30521
},
{
"scientificName": "Mus spicilegus",
"commonName": "Steppe mouse",
"taxonId": 10103
},
{
"scientificName": "Bos indicus x Bos taurus",
"commonName": "Hybrid cattle",
"taxonId": 30522
}
],
"goTerms": [
{ "goId": "GO:0005007", "aspect": "GO Molecular Function" },
{ "goId": "GO:0005524", "aspect": "GO Molecular Function" },
{ "goId": "GO:0005794", "aspect": "GO Cellular Component" },
{ "goId": "GO:0005886", "aspect": "GO Cellular Component" },
{ "goId": "GO:0008284", "aspect": "GO Biological Process" }
]
}
},
{
"from": "P12345",
"to": {
"id": "UniRef90_P12345",
"name": "Cluster: Aspartate aminotransferase, mitochondrial",
"updated": "2024-07-24",
"entryType": "UniRef90",
"commonTaxon": { "scientificName": "Eutheria", "taxonId": 9347 },
"memberCount": 26,
"organismCount": 22,
"representativeMember": {
"memberIdType": "UniProtKB ID",
"memberId": "AATM_RABIT",
"organismName": "Oryctolagus cuniculus (Rabbit)",
"organismTaxId": 9986,
"sequenceLength": 430,
"proteinName": "Aspartate aminotransferase, mitochondrial",
"accessions": ["P12345", "G1SKL2"],
"uniref50Id": "UniRef50_P00507",
"uniref100Id": "UniRef100_P12345",
"uniparcId": "UPI0001C61C61",
"sequence": {
"value": "MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKMNLGVGAYRDDNGKPYVLPSVRKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEVVKSGRFVTVQTISGTGALRIGASFLQRFFKFSRDVFLPKPSWGNHTPIFRDAGMQLQSYRYYDPKTCGFDFTGALEDISKIPEQSVLLLHACAHNPTGVDPRPEQWKEIATVVKKRNLFAFFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKNMGLYGERVGAFTVICKDADEAKRVESQLKILIRPMYSNPPIHGARIASTILTSPDLRKQWLQEVKGMADRIIGMRTQLVSNLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSIYMTKDGRISVAGVTSGNVGYLAHAIHQVTK",
"length": 430,
"molWeight": 47409,
"crc64": "12F54284974D27A5",
"md5": "CF84DAC1BDDD05632A89E4C1F186D0D3"
}
},
"seedId": "UPI001E1B1319",
"memberIdTypes": [
"UniProtKB Unreviewed (TrEMBL)",
"UniProtKB Reviewed (Swiss-Prot)",
"UniParc"
],
"members": [
"P12345",
"A0A5F9DR01",
"A0A5S8H2S7",
"A0A091ELC8",
"A0A250YHV8",
"A0A1U7UNG8",
"A0A9B0WQM0",
"A0A1S3EVA8",
"A0A8L2Q7Q0",
"A0A2K5JB63"
],
"organisms": [
{
"scientificName": "Oryctolagus cuniculus",
"commonName": "Rabbit",
"taxonId": 9986
},
{
"scientificName": "Fukomys damarensis",
"commonName": "Damaraland mole rat",
"taxonId": 885580
},
{
"scientificName": "Castor canadensis",
"commonName": "American beaver",
"taxonId": 51338
},
{
"scientificName": "Carlito syrichta",
"commonName": "Philippine tarsier",
"taxonId": 1868482
},
{
"scientificName": "Chrysochloris asiatica",
"commonName": "Cape golden mole",
"taxonId": 185453
},
{
"scientificName": "Dipodomys ordii",
"commonName": "Ord's kangaroo rat",
"taxonId": 10020
},
{
"scientificName": "Rattus norvegicus",
"commonName": "Rat",
"taxonId": 10116
},
{
"scientificName": "Colobus angolensis palliatus",
"commonName": "Peters' Angolan colobus",
"taxonId": 336983
},
{
"scientificName": "Rhinopithecus roxellana",
"commonName": "Golden snub-nosed monkey",
"taxonId": 61622
},
{
"scientificName": "Muntiacus muntjak",
"commonName": "Barking deer",
"taxonId": 9888
}
],
"goTerms": [
{ "goId": "GO:0030170", "aspect": "GO Molecular Function" },
{ "goId": "GO:0005829", "aspect": "GO Cellular Component" },
{ "goId": "GO:0009058", "aspect": "GO Biological Process" },
{ "goId": "GO:0006520", "aspect": "GO Biological Process" }
]
}
}
]
}
適当にScript化する
jobIdをjqで取得し、それがredirectされるまでstatusを確認します。 redirectできることがわかればredirectするようにcurlしmす。
uniprot-id-mapping.sh
#!/bin/bash
from=$1
to=$2
ids=$3
# submit job
jobId=$(curl --request POST 'https://rest.uniprot.org/idmapping/run' --form "ids=${ids}" --form "from=${from}" --form "to=${to}" | jq .jobId -r)
echo "jobId: $jobId"
# redirect responseが帰ってくるまでpollingする
pollingUrl=https://rest.uniprot.org/idmapping/status/$jobId
echo "pollingUrl: $pollingUrl"
for i in $(seq 5); do
responseCode=$(curl -s -o /dev/null -w "%{http_code}" $pollingUrl)
echo "trial ${i}: $responseCode"
if [ $responseCode -eq 303 ]; then
break
fi
sleep 3
done
# output result
curl -L "https://rest.uniprot.org/idmapping/status/${jobId}/"
例えばuniprotをensemblに変えたければ以下のように変換可能です。
bash uniprot-id-mapping.sh UniProtKB_AC-ID Ensembl "P21802,P12345"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 433 0 52 100 381 53 390 --:--:-- --:--:-- --:--:-- 443
jobId: 14e388f2fc341e38b8c7b164d501c0f31212e732
pollingUrl: https://rest.uniprot.org/idmapping/status/14e388f2fc341e38b8c7b164d501c0f31212e732
trial 1: 303
{"results":[{"from":"P21802","to":"ENSG00000066468.24"}],"failedIds":["P12345"]}
その他
requestを投げるだけなので、他の言語等でも簡単に扱えます。 公式DocsにはPythonで書かれたExampleがあります。