简体繁体 English

Ncbi蛋白质数据库，如何从特定生物项目中获取蛋白质序列（python脚本）

[英]Ncbi protein database, how to get protein sequences from a specific bioproject (python script)

原文 2013-11-14 13:14:35 2 1 python/ ncbi/ protein-database

I am trying to retrieve codding protein sequences from NCBI database from specific bioprojects. 我正在尝试从特定生物项目的NCBI数据库中检索编码蛋白序列。 This can be achieved somehow using a web browser. 这可以通过Web浏览器以某种方式实现。 For instance you can find the specific bioproject you are interested in and "click" on the associated protein : http://www.ncbi.nlm.nih.gov/genome/proteins/994?project_id=207383 which allow you to see all the protein from the BioProject "207383" and for the Genome "994". 例如，您可以找到您感兴趣的特定生物项目，然后“单击”相关的蛋白质： http ://www.ncbi.nlm.nih.gov/genome/proteins/994?project_id=207383，您可以查看所有来自BioProject“ 207383”和基因组“ 994”的蛋白质。 I would like to get thoses protein sequencies automaticaly using python. 我想使用python自动获取那些蛋白质序列。

In order to do that i used the "E-utilities" from NCBI. 为了做到这一点，我使用了NCBI的“ E-utilities”。 Mainly "elink.fcgi?" 主要是“ elink.fcgi”？ which allow to get all the UID of a database (lets say "Protein") linked from a specific UID of a database (lets say a BioProject UID). 允许从数据库的特定UID（比如说BioProject UID）链接到数据库的所有UID（比如说“蛋白”）。 So here is my entrez URL request : 这是我的entrez URL请求：
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=bioproject&linkname=bioproject_protein&id=207383 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=bioproject&linkname=bioproject_protein&id=207383
Then i obtain a list of Protein UID, which is great since i need those, for my next request with the efetch.fcgi? 然后，我获得了蛋白质UID的列表，这对我的下一个请求efetch.fcgi很有用，因为我需要这些。 "E-utility". “E-效用”。 This request would then allow me to get everything i need. 然后，此请求将使我得到所需的一切。

OK, so everything is great and all, it works fine BUT, the number of protein UID i get from my "elink.fcgi?" 好的，一切都很好，但一切正常，但是，我从“ elink.fcgi”获得的蛋白质UID数量是多少？ request isn't the same as the number of protein displayed with a manual web broswer based search. 请求与基于手动网络浏览器的搜索显示的蛋白质数量不同。 Worse, upon inquiring the origin of these issues, you get to see missing sequencies or sequencies from higher taxa (which are also not linked in any way to the BioProject). 更糟糕的是，在查询这些问题的根源时，您会发现丢失的序列或较高分类单元的序列（这些序列也未以任何方式链接到BioProject）。

Here is an exemple : the first link of this post display a number of 4014 sequencies, when the python request get me 3957 Protein UID. 这是一个例子：这篇文章的第一个链接显示了4014个序列，当python请求给我3957 Protein UID时。

I tried some other approaches such as getting all the protein UID linked from a taxonomy UID. 我尝试了其他方法，例如从分类法UID中获取所有蛋白质UID的链接。 This usualy give you more sequencies than wanted since there are different bioprojects (also give you some doubles with different names and same Fasta). 由于存在不同的生物项目，这通常会给您带来比预期更多的序列（也为您提供了一些具有不同名称和相同Fasta的双打）。

Is there a way to do this, one which migth work? 有没有一种方法可以做到这一点？

1 个解决方案

I also find working with NCBI extremely frustrating. 我还发现与NCBI合作非常令人沮丧。 I am amazed that such a data source doesn't even provide us with a clean cut way of download. 令我惊讶的是，这样的数据源甚至没有为我们提供清晰的下载方式。 Instead, it offers some terrible cross linkings and let the users go figure the whole thing themselves. 相反，它提供了一些可怕的交叉链接，并让用户自己去了解整个过程。

My solution is from this post 我的解决方法是从这篇文章

How to Download Bacterial Genomes Using the Entrez API 如何使用Entrez API下载细菌基因组

Be sure change the db to "nuccore" and rettype to "fasta_cds_aa". 确保将数据库更改为“ nuccore”，并重新键入“ fasta_cds_aa”。 Also check the downloaded fasta file for its taxonomy id to make sure it is exactly the strain you ask (This last one messed me up big time, hard learned lesson). 还要检查下载的fasta文件的分类法ID，以确保它恰好是您所要求的类型（最后一个使我很费时间，这是很难学的课）。