简体   繁体   English

Ncbi蛋白质数据库,如何从特定生物项目中获取蛋白质序列(python脚本)

[英]Ncbi protein database, how to get protein sequences from a specific bioproject (python script)

I am trying to retrieve codding protein sequences from NCBI database from specific bioprojects. 我正在尝试从特定生物项目的NCBI数据库中检索编码蛋白序列。 This can be achieved somehow using a web browser. 这可以通过Web浏览器以某种方式实现。 For instance you can find the specific bioproject you are interested in and "click" on the associated protein : http://www.ncbi.nlm.nih.gov/genome/proteins/994?project_id=207383 which allow you to see all the protein from the BioProject "207383" and for the Genome "994". 例如,您可以找到您感兴趣的特定生物项目,然后“单击”相关的蛋白质: http ://www.ncbi.nlm.nih.gov/genome/proteins/994?project_id=207383,您可以查看所有来自BioProject“ 207383”和基因组“ 994”的蛋白质。 I would like to get thoses protein sequencies automaticaly using python. 我想使用python自动获取那些蛋白质序列。

In order to do that i used the "E-utilities" from NCBI. 为了做到这一点,我使用了NCBI的“ E-utilities”。 Mainly "elink.fcgi?" 主要是“ elink.fcgi”? which allow to get all the UID of a database (lets say "Protein") linked from a specific UID of a database (lets say a BioProject UID). 允许从数据库的特定UID(比如说BioProject UID)链接到数据库的所有UID(比如说“蛋白”)。 So here is my entrez URL request : 这是我的entrez URL请求:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=bioproject&linkname=bioproject_protein&id=207383 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=bioproject&linkname=bioproject_protein&id=207383
Then i obtain a list of Protein UID, which is great since i need those, for my next request with the efetch.fcgi? 然后,我获得了蛋白质UID的列表,这对我的下一个请求efetch.fcgi很有用,因为我需要这些。 "E-utility". “E-效用”。 This request would then allow me to get everything i need. 然后,此请求将使我得到所需的一切。

OK, so everything is great and all, it works fine BUT, the number of protein UID i get from my "elink.fcgi?" 好的,一切都很好,但一切正常,但是,我从“ elink.fcgi”获得的蛋白质UID数量是多少? request isn't the same as the number of protein displayed with a manual web broswer based search. 请求与基于手动网络浏览器的搜索显示的蛋白质数量不同。 Worse, upon inquiring the origin of these issues, you get to see missing sequencies or sequencies from higher taxa (which are also not linked in any way to the BioProject). 更糟糕的是,在查询这些问题的根源时,您会发现丢失的序列或较高分类单元的序列(这些序列也未以任何方式链接到BioProject)。

Here is an exemple : the first link of this post display a number of 4014 sequencies, when the python request get me 3957 Protein UID. 这是一个例子:这篇文章的第一个链接显示了4014个序列,当python请求给我3957 Protein UID时。

I tried some other approaches such as getting all the protein UID linked from a taxonomy UID. 我尝试了其他方法,例如从分类法UID中获取所有蛋白质UID的链接。 This usualy give you more sequencies than wanted since there are different bioprojects (also give you some doubles with different names and same Fasta). 由于存在不同的生物项目,这通常会给您带来比预期更多的序列(也为您提供了一些具有不同名称和相同Fasta的双打)。

Is there a way to do this, one which migth work? 有没有一种方法可以做到这一点?

I also find working with NCBI extremely frustrating. 我还发现与NCBI合作非常令人沮丧。 I am amazed that such a data source doesn't even provide us with a clean cut way of download. 令我惊讶的是,这样的数据源甚至没有为我们提供清晰的下载方式。 Instead, it offers some terrible cross linkings and let the users go figure the whole thing themselves. 相反,它提供了一些可怕的交叉链接,并让用户自己去了解整个过程。

My solution is from this post 我的解决方法是从这篇文章

How to Download Bacterial Genomes Using the Entrez API 如何使用Entrez API下载细菌基因组

Be sure change the db to "nuccore" and rettype to "fasta_cds_aa". 确保将数据库更改为“ nuccore”,并重新键入“ fasta_cds_aa”。 Also check the downloaded fasta file for its taxonomy id to make sure it is exactly the strain you ask (This last one messed me up big time, hard learned lesson). 还要检查下载的fasta文件的分类法ID,以确保它恰好是您所要求的类型(最后一个使我很费时间,这是很难学的课)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从python中的多个登录号中的ncbi返回相应的fasta蛋白序列? - How can I return corresponding fasta protein sequences from ncbi from multiple accession numbers in python? 如何使用 python 从一个大的 fasta 文件中提取蛋白质序列的子集? - How to extract a subset of protein sequences from a big fasta file with python? 在 biopython 中获取 ID 和蛋白质序列 - Get ID and protein sequences in biopython 如何使用python编程将一组DNA序列转换为蛋白质序列? - How to convert a set of DNA sequences into protein sequences using python programming? 如何在 Python 中将 DNA 列表序列转换为蛋白质序列 - How to turn DNA list sequences into Protein sequences in Python 来自uniprot蛋白质id python的蛋白质序列 - Protein sequence from uniprot protein id python 在python中读取蛋白质序列的文本文件 - Read text file of protein sequences in python 通过访问 Uniprot 获取蛋白质序列(使用 Python) - Getting protein sequences by accessing Uniprot (with Python) 使用Python提取Fasta Moonlight蛋白序列 - Extracting Fasta Moonlight Protein Sequences with Python 如何使用python或linux命令通过在本地数据库中搜索将蛋白质ID转换为蛋白质名称? - How to use python or linux command to convert protein ID into protein name by searching in a local database?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM