[英]Using "findall" to find a sequence motif for a protein sequence
I have a program that needs to take user input to find a FASTA file containing a protein sequence (and give an error if the file can't be found), then scan through the sequence and find these four-letter sequences that follow the following rules: Start with "N", then anything but "P", then "S" or "T", then anything but "P" .我有一个程序需要用户输入来查找包含蛋白质序列的 FASTA 文件(如果找不到该文件,则给出错误),然后扫描序列并找到以下四个字母的序列规则:以 "N" 开头,然后是除 "P" 以外的任何内容,然后是 "S" 或 "T",然后是除 "P" 以外的任何内容。 I have the part where an error is given if the file is not found.如果找不到文件,我会给出错误的部分。 However, when scanning through the sequence, I am only receiving one-letter sequences.但是,当扫描序列时,我只收到一个字母的序列。
Here is my code:这是我的代码:
import re
userinput = input("Please provide a FASTA file.")
while userinput:
try:
if userinput == "0":
break
with open(userinput, mode = 'r') as protein:
readprotein = protein.read()
matches = re.findall('N[^P](S|T)[^P]', readprotein)
for match in matches:
print(match)
break
except FileNotFoundError:
print("File not found. Please ensure that you are typing the file name exactly as it is found with the file extension.")
userinput = input("Please provide a FASTA file. 0 to quit.")
The FASTA file that I'm working with is the HIV Type-2 proteome, and here's a small snippet:我正在使用的 FASTA 文件是 HIV 2 型蛋白质组,这是一个小片段:
>sp|P18096|POL_HV2BE Gag-Pol polyprotein OS=Human immunodeficiency virus type 2 subtype A (isolate BEN) OX=11714 GN=gag-pol PE=3 SV=4
MGARNSVLRGKKADELEKVRLRPGGKKKYRLKHIVWAANELDKFGLAESLLESKEGCQKI
LRVLDPLVPTGSENLKSLFNTVCVIWCLHAEEKVKDTEEAKKLAQRHLVAETGTAEKMPN
TSRPTAPPSGKRGNYPVQQAGGNYVHVPLSPRTLNAWVKLVEEKKFGAEVVPGFQALSEG
CTPYDINQMLNCVGDHQAAMQIIREIINEEAADWDSQHPIPGPLPAGQLRDPRGSDIAGT
TSTVDEQIQWMYRPQNPVPVGNIYRRWIQIGLQKCVRKYNPTNILDIKQGPKEPFQSYVD
RFYKSLRAEQTDPAVKNWMTQTLLIQNANPDCKLVLKGLGMNPTLEEMLTACQGVGGPGQ
KARLMAEALKEAMGPSPIPFAAAQQRKAIRYWNCGKEGHSARQCRAPRRQGCWKCGKPGH
IMANCPERQAGFFRVGPTGKEASQLPRDPSPSGADTNSTSGRSSSGTVGEIYAAREKAEG
AEGETIQRGDGGLAAPRAERDTSQRGDRGLAAPQFSLWKRPVVTAYIEDQPVEVLLDTGA
DDSIVAGIELGDNYTPKIVGGIGGFINTKEYKNVEIKVLNKRVRATIMTGDTPINIFGRN
ILTALGMSLNLPVAKIEPIKVTLKPGKDGPRLKQWPLTKEKIEALKEICEKMEKEGQLEE
APPTNPYNTPTFAIKKKDKNKWRMLIDFRELNKVTQDFTEIQLGIPHPAGLAKKKRISIL
DVGDAYFSIPLHEDFRQYTAFTLPAVNNMEPGKRYIYKVLPQGWKGSPAIFQYTMRQVLE
PFRKANPDVILIQYMDDILIASDRTGLEHDKVVLQLKELLNGLGFSTPDEKFQKDPPFQW
MGCELWPTKWKLQKLQLPQKDIWTVNDIQKLVGVLNWAAQIYSGIKTKHLCRLIRGKMTL
TEEVQWTELAEAELEENKIILSQEQEGYYYQEEKELEATIQKSQGHQWTYKIHQEEKILK
VGKYAKIKNTHTNGVRLLAQVVQKIGKEALVIWGRIPKFHLPVERETWEQWWDNYWQVTW
IPEWDFVSTPPLVRLTFNLVGDPIPGAETFYTDGSCNRQSKEGKAGYVTDRGKDKVKVLE
QTTNQQAELEVFRMALADSGPKVNIIVDSQYVMGIVAGQPTESENRIVNQIIEEMIKKEA
VYVAWVPAHKGIGGNQEVDHLVSQGIRQVLFLEKIEPAQEEHEKYHSIIKELTHKFGIPL
LVARQIVNSCAQCQQKGEAIHGQVNAEIGVWQMDYTHLEGKIIIVAVHVASGFIEAEVIP
QESGRQTALFLLKLASRWPITHLHTDNGPNFTSQEVKMVAWWVGIEQSFGVPYNPQSQGV
VEAMNHHLKNQISRIREQANTIETIVLMAVHCMNFKRRGGIGDMTPAERLINMITTEQEI
QFLQRKNSNFKNFQVYYREGRDQLWKGPGELLWKGEGAVIVKVGTDIKVVPRRKAKIIRD
YGGRQELDSSPHLEGAREDGEMACPCQVPEIQNKRPRGGALCSPPQGGMGMVDLQQGNIP
TTRKKSSRNTGILEPNTRKRMALLSCSKINLVYRKVLDRCYPRLCRHPNT
>sp|P18095|GAG_HV2BE Gag polyprotein OS=Human immunodeficiency virus type 2 subtype A (isolate BEN) OX=11714 GN=gag PE=3 SV=3
MGARNSVLRGKKADELEKVRLRPGGKKKYRLKHIVWAANELDKFGLAESLLESKEGCQKI
LRVLDPLVPTGSENLKSLFNTVCVIWCLHAEEKVKDTEEAKKLAQRHLVAETGTAEKMPN
TSRPTAPPSGKRGNYPVQQAGGNYVHVPLSPRTLNAWVKLVEEKKFGAEVVPGFQALSEG
CTPYDINQMLNCVGDHQAAMQIIREIINEEAADWDSQHPIPGPLPAGQLRDPRGSDIAGT
TSTVDEQIQWMYRPQNPVPVGNIYRRWIQIGLQKCVRKYNPTNILDIKQGPKEPFQSYVD
RFYKSLRAEQTDPAVKNWMTQTLLIQNANPDCKLVLKGLGMNPTLEEMLTACQGVGGPGQ
KARLMAEALKEAMGPSPIPFAAAQQRKAIRYWNCGKEGHSARQCRAPRRQGCWKCGKPGH
IMANCPERQAGFLGLGPRGKKPRNFPVTQAPQGLIPTAPPADPAAELLERYMQQGRKQRE
QRERPYKEVTEDLLHLEQRETPHREETEDLLHLNSLFGKDQ
Obviously, the error within my code lies within the "findall" function that my professor instructed me to use, and I think it may just be because I can't fully wrap my head around the use of regular expressions.显然,我的代码中的错误在于教授指示我使用的“findall”函数中,我认为这可能只是因为我无法完全理解正则表达式的使用。 What I have is re.findall('N^P[^P]', readprotein).我所拥有的是 re.findall('N^P[^P]', readprotein)。 I don't see why the single letter sequences that I am getting don't even start with "N", it is just a bunch of "T"s or "S"s.我不明白为什么我得到的单字母序列甚至不以“N”开头,它只是一堆“T”或“S”。 Any help is appreciated!任何帮助表示赞赏!
The problem is that (S|T)
is a capturing group, so re.findall
returns just the part of the match which it captures.问题是(S|T)
是一个捕获组,所以re.findall
只返回它捕获的匹配部分。 You can fix it by writing (?:S|T)
to make it a non-capturing group instead, so that re.findall
returns the whole match.您可以通过编写(?:S|T)
使其成为非捕获组来修复它,以便re.findall
返回整个匹配项。
Example:例子:
>>> import re
>>> re.findall('N[^P](S|T)[^P]', 'NSTN')
['T']
>>> re.findall('N[^P](?:S|T)[^P]', 'NSTN')
['NSTN']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.