简体   繁体   English

如何检查列表元素是否包含Python中的正则表达式?

[英]How to check if a list element contains a regex in Python?

I have a file with information about sequences. 我有一个包含序列信息的文件。 Every sequence has some lines. 每个序列都有一些行。 The sequences are separated by five white lines. 序列由五条白线分开。 I want to change the file into a list, and split it by 5 newlines. 我想将文件更改为列表,并将其拆分为5个换行符。 So that I have a list, with every sequence as one element. 所以我有一个列表,每个序列作为一个元素。 Then I want to remove the sequences that not contain the regular expression. 然后我想删除不包含正则表达式的序列。 At the end, I want a list, with only the sequences that contain the regex. 最后,我想要一个列表,只包含包含正则表达式的序列。

Now I have this. 现在我有了这个。 Can anyone help me further? 任何人都可以帮助我吗?

import re
def main():
    ReadFile()
    file = open ("filename.txt", "r")
    CreateList(file, data)
    RegEx(file, data)

def ReadFile()
    try:
        file = open ("filename.txt", "r")
    except IOError:
        print ("Can't open the file")
    except:
        print ("Something went wrong.")

def CreateList(file, data)
    data = file.readlines()
    data = data.split('\n\n\n\n\n')

def RegEx(file, data)
    regex = ("[AG].{4}GK[ST]") 
    for element in data:
        if regex not in element: 
            data.remove(element) 
    print (data) 

main()

File looks like: 文件看起来像:

Hits for PS00017|ATP_GTP_A (pattern) ATP/GTP-binding site motif A (P-loop) :  [occurs frequently]
   Pattern: [AG]-x(4)-G-K-[ST]
   Approximate number of expected random matches in ~ 100'000 sequences (50'000'000 residues): 3371


>sp|Q6GZX2|003R_FRG3G  (438 aa)
Uncharacterized protein 3R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
      2 - 9:          ArpllGKT


>sp|Q6GZX1|004R_FRG3G  (60 aa)
Uncharacterized protein 004R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
      33 - 40:        GyyydGKT


>sp|Q6GZW0|015R_FRG3G  (322 aa)
Uncharacterized protein 015R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
      34 - 41:        GkpgsGKS',


>sp|P32234|128UP_DROME  (368 aa)
GTP-binding protein 128up.  [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
      71 - 78:        GfpsvGKS

Data it should be (but only proteins containing the RegEx): 它应该是数据(但只有含有RegEx的蛋白质):

['>sp|Q6GZX2|003R_FRG3G  (438 aa)
Uncharacterized protein 3R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
      2 - 9:          ArpllGKT',


'>sp|Q6GZX1|004R_FRG3G  (60 aa)
Uncharacterized protein 004R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
      33 - 40:        GyyydGKT',


'>sp|Q6GZW0|015R_FRG3G  (322 aa)
Uncharacterized protein 015R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
      34 - 41:        GkpgsGKS',


'>sp|P32234|128UP_DROME  (368 aa)
GTP-binding protein 128up.  [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
      71 - 78:        GfpsvGKS']
import re
file = open("ploop.txt")
text = file.read()
file.close()

proteins = text.split("\n\n")[1:]
proteinsMatching = []
toWrite = "" 

for protein in proteins:
    if re.search(r"[AG].{4}GK[ST]", protein):
        proteinsMatching.append(protein)        


for protein in proteinsMatching:
    accensionCode = re.findall(r">sp\|(.{6})", protein)[0]
    organism = re.findall(r"\n.+?\[(.+?)\]", protein)[0]
    print(accensionCode, organism)
    toWrite += accensionCode + " " + organism + "\n"

f = open("results.txt", "w+")
f.write(toWrite)
f.close()

# Q6GZX2 Frog virus 3 (isolate Goorha) (FV-3)
# Q6GZX1 Frog virus 3 (isolate Goorha) (FV-3)
# Q6GZW0 Frog virus 3 (isolate Goorha) (FV-3)
# P32234 Drosophila melanogaster (Fruit fly)

updated (again) for new requirements 更新(再次)新要求

Regex1 (Splitting text file into list of proteins:) https://regex101.com/r/gU0gX5/1 Regex1(将文本文件拆分成蛋白质列表:) https://regex101.com/r/gU0gX5/1

Regex2 (Your regex showing that they all match) https://regex101.com/r/nZ0pD6/1 Regex2(你的正则表达式显示它们都匹配) https://regex101.com/r/nZ0pD6/1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何检查列表是否已在Python中包含元素? - How to check if a list already contains an element in Python? 如何检查字符串是否包含 Python 列表中的任何 3 个元素 - How to check if a string contains any 3 element from a list in Python 如何检查数组中的元素是否包含列表 Python 中的任何值 - How to check if element in array contains any values from a list Python 如何检查 Python 列表是否包含任何字符串作为元素 - How to check whether a Python list contains ANY string as an element 如何检查字符串是否包含 Python 中列表中的元素 - How to check if a string contains an element from a list in Python 检查 Python 列表是否包含特定元素 - Check if Python list contains a specific element 如何使用python比较列表中的元素并检查第一个列表元素是否包含在另一个列表的元素中 - How to compare elements in lists and check if first list element contains in another list's element using python 如何检查列表的元素是否包含一些 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ - How to check if an element of a list contains some substring Python-BeautifulSoup-如何检查ResultSet是否包含元素 - Python - BeautifulSoup - how to check if ResultSet contains an element 如何检查元组是否包含 Python 中的元素? - How to check if a tuple contains an element in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM