[英]Finding common elements between two files
我有兩個不同的文件,如下所示:file1.txt是制表符分隔的
AT5G54940.1 3182
pfam
PF01253 SUI1#Translation initiation factor SUI1
mf
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
bp
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 4996
pfam
PF01575 MaoC_dehydratas#MaoC like domain
mf
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01 560919
和包含不同蛋白質名稱的file2.txt,
GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01
我需要運行一個程序,該程序可以從文件1中找到文件2中存在的蛋白質名稱,但也可以打印與該蛋白質相關的所有“ GO:”(如果適用)。 對我來說,困難的部分是解析第一個文件。格式很奇怪。 我嘗試過類似的方法,但是非常感謝其他方法,
import re
with open('file2.txt') as mylist:
proteins = set(line.strip() for line in mylist)
with open('file1.txt') as mydict:
with open('a.txt', 'w') as output:
for line in mydict:
new_list = line.strip().split()
protein = new_list[0]
if protein in proteins:
if re.search(r'GO:\d+', line):
output.write(protein+'\t'+line)
所需的輸出,只要我具有所有對應的GO,就可以選擇哪種格式都可以
AT5G54940.1 GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01
只是為了讓您了解如何解決此問題。 輸入文件中屬於一種蛋白質的“組”由從縮進的行更改為非縮進的行來定界。 搜索此轉換,您便有了組(或“組”)。 組的第一行包含蛋白質名稱。 所有其他行可能是GO:行。
您可以使用if line.startswith(" ")
來檢測縮進if line.startswith(" ")
取決於輸入文件的格式,而不是" "
可能會尋找"\\t"
)。
def get_protein_chunks(filepath):
chunk = []
last_indented = False
with open(filepath) as f:
for line in f:
if not line.startswith(" "):
current_indented = False
else:
current_indented = True
if last_indented and not current_indented:
yield chunk
chunk = []
chunk.append(line.strip())
last_indented = current_indented
look_for_proteins = set(line.strip() for line in open('file2.txt'))
for p in get_protein_chunks("input.txt"):
proteinname = p[0].split()[0]
proteindata = p[1:]
if proteinname not in look_for_proteins:
continue
print "Protein: %s" % proteinname
golines = [l for l in proteindata if l.startswith("GO:")]
for g in golines:
print g
在這里,一個塊不過是一條帶狀線的列表。 我使用生成器從輸入文件中提取蛋白質塊。 如您所見,邏輯僅基於從縮進線到非縮進線的過渡。
使用生成器時,您可以隨意處理數據。 我只是打印出來。 但是,您可能希望將數據放入字典中並進行進一步分析。
輸出:
$ python test.py
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
一種選擇是使用蛋白質的名稱作為關鍵字來建立列表字典:
#!/usr/bin/env python
import pprint
pp = pprint.PrettyPrinter()
proteins = set(line.strip() for line in open('file2.txt'))
d = {}
with open('file1.txt') as file:
for line in file:
line = line.strip()
parts = line.split()
if parts[0] in proteins:
key = parts[0]
d[key] = []
elif parts[0].split(':')[0] == 'GO':
d[key].append(line)
pp.pprint(d)
正如您所說的那樣,我使用pprint
模塊來打印字典,因為您不太擔心格式。 實際的輸出是:
{'AT5G54940.1': ['GO:0003743 translation initiation factor activity',
'GO:0008135 translation factor activity, nucleic acid binding',
'GO:0006413 translational initiation',
'GO:0006412 translation',
'GO:0044260 cellular macromolecule metabolic process'],
'GRMZM2G158629_P02': ['GO:0016491 oxidoreductase activity',
'GO:0033989 3alpha,7alpha,']}
除了使用pprint
,還可以使用循環獲取問題中指定的輸出:
with open('out.txt', 'w') as out:
for k,v in d.iteritems():
out.write('Protein: {}\n'.format(k))
out.write('{}\n'.format('\n'.join(v)))
out.txt
:
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.