[英]Finding common elements between two files
我有两个不同的文件,如下所示:file1.txt是制表符分隔的
AT5G54940.1 3182
pfam
PF01253 SUI1#Translation initiation factor SUI1
mf
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
bp
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 4996
pfam
PF01575 MaoC_dehydratas#MaoC like domain
mf
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01 560919
和包含不同蛋白质名称的file2.txt,
GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01
我需要运行一个程序,该程序可以从文件1中找到文件2中存在的蛋白质名称,但也可以打印与该蛋白质相关的所有“ GO:”(如果适用)。 对我来说,困难的部分是解析第一个文件。格式很奇怪。 我尝试过类似的方法,但是非常感谢其他方法,
import re
with open('file2.txt') as mylist:
proteins = set(line.strip() for line in mylist)
with open('file1.txt') as mydict:
with open('a.txt', 'w') as output:
for line in mydict:
new_list = line.strip().split()
protein = new_list[0]
if protein in proteins:
if re.search(r'GO:\d+', line):
output.write(protein+'\t'+line)
所需的输出,只要我具有所有对应的GO,就可以选择哪种格式都可以
AT5G54940.1 GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01
只是为了让您了解如何解决此问题。 输入文件中属于一种蛋白质的“组”由从缩进的行更改为非缩进的行来定界。 搜索此转换,您便有了组(或“组”)。 组的第一行包含蛋白质名称。 所有其他行可能是GO:行。
您可以使用if line.startswith(" ")
来检测缩进if line.startswith(" ")
取决于输入文件的格式,而不是" "
可能会寻找"\\t"
)。
def get_protein_chunks(filepath):
chunk = []
last_indented = False
with open(filepath) as f:
for line in f:
if not line.startswith(" "):
current_indented = False
else:
current_indented = True
if last_indented and not current_indented:
yield chunk
chunk = []
chunk.append(line.strip())
last_indented = current_indented
look_for_proteins = set(line.strip() for line in open('file2.txt'))
for p in get_protein_chunks("input.txt"):
proteinname = p[0].split()[0]
proteindata = p[1:]
if proteinname not in look_for_proteins:
continue
print "Protein: %s" % proteinname
golines = [l for l in proteindata if l.startswith("GO:")]
for g in golines:
print g
在这里,一个块不过是一条带状线的列表。 我使用生成器从输入文件中提取蛋白质块。 如您所见,逻辑仅基于从缩进线到非缩进线的过渡。
使用生成器时,您可以随意处理数据。 我只是打印出来。 但是,您可能希望将数据放入字典中并进行进一步分析。
输出:
$ python test.py
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
一种选择是使用蛋白质的名称作为关键字来建立列表字典:
#!/usr/bin/env python
import pprint
pp = pprint.PrettyPrinter()
proteins = set(line.strip() for line in open('file2.txt'))
d = {}
with open('file1.txt') as file:
for line in file:
line = line.strip()
parts = line.split()
if parts[0] in proteins:
key = parts[0]
d[key] = []
elif parts[0].split(':')[0] == 'GO':
d[key].append(line)
pp.pprint(d)
正如您所说的那样,我使用pprint
模块来打印字典,因为您不太担心格式。 实际的输出是:
{'AT5G54940.1': ['GO:0003743 translation initiation factor activity',
'GO:0008135 translation factor activity, nucleic acid binding',
'GO:0006413 translational initiation',
'GO:0006412 translation',
'GO:0044260 cellular macromolecule metabolic process'],
'GRMZM2G158629_P02': ['GO:0016491 oxidoreductase activity',
'GO:0033989 3alpha,7alpha,']}
除了使用pprint
,还可以使用循环获取问题中指定的输出:
with open('out.txt', 'w') as out:
for k,v in d.iteritems():
out.write('Protein: {}\n'.format(k))
out.write('{}\n'.format('\n'.join(v)))
out.txt
:
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.