[英]How to remove the redundant data from text file
我已經計算出兩個原子之間的距離並保存在out.txt文件中。 生成的文件是這樣的。
N_TYR_A0002 O_CYS_A0037 6.12
O_CYS_A0037 N_TYR_A0002 6.12
N_ALA_A0001 O_TYR_A0002 5.34
O_TYR_A0002 N_ALA_A0001 5.34
我的外檔有重復,表示相同的原子和相同的距離。
我如何刪除多余的線。
我用這個程序進行距離計算(所有原子)
from __future__ import division
from string import *
from numpy import *
def eudistance(c1,c2):
x_dist = (c1[0] - c2[0])**2
y_dist = (c1[1] - c2[1])**2
z_dist = (c1[2] - c2[2])**2
return math.sqrt (x_dist + y_dist + z_dist)
infile = open('file.pdb', 'r')
text = infile.read().split('\n')
infile.close()
text.remove('')
pdbid = []
#define the pdbid
spfcord = []
for g in pdbid:
ratom = g[0]
ratm1 = ratom.split('_')
ratm2 = ratm1[0]
if ratm2 in allatoms:
spfcord.append(g)
#print spfcord[:10]
outfile1 = open('pairdistance.txt', 'w')
for m in spfcord:
name1 = m[0]
cord1 = m[1]
for n in spfcord:
if n != '':
name2 = n[0]
cord2 = n[1]
dist = euDist(cord1, cord2)
if 7 > dist > 2:
#print name1, '\t', name2, '\t', dist
distances = name1 + '\t ' + name2 + '\t ' + str(dist)
#print distances
outfile1.write(distances)
outfile1.write('\n')
outfile1.close()
如果您不關心訂單:
def remove_duplicates(input_file):
with open(input_file) as fr:
unique = {'\t'.join(sorted([a1, a2] + [d]))
for a1, a2, d in [line.strip().split() for line in fr]
}
for item in unique:
yield item
if __name__ == '__main__':
for line in remove_duplicates('out.txt'):
print line
但是在計算距離和寫入數據之前,簡單檢查一下腳本中的name1 <name2是否會更好。
好吧,我有個主意。 不假裝這是最好或最干凈的方法,但這很有趣。
import numpy as np
from StringIO import StringIO
data_in_file = """
N_TYR_A0002, O_CYS_A0037, 6.12
N_ALA_A0001, O_TYR_A0002, 5.34
P_CUC_A0001, N_TYR_A0002, 9.56
O_TYR_A0002, N_ALA_A0001, 5.34
O_CYS_A0037, N_TYR_A0002, 6.12
N_TYR_A0002, P_CUC_A0001, 9.56
"""
# Import data using numpy, any method is okay really as we don't really on data being array's
data_in_array = np.genfromtxt(StringIO(data_in_file), delimiter=",", autostrip=True,
dtype=[('atom_1', 'S12'), ('atom_2', 'S12'), ('distance', '<f8')])
N = len(data_in_array['distance'])
pairs = []
# For each item find the repeated index
for index, a1, a2 in zip(range(N), data_in_array['atom_1'], data_in_array['atom_2']):
repeat_index = list((data_in_array['atom_2'] == a1) * (data_in_array['atom_1'] == a2)).index(True)
pairs.append(sorted([index, repeat_index]))
# Each item is repeated, so sort and remove every other one
unique_indexs = [item[0] for item in sorted(pairs)[0:N:2]]
atom_1 = data_in_array['atom_1'][unique_indexs]
atom_2 = data_in_array['atom_2'][unique_indexs]
distance = data_in_array['distance'][unique_indexs]
for i in range(N/2):
print atom_1[i], atom_2[i], distance[i]
#Prints
N_TYR_A0002 O_CYS_A0037 6.12
N_ALA_A0001 O_TYR_A0002 5.34
P_CUC_A0001 N_TYR_A0002 9.56
我應該補充一點,這是假設每個對都重復一次,並且沒有對不存在任何項,這將破壞代碼,但可以通過錯誤異常進行處理。
注意,我還使用","
定界符對輸入數據文件進行了更改","
並添加了另一對,以確保順序不會破壞代碼。
讓我們嘗試避免首先生成重復項。 更改代碼的這一部分-
outfile1 = open('pairdistance.txt', 'w')
length = len(spfcord)
for i,m in enumerate(spfcord):
name1 = m[0]
cord1 = m[1]
for n in islice(spfcord,i+1,length):
添加導入:
from itertools import islice
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.