[英]how to compare string in one line with string in next line?
I have a file with about 16,000 lines in it. 我有一个大约有16,000行的文件。 All of them have the same format. 它们都具有相同的格式。 Here is a simple example if it: 这是一个简单的例子:
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
<...>
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
I need to check if lines that contains string DPPC
and has identifier 18
forms 50 line block before the identifier switches to 19
, etc. 我需要检查在标识符切换到19
等之前,包含字符串DPPC
并具有标识符18
行是否形成50行块。
So for now, I have the following code: 所以现在,我有以下代码:
cnt = 0
with open('test_file.pdb') as f1:
with open('out','a') as f2:
lines = f1.readlines()
for i, line in enumerate(lines):
if "DPPC" in line:
A = line.strip()[22:26]
if A[i] == A [i+1]:
cnt = cnt + 1
elif A[i] != A[i+1]:
cnt = 0
And here I stuck. 在这里我卡住了。 I found some examples how to compare subsequent lines but similar approach did not work here. 我找到了一些示例来比较后续的行,但是类似的方法在这里行不通。 I still cannot figure out how to compare the value of A
in line[i]
with the value of A
in the line[i+1]
. 我仍然无法找出如何比较的值A
在line[i]
与值A
在line[i+1]
Try this (explanations in the comments). 试试这个(注释中的解释)。
data = """ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00"""
# The last code seen in the 5th column.
code = None
# The count of lines of the current code.
count = 0
for line in data.split("\n"):
# Get the 5th column.
c = line.split()[4]
# The code in the 5th column changed.
if c != code:
# If we aren't at the start of the file, print the count
# for the code that just ended.
if code:
print("{}: {}".format(code, count))
# Rember the new code.
code = c
# Count the line
count = count + 1
# Print the count for the last code.
print("{}: {}".format(code, count))
Output: 输出:
18: 9
19: 19
Since your data appears to be fixed width fields in fixed width records, you can use the struct
module to quickly break each line up into individual fields. 由于您的数据在固定宽度记录中似乎是固定宽度字段,因此您可以使用struct
模块将每一行快速拆分为各个字段。
Parsing all the fields of each line may be overkill when you only need to process one of them, but I'm doing it the way shown to illustrate how it's done in case you need to do other processing — and using the struct
module makes it relatively fast in any case. 当您只需要处理其中每一行时,解析每一行的所有字段可能会过大,但我正在按照所示方式说明如何在需要进行其他处理的情况下完成此操作-使用struct
模块可以做到这一点在任何情况下都相对较快。
Let's say the input file consisted of only the following lines of data: 假设输入文件仅包含以下几行数据:
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 20 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 20 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 20 23.050 20.800 11.000 1.00 0.00
All you need to do is remember what the value of field was on the previous line to allow a comparison of it to the current one. 您需要做的就是记住上一行的值,以便将其与当前行进行比较。 To start the process, the first line has to be read and parsed separately, so there's a prev
value to compare with on subsequent lines. 要开始此过程,必须分别读取和解析第一行,因此有一个prev
值可与后续行进行比较。 Also note that the 5th field is the one indexed by [4]
because the first starts at [0]
. 另请注意,第5个字段是由[4]
索引的字段,因为第一个字段起始于[0]
。
import struct
# negative widths represent ignored padding fields
fieldwidths = 4, -4, 3, -2, 2, -2, 4, -3, 2, -6, 6, -2, 6, -2, 6, -2, 4, -2, 4
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from # a function to split line up into fields
with open('test_file.pdb') as f1:
prev = parse(next(f1))[4] # remember value of fifth field
cnt = 1
for line in f1:
curr = parse(line)[4] # get value of fifth field
if curr == prev: # same as last one?
cnt += 1
else:
print('{} occurred {} times'.format(prev, cnt))
prev = curr
cnt = 1
print('{} occurred {} times'.format(prev, cnt)) # for last line
Output: 输出:
18 occurred 9 times
19 occurred 7 times
20 occurred 3 times
You can also easily solve this with a parallel list: 您也可以使用并行列表轻松解决此问题:
data = []
with open('data.txt', 'r') as datafile:
for line in datafile:
line=line.strip()
if line:
data.append(line);
keywordList = []
for line in data:
line = line.split()
if (line[4] not in keywordList):
keywordList.append(line[4])
counterList = []
for item in keywordList:
counter = 0
for line in data:
line = line.split()
if (line[4] == item):
counter+=1
counterList.append(counter)
for i in range(len(keywordList)):
print("%s: %d"%(keywordList[i],counterList[i]));
But if you are familiar with dict, I'll go with Lutz's answer. 但是,如果您熟悉dict,那么我会接受Lutz的回答。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.