简体   繁体   English

如何比较一行中的字符串与下一行中的字符串?

[英]how to compare string in one line with string in next line?

I have a file with about 16,000 lines in it. 我有一个大约有16,000行的文件。 All of them have the same format. 它们都具有相同的格式。 Here is a simple example if it: 这是一个简单的例子:

ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00

<...>

ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00

I need to check if lines that contains string DPPC and has identifier 18 forms 50 line block before the identifier switches to 19 , etc. 我需要检查在标识符切换到19等之前,包含字符串DPPC并具有标识符18行是否形成50行块。

So for now, I have the following code: 所以现在,我有以下代码:

cnt = 0
with open('test_file.pdb') as f1:
    with open('out','a') as f2:
        lines = f1.readlines()
        for i, line in enumerate(lines):
             if "DPPC" in line:
                   A = line.strip()[22:26]
                   if A[i] == A [i+1]:
                       cnt = cnt + 1
                   elif A[i] != A[i+1]:
                       cnt = 0

And here I stuck. 在这里我卡住了。 I found some examples how to compare subsequent lines but similar approach did not work here. 我找到了一些示例来比较后续的行,但是类似的方法在这里行不通。 I still cannot figure out how to compare the value of A in line[i] with the value of A in the line[i+1] . 我仍然无法找出如何比较的值Aline[i]与值Aline[i+1]

Try this (explanations in the comments). 试试这个(注释中的解释)。

data = """ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00"""

# The last code seen in the 5th column.
code = None

# The count of lines of the current code.
count = 0

for line in data.split("\n"):
    # Get the 5th column.
    c = line.split()[4]

    # The code in the 5th column changed.
    if c != code:
        # If we aren't at the start of the file, print the count
        # for the code that just ended.
        if code:
            print("{}: {}".format(code, count))

        # Rember the new code.
        code = c

    # Count the line
    count = count + 1

# Print the count for the last code.
print("{}: {}".format(code, count))

Output: 输出:

18: 9
19: 19

Since your data appears to be fixed width fields in fixed width records, you can use the struct module to quickly break each line up into individual fields. 由于您的数据在固定宽度记录中似乎是固定宽度字段,因此您可以使用struct模块将每一行快速拆分为各个字段。

Parsing all the fields of each line may be overkill when you only need to process one of them, but I'm doing it the way shown to illustrate how it's done in case you need to do other processing — and using the struct module makes it relatively fast in any case. 当您只需要处理其中每一行时,解析每一行的所有字段可能会过大,但我正在按照所示方式说明如何在需要进行其他处理的情况下完成此操作-使用struct模块可以做到这一点在任何情况下都相对较快。

Let's say the input file consisted of only the following lines of data: 假设输入文件仅包含以下几行数据:

ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    139  C1  DPPC   18      17.250  58.420  10.850  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   19      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   20      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   20      23.050  20.800  11.000  1.00  0.00
ATOM    189  C1  DPPC   20      23.050  20.800  11.000  1.00  0.00

All you need to do is remember what the value of field was on the previous line to allow a comparison of it to the current one. 您需要做的就是记住上一行的值,以便将其与当前行进行比较。 To start the process, the first line has to be read and parsed separately, so there's a prev value to compare with on subsequent lines. 要开始此过程,必须分别读取和解析第一行,因此有一个prev值可与后续行进行比较。 Also note that the 5th field is the one indexed by [4] because the first starts at [0] . 另请注意,第5个字段是由[4]索引的字段,因为第一个字段起始于[0]

import struct

# negative widths represent ignored padding fields
fieldwidths = 4, -4, 3, -2, 2, -2, 4, -3, 2, -6, 6, -2, 6, -2, 6, -2, 4, -2, 4
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                    for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from  # a function to split line up into fields

with open('test_file.pdb') as f1:
    prev = parse(next(f1))[4]  # remember value of fifth field
    cnt = 1
    for line in f1:
        curr = parse(line)[4]  # get value of fifth field
        if curr == prev:  # same as last one?
            cnt += 1
        else:
            print('{} occurred {} times'.format(prev, cnt))
            prev = curr
            cnt = 1
    print('{} occurred {} times'.format(prev, cnt))  # for last line

Output: 输出:

18 occurred 9 times
19 occurred 7 times
20 occurred 3 times

You can also easily solve this with a parallel list: 您也可以使用并行列表轻松解决此问题:

data = []
with open('data.txt', 'r') as datafile:
    for line in datafile:
        line=line.strip()
        if line:
            data.append(line);


keywordList = []
for line in data:
    line = line.split()
    if (line[4] not in keywordList):
        keywordList.append(line[4])


counterList = []
for item in keywordList:
    counter = 0
    for line in data:
        line = line.split()
        if (line[4] == item):
            counter+=1
    counterList.append(counter)


for i in range(len(keywordList)):
    print("%s: %d"%(keywordList[i],counterList[i]));

But if you are familiar with dict, I'll go with Lutz's answer. 但是,如果您熟悉dict,那么我会接受Lutz的回答。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM