python在文件中搜索字符串，将整行+下一行返回新文本文件

Question

I have a very large text file (50,000+ lines) that should always be in the same sequence. 我有一个非常大的文本文件（50,000行以上），应始终以相同的顺序进行。 In python I want to search the text file for each of the $INGGA lines and join this line with the subsequent $INHDT to create a new text file. 在python中，我想在文本文件中搜索$ INGGA的每一行，并将此行与随后的$ INHDT合并以创建一个新的文本文件。 I need to do this without reading into memory as this causes it to crash every time. 我需要在不读取内存的情况下执行此操作，因为这会导致每次崩溃。 I can find return the $INGGA line but I'm not sure of the best way of then getting the next line and joining into a new string that is memory efficient 我可以找到返回$ INGGA行的方法，但是我不确定获得下一行并连接到内存效率高的新字符串的最佳方法

Thanks 谢谢

Phil 菲尔

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.05.06 09:11:34 =~=~=~=~=~=~=~=~=~=~=~= > $PRDID,2.15,-0.10,31.87*6E $INGGA,091124.00,5249.8336,N,00120.9619,W,1,20,0.6,95.0,M,49.4,M,,*50 $INHDT,31.9,T*1E $INZDA,091124.0055,06,05,2016,,*7F $INVTG,22.0,T,,M,4.4,N,8.1,K,A*24 $PRDID,2.13,-0.06,34.09*6C $INGGA,091124.20,5249.8338,N,00120.9618,W,1,20,0.6,95.0,M,49.4,M,,*5D $INHDT,34.1,T*13 $INZDA,091124.2055,06,05,2016,,*7D $INVTG,24.9,T,,M,4.4,N,8.1,K,A*2B $PRDID,2.16,-0.03,36.24*61 $INGGA,091124.40,5249.8340,N,00120.9616,W,1,20,0.6,95.0,M,49.4,M,,*5A $INHDT,36.3,T*13 $INZDA,091124.4055,06,05,2016,,*7B $INVTG,27.3,T,,M,4.4,N,8.1,K,A*22 $PRDID,2.11,-0.05,38.33*68 $INGGA,091124.60,5249.8343,N,00120.9614,W,1,20,0.6,95.1,M,49.4,M,,*58 $INHDT,38.4,T*1A $INZDA,091124.6055,06,05,2016,,*79 $INVTG,29.5,T,,M,4.4,N,8.1,K,A*2A $PRDID,2.09,-0.02,40.37*6D $INGGA,091124.80,5249.8345,N,00120.9612,W,1,20,0.6,95.1,M,49.4,M,,*56 $INHDT,40.4,T*15 $INZDA,091124.8055,06,05,2016,,*77 $INVTG,31.7,T,,M,4.4,N,8.1,K,A*21 $PRDI =〜=〜=〜=〜=〜=〜=〜=〜=〜=〜=〜= PuTTY日志2016.05.06 09:11:34 =〜=〜=〜=〜=〜=〜=〜=〜= 〜=〜=〜=> $ PRDID，2.15，-0.10,31.87 * 6E $ INGGA，091124.00,5249.8336，N，00120.9619，W，1,20,0.6,95.0，M，49.4，M ,, * 50 $ INHDT ，31.9，T * 1E $ INZDA，091124.0055,06,05,2016 ,, * 7F $ INVTG，22.0，T ,, M，4.4，N，8.1，K，A * 24 $ PRDID，2.13，-0.06,34.09 * 6C $ INGGA，091124.20,5249.8338，N，00120.9618，W，1,20,0.6,95.0，M，49.4，M ,, * 5D $ INHDT，34.1，T * 13 $ INZDA，091124.2055,06,05,2016 ,, * 7D $ INVTG，24.9，T ,, M，4.4，N，8.1，K，A * 2B $ PRDID，2.16，-0.03,36.24 * 61 $ INGGA，091124.40,5249.8340，N，00120.9616，W，1 ，20,0.6,95.0，M，49.4，M，* 5A $ INHDT，36.3，T * 13 $ INZDA，091124.4055,06,05,2016 ,, * 7B $ INVTG，27.3，T，M，4.4， N，8.1，K，A * 22 $ PRDID，2.11，-0.05,38.33 * 68 $ INGGA，091124.60,5249.8343，N，00120.9614，W，1,20,0.6,95.1，M，49.4，M，* 58 $ INHDT，38.4，T * 1A $ INZDA，091124.6055,06,05,2016 ,, * 79 $ INVTG，29.5，T ,, M，4.4，N，8.1，K，A * 2A $ PRDID，2.09，-0.02 ，40.37 * 6D $ INGGA，091124.80,5249.8345，N，00120.9612，W，1,20,0.6,95.1，M，49.4，M ,, * 56 $ INHDT，40.4，T * 15 $ INZDA，091124.8055,06,05 ，2016 ,, * 77 $ INVTG，31.7，T ,, M，4.4，N，8.1，K，A * 21 $ PRDI D,2.09,0.02,42.42*40 $INGGA,091125.00,5249.8347,N,00120.9610,W,1,20,0.6,95.1,M,49.4,M,,*5F $INHDT,42.4,T*17 D，2.09,0.02,42.42 * 40 $ INGGA，091125.00,5249.8347，N，00120.9610，W，1,20,0.6,95.1，M，49.4，M ,, * 5F $ INHDT，42.4，T * 17

Answer 1

You can just read a line of file and write to another new file. 您可以只读取一行文件并写入另一个新文件。 Like this: 像这样：

import re

#open new file with append
nf = open('newfile', 'at')

#open file with read 
with open('file', 'rt') as f:
    for line in f:
        r = re.match(r'\$INGGA', line)
        if r is not None:
            nf.write(line)
            nf.write("$INHDT,31.9,T*1E" + '\n')

You can use at to append write and wt to read line! 您可以使用at追加write和wt来读取行！

I have 150,000 lines file, It's run well! 我有150,000行文件，运行良好！

Answer 2

I suggest using a simple regex that will parse and capture the parts you care about. 我建议使用一个简单的正则表达式来解析和捕获您关心的部分。 Here is an example that will capture the piece you care about: 这是一个示例，它将捕获您关心的部分：

(\\$INGGA.*\\n\\$INHDT.*\\n)

https://regex101.com/r/tK1hF0/3 https://regex101.com/r/tK1hF0/3

As in my above link, you'll notice that I used the "global" g setting on the regex, telling it to capture all groups that match. 在上面的链接中，您会注意到我在正则表达式上使用了“ global” g设置，告诉它捕获所有匹配的组。 Otherwise, it'll stop after the first match. 否则，它将在第一场比赛后停止。

I also had trouble determining where the actual line breaks exist in your above example file, so you can tweak the above to match exactly where the breaks occur. 在确定上述示例文件中实际的换行符的位置时，我也遇到了麻烦，因此您可以对上述内容进行调整，以准确匹配出现换行符的位置。

Here is some starter python example code: 这是一些入门python示例代码：

import re

test_str = # load your file here

p = re.compile(ur'(\$INGGA.*\n\$INHDT.*\n)')
matches = re.findall(p, test_str)

Answer 3

In the example PuTTY log you give, its all one line separated with space. 在您提供的示例PuTTY日志中，其所有行均由空格分隔。 So in this case you can use this to replace the space with new line and gets new file - 因此，在这种情况下，您可以使用它用新行替换空格并获取新文件-

cat large_file | sed 's/ /\n/g' > new_large_file

To iterate over the file separated with new line, run this - 要遍历用新行分隔的文件，请运行以下命令-

cat new_large_file | python your_script.py

Your script get line by line so your computer should not crash. 您的脚本逐行获取，因此计算机不应崩溃。

your_script.py - your_script.py-

import sys

INGGA_line = ""

for line in sys.stdin:
    line_striped = line.strip()
    if line_striped.startswith("$INGGA"):
        INGGA_line = line_striped
    elif line_striped.startswith("$INZDA"):
        print line_striped, INGGA_line
    else:
        print line_striped

Answer 4

This answer is aimed at python 3. 此答案针对python 3。

According to this other answer (and the docs ), you can iterate your file line-by-line memory-efficiently: 根据这个其他答案（和docs ），您可以高效地逐行存储文件：

with open(filename, 'r') as f:
    for line in f:
         ...process...

An example of how you could fulfill your above criteria could be 可以满足上述条件的一个例子是

# Target file write-only, source file read-only
with open(targetfile, 'w') as tf, open(sourcefile, 'r') as sf:
    # Flag for whether we are looking for 1st or 2nd part
    look_for_ingga = True
    for line in sf:
        if look_for_ingga:
            if line.startswith('$INGGA,'):
                tf.write(line)
                look_for_ingga = False
        elif line.startswith('$INHDT,'):
            tf.write(line)
            look_for_ingga = True

In the case where you have multiple '$INGGA,' prior to the '$INHDT,' , this grabs the first one and disregards the rest. 如果您在'$INGGA,'之前有多个'$INGGA,' '$INHDT,' ，则这将抢占第一个，而忽略其余的。 In case you want to take only the last '$INGGA,' before the '$INHDT,' , store the last '$INGGA,' in a variable instead of writing it to disk. 如果您只想取最后的'$INGGA,'在'$INGGA,'之前'$INHDT,' ，则将最后的'$INGGA,'在变量中，而不是将其写入磁盘。 Then, when you find your '$INHDT,' , store both. 然后，当您找到'$INHDT,' ，请同时存储两者。
In case you meant that you want to write to a separate new file for each INGGA-INHDT pair, the target file with -statement should be nested inside for line in sf instead, or the results should be buffered in a list for later storage. 如果您要为每个INGGA-INHDT对写入一个单独的新文件，则with -statement的目标文件应嵌套for line in sf ，或者将结果缓冲在列表中以备后用。

Refer to the docs for introductions to with -statements and file reading/writing . 请参阅文档以获取with -statements和文件读取/写入的介绍。

python在文件中搜索字符串，将整行+下一行返回新文本文件

问题描述

4 个解决方案

解决方案1
2 已采纳 2016-06-14 08:45:55

解决方案2
0 2016-06-14 08:46:35

解决方案3
0 2016-06-14 08:48:42

解决方案4
0 2016-06-14 09:19:19

python在文件中搜索字符串，将整行+下一行返回新文本文件

问题描述

4 个解决方案

解决方案1 2 已采纳 2016-06-14 08:45:55

解决方案2 0 2016-06-14 08:46:35

解决方案3 0 2016-06-14 08:48:42

解决方案4 0 2016-06-14 09:19:19

解决方案1
2 已采纳 2016-06-14 08:45:55

解决方案2
0 2016-06-14 08:46:35

解决方案3
0 2016-06-14 08:48:42

解决方案4
0 2016-06-14 09:19:19