简体   繁体   English

在Python中删除找到某些字符的文本文件的前两行

[英]Deleting prior two lines of a text file in Python where certain characters are found

I have a text file that is an ebook. 我有一个电子书的文本文件。 Sometimes full sentences are broken up by two new lines, and I am trying to get rid of these extra new lines so the sentences are not separated mid-sentence by the new lines. 有时,完整的句子会被两个新的行分开,而我试图摆脱这些多余的新行,以使句子在句子中间不会被新行分开。 The file looks like 该文件看起来像

Here is a regular sentence.
This one is fine too.
However, this

sentence got split up.

If I hit delete twice on the keyboard, it'd fix it. 如果我在键盘上按了两次Delete键,它将对其进行修复。 Here's what I have so far: 这是我到目前为止的内容:

with open("i.txt","r") as input:
    with open("o.txt","w") as output: 
        for line in input:
            line = line.strip()

            if line[0].isalpha() == True and line[0].isupper() == False:  
                # something like hitting delete twice on the keyboard
                output.write(line + "\n")
            else:
                output.write(line + "\n")

Any help would be greatly appreciated! 任何帮助将不胜感激!

If you can read the entire file into memory, then a simple regex will do the trick - it says, replace a sequence of new lines preceding a lowercase letter, with a single space: 如果您可以将整个文件读入内存,那么一个简单的正则表达式就可以解决问题-它说,用一个空格替换小写字母前面的新行序列:

import re

i = open('i.txt').read()
o = re.sub(r'\n+(?=[a-z])', ' ', i)

open('o.txt', 'w').write(o)

The important difference here is that you are not in an editor, but rather writing out lines, that means you can't 'go back' to delete things, but rather recognise they are wrong before you write them. 这里的重要区别是您不是在编辑器中,而是写行,这意味着您不能“返回”以删除内容,而是在编写它们之前就认识到它们是错误的。

In this case, you will iterate five times. 在这种情况下,您将迭代五次。 Each time you will get a string ending in \\n to indicate a newline, and in two of those cases, you want to remove that newline. 每次您都将得到一个以\\n结尾的字符串来表示换行符,并且在其中两种情况下,您希望删除该换行符。 The easiest way I can see to identify that time is if there was no full stop at the end of the line. 我可以看到的最简单的识别时间的方法是,如果行尾没有句号。 If we check that, and strip the newline off in that case, we can get the result you want: 如果我们进行检查,然后在这种情况下删除换行符,则可以得到所需的结果:

with open("i.txt", "r") as input, open("o.txt", "w") as output:
    for line in input:
        if line.endswith(".\n"):
            output.write(line)
        else:
            output.write(line.rstrip("\n"))

Obviously, sometimes it will not be possible to tell in advance that you need to make a change. 显然,有时无法提前告知您需要进行更改。 In such cases, you will either need to make two iterations over the file - the first to find where you want to make changes, the second to make them, or, alternatively, store (part of, or all of) the file in memory until you know what you need to change. 在这种情况下,您将需要对文件进行两次迭代-第一次查找要在何处进行更改,第二次进行更改,或者将文件(部分或全部)存储在内存中直到您知道需要更改什么。 Note that if your files are extremely large, storinbg them in memory could cause problems. 请注意,如果文件很大,则将它们存储在内存中可能会导致问题。

You can use fileinput if you just want to remove the lines in the original file: 如果只想删除原始文件中的行,则可以使用fileinput:

from __future__ import print_function

for line in fileinput.input(""i.txt", inplace=True):
    if not line.rstrip().endswith("."):
         print(line.rstrip(),end=" ")
    else:
        print(line, end="")

Output: 输出:

Here is a regular sentence.
This one is fine too.
However, this sentence got split up.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM