简体   繁体   English

正则表达式:清理文本:删除特定行之前的所有内容

[英]regex: cleaning text: remove everything upto a certain line

I have a text file containing The Tragedie of Macbeth.我有一个包含麦克白悲剧的文本文件。 I want to clean it and the first step is to remove everything upto the line The Tragedie of Macbeth and store the remaining part in removed_intro_file .我想清理它,第一步是删除The Tragedie of Macbeth行之前的所有内容,并将剩余部分存储在removed_intro_file中。

I tried:我试过:

import re
filename, title = 'MacBeth.txt', 'The Tragedie of Macbeth'
with open(filename, 'r') as file:
    removed_intro = file.read()
    with open('removed_intro_file', 'w') as output:
        removed = re.sub(title, '', removed_intro)
        print(removed)
        output.write(removed)

The print statement doesn't print anything so it doesn't match anything. print 语句不打印任何内容,因此它不匹配任何内容。 How can I use regex over several lines?如何在多行中使用正则表达式? Should one instead use pointers that point to the start and end of the lines to removed?是否应该使用指向要删除的行的开头和结尾的指针? I'd also be glad to know if there is a nicer way to solve this maybe not using regex.我也很高兴知道是否有更好的方法可以不使用正则表达式来解决这个问题。

We can try reading your file line by line until hitting the target line.我们可以尝试逐行读取您的文件,直到达到目标行。 After that, read all subsequent lines into the output file.之后,将所有后续行读入输出文件。

filename, title = 'MacBeth.txt', 'The Tragedie of Macbeth'
line = ""
with open(filename, 'r') as file:
    while line != title:                 # discard all lines before the Macbeth title
        line = file.readline()
    lines = '\n'.join(file.readlines())  # read all remaining lines
    with open('removed_intro_file', 'w') as output:
        output.write(title + "\n" + lines)

This approach is probably faster and more efficient than using a regex approach.这种方法可能比使用正则表达式方法更快、更有效。

your regex only replaces title with '' ;您的正则表达式仅将title替换为'' you want to remove the title and all text before it, so search for all characters (including newlines) from the beginning of the string to the title included;你想删除标题和它之前的所有文本,所以搜索从字符串开头到包含的标题的所有字符(包括换行符); this should work (I only tested it on a sample file I wrote):这应该可以工作(我只在我写的示例文件上测试过):

removed = re.sub(r'(?s)^.*'+re.escape(title), '', removed_intro)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM