简体   繁体   English

从python中的文本文件中提取重复模式

[英]Extracting repeated patterns from a text file in python

I am looking to extract all the text between a repeating pattern in a text file. 我正在寻找提取文本文件中重复模式之间的所有文本。 My text file XYZ.txt looks something like this: 我的文本文件XYZ.txt看起来像这样:

Start

This is a great day

End

Start
This is another great day

End

Start
This is 3rd great day
End

I am looking extract the all the text between every start and End, my output should be like: 我正在寻找提取每个开始和结束之间的所有文本,我的输出应该像这样:

This is a great day
This is another great day
This is 3rd great day

and I am looking to save all the outputs as separate HTML file. 我希望将所有输​​出保存为单独的HTML文件。 The code that I am using is as follows: 我正在使用的代码如下:

import re
with open('XYZ.txt') as myfile:
    content = myfile.read()

text = re.search(r'Start\n.*?End', content, re.DOTALL).group()

print(text)

But code above only prints the first line. 但是上面的代码仅显示第一行。 Not sure how I can print all the values between the pattern and save them as seperate html files. 不知道如何在模式之间打印所有值并将它们另存为单独的html文件。 I would really appreciate any directions. 我真的很感谢任何方向。

Thank You 谢谢

You need to use re.findall to find all occurrences of regex. 您需要使用re.findall查找所有出现的正则表达式。

>>> lines
'Start\n\nThis is a great day\n\nEnd\n\nStart\nThis is another great day\n\nEnd\n\nStart\nThis is 3rd great day\nEnd\n'
>>>
>>> re.findall('This is.*day', lines)
['This is a great day', 'This is another great day', 'This is 3rd great day']

You could use string mutation and generators instead of re. 您可以使用字符串突变和生成器代替re。

def format_file(file, start, end):
    f = open(file, 'r').read()
    return tuple(x for x in ''.join(f.split(start)).replace('\n', '').split(end) if x != '')

print format_file('XYZ', 'Start', 'End')

Or pure generator 还是纯发电机

def format_file(file, start, end):
    f = open(file, 'r').readlines()
    return tuple(x.rstrip() for x in f if x != '\n' and not x.startswith(start) and not x.startswith(end))
print format_file('XYZ', 'Start', 'End')

I would use the readlines() function and do something like this: 我将使用readlines()函数并执行以下操作:

with open('jokes.txt') as myfile:
    for line in myfile.readlines():
        if line.strip() != 'Start' and line.strip() != 'End' and line.strip():
            print line[:-1]

This will give output: 这将给出输出:

This is a great day
This is another great day
This is 3rd great day

And furthermore will generalize to any type of string between 'Start' and 'End' 而且还将推广到'Start''End'之间'Start'任何类型的字符串

If your text file looks like in your post, then you may not need regex , you can use list comprehension . 如果您的文本文件看起来像帖子中的样子,则您可能不需要regexregex可以使用列表推导

You can just store all the lines you want to extract in a list. 您可以只将要提取的所有行存储在列表中。

lst = []
with open('XYZ.txt', 'r') as myfile:
    for line in myfile:
        line = line.strip()
        lst.append(line)
lst2 = [i for i in lst if i != 'Start' and i != 'End' ]        
print lst2 

The output: 输出:

['This is a great day', 'This is another great day', 'This is 3rd great day']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM