从python中的文本文件中提取重复模式

Question

I am looking to extract all the text between a repeating pattern in a text file. 我正在寻找提取文本文件中重复模式之间的所有文本。 My text file XYZ.txt looks something like this: 我的文本文件XYZ.txt看起来像这样：

Start

This is a great day

End

Start
This is another great day

End

Start
This is 3rd great day
End

I am looking extract the all the text between every start and End, my output should be like: 我正在寻找提取每个开始和结束之间的所有文本，我的输出应该像这样：

This is a great day
This is another great day
This is 3rd great day

and I am looking to save all the outputs as separate HTML file. 我希望将所有输出保存为单独的HTML文件。 The code that I am using is as follows: 我正在使用的代码如下：

import re
with open('XYZ.txt') as myfile:
    content = myfile.read()

text = re.search(r'Start\n.*?End', content, re.DOTALL).group()

print(text)

But code above only prints the first line. 但是上面的代码仅显示第一行。 Not sure how I can print all the values between the pattern and save them as seperate html files. 不知道如何在模式之间打印所有值并将它们另存为单独的html文件。 I would really appreciate any directions. 我真的很感谢任何方向。

Thank You 谢谢

Answer 1

You need to use re.findall to find all occurrences of regex. 您需要使用re.findall查找所有出现的正则表达式。

>>> lines
'Start\n\nThis is a great day\n\nEnd\n\nStart\nThis is another great day\n\nEnd\n\nStart\nThis is 3rd great day\nEnd\n'
>>>
>>> re.findall('This is.*day', lines)
['This is a great day', 'This is another great day', 'This is 3rd great day']

Answer 2

You could use string mutation and generators instead of re. 您可以使用字符串突变和生成器代替re。

def format_file(file, start, end):
    f = open(file, 'r').read()
    return tuple(x for x in ''.join(f.split(start)).replace('\n', '').split(end) if x != '')

print format_file('XYZ', 'Start', 'End')

Or pure generator 还是纯发电机

def format_file(file, start, end):
    f = open(file, 'r').readlines()
    return tuple(x.rstrip() for x in f if x != '\n' and not x.startswith(start) and not x.startswith(end))
print format_file('XYZ', 'Start', 'End')

Answer 3

I would use the readlines() function and do something like this: 我将使用readlines()函数并执行以下操作：

with open('jokes.txt') as myfile:
    for line in myfile.readlines():
        if line.strip() != 'Start' and line.strip() != 'End' and line.strip():
            print line[:-1]

This will give output: 这将给出输出：

This is a great day
This is another great day
This is 3rd great day

And furthermore will generalize to any type of string between 'Start' and 'End' 而且还将推广到'Start'和'End'之间'Start'任何类型的字符串

Answer 4

If your text file looks like in your post, then you may not need regex , you can use list comprehension . 如果您的文本文件看起来像帖子中的样子，则您可能不需要regex ， regex可以使用列表推导。

You can just store all the lines you want to extract in a list. 您可以只将要提取的所有行存储在列表中。

lst = []
with open('XYZ.txt', 'r') as myfile:
    for line in myfile:
        line = line.strip()
        lst.append(line)
lst2 = [i for i in lst if i != 'Start' and i != 'End' ]        
print lst2

The output: 输出：

['This is a great day', 'This is another great day', 'This is 3rd great day']

从python中的文本文件中提取重复模式

问题描述

4 个解决方案

解决方案1
0 2016-06-06 03:53:35

解决方案2
0 2016-06-06 03:54:32

解决方案3
0 已采纳 2016-06-06 03:55:57

解决方案4
0 2016-06-06 04:32:59

从python中的文本文件中提取重复模式

问题描述

4 个解决方案

解决方案1 0 2016-06-06 03:53:35

解决方案2 0 2016-06-06 03:54:32

解决方案3 0 已采纳 2016-06-06 03:55:57

解决方案4 0 2016-06-06 04:32:59

解决方案1
0 2016-06-06 03:53:35

解决方案2
0 2016-06-06 03:54:32

解决方案3
0 已采纳 2016-06-06 03:55:57

解决方案4
0 2016-06-06 04:32:59