Python 使用正则表达式提取文本文件中的段落

Question

I am using Python 3.7 and I am trying to extract some paragraph from some text files using regex.我正在使用 Python 3.7，我正在尝试使用正则表达式从一些文本文件中提取一些段落。

Here is a sample of the txt file content.这是txt文件内容的示例。

AREA: OMBEYI MARKET, ST. RITA RAMULA

DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 5.00 P.M.

Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.

AREA: NYAMACHE FACTORY

DATE: Thursday 25.03.2021, TIME: 830 A.M. - 3.00 P.M.

Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.

AREA: SUNEKA MARKET, RIANA MARKET

DATE: Thursday 25.03.2021, TIME: 8.00 A.M. - 3.00 P.M.

Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.

AREA: ITIATI, GITUNDUTI

DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 2.00 P.M.

General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.

Currently I am able to extract the Area, Date and Time using regex:目前我可以使用正则表达式提取区域、日期和时间：

area_pattern = re.compile("^AREA:((.*))")
date_pattern = re.compile("^DATE:(.*),")
time_pattern = re.compile("TIME:(.*).")

I would like to be able to extract the paragraph after DATE/TIME and before AREA containing locations separated by commas.我希望能够在DATE/TIME之后和AREA之前提取包含以逗号分隔的位置的段落。 So I will be able to match the following:所以我将能够匹配以下内容：

1.
Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.

2.
Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.

3.
Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.

4.
General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.

If anyone could help with suggesting a regex that would help with this use case, as well as improvements to my current regex, I would really appreciate it.如果有人可以帮助建议一个对这个用例有帮助的正则表达式，以及对我当前的正则表达式的改进，我将非常感激。 Thanks谢谢

Answer 1

You may use this regex with a capture group to be used in re.findall :您可以将此正则表达式与要在re.findall中使用的捕获组一起使用：

\nDATE:.*\n*((?:\n.*)+?)(?=\nAREA:|\Z)

RegEx Demo正则表达式演示

RegEx Details:正则表达式详细信息：

\nDATE: : Match text DATE: after matching a line break \nDATE: : 匹配文本DATE:匹配换行符后
.*\n* : Match rest of the line followed by 0 or more line breaks .*\n* : 匹配 rest 后跟 0 个或多个换行符的行
((?:\n.*)+?) : Capture group 1 to capture our text which 1 or lines of everything until next condition is satisfied ((?:\n.*)+?) ：捕获组 1 以捕获我们的文本，其中 1 或所有内容的行，直到满足下一个条件
(?=\nAREA:|\Z) : Assert that we have a line break followed by AREA: or end of input right ahead of the current position (?=\nAREA:|\Z) ：断言我们有一个换行符，然后是AREA:或当前 position 之前的输入结束

Answer 2

As an alternative pattern:作为替代模式：

^DATE:.*((?:\n(?!AREA:).*)+)

^DATE:.* Match DATE: and the rest of the line ^DATE:.*匹配DATE:和该行的 rest
( Capture group 1 (捕获组 1
- (?:\n(?:AREA.).*)+ Repeat 1+ lines that do not start with AREA: (?:\n(?:AREA.).*)+重复 1+ 行不以AREA:
) Close group 1 )关闭第 1 组

Regex demo |正则表达式演示| Python demo Python 演示

Python 使用正则表达式提取文本文件中的段落

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-03-31 13:53:49

解决方案2
2 2021-03-31 15:25:52

Python 使用正则表达式提取文本文件中的段落

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-03-31 13:53:49

解决方案2 2 2021-03-31 15:25:52

解决方案1
3 已采纳 2021-03-31 13:53:49

解决方案2
2 2021-03-31 15:25:52