简体   繁体   English

正则表达式根据月份缩写拆分文本并提取以下文本?

[英]regex to split text based on month abbreviations and extract following text?

I am working on a personal project, and am stuck on extracting the text surrounding month abbreviations.我正在做一个个人项目,并且一直在提取围绕月份缩写的文本。

A sample input text is of the form:示例输入文本的格式如下:

text = "apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt"

I expect output of the form:我期望表单的输出:

[ ("apr25, 2016\nblah blah\npow\n"), ("may22, 2017\nasdf rtys\nqwer\n"), ("jan9, 2018\npoiu\nlkjhj yertt") ]

I tried a simple regex, but it is incorrect:我尝试了一个简单的正则表达式,但它是不正确的:

import re

# Greedy version
REGEX_MONTHS_TEXT = re.compile(r'(apr[\w\W]*)|(may[\w\W]*)|(jan[\w\W]*)')
REGEX_MONTHS_TEXT.findall(text)
# output: [('apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt', '', '')]

# Non-Greedy version
REGEX_MONTHS_TEXT = re.compile(r'(apr[\w\W]*?)|(may[\w\W]*?)|(jan[\w\W]*?)')
REGEX_MONTHS_TEXT.findall(text)
# output: [('apr', '', ''), ('', 'may', ''), ('', '', 'jan')]

Can you help me produce the desired output with python3 regex?你能帮我用 python3 正则表达式生成所需的输出吗?

Or do i need to write custom python3 code to produce the desired output?或者我是否需要编写自定义 python3 代码来生成所需的输出?

The problem was in stopping around month abbreviations in my regex, after matching for month abbreviations.问题是在匹配月份缩写,在我的正则表达式中停止月份缩写。

I referred Python RegEx Stop before a Word and used the tempered greedy token solution mentioned there.在 Word 之前提到了Python RegEx Stop并使用了那里提到的缓和的贪婪令牌解决方案。

import re

REGEX_MONTHS_TEXT = re.compile(r'(apr|may|jan)((?:(?!apr|may|jan)[\w\W])+)')
text = "apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt"

arr = REGEX_MONTHS_TEXT.findall(text)
# arr = [ ('apr', '25, 2016\nblah blah\npow\n'),  ('may', '22, 2017\nasdf rtys\nqwer\n'),  ('jan', '9, 2018\npoiu\nlkjhj yertt')]

# The above arr can be combined using list comprehension to form
# list of singleton tuples as expected in the original question
output = [ (x + y,) for (x, y) in arr ]
# output = [('apr25, 2016\nblah blah\npow\n',), ('may22, 2017\nasdf rtys\nqwer\n',), ('jan9, 2018\npoiu\nlkjhj yertt',)]

Additional Resource for Tempered Greedy Token: Tempered Greedy Token - What is different about placing the dot before the negative lookahead Tempered Greedy Token 的附加资源: Tempered Greedy Token - 在负前瞻之前放置点有什么不同

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM