[英]How can I split a string using a regular expression, to the left of the matching string?
我有以下示例文本:
Performed by:
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John Age:80
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]
XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally Age:31
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]
ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj Age:56
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]
00123795,,"TEXT:
Name: Shiloh Age:12
[Lots of text spanning multiple lines with special characters, new lines and whitespaces]
我正在尝试将其拆分到以下位置:
身份证号:
XSOR-160491"
15632894、136259874 、 “正文:
其中所有粗体文本都是可选的(并非在每个实例中都存在)。
我创建了以下正则表达式,但它使用了我想要保留的信息,并且不一定考虑所有粗体可选文本。
re.split(r"[0-9]+,[0-9]+?,\"TEXT", test))
我尝试使用?=
添加前瞻:
re.split(r"?=([0-9]+,[0-9]+?,\"TEXT)", test))
但这似乎不起作用。 任何帮助是极大的赞赏!
编辑:预期的 output 如下:
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John Age:80
...
XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally Age:31
...
ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj Age:56
...
00123795,,"TEXT:
Name: Shiloh Age:12
...
您可以使用"TEXT:
捕获(可选)行,但将其包装在捕获组中。然后re.split
将将该捕获组中的内容作为返回的块列表中的单独条目复制。然后您可以将这些块配对以获得最终拆分:
import re
regex = re.compile(r'(?m)^((?:ID NUMBER:\n)?(?:XSOR-\d+"\n)?\d+,\d*,"TEXT:\n)')
s = """
Performed by:
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John Age:80
XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally Age:31
ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj Age:56
00123795,,"TEXT:
Name: Shiloh Age:12
"""
it = iter(regex.split(s))
# Pair the "delimiter" chunks with the successor chunks:
result = [next(it)] + [match + next(it) for match in it]
print("----\n".join(result))
该代码的output为:
Performed by:
----
ID NUMBER:
XSOR-160491"
15632894,136259874,"TEXT:
Name: John Age:80
----
XSOR-160491"
78452156,784569851,"TEXT:
Name: Sally Age:31
----
ID NUMBER:
01236589,489456156878,"TEXT:
Name: Suraj Age:56
----
00123795,,"TEXT:
Name: Shiloh Age:12
正则表达式非常严格,因此如果您在开始一个块的行中有更多变化,您将不得不相应地放松正则表达式。
(?m)
是多行标志(用于整个正则表达式),表示^
(和$
)匹配行尾而不是文本尾。^
要求匹配从行首开始(?: )?
使一个部分成为可选的,而不为它创建一个所谓的捕获组。(?:ID NUMBER:\n)?
允许这个可选的文字行(?:XSOR-\d+"\n)?
允许这个可选行有一些数字( \d+
)\d+,\d*,"TEXT:\n'
需要一行有两个数字,其中第二个是可选的。( )
:这包含整个匹配,并且是正则表达式中唯一的捕获组。 re.split
将在返回列表中将这些括号内捕获的内容复制为单独的块。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.