[英]Separate only at first instance with multiple delimiters using regex
I have some strings which are in a format我有一些格式的字符串
lorem ipsum, dolor sit - amet, consectetur : adipiscing elit. Praesent vitae orc
I want it to get split at the first instance of each separator, to return我希望它在每个分隔符的第一个实例处拆分,以返回
['lorem ipsum',
'dolor sit',
'amet, consectetur',
'adipiscing elit. Praesent vitae orc']
Right now my output is现在我的输出是
['lorem ipsum',
'dolor sit',
'amet',
'consectetur ',
'adipiscing elit. Praesent vitae orc']
Right now I'm using re.split(', | - |: ', txt)
but it separates at all the instances in the string.现在我正在使用
re.split(', | - |: ', txt)
但它在字符串中的所有实例re.split(', | - |: ', txt)
开。 Any suggestions on how I can achieve the required output?关于如何实现所需输出的任何建议?
Edit:编辑:
I realised my question isn't clear, so for an example, if the string is我意识到我的问题不清楚,例如,如果字符串是
"abc: def: ijk, lmno: pqr - stu, wx"
the output should be输出应该是
["abc",
"def: ijk",
"lmno: pqr",
"stu, wxy"]
and not并不是
["abc",
"def",
"ijk",
"lmno",
"pqr",
"stu",
"wxy"]
If all separators have to be present at least once, instead of using split you could use 4 capturing groups with a backreference matching 1 of the 3 options except what is already matched.如果所有分隔符必须至少出现一次,而不是使用 split,您可以使用 4 个捕获组,其中反向引用匹配 3 个选项中的 1 个,但已匹配的选项除外。
^(.*?)(, | - |: )(.*?)(?!\2)(, | - |: )(.*?)(?!\2|\4)(, | - |: )(.*)
The pattern will match模式将匹配
^
Start of string ^
字符串开始(.*?)
Group 1 , match as least as possible (.*?)
第1组,尽可能匹配(, | - |: )
Group 2 , match any of the listed (, | - |: )
第2组,匹配任何列出的(.*?)
Group 3 , match as least as possible (.*?)
第3组,尽可能匹配(?!\\2)
Negative lookahead, assert what is on the right is not what is matched in group 2 (pick one of 2 valid options) (?!\\2)
否定前瞻,断言右边的不是第 2 组中匹配的(选择 2 个有效选项之一)(, | - |: )
Group 4 , match any of the listed (, | - |: )
第4组,匹配任何列出的(.*?)
Group 5 , match as least as possible (.*?)
第5组,尽可能匹配(?!\\2|\\4)
Negative lookahead, assert what is on the right is not what is matched in group 2 or group 4 (Pick the only valid option left) (?!\\2|\\4)
负前瞻,断言右边的不是第 2 组或第 4 组中匹配的(选择左边唯一有效的选项)(, | - |: )
Group 6 , match any of the listed (, | - |: )
第6组,匹配任何列出的(.*)
Group 7 , match any char as much as possible (.*)
Group 7 ,尽可能匹配任意字符For example例如
import re
regex = r"^(.*?)(, | - |: )(.*?)(?!\2)(, | - |: )(.*?)(?!\2|\4)(, | - |: )(.*)"
test_str = ("lorem ipsum, dolor sit - amet , consectetur : adipiscing elit. Praesent vitae orc\n\n"
"abc: def: ijk, lmno: pqr - stu, wx\n\n")
matches = re.search(regex, test_str, re.MULTILINE)
if matches:
print(matches.group(1))
print(matches.group(3))
print(matches.group(5))
print(matches.group(7))
Output输出
lorem ipsum
dolor sit
amet , consectetur
adipiscing elit. Praesent vitae orc
You could use a small class that counts the replacements:您可以使用一个计算替换次数的小类:
import re
text = "lorem ipsum, dolor sit - amet, consectetur : adipiscing elit. Praesent vitae orc"
# text = "abc: def: ijk, lmno: pqr - stu, wx"
rx = re.compile(r'[-,:]')
class Replacer:
def __init__(self, *args, **kwargs):
for key in args:
setattr(self, key, 0)
self.needle = kwargs.get("needle")
def __call__(self, match):
key = match.group(0)
setattr(self, key, getattr(self, key, 0) + 1)
cnt = getattr(self, key, 0)
return self.needle if cnt == 1 else key
rpl = Replacer("-", ",", ":", needle="#@#")
result = [item.strip() for item in re.split("#@#", rx.sub(rpl, text))]
print(result)
Which yields哪个产量
['lorem ipsum', 'dolor sit', 'amet, consectetur', 'adipiscing elit. Praesent vitae orc']
Just food for thought, not sure if it's a valued answer but maybe if you can use regex
instead of re
module to utilize the capability of a negative lookbehind with a non-fixed width.仅供参考,不确定这是否是有价值的答案,但也许您可以使用
regex
而不是re
模块来利用具有非固定宽度的负向后视功能。 For example:例如:
\s*([,:-])(?<!\1.*\1)\s*
In Python:在 Python 中:
import regex as re
string1 = "abc: def: ijk, lmno: pqr - stu, wx"
lst1 = re.sub(r'\s*([,:-])(?<!\1.*\1)\s*', '|' , string1).split('|')
print(lst1)
Result:结果:
['abc', 'def: ijk', 'lmno: pqr', 'stu, wx']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.