[英]Split a string using regex and include pattern
I need to split a string on the degree (MSC, BSc,...) and keep the name with the title in column 0 and the address in column 1. Note the country code at the end BS
matches the title 我需要在度数上拆分一个字符串(MSC,BSc等),并在标题0列中保留标题名称,在地址1列中保留地址。 请注意,末尾BS
的国家/地区代码与标题匹配
Please find some sample data below: 请在下面找到一些示例数据:
Phillipp Shuster MSc Grolmanstraße 6 28195 Bremen Bahnhofsvorstadt DE
Eric Jager BSc Mohrenstrasse 29 72362 Nusplingen DE
Nykee Peters BS Taylor Street, Duncan Town BS
I want to finish as below: 我要完成以下操作:
Phillipp Shuster MSc | Grolmanstraße 6 28195 Bremen Bahnhofsvorstadt DE
Eric Jager BSc | Mohrenstrasse 29 72362 Nusplingen DE
Nykee Peters BS | Taylor Street, Duncan Town BS
I tried this, but this adds the title to the address ( Not correct ). 我尝试了此操作,但这将标题添加到地址中( 不正确 )。
splitted=re.split("\s(?=(?:msc|bsc|bs)[^$])",participants, flags=re.IGNORECASE)
Phillipp Shuster | Msc Grolmanstraße 6 28195 Bremen Bahnhofsvorstadt DE
Eric Jager | BSc Mohrenstrasse 29 72362 Nusplingen DE
Nykee Peters | BS Taylor Street, Duncan Town BS
Instead of splitting I would suggest re.subn
approach: 我不建议拆分,而是建议使用re.subn
方法:
import re
data = '''Phillipp Shuster MSc Grolmanstraße 6 28195 Bremen Bahnhofsvorstadt DE
Eric Jager BSc Mohrenstrasse 29 72362 Nusplingen DE
Nykee Peters BS Taylor Street, Duncan Town BS'''
pattern = re.compile(r'^.+? (msc|bsc|bs)', flags=re.I)
for line in data.split('\n'):
result = pattern.subn(lambda m: '{:<20s} | '.format(m.group()), line, count=1)
print(result[0])
The output: 输出:
Phillipp Shuster MSc | Grolmanstraße 6 28195 Bremen Bahnhofsvorstadt DE
Eric Jager BSc | Mohrenstrasse 29 72362 Nusplingen DE
Nykee Peters BS | Taylor Street, Duncan Town BS
Instead of split
you can use this simple regex with 2 captured group in findall
: 您可以将这个简单的正则表达式与findall
2个捕获组一起使用,而不用split
:
reg = r'(?i)^(.*\s[BM]Sc?)\s+(.+)$'
RegEx Description: RegEx说明:
(?i)
: Ignore case mode (?i)
:忽略大小写模式 ^
: start ^
:开始 (.*\\s[BM]Sc?)
: Match 0+ characters till BSc
or BS
or MS
or Msc
in capture group 1 (.*\\s[BM]Sc?)
:匹配0+个字符,直到捕获组1中的BSc
或BS
或MS
或Msc
\\s+
: Match 1+ whitespaces \\s+
:匹配1+个空格 (.+)
: Match 1+ characters until end in 2nd capture group (.+)
:匹配1+个字符,直到在第二个捕获组中结束 $
: End $
:结束 My 2c using re.sub
: 我的2c使用re.sub
:
import re
x = """Phillipp Shuster MSc Grolmanstraße 6 28195 Bremen Bahnhofsvorstadt DE
Eric Jager BSc Mohrenstrasse 29 72362 Nusplingen DE
Nykee Peters BS Taylor Street, Duncan Town BS"""
for y in x.split("\n"):
print(re.sub("^(.*?(?:MS|BS)c?)(.*)", r"\1 |\2", y, 0, re.DOTALL))
Output: 输出:
Phillipp Shuster MSc | Grolmanstraße 6 28195 Bremen Bahnhofsvorstadt DE
Eric Jager BSc | Mohrenstrasse 29 72362 Nusplingen DE
Nykee Peters BS | Taylor Street, Duncan Town BS
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.