使用正则表达式分割字符串

Question

I've been tasked to tokenize words from a corpus using regular expressions but I'm having trouble tokenizing abbreviations such as "eg" or "ie". 我的任务是使用正则表达式对语料库中的单词进行标记，但是我在对诸如“ eg”或“ ie”之类的缩写进行标记时遇到了麻烦。 In particular, the one that occurs in the corpus that I'm looking at appears as '(NB--I' 特别是，我正在查看的语料库中出现的那个显示为'(NB--I'

string = '(N.B.--I'
pattern = r'(\w\.){2,}'
split_p = r'((\w\.){2,})'

match = re.search(pattern, string)
print(match)

split = re.split(split_p, string)
print(split)

['(', 'NB', '--', 'I'] is the desired output list split however when I run it... ['(', 'NB', '--', 'I']是所需的输出列表，但是在运行时...

<_sre.SRE_Match object; span=(1, 5), match='N.B.'>
['(', 'N.B.', 'B.', '--I']

I believe I can split the dashes with |-+ 我相信我可以用|-+来分隔破折号

However, I can't understand why this B. is repeating 但是，我不明白为什么这个B.在重复

Answer 1

The split includes all capturing groups. 拆分包括所有捕获组。 Use (?:...) to create a non-capturing group around the \\w. 使用(?:...)在\\w.周围创建一个非捕获组\\w. sub-pattern instead: 子模式：

split_p = r'((?:\w\.){2,})'

Demo: 演示：

>>> import re
>>> split_p = r'((?:\w\.){2,})'
>>> string = '(N.B.--I'
>>> re.split(split_p, string)
['(', 'N.B.', '--I']

Next, if you want to split on repeating dashes, just add an alternative pattern with | 接下来，如果要拆分重复的破折号，只需添加带有|的替代模式| : ：

split_p = r'((?:\w\.){2,}|-+)'

Demo: 演示：

>>> split_p = r'((?:\w\.){2,}|-+)'
>>> re.split(split_p, string)
['(', 'N.B.', '', '--', 'I']

This gives an empty string in-between because there are 0 characters between the NB split point and the -- point; 这会在中间产生一个空字符串，因为NB分割点和--点之间有0个字符； you'd have to filter those out again. 您将不得不再次将其过滤掉。

使用正则表达式分割字符串

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-04-16 20:02:44

使用正则表达式分割字符串

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-04-16 20:02:44

解决方案1
0 已采纳 2017-04-16 20:02:44