splitting a string using regular expression

Question

I've been tasked to tokenize words from a corpus using regular expressions but I'm having trouble tokenizing abbreviations such as "eg" or "ie". In particular, the one that occurs in the corpus that I'm looking at appears as '(NB--I'

string = '(N.B.--I'
pattern = r'(\w\.){2,}'
split_p = r'((\w\.){2,})'

match = re.search(pattern, string)
print(match)

split = re.split(split_p, string)
print(split)

['(', 'NB', '--', 'I'] is the desired output list split however when I run it...

<_sre.SRE_Match object; span=(1, 5), match='N.B.'>
['(', 'N.B.', 'B.', '--I']

I believe I can split the dashes with |-+

However, I can't understand why this B. is repeating

Answer 1

The split includes all capturing groups. Use (?:...) to create a non-capturing group around the \\w. sub-pattern instead:

split_p = r'((?:\w\.){2,})'

Demo:

>>> import re
>>> split_p = r'((?:\w\.){2,})'
>>> string = '(N.B.--I'
>>> re.split(split_p, string)
['(', 'N.B.', '--I']

Next, if you want to split on repeating dashes, just add an alternative pattern with | :

split_p = r'((?:\w\.){2,}|-+)'

Demo:

>>> split_p = r'((?:\w\.){2,}|-+)'
>>> re.split(split_p, string)
['(', 'N.B.', '', '--', 'I']

This gives an empty string in-between because there are 0 characters between the NB split point and the -- point; you'd have to filter those out again.

splitting a string using regular expression

Question

1 answers

solution1
0 ACCPTED 2017-04-16 20:02:44

splitting a string using regular expression

Question

1 answers

solution1 0 ACCPTED 2017-04-16 20:02:44

solution1
0 ACCPTED 2017-04-16 20:02:44