简体   繁体   中英

splitting a string using regular expression

I've been tasked to tokenize words from a corpus using regular expressions but I'm having trouble tokenizing abbreviations such as "eg" or "ie". In particular, the one that occurs in the corpus that I'm looking at appears as '(NB--I'

string = '(N.B.--I'
pattern = r'(\w\.){2,}'
split_p = r'((\w\.){2,})'

match = re.search(pattern, string)
print(match)

split = re.split(split_p, string)
print(split)

['(', 'NB', '--', 'I'] is the desired output list split however when I run it...

<_sre.SRE_Match object; span=(1, 5), match='N.B.'>
['(', 'N.B.', 'B.', '--I']

I believe I can split the dashes with |-+

However, I can't understand why this B. is repeating

The split includes all capturing groups. Use (?:...) to create a non-capturing group around the \\w. sub-pattern instead:

split_p = r'((?:\w\.){2,})'

Demo:

>>> import re
>>> split_p = r'((?:\w\.){2,})'
>>> string = '(N.B.--I'
>>> re.split(split_p, string)
['(', 'N.B.', '--I']

Next, if you want to split on repeating dashes, just add an alternative pattern with | :

split_p = r'((?:\w\.){2,}|-+)'

Demo:

>>> split_p = r'((?:\w\.){2,}|-+)'
>>> re.split(split_p, string)
['(', 'N.B.', '', '--', 'I']

This gives an empty string in-between because there are 0 characters between the NB split point and the -- point; you'd have to filter those out again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM