查找并拆分单词后面的某些字符

Question

I'm trying to use regular expressions to split text on punctuation, only when the punctuation follows a word and proceeds a space or the end of the string. 我试图使用正则表达式在标点符号上拆分文本，仅当标点符号在单词后面并在空格或字符串末尾使用时。

I've tried ([a-zA-Z])([,;.-])(\\s|$) 我试过了([a-zA-Z])([,;.-])(\\s|$)

But when I want to split in Python, it includes the last character of the word. 但是当我想在Python中拆分时，它包括单词的最后一个字符。

I want to split it like this: 我想这样分割它：

text = 'Mr.Smith is a professor at Harvard, and is a great guy.'
splits = ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'a', 'great', 'guy', '.']

Any help would be greatly appreciated! 任何帮助将不胜感激！

Answer 1

It seems you want to do tokenize. 看来您想进行标记化。 Try nltk 试试nltk

http://text-processing.com/demo/tokenize/ http://text-processing.com/demo/tokenize/

from nltk.tokenize import TreebankWordTokenizer
splits = TreebankWordTokenizer().tokenize(text)

Answer 2

You may use 您可以使用

re.findall(r'\w+(?:\.\w+)*|[^\w\s]', s)

See the regex demo . 参见regex演示。

Details 细节

\\w+(?:\\.\\w+)* - 1+ word chars followed with 0 or more occurrences of a dot followed with 1+ word chars \\w+(?:\\.\\w+)* -1+个单词字符，后跟0个或更多的点，再加上1+个单词字符
| - or - 要么
[^\\w\\s] - any char other than a word and whitespace char. [^\\w\\s] -除单词和空格字符外的任何字符。

Python demo : Python演示：

import re
rx = r"\w+(?:\.\w+)*|[^\w\s]"
s = "Mr.Smith is a professor at Harvard, and is a great guy."
print(re.findall(rx, s))

Output: ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'is', 'a', 'great', 'guy', '.'] . 输出： ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'is', 'a', 'great', 'guy', '.'] 。

This approach can be further precised. 该方法可以进一步精确化。 Eg tokenizing only letter words, numbers and underscores as punctuation: 例如，仅将字母，数字和下划线标记为标点符号：

re.findall(r'[+-]?\d*\.?\d+|[^\W\d_]+(?:\.[^\W\d_]+)*|[^\w\s]|_', s)

See the regex demo 见正则表达式演示

Answer 3

You can first split on ([.,](?=\\s)|\\s) and then filter out empty or blanks strings: 您可以先分割([.,](?=\\s)|\\s) ，然后过滤掉空字符串或空白字符串：

In [16]: filter(lambda s: not re.match(r'\s*$', s) , re.split(r'([.,](?=\s)|\s)',  'Mr.Smith is a professor at Har
    ...: vard, and is a great guy.'))
Out[16]: 
['Mr.Smith',
 'is',
 'a',
 'professor',
 'at',
 'Harvard',
 ',',
 'and',
 'is',
 'a',
 'great',
 'guy.']

查找并拆分单词后面的某些字符

问题描述

3 个解决方案

解决方案1
2 已采纳 2019-08-08 20:43:15

解决方案2
2 2019-08-08 20:43:16

解决方案3
1 2019-08-08 20:44:40

查找并拆分单词后面的某些字符

问题描述

3 个解决方案

解决方案1 2 已采纳 2019-08-08 20:43:15

解决方案2 2 2019-08-08 20:43:16

解决方案3 1 2019-08-08 20:44:40

解决方案1
2 已采纳 2019-08-08 20:43:15

解决方案2
2 2019-08-08 20:43:16

解决方案3
1 2019-08-08 20:44:40