[英]Find and split on certain characters that follow words
I'm trying to use regular expressions to split text on punctuation, only when the punctuation follows a word and proceeds a space or the end of the string. 我试图使用正则表达式在标点符号上拆分文本,仅当标点符号在单词后面并在空格或字符串末尾使用时。
I've tried ([a-zA-Z])([,;.-])(\\s|$)
我试过了
([a-zA-Z])([,;.-])(\\s|$)
But when I want to split in Python, it includes the last character of the word. 但是当我想在Python中拆分时,它包括单词的最后一个字符。
I want to split it like this: 我想这样分割它:
text = 'Mr.Smith is a professor at Harvard, and is a great guy.'
splits = ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'a', 'great', 'guy', '.']
Any help would be greatly appreciated! 任何帮助将不胜感激!
It seems you want to do tokenize. 看来您想进行标记化。 Try
nltk
试试
nltk
http://text-processing.com/demo/tokenize/ http://text-processing.com/demo/tokenize/
from nltk.tokenize import TreebankWordTokenizer
splits = TreebankWordTokenizer().tokenize(text)
You may use 您可以使用
re.findall(r'\w+(?:\.\w+)*|[^\w\s]', s)
See the regex demo . 参见regex演示 。
Details 细节
\\w+(?:\\.\\w+)*
- 1+ word chars followed with 0 or more occurrences of a dot followed with 1+ word chars \\w+(?:\\.\\w+)*
-1+个单词字符,后跟0个或更多的点,再加上1+个单词字符 |
- or [^\\w\\s]
- any char other than a word and whitespace char. [^\\w\\s]
-除单词和空格字符外的任何字符。 Python demo : Python演示 :
import re
rx = r"\w+(?:\.\w+)*|[^\w\s]"
s = "Mr.Smith is a professor at Harvard, and is a great guy."
print(re.findall(rx, s))
Output: ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'is', 'a', 'great', 'guy', '.']
. 输出:
['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'is', 'a', 'great', 'guy', '.']
。
This approach can be further precised. 该方法可以进一步精确化。 Eg tokenizing only letter words, numbers and underscores as punctuation:
例如,仅将字母,数字和下划线标记为标点符号:
re.findall(r'[+-]?\d*\.?\d+|[^\W\d_]+(?:\.[^\W\d_]+)*|[^\w\s]|_', s)
See the regex demo 见正则表达式演示
You can first split on ([.,](?=\\s)|\\s)
and then filter out empty or blanks strings: 您可以先分割
([.,](?=\\s)|\\s)
,然后过滤掉空字符串或空白字符串:
In [16]: filter(lambda s: not re.match(r'\s*$', s) , re.split(r'([.,](?=\s)|\s)', 'Mr.Smith is a professor at Har
...: vard, and is a great guy.'))
Out[16]:
['Mr.Smith',
'is',
'a',
'professor',
'at',
'Harvard',
',',
'and',
'is',
'a',
'great',
'guy.']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.