简体   繁体   English

查找并拆分单词后面的某些字符

[英]Find and split on certain characters that follow words

I'm trying to use regular expressions to split text on punctuation, only when the punctuation follows a word and proceeds a space or the end of the string. 我试图使用正则表达式在标点符号上拆分文本,仅当标点符号在单词后面并在空格或字符串末尾使用时。

I've tried ([a-zA-Z])([,;.-])(\\s|$) 我试过了([a-zA-Z])([,;.-])(\\s|$)

But when I want to split in Python, it includes the last character of the word. 但是当我想在Python中拆分时,它包括单词的最后一个字符。

I want to split it like this: 我想这样分割它:

text = 'Mr.Smith is a professor at Harvard, and is a great guy.'
splits = ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'a', 'great', 'guy', '.']

Any help would be greatly appreciated! 任何帮助将不胜感激!

It seems you want to do tokenize. 看来您想进行标记化。 Try nltk 试试nltk

http://text-processing.com/demo/tokenize/ http://text-processing.com/demo/tokenize/

from nltk.tokenize import TreebankWordTokenizer
splits = TreebankWordTokenizer().tokenize(text)

You may use 您可以使用

re.findall(r'\w+(?:\.\w+)*|[^\w\s]', s)

See the regex demo . 参见regex演示

Details 细节

  • \\w+(?:\\.\\w+)* - 1+ word chars followed with 0 or more occurrences of a dot followed with 1+ word chars \\w+(?:\\.\\w+)* -1+个单词字符,后跟0个或更多的点,再加上1+个单词字符
  • | - or - 要么
  • [^\\w\\s] - any char other than a word and whitespace char. [^\\w\\s] -除单词和空格字符外的任何字符。

Python demo : Python演示

import re
rx = r"\w+(?:\.\w+)*|[^\w\s]"
s = "Mr.Smith is a professor at Harvard, and is a great guy."
print(re.findall(rx, s))

Output: ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'is', 'a', 'great', 'guy', '.'] . 输出: ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'is', 'a', 'great', 'guy', '.']

This approach can be further precised. 该方法可以进一步精确化。 Eg tokenizing only letter words, numbers and underscores as punctuation: 例如,仅将字母,数字和下划线标记为标点符号:

re.findall(r'[+-]?\d*\.?\d+|[^\W\d_]+(?:\.[^\W\d_]+)*|[^\w\s]|_', s)

See the regex demo 正则表达式演示

You can first split on ([.,](?=\\s)|\\s) and then filter out empty or blanks strings: 您可以先分割([.,](?=\\s)|\\s) ,然后过滤掉空字符串或空白字符串:

In [16]: filter(lambda s: not re.match(r'\s*$', s) , re.split(r'([.,](?=\s)|\s)',  'Mr.Smith is a professor at Har
    ...: vard, and is a great guy.'))
Out[16]: 
['Mr.Smith',
 'is',
 'a',
 'professor',
 'at',
 'Harvard',
 ',',
 'and',
 'is',
 'a',
 'great',
 'guy.']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM