[英]How to split sentence into words with some exceptions
I am working on a text classification project, and I need to split a sentence into words so I can calculate the probability of it being positive or negative. 我正在一个文本分类项目中,我需要将一个句子拆分成多个单词,以便可以计算出它是正数还是负数的可能性。 The problem is the word
"not"
, when ever it comes in, it changes the sentence which was suppose to be positive to negative, but my system still categorizes the sentence as positive which makes it wrong. 问题是
"not"
一词,无论何时出现,它都会将原本肯定的句子改为否定的句子,但是我的系统仍将句子归类为肯定的句子,这使它是错误的。
My idea is to find a way to split the sentence into words with an exception of 'not'
我的想法是找到一种将句子拆分成单词的方法,除了
'not'
For example, " she is not beautiful "
例如,
" she is not beautiful "
Instead of getting "she", "is", "not", "beautiful"
而不是得到
"she", "is", "not", "beautiful"
I want to get "she", "is", "not beautiful "
我想得到
"she", "is", "not beautiful "
You can use re.split
with a negative lookbehind for the word "not"
您可以在
re.split
使用re.split
来表示"not"
一词
import re
mystr = "she is not beautiful"
re.split("(?<!not)\s", mystr)
#['she', 'is', 'not beautiful']
The regular expression pattern is: 正则表达式模式为:
(?<!not)
: Negative lookbehind for "not"
(?<!not)
: "not"
负向后看 \\s
: Any whitespace character \\s
:任何空格字符 You can also try to 您也可以尝试
Split the text by 'not' 用“不”分隔文字
Take the first element in the new list and split it and add it to another list to be returned 将新列表中的第一个元素拆分,然后将其添加到另一个列表中以返回
for other elements of list from step1. 对于来自步骤1的列表的其他元素。 we split each item and add not to the first item.
我们拆分每个项目,不添加到第一项。
def my_seperator(text):
text = text.strip()
my_text = []
text = text.split('not')
my_text = my_text + text[0].split()
for t in text[1:]:
temp_text = t.split()
my_text.append('not '+temp_text[0])
my_text = my_text+temp_text[1:]
return my_text
>>> my_seperator('she is not beautiful . but not that she is ugly. Maybe she is not my type')
['she', 'is', 'not beautiful', '.', 'but', 'not that', 'she', 'is', 'ugly.', 'Maybe', 'she', 'is', 'not my', 'type']
Although like @pault mentioned regular expression is the way to go. 尽管像@pault一样提到正则表达式是正确的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.