简体   繁体   English

如何在某些例外情况下将句子分解为单词

[英]How to split sentence into words with some exceptions

I am working on a text classification project, and I need to split a sentence into words so I can calculate the probability of it being positive or negative. 我正在一个文本分类项目中,我需要将一个句子拆分成多个单词,以便可以计算出它是正数还是负数的可能性。 The problem is the word "not" , when ever it comes in, it changes the sentence which was suppose to be positive to negative, but my system still categorizes the sentence as positive which makes it wrong. 问题是"not"一词,无论何时出现,它都会将原本肯定的句子改为否定的句子,但是我的系统仍将句子归类为肯定的句子,这使它是错误的。

My idea is to find a way to split the sentence into words with an exception of 'not' 我的想法是找到一种将句子拆分成单词的方法,除了'not'

For example, " she is not beautiful " 例如, " she is not beautiful "

Instead of getting "she", "is", "not", "beautiful" 而不是得到"she", "is", "not", "beautiful"

I want to get "she", "is", "not beautiful " 我想得到"she", "is", "not beautiful "

You can use re.split with a negative lookbehind for the word "not" 您可以在re.split使用re.split来表示"not"一词

import re
mystr = "she is not beautiful"
re.split("(?<!not)\s", mystr)
#['she', 'is', 'not beautiful']

The regular expression pattern is: 正则表达式模式为:

  • (?<!not) : Negative lookbehind for "not" (?<!not)"not"负向后看
  • \\s : Any whitespace character \\s :任何空格字符

You can also try to 您也可以尝试

  1. Split the text by 'not' 用“不”分隔文字

  2. Take the first element in the new list and split it and add it to another list to be returned 将新列表中的第一个元素拆分,然后将其添加到另一个列表中以返回

  3. for other elements of list from step1. 对于来自步骤1的列表的其他元素。 we split each item and add not to the first item. 我们拆分每个项目,不添加到第一项。

def my_seperator(text):
    text = text.strip()
    my_text = []
    text = text.split('not')
    my_text = my_text + text[0].split()
    for t in text[1:]:
        temp_text = t.split()
        my_text.append('not '+temp_text[0])
        my_text = my_text+temp_text[1:]
    return my_text

>>> my_seperator('she is not beautiful . but not that she is ugly. Maybe she is not my type')
['she', 'is', 'not beautiful', '.', 'but', 'not that', 'she', 'is', 'ugly.', 'Maybe', 'she', 'is', 'not my', 'type']

Although like @pault mentioned regular expression is the way to go. 尽管像@pault一样提到正则表达式是正确的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 RegEx将单词和符号中的句子拆分为例外 - RegEx split sentence in words and symbols with exceptions 如何用正则表达式将句子拆分为单词? - How to split sentence to words with regular expression? 如何根据两个单词拆分列表中的句子? - How to split sentence in a list based on two words? 语音识别 - 如何将句子拆分为单词? - Speech Recognition - how to split a sentence into words? Python计算拆分句子的单词? - Python count words of split sentence? 如何将句子字符串拆分为单词,还使标点符号成为一个单独的元素 - How to split a sentence string into words, but also make punctuation a separate element 在拆分句子(pandas)上使用isin时如何获得单词的出现? - How to get the occurrence of words while using isin on a split sentence (pandas)? 如何使用 pandas 将句子拆分为句子 ID、单词和标签? - How to split sentences into sentence Id, words and labels with pandas? 如何将句子拆分成适合给定空间的单词? - How to split sentence into words which fit given space? 如何将每个句子拆分成单个单词,将每个句子的平均极性得分添加到数据框中的新列中? - How to split every sentence into individual words and average polarity score per sentence and append into new column in dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM