简体   繁体   English

数据集中的机器学习模型仅预测模式

[英]Machine Learning Model Only Predicting Mode in Data Set

I am trying to do sentiment analysis for text.我正在尝试对文本进行情感分析。 I have 909 phrases commonly used in emails, and I scored them out of ten for how angry they are, when isolated.我有 909 个电子邮件中常用的短语,我根据他们在孤立时的愤怒程度给它们打分(满分 10 分)。

Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules: 现在,我将此 .csv 文件上传到 Jupyter Notebook,并在其中导入以下模块:

df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']

Now, I define both columns as 'phrases' and 'anger':现在,我将这两列定义为“短语”和“愤怒”:

x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

Subsequently, I split this data such that 20% is used for testing and 80% is used for training:随后,我将这些数据拆分为 20% 用于测试,80% 用于训练:

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))

Now, I convert the words in x_train to numerical data using TfidfVectorizer:现在,我使用 TfidfVectorizer 将x_train中的单词转换为数值数据:

a = x_traincv.toarray()

Now, I convert x_traincv to an array:现在,我将x_traincv转换为数组:

x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()

I also convert x_testcv to a numerical array:我还将x_testcv转换为数值数组:

mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
    mnb.fit(x_testcv,y_test)
    testmessage=x_test.iloc[i]
    predictions = mnb.predict(x_testcv[i].reshape(1,-1))
    error_score = error_score + (predictions-int(b[i]))**2
    print(testmessage)
    print(predictions)
print(error_score/len(x_test))

Now, I have我现在有

mnb = MultinomialNB() b=np.array(y_test) error_score = 0 b=np.array(y_test) for i in range(len(x_test)): mnb.fit(x_testcv,y_test) testmessage=x_test.iloc[i] predictions = mnb.predict(x_testcv[i].reshape(1,-1)) error_score = error_score + (predictions-int(b[i]))**2 print(testmessage) print(predictions) print(error_score/len(x_test))

However, an example of the results I get are:但是,我得到的结果示例是:

Bring it back [0] It is greatly appreciatd when [0] Apologies in advance [0] Can you please [0] See you then [0] I hope this email finds you well.把它带回来 [0] 非常感谢 [0] 提前道歉 [0] 你能请 [0] 再见 [0] 我希望这封电子邮件能找到你。 [0] Thanks in advance [0] I am sorry to inform [0] You're absolutely right [0] I am deeply regretful [0] Shoot me through [0] I'm looking forward to [0] As I already stated [0] Hello [0] We expect all students [0] If it's not too late [0] [0] 提前致谢 [0] 很抱歉通知 [0] 你说的很对 [0] 深表遗憾 [0] 射穿我 [0] 我很期待 [0] 正如我已经声明 [0] 你好 [0] 我们期待所有学生 [0] 如果还不算太晚 [0]

and this repeats on a large scale, even for phrases that are obviously very angry.并且这种情况会大量重复,即使对于那些明显很生气的短语也是如此。 When I removed all data containing a '0' from the .csv file, the now modal value (a 10) is the only prediction for my sentences.当我从 .csv 文件中删除所有包含“0”的数据时,现在的模态值(a 10)是我句子的唯一预测。

Why is this happening? 为什么会这样? Is it some weird way to minimise error? 是否有一些奇怪的方法来减少错误? Are there any inherent flaws in my code? 我的代码中是否存在任何固有缺陷? Should I take a different approach? 我应该采取不同的方法吗?

Two things, you are fitting The MultinomialNB with the test set.两件事,你正在用测试集拟合 MultinomialNB。 In your loop you have mnb.fit(x_testcv,y_test) but you should do mnb.fit(x_traincv,y_train)在你的循环中你有mnb.fit(x_testcv,y_test)但你应该做mnb.fit(x_traincv,y_train)

Second, when performing pre-processing you should call the fit_transform only on the training data while on the test you should call only the transform method.其次,在执行预处理时,您应该只在训练数据上调用fit_transform ,而在测试时,您应该只调用transform方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM