我應該如何重新格式化sklearn.naive_bayes.GaussianNB的數據

Question

我有一個數據集users 。 每個用戶都有性別和顏色屬性（喜歡的顏色），依此類推。 我將每種顏色和喜歡該顏色的一種性別的用戶總數划分為一個列表：

features_train = [['indigo', 2341], ['yellow', 856], ['lavender', 690], ['yellowgreen', 1208], ['indigo', 565], ['yellow', 103], ['lavender', 571], ['yellowgreen', 234] ...]

在第二個列表中，對於第一個列表中的每個元素，我說出代表該元素的性別：

labels_train = [0, 0, 0, 0, 1, 1, 1, 1, ...]

現在，我有了第三個帶有顏色的列表： features_test = ['yellow', 'red', ...] ，並且我需要預測性別。

我必須使用naive_bayes.GaussianNB功能從sklearn ，我將有更多的性能users ，但我只使用顏色和性別來解釋我的問題。 因此，我找到了一個官方示例，但我不明白如何重新格式化數據集以使用它們。 我應該將顏色轉換為某些數字表示形式，例如： [[0, 2341], [1, 856]] sklearn [[0, 2341], [1, 856]]還是應該使用sklearn其他函數來做到這一點？

import numpy as np
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(features_train, labels_train)
print(clf.predict(features_test))

Answer 1

為了使用scikit-learn對文本文檔執行“機器學習”，您首先需要將文本內容轉換為數字特征向量。

做到這一點最直觀的方法是使用單詞表示法-您可以通過確實格式化您所說的數據集來解決此問題。

假設您的“ X”和“ y”均為1-DI，建議您使用scikit-learn中的LabelEnconder將您的文本類轉換為一組數字特征向量。

見下文：

import numpy as np
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
le = preprocessing.LabelEncoder()


#Fit label encoder and return encoded features
features_train_num = le.fit_transform(features_train)
features_test_num  = le.transform(features_test)

#Fit label encoder and return encoded labels
labels_train_num   = le.fit_transform(labels_train)
labels_test_num    = le.transform(labels_test)

clf.fit(features_train_num, labels_train_num)
print(clf.predict(features_test_num))

我應該如何重新格式化sklearn.naive_bayes.GaussianNB的數據

問題描述

1 個解決方案

解決方案1
1 已采納 2017-06-05 10:13:15

我應該如何重新格式化sklearn.naive_bayes.GaussianNB的數據

問題描述

1 個解決方案

解決方案1 1 已采納 2017-06-05 10:13:15

解決方案1
1 已采納 2017-06-05 10:13:15