使用scikit-learn處理分類特征

Question

我在做什么？

我正在使用隨機森林解決分類問題。 我有一組固定長度（長10個字符）的字符串，它們代表DNA序列。 DNA字母由4個字母組成，即A ， C ， G ， T 。

這是我的原始數據的示例：

ATGCTACTGA
ACGTACTGAT
AGCTATTGTA
CGTGACTAGT
TGACTATGAT

每個DNA序列都帶有描述真實生物學反應的實驗數據； 該分子被認為引發了生物反應（1）或沒有（0）。

問題：

訓練集包括分類（標稱）特征和數字特征。 它具有以下結構：

training_set = [
  {'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
   'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
   'mass':370.2, 'temp':70.0},
  {'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A', 
   'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T', 
   'mass':400.3, 'temp':67.2},
]

target = [1, 0]

我使用DictVectorizer類成功創建了分類器，以對名義特征進行編碼，但是在對測試數據進行預測時遇到了問題。

下面是到目前為止完成的代碼的簡化版本：

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer

training_set = [
  {'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
   'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
   'mass':370.2, 'temp':70.0},
  {'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A', 
   'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T', 
   'mass':400.3, 'temp':67.2},
]

target = [1, 0]

vec = DictVectorizer()
train = vec.fit_transform(training_set).toarray()

clf = RandomForestClassifier(n_estimators=1000)
clf = clf.fit(train, target)


# The following part fails.
test_set =   {
  'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
  'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
  'mass':370.2, 'temp':70.0}
vec = DictVectorizer()
test = vec.fit_transform(test_set).toarray()
print clf.predict_proba(test)

結果，我得到一個錯誤：

ValueError: Number of features of the model must  match the input. 
Model n_features is 20 and  input n_features is 12

Answer 1

您應該使用創建火車數據集的相同DictVectorizer對象來transform test_set ：

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer

training_set = [
  {'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
   'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
   'mass':370.2, 'temp':70.0},
  {'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A', 
   'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T', 
   'mass':400.3, 'temp':67.2},
]

target = [1, 0]

vec = DictVectorizer()
train = vec.fit_transform(training_set).toarray()

clf = RandomForestClassifier(n_estimators=1000)
clf = clf.fit(train, target)


# The following part fails.
test_set =   {
  'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
  'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
  'mass':370.2, 'temp':70.0}

test = vec.transform(test_set).toarray()
print clf.predict_proba(test)

使用scikit-learn處理分類特征

問題描述

1 個解決方案

解決方案1
3 已采納 2014-01-26 10:52:19

使用scikit-learn處理分類特征

問題描述

1 個解決方案

解決方案1 3 已采納 2014-01-26 10:52:19

解決方案1
3 已采納 2014-01-26 10:52:19