[英]Handling categorical features using scikit-learn
What am I doing? 我在做什么?
I am solving a classification problem using Random Forests. 我正在使用随机森林解决分类问题。 I have a set of strings of a fixed length (10 characters long) that represent DNA sequences.
我有一组固定长度(长10个字符)的字符串,它们代表DNA序列。 DNA alphabet consists of 4 letters, namely
A
, C
, G
, T
. DNA字母由4个字母组成,即
A
, C
, G
, T
。
Here's a sample of my raw data: 这是我的原始数据的示例:
ATGCTACTGA
ACGTACTGAT
AGCTATTGTA
CGTGACTAGT
TGACTATGAT
Each DNA sequence comes with experimental data describing a real biological response; 每个DNA序列都带有描述真实生物学反应的实验数据; the molecule was seen to elicit biological response (1), or not (0).
该分子被认为引发了生物反应(1)或没有(0)。
Problem: 问题:
The training set consists of both, categorical (nominal) and numerical features. 训练集包括分类(标称)特征和数字特征。 It is of the following structure:
它具有以下结构:
training_set = [
{'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T',
'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A',
'mass':370.2, 'temp':70.0},
{'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A',
'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T',
'mass':400.3, 'temp':67.2},
]
target = [1, 0]
I successfully create the classifier using the DictVectorizer class to encode nominal features, but I'm having problems while performing predictions on my testing data. 我使用DictVectorizer类成功创建了分类器,以对名义特征进行编码,但是在对测试数据进行预测时遇到了问题。
Below is the simplified version of my code accomplished so far: 下面是到目前为止完成的代码的简化版本:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
training_set = [
{'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T',
'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A',
'mass':370.2, 'temp':70.0},
{'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A',
'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T',
'mass':400.3, 'temp':67.2},
]
target = [1, 0]
vec = DictVectorizer()
train = vec.fit_transform(training_set).toarray()
clf = RandomForestClassifier(n_estimators=1000)
clf = clf.fit(train, target)
# The following part fails.
test_set = {
'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T',
'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A',
'mass':370.2, 'temp':70.0}
vec = DictVectorizer()
test = vec.fit_transform(test_set).toarray()
print clf.predict_proba(test)
As a result, I got an error: 结果,我得到一个错误:
ValueError: Number of features of the model must match the input.
Model n_features is 20 and input n_features is 12
You should use the same DictVectorizer
object which created the train dataset to transform
the test_set
: 您应该使用创建火车数据集的相同
DictVectorizer
对象来transform
test_set
:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
training_set = [
{'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T',
'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A',
'mass':370.2, 'temp':70.0},
{'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A',
'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T',
'mass':400.3, 'temp':67.2},
]
target = [1, 0]
vec = DictVectorizer()
train = vec.fit_transform(training_set).toarray()
clf = RandomForestClassifier(n_estimators=1000)
clf = clf.fit(train, target)
# The following part fails.
test_set = {
'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T',
'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A',
'mass':370.2, 'temp':70.0}
test = vec.transform(test_set).toarray()
print clf.predict_proba(test)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.