I am trying to build a tree classifier with the scikit-learn package but I have problems getting the correct format for the classifier input..
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
#import dataset
data = pd.read_table('Data/Breast.csv')
data.head(10)
X=data[['clump_thickness','shape_uniformity','marginal_adhesion','epithelial_size','bare_nucleoli','bland_chromatin','normal_nucleoli','mitoses']]
X_train = X.values
Y = data[['class']]
Y_train = Y.values
model = DecisionTreeClassifier()
model
model.fit(X_train,Y_train)
But I get the following error message:
ValueError Traceback (most recent call
last) <ipython-input-215-ffa49499a3bf> in <module>()
----> 1 model.fit(X_train,Y_train)
c:\users\tobias\appdata\local\programs\python\python36\lib\site-packages\sklearn\tree\tree.py
in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
788 sample_weight=sample_weight,
789 check_input=check_input,
--> 790 X_idx_sorted=X_idx_sorted)
791 return self
792
c:\users\tobias\appdata\local\programs\python\python36\lib\site-packages\sklearn\tree\tree.py
in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
114 random_state = check_random_state(self.random_state)
115 if check_input:
--> 116 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
117 y = check_array(y, ensure_2d=False, dtype=None)
118 if issparse(X):
c:\users\tobias\appdata\local\programs\python\python36\lib\site-packages\sklearn\utils\validation.py
in check_array(array, accept_sparse, dtype, order, copy,
force_all_finite, ensure_2d, allow_nd, ensure_min_samples,
ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: could not convert string to float: '?'
What am I doing wrong? I can see that X.values is of dType = Object...
Try this to make sure you are passing integers, if your set contains strings or categorical values, or shows another issue, I'll edit this answer with the solution:
cols = ['clump_thickness','shape_uniformity','marginal_adhesion','epithelial_size','bare_nucleoli','bland_chromatin','normal_nucleoli','mitoses']
for col in cols:
data[col] = data[col].astype('int')
X.train = data[cols]
Y.train = data[['class]]
model = DecissionTreeClassifier()
model.fit(X_train,Y_train)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.