简体   繁体   中英

why am i getting same value as prediction for all?

I am trying to build a decision tree using sci-kit. But I have been getting the same value as a prediction for all the value.

le = preprocessing.LabelEncoder()
    def labelEncoder(df, col_name):
       df[[col_name]] = le.fit_transform(df[[col_name]])
    labelEncoder(dfr, "Gender")
    labelEncoder(dfr, "Subscription Tenure Type")
    labelEncoder(dfr, "Located Region")
    labelEncoder(dfr, "Attrition")
    labelEncoder(dfr, "Type of subscription")
    labelEncoder(dfr, "Genre")
    # # Splitiing the data to test and train
    feature = dfr[["Gender", "Age", "Subscription year", "Subscription Tenure Type", "Type of subscription",
          "Located Region", "Average Hours of watching(Weekly)", "Attrition",
          "Web channle utilization", "Mobile Channel Utilization"]]
    labels = dfr[["Genre"]]
clf_gini = DecisionTreeClassifier(criterion="entropy", random_state=100,
                                     max_depth=3, min_samples_leaf=9 ,min_samples_split=2, splitter='random')

clf_gini.fit(feature_train, labels_train)
y_pred = clf_gini.predict(feature_test)

print(list((y_pred)))

Following is the sample data.

User Id Genre   Rating  Gender  Age Subscription year   Subscription Tenure Type    Type of subscription    Located Region  Average Hours of watching(Weekly)   Attrition   Web channle utilization Mobile Channel Utilization
1   Romance 4   Female  51  2000    Annual  Individual  R3  7   Yes 89  11
2   Action  4.769230769 Female  42  2004    6 Months    Individual  R6  13  No  88  12
2   Adventure   4.909090909 Female  42  2004    6 Months    Individual  R6  13  No  88  12
2   Comedy  4.2 Female  42  2004    6 Months    Individual  R6  13  No  88  12
2   Crime   5   Female  42  2004    6 Months    Individual  R6  13  No  88  12
2   Drama   4.2 Female  42  2004    6 Months    Individual  R6  13  No  88  12

There are a few issues with the code snippet you provided.

  • You are predicting using svm instead of clf_gini ;
  • Code for actually splitting dataset into train and test is missing;
  • Did you apply the same transformation to train and test set?

You were calling svm instead of clf_gini . If this does not answer your question, could you please provide some more details ?

The following example code works:

import pandas as pd

arr = [[1  , 'Romance', 4,   'Female',  51,  2000,    'Annual' , 'Individual' , 'R3',  7,   'Yes', 89,  11],
[2  , 'Action' , 4.7, 'Female',  42,  2004,    '6 Months' ,   'Individual',  'R6',  13,  'No',  88,  12],
[2  , 'Adventure',   4.9, 'Female',  42,  2004,    '6 Months',    'Individual',  'R6',  13,  'No',  88,  12],
[2  , 'Comedy' , 4.2, 'Female',  42 , 2004,    '6 Months' ,   'Individual',  'R6'  ,13,  'No',  88,  12],
[2  , 'Crime'  , 5  , 'Female',  42 , 2004,    '6 Months' ,   'Individual',  'R6' , 13,  'No',  88,  12],
[2  , 'Drama'  , 4.2, 'Female',  42,  2004,    '6 Months' ,   'Individual',  'R6',  13,  'No',  88,  12]]

headers = ['User Id', 'Genre',   'Rating',  'Gender',  'Age', 'Subscription year',   'Subscription Tenure Type', 'Type of subscription',  'Located Region',  'Average Hours of watching(Weekly)',   'Attrition',   'Web channle utilization', 'Mobile Channel Utilization']

dfr = pd.DataFrame(arr, columns = headers )

import sklearn
le = sklearn.preprocessing.LabelEncoder()
def labelEncoder(df, col_name):
    df[[col_name]] = le.fit_transform(df[[col_name]])
labelEncoder(dfr, "Gender")
labelEncoder(dfr, "Subscription Tenure Type")
labelEncoder(dfr, "Located Region")
labelEncoder(dfr, "Attrition")
labelEncoder(dfr, "Type of subscription")
labelEncoder(dfr, "Genre")

# # Splitiing the data to test and train

feature = dfr[["Gender", "Age", "Subscription year", "Subscription Tenure Type", "Type of subscription",
  "Located Region", "Average Hours of watching(Weekly)", "Attrition",
  "Web channle utilization", "Mobile Channel Utilization"]]

clf_gini = DecisionTreeClassifier(criterion="entropy", random_state=100,
                                 max_depth=3, min_samples_leaf=9 ,min_samples_split=2, splitter='random')

# create test / train split
dfr_train = dfr.iloc[:-1]
dfr_test = dfr.iloc[-1]
y_train = dfr_train['Genre']
y_test = dfr_test['Genre']

del dfr_train['Genre']
del dfr_test['Genre']


clf_gini.fit(dfr_train, y_train)
y_pred = clf_gini.predict(dfr_test)

print(list((y_pred)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM