简体   繁体   English

特征集上的决策树算法

[英]decision tree algorithm on feature set

I'm trying to predict the no.of updates('sys_mod_count')based on the text description('eng') 我正在尝试根据文字说明('eng')预测no.of更新('sys_mod_count')

I have predefined the 'sys_mod_count' into two classes if >=17 as 1; 如果> = 17为1,我已将'sys_mod_count'预定义为两个类; <17 as 0. <17为0。

But I want to remove this condition as this value is not available at decision time in real world. 但是我想删除这个条件,因为在现实世界的决策时间这个值是不可用的。

I'm thinking to do this in Decision tree/ Random forest method to train the classifier on feature set. 我想在决策树/随机森林方法中做这个来训练特征集上的分类器。


def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    # return metrics.accuracy_score(predictions, valid_y)
    return predictions

import pandas as pd
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

df_3 =pd.read_csv('processedData.csv', sep=";")
st_new = df_3[['sys_mod_count','eng','ger']]
st_new['updates_binary'] = st_new['sys_mod_count'].apply(lambda x: 1 if x >= 17 else 0)
st_org = st_new[['eng','updates_binary']]
st_org = st_org.dropna(axis=0, subset=['eng']) #Determine if column 'eng'contain missing values are removed
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(st_org['eng'], st_org['updates_binary'],stratify=st_org['updates_binary'],test_size=0.20)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(st_org['eng'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("NB, WordLevel TF-IDF: ", metrics.accuracy_score(accuracy, valid_y))


This seems to be a threshold setting problem - you would like to set a threshold at which a certain classification is made. 这似乎是一个阈值设置问题 - 您希望设置一个阈值,在该阈值进行某种分类。 No supervised classifier can set the threshold for you because if it does not have any training data with binary classes, then you cannot train the cvlassifier, and to create training data, you need to set the threshold to begin with. 没有监督分类器可以为您设置阈值,因为如果它没有任何二进制类的训练数据,那么您无法训练cvlassifier,并且要创建训练数据,您需要设置阈值开始。 It's a chicken and egg problem. 这是鸡和蛋的问题。

If you have some way of identifying which binary label is correct, then you can vary the threshold and measure errors similar to how it's suggested here . 如果您有某种方法可以识别哪个二进制标签是正确的,那么您可以改变阈值并测量类似于此处建议的错误。 Then you can either run a Classifier on your binary labels based on the threshold or a Regressor on sys_mod_count and convert to binary based on the identified threshold. 然后,您可以根据阈值在二进制标签上运行分类器,也可以在sys_mod_count上运行sys_mod_count并根据识别的阈值转换为二进制。

The above approach does not work if you have no way to identify what the correct binary label should be. 如果您无法确定正确的二进制标签应该是什么,则上述方法不起作用。 Then, the problem you are trying to solve is creating some boundary between points based on the value of your sys_mod_count variable. 然后,您要解决的问题是根据sys_mod_count变量的值在点之间创建一些边界。 This is unsupervised learning. 这是无人监督的学习。 So, techniques like clustering will be helpful here. 因此,像聚类这样的技术在这里会有所帮助。 You can cluster your data into two clusters based on the distance of points from each other, and then label each cluster, which becomes your binary label. 您可以根据彼此的点距离将数据聚类为两个聚类,然后标记每个聚类,这将成为您的二进制标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM