简体繁体 English

Sklearn中的数据类型和机器学习算法

[英]Data Types and Machine Learning Algorithms in Sklearn

原文 2016-01-14 20:32:29 7 1 python/ pandas/ machine-learning/ scikit-learn

Does anyone know if the data type of a variable plays a (negative) role when running a machine learning algorithm in ski kit learn? 有谁知道在滑雪套件学习中运行机器学习算法时变量的数据类型是否扮演（负）角色？

Here's a little background that may influence responses to this question: I have a 299 variable dataset where the output variable is a dummy variable. 这里有一些可能会影响对此问题的回答的背景知识：我有一个299变量数据集，其中输出变量是虚拟变量。 This will be a classification problem and I would like to try different options like logistic regression and tree based models. 这将是一个分类问题，我想尝试其他选择，例如逻辑回归和基于树的模型。 When I imported my dataset with pandas, I noticed that some of the variables were assigned a data type of int64 when, in fact, they are categorical variables. 当我用熊猫导入数据集时，我注意到某些变量实际上是分类变量，而它们被分配为int64数据类型。 Is this going to be a problem for the machine learning algorithm? 这对于机器学习算法会是个问题吗？ Please forgive me if this is a silly question...I am still relatively new to the machine learning world and while I have not seen anything in the literature on this topic, I did want to make sure I don't go off track before I even start. 如果这是一个愚蠢的问题，请原谅我...对于机器学习世界来说我还是一个相对较新的人，尽管我在文献中没有看到有关此主题的任何内容，但我确实想确保自己之前不会偏离轨道我什至开始。

1 个解决方案

It will be for scikit-learn, as scikit-learn does not support categorical features. 它将用于scikit-learn，因为scikit-learn不支持分类功能。 It will end up treating that integer values as a numeric feature, and will not behave as you might hope. 最终会将整数值视为数字功能，并且不会像您希望的那样运行。 It does support re-encoding them in a numeric form (see here ), however that is sub-optimal compared to using a library and algorithms that naturally support numeric and categorical features. 它确实支持以数字形式对它们进行重新编码（请参阅此处），但是与使用自然支持数字和分类特征的库和算法相比，这不是最佳选择。