简体繁体 English

混合数据集中的特征选择

[英]Feature Selection from Mixed dataset

原文 2021-05-19 19:30:51 9 2 python/ feature-selection

I am a newbie in data science domain.我是数据科学领域的新手。

I have a data set, which has both numerical and string data.The interesting fact is both type of data make sense for the outcome.我有一个数据集，其中包含数字和字符串数据。有趣的事实是这两种类型的数据都对结果有意义。 How to choose the relevant features from the data set?如何从数据集中选择相关特征？

Should I be using the LabelEncoder and convert the data from string to numerical and continue with the correlation?我应该使用 LabelEncoder 并将数据从字符串转换为数字并继续相关吗？ I am taking the right path?我走的是正确的道路吗？ Is there any better way to solve this crisis?有没有更好的方法来解决这场危机？

2 个解决方案

Kind of a cop out but you could simply use a random forest and happily mix numerical and categorical data.有点像警察，但您可以简单地使用随机森林并愉快地混合数字和分类数据。 Encoding with LabelEncoder on OneHotEncoding would allow you to use a wider variety of algorithms.在 OneHotEncoding 上使用 LabelEncoder 进行编码将允许您使用更广泛的算法。

You can encode categorical variables with label encoding if there is a meaningful ordering of available values and making sure the ordering is retained in the encoding.如果可用值的排序有意义并确保在编码中保留排序，则可以使用 label 编码对分类变量进行编码。 See here for an example.有关示例，请参见此处。

If there's no ordering (or resolving a meaningful one is too much work) you can use one-hot encoding.如果没有排序（或解决有意义的排序工作量太大），您可以使用 one-hot 编码。 This, however will increase the feature set proportionally to the distinct values for the feature in the dataset.然而，这将根据数据集中特征的不同值成比例地增加特征集。

If one-hot results in a very large feature set and the categorical string data are natural language words, you may want to use a pretrained embedding.如果 one-hot 导致一个非常大的特征集并且分类字符串数据是自然语言单词，您可能需要使用预训练嵌入。

Either way, you can then concatenate the encoded categorical column(s) to the continuous feature set and proceed with learning and feature selection.无论哪种方式，您都可以将编码的分类列连接到连续特征集并继续学习和特征选择。