简体   繁体   English

混合数据集中的特征选择

[英]Feature Selection from Mixed dataset

I am a newbie in data science domain.我是数据科学领域的新手。

I have a data set, which has both numerical and string data.The interesting fact is both type of data make sense for the outcome.我有一个数据集,其中包含数字和字符串数据。有趣的事实是这两种类型的数据都对结果有意义。 How to choose the relevant features from the data set?如何从数据集中选择相关特征?

Should I be using the LabelEncoder and convert the data from string to numerical and continue with the correlation?我应该使用 LabelEncoder 并将数据从字符串转换为数字并继续相关吗? I am taking the right path?我走的是正确的道路吗? Is there any better way to solve this crisis?有没有更好的方法来解决这场危机?

Kind of a cop out but you could simply use a random forest and happily mix numerical and categorical data.有点像警察,但您可以简单地使用随机森林并愉快地混合数字和分类数据。 Encoding with LabelEncoder on OneHotEncoding would allow you to use a wider variety of algorithms.在 OneHotEncoding 上使用 LabelEncoder 进行编码将允许您使用更广泛的算法。

You can encode categorical variables with label encoding if there is a meaningful ordering of available values and making sure the ordering is retained in the encoding.如果可用值的排序有意义并确保在编码中保留排序,则可以使用 label 编码对分类变量进行编码。 See here for an example.有关示例,请参见此处

If there's no ordering (or resolving a meaningful one is too much work) you can use one-hot encoding.如果没有排序(或解决有意义的排序工作量太大),您可以使用 one-hot 编码。 This, however will increase the feature set proportionally to the distinct values for the feature in the dataset.然而,这将根据数据集中特征的不同值成比例地增加特征集。

If one-hot results in a very large feature set and the categorical string data are natural language words, you may want to use a pretrained embedding.如果 one-hot 导致一个非常大的特征集并且分类字符串数据是自然语言单词,您可能需要使用预训练嵌入。

Either way, you can then concatenate the encoded categorical column(s) to the continuous feature set and proceed with learning and feature selection.无论哪种方式,您都可以将编码的分类列连接到连续特征集并继续学习和特征选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM