简体   繁体   English

在sklearn中将文本列转换为数字

[英]convert text columns into numbers in sklearn

I'm new to data analytics.我是数据分析的新手。 I'm trying some models in python Sklearn.我正在 python Sklearn 中尝试一些模型。 I have a dataset in which some of the columns have text columns.我有一个数据集,其中一些列有文本列。 Like below,如下图,

Dataset数据集

Is there a way to convert these column values into numbers in pandas or Sklearn?.有没有办法将这些列值转换为 pandas 或 Sklearn 中的数字? Assigning numbers to these values will be right?.为这些值分配数字是对的吗? And what if a new string pops out in test data?.如果测试数据中弹出一个新字符串怎么办?

Please advice.请指教。

Consider using Label Encoding - it transforms the categorical data by assigning each category an integer between 0 and the num_of_categories-1: 考虑使用标签编码 - 它通过为每个类别分配0和num_of_categories-1之间的整数来转换分类数据:

from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(['a','b','c','d','a','c','a','d'], columns=['letter'])

  letter
0      a
1      b
2      c
3      d
4      a
5      c
6      a

Applying: 应用:

le = LabelEncoder()
encoded_series = df[df.columns[:]].apply(le.fit_transform)

encoded_series: encoded_series:

    letter
0   0
1   1
2   2
3   3
4   0
5   2
6   0
7   3

You can convert them into integer codes by using the categorical datatype. 您可以使用分类数据类型将它们转换为整数代码。

column = column.astype('category')
column_encoded = column.cat.codes

As long as use use a tree based model with deep enough trees, eg GradientBoostingClassifier(max_depth=10 ), your model should be able to split out the categories again. 只要使用具有足够深的树的基于树的模型,例如GradientBoostingClassifier(max_depth=10 ),您的模型应该能够再次拆分类别。

I think it would be better to use OrdinalEncoder if you want to transform feature columns, because it's meant for categorical features (LabelEncoder is meant for labels).我认为如果要转换特征列,最好使用 OrdinalEncoder,因为它适用于分类特征(LabelEncoder 用于标签)。 Also, it can handle values not seen in training and multiple features at the same time.此外,它可以同时处理训练中未见的值和多个特征。 An example:一个例子:

from sklearn.preprocessing import OrdinalEncoder

features = ["city", "age", ...]
encoder = OrdinalEncoder(
        handle_unknown='use_encoded_value', 
        unknown_value=-1
    ).fit(train[features])
train[features] = encoder.transform(train[features])
test[features] = encoder.transform(test[features])

More on the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html有关文档的更多信息: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM