简体   繁体   English

分解两个不同 DataFrame 中的列

[英]Factorize columns in two different DataFrames

I have two DataFrames and in each of them I have a categorical column col .我有两个 DataFrame,在每个 DataFrame 中我都有一个分类列col I want to replace all the categories with numbers, so I decided to do it this fashion:我想用数字替换所有类别,所以我决定这样做:

df1['col'] = pd.factorize(df1['col'])[0]

Now the question is how can I code df2[col] in the same way?现在的问题是我如何以同样的方式编码df2[col] And how can I also code categories that are present in df2[col] but not in df1[col] ?以及如何对df2[col]中存在但df1[col]中不存在的类别进行编码?

You need a LabelEncoder你需要一个标签编码器

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
df1['col'] = enc.fit_transform(df1['col'])
df2['col'] = enc.transform(df2['col'])

for unseen label, this may be a solution:对于看不见的 label,这可能是一个解决方案:

enc = LabelEncoder()
enc.fit(df1['col'])
diz_map = dict(zip(enc.classes_, enc.transform(enc.classes_)+1))

for i in set(df2['col']).difference(df1['col']):
    diz_map[i] = 0

df1['col'] = [diz_map[i] for i in df1['col'].values]
df2['col'] = [diz_map[i] for i in df2['col'].values]

you map as 0 all the unseen values in df2['col']你 map 为 0 df2['col'] 中所有看不见的值

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM