[英]Python pandas: labeling categorical values based on legend dataframe
I have a big dataset (2m rows, 70 variables), which has many categorical variables.我有一个大数据集(2m 行,70 个变量),其中包含许多分类变量。 All categorical variables are coded in numbers (eg see df1)
所有分类变量都用数字编码(例如参见 df1)
df1:
obs gender job
1 1 1
2 1 2
3 2 2
4 1 1
I have a another data frame with all explanations, looking like this:我有一个包含所有解释的另一个数据框,如下所示:
df2:
Var: Value: Label:
gender 1 male
gender 2 female
job 1 blue collar
job 2 white collar
Is there a fast way to replace all values of the categorical columns with their label from df2?有没有一种快速的方法可以用 df2 中的标签替换分类列的所有值? This would save me the work to always look up the meaning of the value in df2.
这将节省我始终在 df2 中查找值的含义的工作。 I found some solutions to replace values by hand, but I look for an automatic way doing this.
我找到了一些手动替换值的解决方案,但我正在寻找一种自动的方法来做到这一点。
Thank you谢谢
You could use a dictionary generated from df2 .您可以使用从df2生成的字典。 Like this:
像这样:
Firstly, generating some dummy data:首先,生成一些虚拟数据:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['obs'] = range(1,1001)
df1['gender'] = np.random.choice([1,2],1000)
df1['job'] = np.random.choice([1,2],1000)
df2 = pd.DataFrame()
df2['var'] = ['gender','gender','job','job']
df2['value'] = [1,2,1,2]
df2['label'] = ['male','female','blue collar', 'white collar']
If you want to replace one variable something like this:如果你想替换这样的一个变量:
genderDict = dict(df2.loc[df2['var']=='gender'][['value','label']].values)
df1['gender_name'] = df1['gender'].apply(lambda x: genderDict[x])
And if you'd like to replace a bunch of variables:如果你想替换一堆变量:
colNames = list(df1.columns)
colNames.remove('obs')
for variable in colNames:
varDict = dict(df2.loc[df2['var']==variable][['value','label']].values)
df1[variable+'_name'] = df1[variable].apply(lambda x: varDict[x])
For a million rows it takes about 1 second so should be reasonable fast.对于一百万行,大约需要 1 秒,所以应该是合理的快速。
Create a mapper dictionary from df2 using groupby使用 groupby 从 df2 创建映射器字典
d = df2.groupby('Var').apply(lambda x: dict(zip(x['Value'], x['Label']))).to_dict()
{'gender': {1: 'male', 2: 'female'},
'job': {1: 'blue collar', 2: 'white collar'}}
Now map the values in df1 using outer key of the dictionary as column and inner dictionary is mapper现在使用字典的外部键作为列映射 df1 中的值,内部字典是映射器
for col in df1.columns:
if col in d.keys():
df1[col] = df1[col].map(d[col])
You get你得到
obs gender job
0 1 male blue collar
1 2 male white collar
2 3 female white collar
3 4 male blue collar
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.