简体   繁体   English

Python pandas:基于图例数据框标记分类值

[英]Python pandas: labeling categorical values based on legend dataframe

I have a big dataset (2m rows, 70 variables), which has many categorical variables.我有一个大数据集(2m 行,70 个变量),其中包含许多分类变量。 All categorical variables are coded in numbers (eg see df1)所有分类变量都用数字编码(例如参见 df1)

df1:
   obs  gender  job
    1     1       1
    2     1       2
    3     2       2
    4     1       1

I have a another data frame with all explanations, looking like this:我有一个包含所有解释的另一个数据框,如下所示:

df2:
Var:     Value:   Label:
gender     1      male
gender     2      female
job        1      blue collar
job        2      white collar

Is there a fast way to replace all values of the categorical columns with their label from df2?有没有一种快速的方法可以用 df2 中的标签替换分类列的所有值? This would save me the work to always look up the meaning of the value in df2.这将节省我始终在 df2 中查找值的含义的工作。 I found some solutions to replace values by hand, but I look for an automatic way doing this.我找到了一些手动替换值的解决方案,但我正在寻找一种自动的方法来做到这一点。

Thank you谢谢

You could use a dictionary generated from df2 .您可以使用从df2生成的字典。 Like this:像这样:

Firstly, generating some dummy data:首先,生成一些虚拟数据:

import pandas as pd
import numpy as np

df1 = pd.DataFrame()
df1['obs'] = range(1,1001)
df1['gender'] = np.random.choice([1,2],1000)
df1['job'] = np.random.choice([1,2],1000)

df2 = pd.DataFrame()
df2['var'] = ['gender','gender','job','job']
df2['value'] = [1,2,1,2]
df2['label'] = ['male','female','blue collar', 'white collar']

If you want to replace one variable something like this:如果你想替换这样的一个变量:

genderDict = dict(df2.loc[df2['var']=='gender'][['value','label']].values)
df1['gender_name'] = df1['gender'].apply(lambda x: genderDict[x])

And if you'd like to replace a bunch of variables:如果你想替换一堆变量:

colNames = list(df1.columns)
colNames.remove('obs')
for variable in colNames:
    varDict = dict(df2.loc[df2['var']==variable][['value','label']].values)
    df1[variable+'_name'] = df1[variable].apply(lambda x: varDict[x])

For a million rows it takes about 1 second so should be reasonable fast.对于一百万行,大约需要 1 秒,所以应该是合理的快速。

Create a mapper dictionary from df2 using groupby使用 groupby 从 df2 创建映射器字典

d = df2.groupby('Var').apply(lambda x: dict(zip(x['Value'], x['Label']))).to_dict()

{'gender': {1: 'male', 2: 'female'},
'job': {1: 'blue collar', 2: 'white collar'}}

Now map the values in df1 using outer key of the dictionary as column and inner dictionary is mapper现在使用字典的外部键作为列映射 df1 中的值,内部字典是映射器

for col in df1.columns:
    if col in d.keys():
        df1[col] = df1[col].map(d[col])

You get你得到

    obs gender  job
0   1   male    blue collar
1   2   male    white collar
2   3   female  white collar
3   4   male    blue collar

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据唯一值标记数据框 - Labeling a dataframe based on unique values 根据分类列的值的数量对 pandas dataframe 进行排序 - Sorting a pandas dataframe based on number of values of a categorical column 我需要使用 pandas dataframe 根据第二个分类变量中的值来估算分类变量的缺失值 - I need to impute the missing values of a categorical variable based on the values in second categorical variable using pandas dataframe pandas/python:通过迭代替换 dataframe 中的分类值 - pandas/python: replacing categorical values in dataframe through iteration Python/Pandas 根据 DateTime 值创建分类列 - Python/Pandas Create categorical column based on DateTime values 根据其他列(python)中的分类值创建新的pandas列 - Create new pandas column based on categorical values in other column (python) 将具有分类值的熊猫数据框转换为二进制值 - converting pandas dataframe with categorical values into binary values 如何更新我的散景图例以反映 Pandas Dataframe 中的分类变量 - How to update my Bokeh Legend to reflect Categorical Variable in Pandas Dataframe PYTHON Pandas - 基于其他数据帧中的值对数据帧使用 Pandas 样式 - PYTHON Pandas - Using Pandas Styling for dataframe based on values in other dataframe python pandas数据框根据日期预测值 - python pandas dataframe predict values based on date
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM