简体   繁体   English

来自Pandas DataFrame的叠加多个直方图

[英]Multiple Histograms from Pandas DataFrame with overlay

I've had the hardest time figuring this out. 我最难以解决这个问题。 I have a dataframe with multiple categorical fields and I wish to plot them all as histograms with the target variable (Income) overlaid on each histogram. 我有一个包含多个分类字段的数据框,我希望将它们全部绘制为直方图,并在每个直方图上叠加目标变量(收入)。 I had hoped to be able to use Pandas to do the histogram and just iterate over all the fields, but when I try just to plot Race and overlay Income the legend says None and I can't seem to get the Income to stack on one another. 我原本希望能够使用Pandas进行直方图并迭代所有的字段,但是当我尝试绘制Race和overlay Income时,传说中没有,我似乎无法将收入叠加到一个另一个。

Below is a sample dataframe similar to mine and the latest thing I have tried.. 下面是一个类似于我的示例数据框,以及我尝试过的最新内容。

exampledf = {'Race': ['Black', 'White', 'Asian', 'White', 
                  'White', 'Asian', 'White', 'White', 
                  'White', 'Black', 'White', 'Asian'],
        'Income': ['>=50k', '>=50k', '>=50k', '>=50k',
                   '>=50k', '<50k', '<50k', '>=50k',
                   '>=50k', '>=50k', '<50k', '>=50k',],
        'Gender': ['M', 'F', 'F', 'F',
                   'M', 'M', 'M', 'M',
                   'M', 'M', 'M', 'M']}
exampledf =pd.DataFrame(exampledf)
exampledf.groupby(['Income','Race']).size().plot(x=exampledf['Race'], kind='bar', color=['r','b'], logy=False, legend=True)

The way you are calling plot is not correct. 你调用plot的方式是不正确的。 You don't pass an x variable for a bar plot using pandas. 您不使用pandas为条形图传递x变量。 It will automatically use the index as for the x axis. 它将自动使用索引作为x轴。 However, because you have a multi-index, it is probably not going to give you the chart you want. 但是,因为你有一个多索引,它可能不会给你你想要的图表。

To create a bar chart of race vs income, you need to have race as the index (rows), income as the columns, and the count as the values. 要创建种族与收入的条形图,您需要将种族作为索引(行),将收入作为列,将计数作为值。 You don't want groupby , you want to pivot your data. 您不想要groupby ,您想要透视数据。 In this case, you want to use .pivot_table . 在这种情况下,您要使用.pivot_table

This will create a new dataframe with the index as race (the x-values for pandas .plot ) and the different incomes as the the columns (the y-values for .plot ). 这将创建一个新的数据框,其索引为race(pandas .plot的x值),不同的收入为列( .plot的y值)。

pt = exampledf[['Race','Income']].pivot_table(index='Race', columns='Income', 
                                              aggfunc=len, fill_value=0) 
# output of pt:
# Income  <50k  >=50k
# Race
# Asian      1      2
# Black      0      2
# White      2      5

# make the plot
pt.plot.bar()

Here is the image using IPython. 这是使用IPython的图像。 The defaults using Jupyter Notebook look better. 使用Jupyter Notebook的默认设置看起来更好。

在此输入图像描述

The answer from James using pure pandas is likely what you're looking for, but I've been more and more turning to altair for visualization from DataFrames because of its amazing simplicity. 詹姆斯使用纯大熊猫的答案很可能是你正在寻找的,但我越来越多地转向使用DataFrames进行可视化,因为它非常简单。

You can get something like what you want by just assigning your frame columns to dimensions in the chart: 只需将框架列分配给图表中的尺寸,即可获得所需的内容:

from altair import Chart

Chart(exampledf).mark_bar(
).encode(
    y='Race',
    x='count(*)',
    color='Income'
)

在此输入图像描述

or: 要么:

Chart(exampledf).mark_bar(
).encode(
    column='Race',
    y='count(*)',
    x='Income'
)

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM