简体   繁体   English

Seaborn countplot,每组归一化 y 轴

[英]Seaborn countplot with normalized y axis per group

I was wondering if it is possible to create a Seaborn count plot, but instead of actual counts on the y-axis, show the relative frequency (percentage) within its group (as specified with the hue<\/code> parameter).我想知道是否可以创建 Seaborn 计数图,但不是在 y 轴上显示实际计数,而是显示其组内的相对频率(百分比)(由hue<\/code>参数指定)。

I sort of fixed this with the following approach, but I can't imagine this is the easiest approach:我用以下方法解决了这个问题,但我无法想象这是最简单的方法:

# Plot percentage of occupation per income class
grouped = df.groupby(['income'], sort=False)
occupation_counts = grouped['occupation'].value_counts(normalize=True, sort=False)

occupation_data = [
    {'occupation': occupation, 'income': income, 'percentage': percentage*100} for 
    (income, occupation), percentage in dict(occupation_counts).items()
]

df_occupation = pd.DataFrame(occupation_data)

p = sns.barplot(x="occupation", y="percentage", hue="income", data=df_occupation)
_ = plt.setp(p.get_xticklabels(), rotation=90)  # Rotate labels

I might be confused.我可能会感到困惑。 The difference between your output and the output of你的输出和输出之间的差异

occupation_counts = (df.groupby(['income'])['occupation']
                     .value_counts(normalize=True)
                     .rename('percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('occupation'))
p = sns.barplot(x="occupation", y="percentage", hue="income", data=occupation_counts)
_ = plt.setp(p.get_xticklabels(), rotation=90)  # Rotate labels

is, it seems to me, only the order of the columns.在我看来,只是列的顺序。

在此处输入图片说明

And you seem to care about that, since you pass sort=False .而且您似乎很关心这一点,因为您通过了sort=False But then, in your code the order is determined uniquely by chance (and the order in which the dictionary is iterated even changes from run to run with Python 3.5).但是,在您的代码中,顺序是由偶然唯一确定的(并且字典的迭代顺序甚至在使用 Python 3.5 运行时也会发生变化)。

With newer versions of seaborn you can do following:使用较新版本的 seaborn,您可以执行以下操作:

import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)

df = sns.load_dataset('titanic')
df.head()

x,y = 'class', 'survived'

(df
.groupby(x)[y]
.value_counts(normalize=True)
.mul(100)
.rename('percent')
.reset_index()
.pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))


output输出

在此处输入图片说明

Update更新

If you also want percentages, you can do following:如果您还想要百分比,您可以执行以下操作:

import numpy as np
import pandas as pd
import seaborn as sns

df = sns.load_dataset('titanic')
df.head()

x,y = 'class', 'survived'

df1 = df.groupby(x)[y].value_counts(normalize=True)
df1 = df1.mul(100)
df1 = df1.rename('percent').reset_index()

g = sns.catplot(x=x,y='percent',hue=y,kind='bar',data=df1)
g.ax.set_ylim(0,100)

for p in g.ax.patches:
    txt = str(p.get_height().round(2)) + '%'
    txt_x = p.get_x() 
    txt_y = p.get_height()
    g.ax.text(txt_x,txt_y,txt)

在此处输入图片说明

You can use the library Dexplot to do counting as well as normalizing over any variable to get relative frequencies.您可以使用库 Dexplot进行计数以及对任何变量进行归一化以获得相对频率。

Pass the count function the name of the variable you would like to count and it will automatically produce a bar plot of the counts of all unique values.将您要计数的变量的名称传递给count函数,它将自动生成所有唯一值计数的条形图。 Use split to subdivide the counts by another variable.使用split将计数细分为另一个变量。 Notice that Dexplot automatically wraps the x-tick labels.请注意,Dexplot 会自动包装 x-tick 标签。

dxp.count('occupation', data=df, split='income')

在此处输入图片说明

Use the normalize parameter to normalize the counts over any variable (or combination of variables with a list).使用normalize参数对任何变量(或变量与列表的组合)的计数进行标准化。 You can also use True to normalize over the grand total of counts.您还可以使用True对总计数进行归一化。

dxp.count('occupation', data=df, split='income', normalize='income') dxp.count('occupation', data=df, split='收入', normalize='收入')

在此处输入图片说明

It boggled my mind that Seaborn doesn't provide anything like this out of the box. Seaborn 没有提供这样的开箱即用的东西,这让我难以置信。

Still, it was pretty easy to tweak the source code to get what you wanted.尽管如此,调整源代码以获得您想要的东西还是很容易的。 The following code, with the function "percentageplot(x, hue, data)" works just like sns.countplot, but norms each bar per group (ie divides each green bar's value by the sum of all green bars)下面的代码,使用函数“percentageplot(x,hue,data)”就像sns.countplot一样工作,但对每组的每个条进行规范(即,将每个绿色条的值除以所有绿色条的总和)

In effect, it turns this (hard to interpret because different N of Apple vs. Android): sns.countplot into this (Normed so that bars reflect proportion of total for Apple, vs Android): Percentageplot实际上,它变成了这个(很难解释,因为 Apple 和 Android 的 N 不同): sns.countplot变成了这个(规范,以便条形反映 Apple 和 Android 的总数比例): Percentageplot

Hope this helps!!希望这可以帮助!!

from seaborn.categorical import _CategoricalPlotter, remove_na
import matplotlib as mpl

class _CategoricalStatPlotter(_CategoricalPlotter):

    @property
    def nested_width(self):
        """A float with the width of plot elements when hue nesting is used."""
        return self.width / len(self.hue_names)

    def estimate_statistic(self, estimator, ci, n_boot):

        if self.hue_names is None:
            statistic = []
            confint = []
        else:
            statistic = [[] for _ in self.plot_data]
            confint = [[] for _ in self.plot_data]

        for i, group_data in enumerate(self.plot_data):
            # Option 1: we have a single layer of grouping
            # --------------------------------------------

            if self.plot_hues is None:

                if self.plot_units is None:
                    stat_data = remove_na(group_data)
                    unit_data = None
                else:
                    unit_data = self.plot_units[i]
                    have = pd.notnull(np.c_[group_data, unit_data]).all(axis=1)
                    stat_data = group_data[have]
                    unit_data = unit_data[have]

                # Estimate a statistic from the vector of data
                if not stat_data.size:
                    statistic.append(np.nan)
                else:
                    statistic.append(estimator(stat_data, len(np.concatenate(self.plot_data))))

                # Get a confidence interval for this estimate
                if ci is not None:

                    if stat_data.size < 2:
                        confint.append([np.nan, np.nan])
                        continue

                    boots = bootstrap(stat_data, func=estimator,
                                      n_boot=n_boot,
                                      units=unit_data)
                    confint.append(utils.ci(boots, ci))

            # Option 2: we are grouping by a hue layer
            # ----------------------------------------

            else:
                for j, hue_level in enumerate(self.hue_names):
                    if not self.plot_hues[i].size:
                        statistic[i].append(np.nan)
                        if ci is not None:
                            confint[i].append((np.nan, np.nan))
                        continue

                    hue_mask = self.plot_hues[i] == hue_level
                    group_total_n = (np.concatenate(self.plot_hues) == hue_level).sum()
                    if self.plot_units is None:
                        stat_data = remove_na(group_data[hue_mask])
                        unit_data = None
                    else:
                        group_units = self.plot_units[i]
                        have = pd.notnull(
                            np.c_[group_data, group_units]
                            ).all(axis=1)
                        stat_data = group_data[hue_mask & have]
                        unit_data = group_units[hue_mask & have]

                    # Estimate a statistic from the vector of data
                    if not stat_data.size:
                        statistic[i].append(np.nan)
                    else:
                        statistic[i].append(estimator(stat_data, group_total_n))

                    # Get a confidence interval for this estimate
                    if ci is not None:

                        if stat_data.size < 2:
                            confint[i].append([np.nan, np.nan])
                            continue

                        boots = bootstrap(stat_data, func=estimator,
                                          n_boot=n_boot,
                                          units=unit_data)
                        confint[i].append(utils.ci(boots, ci))

        # Save the resulting values for plotting
        self.statistic = np.array(statistic)
        self.confint = np.array(confint)

        # Rename the value label to reflect the estimation
        if self.value_label is not None:
            self.value_label = "{}({})".format(estimator.__name__,
                                               self.value_label)

    def draw_confints(self, ax, at_group, confint, colors,
                      errwidth=None, capsize=None, **kws):

        if errwidth is not None:
            kws.setdefault("lw", errwidth)
        else:
            kws.setdefault("lw", mpl.rcParams["lines.linewidth"] * 1.8)

        for at, (ci_low, ci_high), color in zip(at_group,
                                                confint,
                                                colors):
            if self.orient == "v":
                ax.plot([at, at], [ci_low, ci_high], color=color, **kws)
                if capsize is not None:
                    ax.plot([at - capsize / 2, at + capsize / 2],
                            [ci_low, ci_low], color=color, **kws)
                    ax.plot([at - capsize / 2, at + capsize / 2],
                            [ci_high, ci_high], color=color, **kws)
            else:
                ax.plot([ci_low, ci_high], [at, at], color=color, **kws)
                if capsize is not None:
                    ax.plot([ci_low, ci_low],
                            [at - capsize / 2, at + capsize / 2],
                            color=color, **kws)
                    ax.plot([ci_high, ci_high],
                            [at - capsize / 2, at + capsize / 2],
                            color=color, **kws)

class _BarPlotter(_CategoricalStatPlotter):
    """Show point estimates and confidence intervals with bars."""

    def __init__(self, x, y, hue, data, order, hue_order,
                 estimator, ci, n_boot, units,
                 orient, color, palette, saturation, errcolor, errwidth=None,
                 capsize=None):
        """Initialize the plotter."""
        self.establish_variables(x, y, hue, data, orient,
                                 order, hue_order, units)
        self.establish_colors(color, palette, saturation)
        self.estimate_statistic(estimator, ci, n_boot)

        self.errcolor = errcolor
        self.errwidth = errwidth
        self.capsize = capsize

    def draw_bars(self, ax, kws):
        """Draw the bars onto `ax`."""
        # Get the right matplotlib function depending on the orientation
        barfunc = ax.bar if self.orient == "v" else ax.barh
        barpos = np.arange(len(self.statistic))

        if self.plot_hues is None:

            # Draw the bars
            barfunc(barpos, self.statistic, self.width,
                    color=self.colors, align="center", **kws)

            # Draw the confidence intervals
            errcolors = [self.errcolor] * len(barpos)
            self.draw_confints(ax,
                               barpos,
                               self.confint,
                               errcolors,
                               self.errwidth,
                               self.capsize)

        else:

            for j, hue_level in enumerate(self.hue_names):

                # Draw the bars
                offpos = barpos + self.hue_offsets[j]
                barfunc(offpos, self.statistic[:, j], self.nested_width,
                        color=self.colors[j], align="center",
                        label=hue_level, **kws)

                # Draw the confidence intervals
                if self.confint.size:
                    confint = self.confint[:, j]
                    errcolors = [self.errcolor] * len(offpos)
                    self.draw_confints(ax,
                                       offpos,
                                       confint,
                                       errcolors,
                                       self.errwidth,
                                       self.capsize)

    def plot(self, ax, bar_kws):
        """Make the plot."""
        self.draw_bars(ax, bar_kws)
        self.annotate_axes(ax)
        if self.orient == "h":
            ax.invert_yaxis()

def percentageplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None,
              orient=None, color=None, palette=None, saturation=.75,
              ax=None, **kwargs):

    # Estimator calculates required statistic (proportion)        
    estimator = lambda x, y: (float(len(x))/y)*100 
    ci = None
    n_boot = 0
    units = None
    errcolor = None

    if x is None and y is not None:
        orient = "h"
        x = y
    elif y is None and x is not None:
        orient = "v"
        y = x
    elif x is not None and y is not None:
        raise TypeError("Cannot pass values for both `x` and `y`")
    else:
        raise TypeError("Must pass values for either `x` or `y`")

    plotter = _BarPlotter(x, y, hue, data, order, hue_order,
                          estimator, ci, n_boot, units,
                          orient, color, palette, saturation,
                          errcolor)

    plotter.value_label = "Percentage"

    if ax is None:
        ax = plt.gca()

    plotter.plot(ax, kwargs)
    return ax

You can provide estimators for the height of the bar (along y axis) in a seaborn countplot by using the estimator keyword.您可以使用 estimator 关键字为 seaborn 计数图中的条形高度(沿 y 轴)提供估计量。

ax = sns.barplot(x="x", y="x", data=df, estimator=lambda x: len(x) / len(df) * 100)

The above code snippet is from https://github.com/mwaskom/seaborn/issues/1027以上代码片段来自https://github.com/mwaskom/seaborn/issues/1027

They have a whole discussion about how to provide percentages in a countplot.他们对如何在计数图中提供百分比进行了全面的讨论。 This answer is based off the same thread linked above.这个答案基于上面链接的同一线程。

In the context of your specific problem, you can probably do something like this:在您的特定问题的上下文中,您可能可以执行以下操作:

ax = sb.barplot(x='occupation', y='some_numeric_column', data=raw_data, estimator=lambda x: len(x) / len(raw_data) * 100, hue='income')
ax.set(ylabel="Percent")

The above code worked for me (on a different dataset with different attributes).上面的代码对我有用(在具有不同属性的不同数据集上)。 Note that you need to put in some numeric column for y else, it gives an error: "ValueError: Neither the x nor y variable appears to be numeric."请注意,您需要为 y 输入一些数字列,否则会出现错误:“ValueError: xy变量都不是数字。”

You could do this with sns.histplot by setting the following properties:您可以通过设置以下属性使用sns.histplot执行此操作:

  • stat = 'density' (this will make the y-axis the density rather than count) stat = 'density' (这将使 y 轴成为密度而不是计数)
  • common_norm = False (this will normalize each density independently) common_norm = False (这将独立标准化每个密度)

See the simple example below:请参阅下面的简单示例:

import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')

ax = sns.histplot(x = df['class'], hue=df['survived'], multiple="dodge", 
                  stat = 'density', shrink = 0.8, common_norm=False)

输出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM