[英]Seaborn countplot with normalized y axis per group
I was wondering if it is possible to create a Seaborn count plot, but instead of actual counts on the y-axis, show the relative frequency (percentage) within its group (as specified with the hue<\/code> parameter).
我想知道是否可以创建 Seaborn 计数图,但不是在 y 轴上显示实际计数,而是显示其组内的相对频率(百分比)(由
hue<\/code>参数指定)。
I sort of fixed this with the following approach, but I can't imagine this is the easiest approach:我用以下方法解决了这个问题,但我无法想象这是最简单的方法:
# Plot percentage of occupation per income class
grouped = df.groupby(['income'], sort=False)
occupation_counts = grouped['occupation'].value_counts(normalize=True, sort=False)
occupation_data = [
{'occupation': occupation, 'income': income, 'percentage': percentage*100} for
(income, occupation), percentage in dict(occupation_counts).items()
]
df_occupation = pd.DataFrame(occupation_data)
p = sns.barplot(x="occupation", y="percentage", hue="income", data=df_occupation)
_ = plt.setp(p.get_xticklabels(), rotation=90) # Rotate labels
I might be confused.我可能会感到困惑。 The difference between your output and the output of
你的输出和输出之间的差异
occupation_counts = (df.groupby(['income'])['occupation']
.value_counts(normalize=True)
.rename('percentage')
.mul(100)
.reset_index()
.sort_values('occupation'))
p = sns.barplot(x="occupation", y="percentage", hue="income", data=occupation_counts)
_ = plt.setp(p.get_xticklabels(), rotation=90) # Rotate labels
is, it seems to me, only the order of the columns.在我看来,只是列的顺序。
And you seem to care about that, since you pass sort=False
.而且您似乎很关心这一点,因为您通过了
sort=False
。 But then, in your code the order is determined uniquely by chance (and the order in which the dictionary is iterated even changes from run to run with Python 3.5).但是,在您的代码中,顺序是由偶然唯一确定的(并且字典的迭代顺序甚至在使用 Python 3.5 运行时也会发生变化)。
With newer versions of seaborn you can do following:使用较新版本的 seaborn,您可以执行以下操作:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
df = sns.load_dataset('titanic')
df.head()
x,y = 'class', 'survived'
(df
.groupby(x)[y]
.value_counts(normalize=True)
.mul(100)
.rename('percent')
.reset_index()
.pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))
If you also want percentages, you can do following:如果您还想要百分比,您可以执行以下操作:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
df.head()
x,y = 'class', 'survived'
df1 = df.groupby(x)[y].value_counts(normalize=True)
df1 = df1.mul(100)
df1 = df1.rename('percent').reset_index()
g = sns.catplot(x=x,y='percent',hue=y,kind='bar',data=df1)
g.ax.set_ylim(0,100)
for p in g.ax.patches:
txt = str(p.get_height().round(2)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
g.ax.text(txt_x,txt_y,txt)
You can use the library Dexplot to do counting as well as normalizing over any variable to get relative frequencies.您可以使用库 Dexplot进行计数以及对任何变量进行归一化以获得相对频率。
Pass the count
function the name of the variable you would like to count and it will automatically produce a bar plot of the counts of all unique values.将您要计数的变量的名称传递给
count
函数,它将自动生成所有唯一值计数的条形图。 Use split
to subdivide the counts by another variable.使用
split
将计数细分为另一个变量。 Notice that Dexplot automatically wraps the x-tick labels.请注意,Dexplot 会自动包装 x-tick 标签。
dxp.count('occupation', data=df, split='income')
Use the normalize
parameter to normalize the counts over any variable (or combination of variables with a list).使用
normalize
参数对任何变量(或变量与列表的组合)的计数进行标准化。 You can also use True
to normalize over the grand total of counts.您还可以使用
True
对总计数进行归一化。
dxp.count('occupation', data=df, split='income', normalize='income') dxp.count('occupation', data=df, split='收入', normalize='收入')
It boggled my mind that Seaborn doesn't provide anything like this out of the box. Seaborn 没有提供这样的开箱即用的东西,这让我难以置信。
Still, it was pretty easy to tweak the source code to get what you wanted.尽管如此,调整源代码以获得您想要的东西还是很容易的。 The following code, with the function "percentageplot(x, hue, data)" works just like sns.countplot, but norms each bar per group (ie divides each green bar's value by the sum of all green bars)
下面的代码,使用函数“percentageplot(x,hue,data)”就像sns.countplot一样工作,但对每组的每个条进行规范(即,将每个绿色条的值除以所有绿色条的总和)
In effect, it turns this (hard to interpret because different N of Apple vs. Android): sns.countplot into this (Normed so that bars reflect proportion of total for Apple, vs Android): Percentageplot实际上,它变成了这个(很难解释,因为 Apple 和 Android 的 N 不同): sns.countplot变成了这个(规范,以便条形反映 Apple 和 Android 的总数比例): Percentageplot
Hope this helps!!希望这可以帮助!!
from seaborn.categorical import _CategoricalPlotter, remove_na
import matplotlib as mpl
class _CategoricalStatPlotter(_CategoricalPlotter):
@property
def nested_width(self):
"""A float with the width of plot elements when hue nesting is used."""
return self.width / len(self.hue_names)
def estimate_statistic(self, estimator, ci, n_boot):
if self.hue_names is None:
statistic = []
confint = []
else:
statistic = [[] for _ in self.plot_data]
confint = [[] for _ in self.plot_data]
for i, group_data in enumerate(self.plot_data):
# Option 1: we have a single layer of grouping
# --------------------------------------------
if self.plot_hues is None:
if self.plot_units is None:
stat_data = remove_na(group_data)
unit_data = None
else:
unit_data = self.plot_units[i]
have = pd.notnull(np.c_[group_data, unit_data]).all(axis=1)
stat_data = group_data[have]
unit_data = unit_data[have]
# Estimate a statistic from the vector of data
if not stat_data.size:
statistic.append(np.nan)
else:
statistic.append(estimator(stat_data, len(np.concatenate(self.plot_data))))
# Get a confidence interval for this estimate
if ci is not None:
if stat_data.size < 2:
confint.append([np.nan, np.nan])
continue
boots = bootstrap(stat_data, func=estimator,
n_boot=n_boot,
units=unit_data)
confint.append(utils.ci(boots, ci))
# Option 2: we are grouping by a hue layer
# ----------------------------------------
else:
for j, hue_level in enumerate(self.hue_names):
if not self.plot_hues[i].size:
statistic[i].append(np.nan)
if ci is not None:
confint[i].append((np.nan, np.nan))
continue
hue_mask = self.plot_hues[i] == hue_level
group_total_n = (np.concatenate(self.plot_hues) == hue_level).sum()
if self.plot_units is None:
stat_data = remove_na(group_data[hue_mask])
unit_data = None
else:
group_units = self.plot_units[i]
have = pd.notnull(
np.c_[group_data, group_units]
).all(axis=1)
stat_data = group_data[hue_mask & have]
unit_data = group_units[hue_mask & have]
# Estimate a statistic from the vector of data
if not stat_data.size:
statistic[i].append(np.nan)
else:
statistic[i].append(estimator(stat_data, group_total_n))
# Get a confidence interval for this estimate
if ci is not None:
if stat_data.size < 2:
confint[i].append([np.nan, np.nan])
continue
boots = bootstrap(stat_data, func=estimator,
n_boot=n_boot,
units=unit_data)
confint[i].append(utils.ci(boots, ci))
# Save the resulting values for plotting
self.statistic = np.array(statistic)
self.confint = np.array(confint)
# Rename the value label to reflect the estimation
if self.value_label is not None:
self.value_label = "{}({})".format(estimator.__name__,
self.value_label)
def draw_confints(self, ax, at_group, confint, colors,
errwidth=None, capsize=None, **kws):
if errwidth is not None:
kws.setdefault("lw", errwidth)
else:
kws.setdefault("lw", mpl.rcParams["lines.linewidth"] * 1.8)
for at, (ci_low, ci_high), color in zip(at_group,
confint,
colors):
if self.orient == "v":
ax.plot([at, at], [ci_low, ci_high], color=color, **kws)
if capsize is not None:
ax.plot([at - capsize / 2, at + capsize / 2],
[ci_low, ci_low], color=color, **kws)
ax.plot([at - capsize / 2, at + capsize / 2],
[ci_high, ci_high], color=color, **kws)
else:
ax.plot([ci_low, ci_high], [at, at], color=color, **kws)
if capsize is not None:
ax.plot([ci_low, ci_low],
[at - capsize / 2, at + capsize / 2],
color=color, **kws)
ax.plot([ci_high, ci_high],
[at - capsize / 2, at + capsize / 2],
color=color, **kws)
class _BarPlotter(_CategoricalStatPlotter):
"""Show point estimates and confidence intervals with bars."""
def __init__(self, x, y, hue, data, order, hue_order,
estimator, ci, n_boot, units,
orient, color, palette, saturation, errcolor, errwidth=None,
capsize=None):
"""Initialize the plotter."""
self.establish_variables(x, y, hue, data, orient,
order, hue_order, units)
self.establish_colors(color, palette, saturation)
self.estimate_statistic(estimator, ci, n_boot)
self.errcolor = errcolor
self.errwidth = errwidth
self.capsize = capsize
def draw_bars(self, ax, kws):
"""Draw the bars onto `ax`."""
# Get the right matplotlib function depending on the orientation
barfunc = ax.bar if self.orient == "v" else ax.barh
barpos = np.arange(len(self.statistic))
if self.plot_hues is None:
# Draw the bars
barfunc(barpos, self.statistic, self.width,
color=self.colors, align="center", **kws)
# Draw the confidence intervals
errcolors = [self.errcolor] * len(barpos)
self.draw_confints(ax,
barpos,
self.confint,
errcolors,
self.errwidth,
self.capsize)
else:
for j, hue_level in enumerate(self.hue_names):
# Draw the bars
offpos = barpos + self.hue_offsets[j]
barfunc(offpos, self.statistic[:, j], self.nested_width,
color=self.colors[j], align="center",
label=hue_level, **kws)
# Draw the confidence intervals
if self.confint.size:
confint = self.confint[:, j]
errcolors = [self.errcolor] * len(offpos)
self.draw_confints(ax,
offpos,
confint,
errcolors,
self.errwidth,
self.capsize)
def plot(self, ax, bar_kws):
"""Make the plot."""
self.draw_bars(ax, bar_kws)
self.annotate_axes(ax)
if self.orient == "h":
ax.invert_yaxis()
def percentageplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None,
orient=None, color=None, palette=None, saturation=.75,
ax=None, **kwargs):
# Estimator calculates required statistic (proportion)
estimator = lambda x, y: (float(len(x))/y)*100
ci = None
n_boot = 0
units = None
errcolor = None
if x is None and y is not None:
orient = "h"
x = y
elif y is None and x is not None:
orient = "v"
y = x
elif x is not None and y is not None:
raise TypeError("Cannot pass values for both `x` and `y`")
else:
raise TypeError("Must pass values for either `x` or `y`")
plotter = _BarPlotter(x, y, hue, data, order, hue_order,
estimator, ci, n_boot, units,
orient, color, palette, saturation,
errcolor)
plotter.value_label = "Percentage"
if ax is None:
ax = plt.gca()
plotter.plot(ax, kwargs)
return ax
You can provide estimators for the height of the bar (along y axis) in a seaborn countplot by using the estimator keyword.您可以使用 estimator 关键字为 seaborn 计数图中的条形高度(沿 y 轴)提供估计量。
ax = sns.barplot(x="x", y="x", data=df, estimator=lambda x: len(x) / len(df) * 100)
The above code snippet is from https://github.com/mwaskom/seaborn/issues/1027以上代码片段来自https://github.com/mwaskom/seaborn/issues/1027
They have a whole discussion about how to provide percentages in a countplot.他们对如何在计数图中提供百分比进行了全面的讨论。 This answer is based off the same thread linked above.
这个答案基于上面链接的同一线程。
In the context of your specific problem, you can probably do something like this:在您的特定问题的上下文中,您可能可以执行以下操作:
ax = sb.barplot(x='occupation', y='some_numeric_column', data=raw_data, estimator=lambda x: len(x) / len(raw_data) * 100, hue='income')
ax.set(ylabel="Percent")
The above code worked for me (on a different dataset with different attributes).上面的代码对我有用(在具有不同属性的不同数据集上)。 Note that you need to put in some numeric column for y else, it gives an error: "ValueError: Neither the
x
nor y
variable appears to be numeric."请注意,您需要为 y 输入一些数字列,否则会出现错误:“ValueError:
x
和y
变量都不是数字。”
You could do this with sns.histplot
by setting the following properties:您可以通过设置以下属性使用
sns.histplot
执行此操作:
stat = 'density'
(this will make the y-axis the density rather than count) stat = 'density'
(这将使 y 轴成为密度而不是计数)common_norm = False
(this will normalize each density independently) common_norm = False
(这将独立标准化每个密度) See the simple example below:请参阅下面的简单示例:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
ax = sns.histplot(x = df['class'], hue=df['survived'], multiple="dodge",
stat = 'density', shrink = 0.8, common_norm=False)
From this answer<\/a> , and using "probability" worked best.从
这个答案<\/a>中,使用“概率”效果最好。
Taken from sns.histplot documentation<\/a> on the "stat" parameter:取自关于“stat”参数的
sns.histplot 文档<\/a>:
Aggregate statistic to compute in each bin.要在每个 bin 中计算的汇总统计信息。
count: show the number of observations in each bin count:显示每个 bin 中的观察次数<\/li>
frequency: show the number of observations divided by the bin width频率:显示观察数除以 bin 宽度<\/li>
probability: or proportion: normalize such that bar heights sum to 1概率:或比例:归一化,使条形高度总和为 1<\/li>
percent: normalize such that bar heights sum to 100百分比:标准化,使条形高度总和为 100<\/li>
density: normalize such that the total area of the histogram equals 1密度:归一化,使得直方图的总面积等于 1<\/li><\/ul><\/blockquote>
import seaborn as sns df = sns.load_dataset('titanic') ax = sns.histplot( x = df['class'], hue=df['survived'], multiple="dodge", stat = 'probability', shrink = 0.5, common_norm=False )<\/code><\/pre>"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.