让 ggplot2 直方图在 y 轴上显示分类百分比

Question

library(ggplot2)
data = diamonds[, c('carat', 'color')]
data = data[data$color %in% c('D', 'E'), ]

I would like to compare the histogram of carat across color D and E, and use the classwise percentage on the y-axis.我想比较颜色 D 和 E 的克拉直方图，并在 y 轴上使用分类百分比。 The solutions I have tried are as follows:我尝试过的解决方案如下：

Solution 1:解决方案1：

ggplot(data=data, aes(carat, fill=color)) +  geom_bar(aes(y=..density..), position='dodge', binwidth = 0.5) + ylab("Percentage") +xlab("Carat")

在此处输入图像描述

This is not quite right since the y-axis shows the height of the estimated density.这不太正确，因为 y 轴显示了估计密度的高度。

Solution 2:解决方案2：

 ggplot(data=data, aes(carat, fill=color)) +  geom_histogram(aes(y=(..count..)/sum(..count..)), position='dodge', binwidth = 0.5) + ylab("Percentage") +xlab("Carat")

在此处输入图像描述

This is also not I want, because the denominator used to calculate the ratio on the y-axis are the total count of D + E.这也不是我想要的，因为用于计算 y 轴上比率的分母是 D + E 的总数。

Is there a way to display the classwise percentages with ggplot2's stacked histogram?有没有办法用 ggplot2 的堆叠直方图显示分类百分比？ That is, instead of showing (# of obs in bin)/count(D+E) on y axis, I would like it to show (# of obs in bin)/count(D) and (# of obs in bin)/count(E) respectively for two color classes.也就是说，不是在 y 轴上显示 (# of obs in bin)/count(D+E)，我希望它显示 (# of obs in bin)/count(D) 和 (# of obs in bin) /count(E) 分别用于两个颜色类别。 Thanks.谢谢。

Answer 1

Calculating from stats从统计数据计算

You can scale them by group by using the special stat variables group and count , using group to select subsets of count .您可以使用特殊的统计变量group和count按组对它们进行缩放，使用group选择count的子集。

If you have ggplot 3.3.0 or newer, you can use the after_stat function to access these special variables:如果你有 ggplot 3.3.0 或更新版本，你可以使用after_stat函数来访问这些特殊变量：

ggplot(data, aes(carat, fill=color)) +
  geom_histogram(
    aes(y=after_stat(c(
      count[group==1]/sum(count[group==1]),
      count[group==2]/sum(count[group==2])
    )*100)),
    position='dodge',
    binwidth=0.5
  ) +
  ylab("Percentage") + xlab("Carat")

克拉与百分比的 ggplot 图，有两组条，每组显示给定颜色的百分比，根据需要

Using older versions of ggplot使用旧版本的 ggplot

In earlier versions, this is more cumbersome - if you have at least 3.0 you can wrap stat() function around each individual variable reference, in pre-3.0 versions you have to surround them with two dots instead:在早期版本中，这更加麻烦 - 如果您至少有 3.0，您可以将stat()函数包装在每个单独的变量引用中，在 3.0 之前的版本中，您必须用两个点将它们包围：

aes(y=c(
  ..count..[..group..==1]/sum(..count..[..group..==1]),
  ..count..[..group..==2]/sum(..count..[..group..==2])
)*100),

Yeah but what are* all these stats?*是的，但所有这些统计数据是什么？

For more details on where these variables come from, summary stats will be documented alongside the stat function being used - for example geom_histogram 's default stat_bin() has this Computed variables section:有关这些变量来自何处的更多详细信息，汇总统计信息将与正在使用的 stat 函数一起记录 - 例如geom_histogram的默认stat_bin()具有此Computed variables部分：

Computed variables:计算变量：

count number of points in bin计算bin 中的点数

density density of points in bin, scaled to integrate to 1密度bin 中点的密度，缩放到积分为 1

ncount count, scaled to maximum of 1 ncount计数，最大为 1

ndensity density, scaled to maximum of 1 ndensity密度，最大为 1

width widths of bins垃圾箱的宽度

Beyond that, you can use ggplot_build() to inspect all the stats generated for any given plot:除此之外，您可以使用 ggplot_build()检查为任何给定绘图生成的所有统计信息：

> p = ggplot(data, [...etc...])
> ggplot_build(p)
$data
$data[[1]]
        fill           y count      x  xmin xmax      density       ncount
1  #440154FF  1.50553506   102 -0.125 -0.25 0.00 0.0301107011 0.0224323730
2  #440154FF 67.11439114  4547  0.375  0.25 
[...snip...]
       ndensity flipped_aes PANEL group ymin        ymax colour size linetype
1  0.0224323730       FALSE     1     1    0  1.50553506     NA  0.5        1
2  1.0000000000       FALSE     1     1    0 67.11439114     NA  0.5        1
[...snip...]

Answer 2

It seems that binning the data outside of ggplot2 is the way to go.似乎将 ggplot2 之外的数据分箱是可行的方法。 But I would still be interested to see if there is a way to do it with ggplot2.但我仍然有兴趣看看是否有办法用 ggplot2 来做到这一点。

library(dplyr)
breaks = seq(0,4,0.5)

data$carat_cut = cut(data$carat, breaks = breaks)

data_cut = data %>%
  group_by(color, carat_cut) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n))

ggplot(data=data_cut, aes(x = carat_cut, y=freq*100, fill=color)) + geom_bar(stat="identity",position="dodge") + scale_x_discrete(labels = breaks) +  ylab("Percentage") +xlab("Carat")

在此处输入图像描述

Answer 3

Fortunately, in my case, Rorschach's answer worked perfectly.幸运的是，就我而言，罗夏的答案非常有效。 I was here looking to avoid the solution proposed by Megan Halbrook, which is the one I was using until I realized it is not a correct solution.我来这里是为了避免使用 Megan Halbrook 提出的解决方案，在我意识到这不是一个正确的解决方案之前，我一直在使用这个解决方案。

Adding a density line to the histogram automatically change the y axis to frequency density, not to percentage.向直方图添加密度线会自动将 y 轴更改为频率密度，而不是百分比。 The values of frequency density would be equivalent to percentages only if binwidth = 1.只有当 binwidth = 1 时，频率密度的值才等于百分比。

Googling: To draw a histogram, first find the class width of each category.谷歌搜索：要绘制直方图，首先找到每个类别的类宽度。 The area of the bar represents the frequency, so to find the height of the bar, divide frequency by the class width.条的面积代表频率，因此要找到条的高度，请将频率除以类宽度。 This is called frequency density.这称为频率密度。 https://www.bbc.co.uk/bitesize/guides/zc7sb82/revision/9 https://www.bbc.co.uk/bitesize/guides/zc7sb82/revision/9

Below an example, where the left panel shows percentage and the right panel shows density for the y axis.下面是一个示例，其中左侧面板显示百分比，右侧面板显示 y 轴的密度。

library(ggplot2)
library(gridExtra)

TABLE <- data.frame(vari = c(0,1,1,2,3,3,3,4,4,4,5,5,6,7,7,8))

## selected binwidth
bw <- 2

## plot using count
plot_count <- ggplot(TABLE, aes(x = vari)) + 
   geom_histogram(aes(y = ..count../sum(..count..)*100), binwidth = bw, col =1) 
## plot using density
plot_density <- ggplot(TABLE, aes(x = vari)) + 
   geom_histogram(aes(y = ..density..), binwidth = bw, col = 1)

## visualize together
grid.arrange(ncol = 2, grobs = list(plot_count,plot_density))

## visualize the values
data_count <- ggplot_build(plot_count)
data_density <- ggplot_build(plot_density)

## using ..count../sum(..count..) the values of the y axis are the same as 
## density * bindwidth * 100. This is because density shows the "frequency density".
data_count$data[[1]]$y == data_count$data[[1]]$density*bw * 100
## using ..density.. the values of the y axis are the "frequency densities".
data_density$data[[1]]$y == data_density$data[[1]]$density


## manually calculated percentage for each range of the histogram. Note 
## geom_histogram use right-closed intervals.
min_range_of_intervals <- data_count$data[[1]]$xmin

for(i in min_range_of_intervals)
  cat(paste("Values >",i,"and <=",i+bw,"involve a percent of",
            sum(TABLE$vari>i & TABLE$vari<=(i+bw))/nrow(TABLE)*100),"\n")

# Values > -1 and <= 1 involve a percent of 18.75 
# Values > 1 and <= 3 involve a percent of 25 
# Values > 3 and <= 5 involve a percent of 31.25 
# Values > 5 and <= 7 involve a percent of 18.75 
# Values > 7 and <= 9 involve a percent of 6.25

Answer 4

When I tried Rorschach's answer it wasn't working for me for reasons that weren't readily apparent but I wanted to comment to say if you were open to adding density lines to a histogram once you do that it will automatically change the y axis to percent.当我尝试 Rorschach 的答案时，由于不太明显的原因，它对我不起作用，但我想评论说，如果你愿意在直方图中添加密度线，一旦你这样做，它会自动将 y 轴更改为百分。

For example I have a count of "doses" by a binary outcome (0,1)例如，我有一个二进制结果（0,1）的“剂量”计数

this code produces the following graph:此代码生成以下图表：

ggplot(data, aes(x=siadoses, fill=recallbin, color=recallbin)) +
  geom_histogram(binwidth=1, alpha=.5, position='identity')

But when I include a density plot to my ggplot code and add y=..density.. I get this plot with percent on the Y但是，当我在我的 ggplot 代码中包含一个密度图并添加 y=..density.. 时，我得到了这个带有 Y 百分比的图

ggplot(data, aes(x=siadoses, fill=recallbin, color=recallbin)) +
  geom_histogram(aes(y=..density..), binwidth=1, alpha=.5, position='identity') +
  geom_density(alpha=.2)

kind of a work around to your original question but thought I would share.一种解决您最初问题的方法，但我想我会分享。

让 ggplot2 直方图在 y 轴上显示分类百分比

问题描述

4 个解决方案

解决方案1
12 已采纳 2015-07-03 08:18:35

Calculating from stats从统计数据计算

Using older versions of ggplot使用旧版本的 ggplot

Yeah but what are* all these stats?*是的，但所有这些统计数据是什么？

解决方案2
9 2015-07-03 07:18:23

解决方案3
2 2021-07-22 15:36:00

解决方案4
1 2021-03-16 22:37:34

让 ggplot2 直方图在 y 轴上显示分类百分比

问题描述

4 个解决方案

解决方案1 12 已采纳 2015-07-03 08:18:35

Calculating from stats从统计数据计算

Using older versions of ggplot使用旧版本的 ggplot

Yeah but what are all these stats?是的，但所有这些统计数据是什么？

解决方案2 9 2015-07-03 07:18:23

解决方案3 2 2021-07-22 15:36:00

解决方案4 1 2021-03-16 22:37:34

解决方案1
12 已采纳 2015-07-03 08:18:35

Yeah but what are* all these stats?*是的，但所有这些统计数据是什么？

解决方案2
9 2015-07-03 07:18:23

解决方案3
2 2021-07-22 15:36:00

解决方案4
1 2021-03-16 22:37:34