简体   繁体   English

制作用于分布连续变量的条形图

[英]making barplot for distribution of a continuous variable

I'm making a barplot to visualize distribution of continuous variable, eg price distribution for listing price. 我正在制作一个图表来可视化连续变量的分布,例如,标价的价格分布。 I generated some sample data and make a barplot with ggplot2. 我生成了一些样本数据,并使用ggplot2创建了一个barplot。

a<- rnorm (100, 1000, 1000)
d <- as.data.frame(a)
d <-d %>%  mutate(b=cut(a, breaks=seq(min(a),max(a), 500))) %>% 
    group_by(b) %>% summarize(count=n())
ggplot(data=d, aes(x=b, y=count)) + 
    geom_bar(stat = 'identity')+
    theme (axis.text.x=element_text(angle=90, size=5, face='bold'))

在此处输入图片说明

My question is 我的问题是

  • how can I format the xaxis label so that, for example, 1.22e+03 becomes 1220. 如何设置xaxis标签的格式,例如1.22e + 03变为1220。

  • why the last bin becomes NA? 为什么最后一个垃圾箱变成NA?

I know I can just use geom_histogram for this data. 我知道我可以只使用geom_histogram来获取这些数据。 But I just want to have some flexibility to cut the continuous variable into bins for some highly skewed data. 但是我只想具有一些灵活性,可以将连续变量切成一些高度偏斜的数据的箱。 Any help is very much appreciated. 很感谢任何形式的帮助。 Thanks in advance. 提前致谢。

Both issues are about cut() . 这两个问题都是关于cut() You should read ?cut 您应该阅读?cut

To avoid scientific notations in the classes labels, use the argument dig.lab . 为了避免在类标签中使用科学计数法,请使用dig.lab参数。 In your example, cut(a, breaks=seq(min(a),max(a), 500), dig.lab = 6L) seems to be enough. 在您的示例中, cut(a, breaks=seq(min(a),max(a), 500), dig.lab = 6L)似乎足够。

NA s appear for two reasons linked to your breaks argument. NA的出现有两个原因与您的breaks参数相关。 First, by default, the first break is excluded from cut() , so that the observation where a == min(a) will be NA . 首先,默认情况下,第一个中断不包含在cut() ,因此a == min(a)的观察值为NA To overcome this, use include.lowest = TRUE . 为了克服这个问题,请使用include.lowest = TRUE

Finally, your highest values will be ignored because seq(min(a), max(a), 500) produces a vector that stops at the last multiple of 500 before max(a) and therefore does not include max(a) . 最后,您的seq(min(a), max(a), 500)将被忽略,因为seq(min(a), max(a), 500)生成一个向量,该向量max(a)之前的500的最后一个倍数处停止,因此不包括max(a) To overcome this, you need to make sure the second argument of seq is the first multiple of 500 after max(a) , like ceiling(max(a) / 500) * 500 . 为了克服这个问题,您需要确保seq的第二个参数是max(a)之后的500的第一个整数,例如ceiling(max(a) / 500) * 500

Therefore, this should work: 因此,这应该工作:

d <-d %>%  
  mutate(b=cut(a, breaks=seq(min(a), ceiling(max(a) / 500) * 500, 500), 
               include.lowest = TRUE, 
               dig.lab = 6L)) %>% 
  group_by(b) %>% summarize(count=n())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM