[英]making barplot for distribution of a continuous variable
I'm making a barplot to visualize distribution of continuous variable, eg price distribution for listing price. 我正在制作一个图表来可视化连续变量的分布,例如,标价的价格分布。 I generated some sample data and make a barplot with ggplot2.
我生成了一些样本数据,并使用ggplot2创建了一个barplot。
a<- rnorm (100, 1000, 1000)
d <- as.data.frame(a)
d <-d %>% mutate(b=cut(a, breaks=seq(min(a),max(a), 500))) %>%
group_by(b) %>% summarize(count=n())
ggplot(data=d, aes(x=b, y=count)) +
geom_bar(stat = 'identity')+
theme (axis.text.x=element_text(angle=90, size=5, face='bold'))
My question is 我的问题是
how can I format the xaxis label so that, for example, 1.22e+03 becomes 1220. 如何设置xaxis标签的格式,例如1.22e + 03变为1220。
why the last bin becomes NA? 为什么最后一个垃圾箱变成NA?
I know I can just use geom_histogram for this data. 我知道我可以只使用geom_histogram来获取这些数据。 But I just want to have some flexibility to cut the continuous variable into bins for some highly skewed data.
但是我只想具有一些灵活性,可以将连续变量切成一些高度偏斜的数据的箱。 Any help is very much appreciated.
很感谢任何形式的帮助。 Thanks in advance.
提前致谢。
Both issues are about cut()
. 这两个问题都是关于
cut()
。 You should read ?cut
您应该阅读
?cut
To avoid scientific notations in the classes labels, use the argument dig.lab
. 为了避免在类标签中使用科学计数法,请使用
dig.lab
参数。 In your example, cut(a, breaks=seq(min(a),max(a), 500), dig.lab = 6L)
seems to be enough. 在您的示例中,
cut(a, breaks=seq(min(a),max(a), 500), dig.lab = 6L)
似乎足够。
NA
s appear for two reasons linked to your breaks
argument. NA
的出现有两个原因与您的breaks
参数相关。 First, by default, the first break is excluded from cut()
, so that the observation where a == min(a)
will be NA
. 首先,默认情况下,第一个中断不包含在
cut()
,因此a == min(a)
的观察值为NA
。 To overcome this, use include.lowest = TRUE
. 为了克服这个问题,请使用
include.lowest = TRUE
。
Finally, your highest values will be ignored because seq(min(a), max(a), 500)
produces a vector that stops at the last multiple of 500 before max(a)
and therefore does not include max(a)
. 最后,您的
seq(min(a), max(a), 500)
将被忽略,因为seq(min(a), max(a), 500)
生成一个向量,该向量在max(a)
之前的500的最后一个倍数处停止,因此不包括max(a)
。 To overcome this, you need to make sure the second argument of seq is the first multiple of 500 after max(a)
, like ceiling(max(a) / 500) * 500
. 为了克服这个问题,您需要确保seq的第二个参数是
max(a)
之后的500的第一个整数,例如ceiling(max(a) / 500) * 500
。
Therefore, this should work: 因此,这应该工作:
d <-d %>%
mutate(b=cut(a, breaks=seq(min(a), ceiling(max(a) / 500) * 500, 500),
include.lowest = TRUE,
dig.lab = 6L)) %>%
group_by(b) %>% summarize(count=n())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.