简体   繁体   中英

What is wrong with my density plot in ggplot2

I have a simple data frame with a lot of entries in it. I would like to plot a density plot of the distribution.

Quick data frame summary:

summary(rr_stats)
rr       
Min.   : 1.00  
1st Qu.:17.00  
Median :20.00  
Mean   :20.33  
3rd Qu.:23.00  
Max.   :96.00  

The first 20 entries in my df:

rr_stats[1:20,1]
[1] 30 28 29 32 32 33 28 25 35 24 28 22 30 26 22 26 23 25 23 23

When I plot this df the density plot looks rather strange:

ggplot(rr_stats, aes(x=rr)) + geom_density() + xlim(0,55)

在此处输入图片说明

I've done the exact same operations with another data frame with similar data, but here the plot looks much nicer:

在此处输入图片说明

What am I doing wrong?

(edit) the problem seems to be related to the size of the data frame? With 50.000 entries the issue is barely noticable_ 在此处输入图片说明

But with 80.000 entries it starts being more visible: 在此处输入图片说明

You may just need to do a restart. When I run these commands in a new session,

rr_stats <- data.frame(rr = c(30,28, 29, 32, 32, 33, 28, 25, 35, 24, 28, 22, 30, 26, 22, 26, 23, 25, 23, 23))
require(ggplot2)
ggplot(rr_stats, aes(x=rr)) + geom_density() + xlim(0,55)

I get the second plot in your question, not the first:

在此处输入图片说明

It seems, that your data is discrete. geom_density() gives you a kernelsmoothed density (eg. you implicitly assume a continous distribution). To see what goes wrong I simulated a little example:

N<-80000
S<-as.data.frame(rbinom(N,55,0.5))
dens80000<-density(S[,1])
dens80000
dens10000<-density(S[1:1000,])
par(mfrow=c(1,2))
plot(dens80000)
plot(dens10000)

密度 Notice how the bandwidth differs, eg. gives you a smoother plot. The bandwidth is calculated automatically, so when N=80k the bandwidth is smaller than for N=10k, which in turn leads to a 'peaky' estimated density because of the discrete nature of your data. This can of course be solved by changing the bandwidth to a higher setting or simply using a more appropriate plot.

plot(density(S[,1],bw=2))

在此处输入图片说明

or in ggplot you can use the adjust argument in stat_density() , eg. do something like:

ggplot(S, aes(x=S[,1])) + geom_density() + stat_density(adjust = 2) + xlim(0,55)

在此处输入图片说明

I'm not sure if there is a more elegant way to set the bandwidth in ggplot, but will look into it when I have the time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM