I have a simple data frame with a lot of entries in it. I would like to plot a density plot of the distribution.
Quick data frame summary:
summary(rr_stats)
rr
Min. : 1.00
1st Qu.:17.00
Median :20.00
Mean :20.33
3rd Qu.:23.00
Max. :96.00
The first 20 entries in my df:
rr_stats[1:20,1]
[1] 30 28 29 32 32 33 28 25 35 24 28 22 30 26 22 26 23 25 23 23
When I plot this df the density plot looks rather strange:
ggplot(rr_stats, aes(x=rr)) + geom_density() + xlim(0,55)
I've done the exact same operations with another data frame with similar data, but here the plot looks much nicer:
What am I doing wrong?
(edit) the problem seems to be related to the size of the data frame? With 50.000 entries the issue is barely noticable_
But with 80.000 entries it starts being more visible:
You may just need to do a restart. When I run these commands in a new session,
rr_stats <- data.frame(rr = c(30,28, 29, 32, 32, 33, 28, 25, 35, 24, 28, 22, 30, 26, 22, 26, 23, 25, 23, 23))
require(ggplot2)
ggplot(rr_stats, aes(x=rr)) + geom_density() + xlim(0,55)
I get the second plot in your question, not the first:
It seems, that your data is discrete. geom_density()
gives you a kernelsmoothed density (eg. you implicitly assume a continous distribution). To see what goes wrong I simulated a little example:
N<-80000
S<-as.data.frame(rbinom(N,55,0.5))
dens80000<-density(S[,1])
dens80000
dens10000<-density(S[1:1000,])
par(mfrow=c(1,2))
plot(dens80000)
plot(dens10000)
Notice how the bandwidth differs, eg. gives you a smoother plot. The bandwidth is calculated automatically, so when N=80k the bandwidth is smaller than for N=10k, which in turn leads to a 'peaky' estimated density because of the discrete nature of your data. This can of course be solved by changing the bandwidth to a higher setting or simply using a more appropriate plot.
plot(density(S[,1],bw=2))
or in ggplot you can use the adjust argument in stat_density()
, eg. do something like:
ggplot(S, aes(x=S[,1])) + geom_density() + stat_density(adjust = 2) + xlim(0,55)
I'm not sure if there is a more elegant way to set the bandwidth in ggplot, but will look into it when I have the time.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.