简体   繁体   中英

Kernel Density Estimate (Probability Density Function) is wrong?

I've created a histogram to show the density of the age at which serial killers first killed and have tried to superimpose a probability density function on this. However, when I use the geom_density() function in ggplot2, I get a density function that looks far too small (area<1). What is strange is that by changing the bin width of the histogram, the density function also changes (the smaller the bin width, the seemingly better fitting the density function. I was wondering if anyone had some guidance to make this function fit better and its area is so far below 1?

    #Histograms for Age of First Kill: 
library(ggplot2)
AFKH <- ggplot(df, aes(AgeFirstKill,fill = cut(AgeFirstKill, 100))) +
  geom_histogram(aes(y=..count../sum(..count..)), show.legend = FALSE, binwidth = 3) + # density wasn't working, so had to use the ..count/../sum(..count..)
  scale_fill_discrete(h = c(200, 10), c = 100, l = 60) + # c =, for color, and l = for brightness, the #h = c() changes the color gradient
  theme(axis.title=element_text(size=22,face="bold"), 
        plot.title = element_text(size=30, face = "bold"),
        axis.text.x = element_text(face="bold", size=14),
        axis.text.y = element_text(face="bold", size=14)) +
  labs(title = "Age of First kill",x = "Age of First Kill", y = "Density")+
  geom_density(aes(AgeFirstKill, y = ..density..), alpha = 0.7, fill = "white",lwd =1, stat = "density")
AFKH

仓宽 = 1

绑定宽度 =3

We don't have your data set, so let's make one that's reasonably close to it:

set.seed(3)
df <- data.frame(AgeFirstKill = rgamma(100, 3, 0.2) + 10)

The first thing to notice is that the density curve doesn't change . Look carefully at the y axis on your plot. You will notice that the peak of the density curve doesn't change, but remains at about 0.06. It's the height of the histogram bars that change, and the y axis changes accordingly.

The reason for this is that you aren't dividing the height of the histogram bars by their width to preserve their area. Your y aesthetic should be ..count../sum(..count..)/binwidth to keep this constant.

To show this, let's wrap your plotting code in a function that allows you to specify the bin width but also takes the binwidth into account when plotting:

draw_it <- function(bw) {
  ggplot(df, aes(AgeFirstKill,fill = cut(AgeFirstKill, 100))) +
  geom_histogram(aes(y=..count../sum(..count..)/bw), show.legend = FALSE, 
                 binwidth = bw) +
  scale_fill_discrete(h = c(200, 10), c = 100, l = 60) + 
  theme(axis.title=element_text(size=22,face="bold"), 
        plot.title = element_text(size=30, face = "bold"),
        axis.text.x = element_text(face="bold", size=14),
        axis.text.y = element_text(face="bold", size=14)) +
  labs(title = "Age of First kill",x = "Age of First Kill", y = "Density") +
  geom_density(aes(AgeFirstKill, y = ..density..), alpha = 0.7, 
               fill = "white",lwd =1, stat = "density")
}

And now we can do:

draw_it(bw = 1)

在此处输入图像描述

draw_it(bw = 3)

在此处输入图像描述

draw_it(bw = 7)

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM