简体   繁体   中英

R: Find maximum of density plot

I have data with around 25,000 rows myData with column attr having values from 0 -> 45,600. I am not sure how to make a simplified or reproducible data...

Anyway, I am plotting the density of attr like below, and I also find the attr value where density is maximum:

library(ggplot)
max <- which.max(density(myData$attr)$y)
density(myData$attr)$x[max]
ggplot(myData, aes(x=attr))+ 
  geom_density(color="darkblue", fill="lightblue")+
  geom_vline(xintercept = density(myData$attr)$x[max])+
  xlab("attr")

Here is the plot I have got with the x-intercept at maximum point: 在此处输入图像描述

Since the data is skewed, I then attempted to draw x-axis in log scale by adding scale_x_log10() to the ggplot , here is the new graph: 在此处输入图像描述

My questions now are:

1. Why does it have 2 maximum points now? Why is my x-intercept no longer at the maximum point(s)?

2. How do I find the intercepts for the 2 new maximum points?

Finally, I attempt to convert the y-axis to count instead:

ggplot(myData, aes(x=attr)) +
  stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3)+
  xlab("attr")+
  scale_x_log10()

I got the following plot: 在此处输入图像描述

3. How do I find the count of the 2 peaks?

Why the density shapes are different

To put my comments into a fuller context, ggplot is taking the log before doing the density estimation, which is causing the difference in shape because the binning covers different parts of the domain. For example,

(bins <- seq(1, 10, length.out = 10))
#>  [1]  1  2  3  4  5  6  7  8  9 10
(bins_log <- 10^seq(log10(1), log10(10), length.out = 10))
#>  [1]  1.000000  1.291550  1.668101  2.154435  2.782559  3.593814  4.641589
#>  [8]  5.994843  7.742637 10.000000

library(ggplot2)

ggplot(data.frame(x = c(bins, bins_log), 
                  trans = rep(c('identity', 'log10'), each = 10)), 
       aes(x, y = trans, col = trans)) + 
    geom_point()

偶数箱与日志箱

This binning can affect the resulting density shape. For example, compare an untransformed density:

d <- density(mtcars$disp)
plot(d)

线性箱

to one which is logged beforehand:

d_log <- density(log10(mtcars$disp))
plot(d_log)

在密度之前记录

Note that the height of the modes flips, I believe what you are asking for is the first one, but with the log transformation applied after the density. ie

d_x_log <- d
d_x_log$x <- log10(d_x_log$x)
plot(d_x_log)

原木前的密度

Here the modes are similar, just compressed.

Moving to ggplot

When moving to ggplot, to do the density estimation before the log transformation it's easiest to do it outside of ggplot beforehand:

library(ggplot2)

d <- density(mtcars$disp)

ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) + 
    geom_density(stat = "identity", fill = 'burlywood', alpha = 0.3) + 
    scale_x_log10()

ggplot 在日志之前具有密度

Finding modes

Finding modes when there's a single one is relatively easy; it's just d$x[which.max(d$x)] . But when you have multiple modes, that's not good enough, since it will only show you the highest one. A solution is to effectively take the derivative and look for where the slope changes from positive to negative. We can do this numerically with diff , and since we only care about whether the result is positive or negative, call sign on that to turn everything into -1 and 1.* If we call diff on that , everything will be 0 except the maximums and minimums, which will be -2 and 2, respectively. We can then look for which values are less than 0, which we can use to subset. (Because diff does not insert NA s on the end, you'll have to add one to the indices.) Altogether, designed to work on a density object,

d <- density(mtcars$disp)

modes <- function(d){
    i <- which(diff(sign(diff(d$y))) < 0) + 1
    data.frame(x = d$x[i], y = d$y[i])
}

modes(d)
#>          x           y
#> 1 128.3295 0.003100294
#> 2 305.3759 0.002204658

d$x[which.max(d$y)]    # double-check
#> [1] 128.3295

We can add them to our plot, and they'll get transformed nicely:

ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) + 
    geom_density(stat = "identity", fill = 'mistyrose', alpha = 0.3) + 
    geom_vline(xintercept = modes(d)$x) +
    scale_x_log10()

使用模式线记录 ggplot

Plotting counts instead of density

To turn the y-axis into counts instead of density, multiply y by the number of observations, which is stored in the density object as n :

ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) + 
    geom_density(stat = "identity", fill = 'thistle', alpha = 0.3) + 
    geom_vline(xintercept = modes(d)$x) +
    scale_x_log10()

记录的ggplot计数密度

In this case it looks a little silly because there are only 32 observations spread over a wide domain, but with a larger n and smaller domain, it is more interpretable:

d <- density(diamonds$carat, n = 2048)

ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) + 
    geom_density(stat = "identity", fill = 'papayawhip', alpha = 0.3) + 
    geom_point(data = modes(d), aes(y = y * d$n)) +
    scale_x_log10()

钻石计数密度图


* Or 0 if the value is exactly 0, but that's unlikely here and will work fine regardless.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM