[英]R: Find maximum of density plot
I have data with around 25,000 rows myData
with column attr
having values from 0 -> 45,600.我有大约 25,000 行
myData
的数据,其中列attr
的值从 0 -> 45,600。 I am not sure how to make a simplified or reproducible data...我不确定如何制作简化或可重现的数据...
Anyway, I am plotting the density of attr
like below, and I also find the attr
value where density is maximum:无论如何,我正在绘制
attr
的密度,如下所示,我还找到了密度最大的attr
值:
library(ggplot)
max <- which.max(density(myData$attr)$y)
density(myData$attr)$x[max]
ggplot(myData, aes(x=attr))+
geom_density(color="darkblue", fill="lightblue")+
geom_vline(xintercept = density(myData$attr)$x[max])+
xlab("attr")
Here is the plot I have got with the x-intercept at maximum point:这是 plot 我在最大点处的 x 截距:
Since the data is skewed, I then attempted to draw x-axis in log scale by adding scale_x_log10()
to the ggplot
, here is the new graph:由于数据是倾斜的,因此我尝试通过将
scale_x_log10()
添加到ggplot
来以对数比例绘制 x 轴,这是新图:
My questions now are:我现在的问题是:
1. Why does it have 2 maximum points now? 1.为什么现在最高2分? Why is my x-intercept no longer at the maximum point(s)?
为什么我的 x 截距不再位于最大点?
2. How do I find the intercepts for the 2 new maximum points? 2.如何找到 2 个新的最大点的截距?
Finally, I attempt to convert the y-axis to count
instead:最后,我尝试将 y 轴转换为
count
:
ggplot(myData, aes(x=attr)) +
stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3)+
xlab("attr")+
scale_x_log10()
I got the following plot:我得到了以下 plot:
3. How do I find the count
of the 2 peaks? 3.如何找到 2 个峰值的
count
?
To put my comments into a fuller context, ggplot is taking the log before doing the density estimation, which is causing the difference in shape because the binning covers different parts of the domain.为了让我的评论更全面,ggplot 在进行密度估计之前先记录日志,这会导致形状差异,因为分箱覆盖了域的不同部分。 For example,
例如,
(bins <- seq(1, 10, length.out = 10))
#> [1] 1 2 3 4 5 6 7 8 9 10
(bins_log <- 10^seq(log10(1), log10(10), length.out = 10))
#> [1] 1.000000 1.291550 1.668101 2.154435 2.782559 3.593814 4.641589
#> [8] 5.994843 7.742637 10.000000
library(ggplot2)
ggplot(data.frame(x = c(bins, bins_log),
trans = rep(c('identity', 'log10'), each = 10)),
aes(x, y = trans, col = trans)) +
geom_point()
This binning can affect the resulting density shape.这种分箱会影响最终的密度形状。 For example, compare an untransformed density:
例如,比较未转换的密度:
d <- density(mtcars$disp)
plot(d)
to one which is logged beforehand:到预先记录的一个:
d_log <- density(log10(mtcars$disp))
plot(d_log)
Note that the height of the modes flips, I believe what you are asking for is the first one, but with the log transformation applied after the density.请注意,模式的高度会翻转,我相信您要的是第一个,但是在密度之后应用了对数变换。 ie
IE
d_x_log <- d
d_x_log$x <- log10(d_x_log$x)
plot(d_x_log)
Here the modes are similar, just compressed.这里的模式是相似的,只是被压缩了。
When moving to ggplot, to do the density estimation before the log transformation it's easiest to do it outside of ggplot beforehand:移至 ggplot 时,要在对数转换之前进行密度估计,最简单的方法是事先在 ggplot 之外进行:
library(ggplot2)
d <- density(mtcars$disp)
ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) +
geom_density(stat = "identity", fill = 'burlywood', alpha = 0.3) +
scale_x_log10()
Finding modes when there's a single one is relatively easy;当只有一个模式时找到模式相对容易; it's just
d$x[which.max(d$x)]
.它只是
d$x[which.max(d$x)]
。 But when you have multiple modes, that's not good enough, since it will only show you the highest one.但是当您有多种模式时,这还不够好,因为它只会显示最高的模式。 A solution is to effectively take the derivative and look for where the slope changes from positive to negative.
一种解决方案是有效地求导并寻找斜率从正变为负的位置。 We can do this numerically with
diff
, and since we only care about whether the result is positive or negative, call sign
on that to turn everything into -1 and 1.* If we call diff
on that , everything will be 0 except the maximums and minimums, which will be -2 and 2, respectively.我们可以用
diff
以数字方式执行此操作,并且由于我们只关心结果是正数还是负数,因此在其上sign
以将所有内容变为 -1 和 1。* 如果我们在that上调用diff
,除最大值外,所有内容都将为 0和最小值,分别为 -2 和 2。 We can then look for which
values are less than 0, which we can use to subset.然后我们可以查找
which
值小于 0,我们可以使用它来进行子集化。 (Because diff
does not insert NA
s on the end, you'll have to add one to the indices.) Altogether, designed to work on a density object, (因为
diff
没有在末尾插入NA
,所以您必须在索引中添加一个。)总而言之,设计用于密度 object,
d <- density(mtcars$disp)
modes <- function(d){
i <- which(diff(sign(diff(d$y))) < 0) + 1
data.frame(x = d$x[i], y = d$y[i])
}
modes(d)
#> x y
#> 1 128.3295 0.003100294
#> 2 305.3759 0.002204658
d$x[which.max(d$y)] # double-check
#> [1] 128.3295
We can add them to our plot, and they'll get transformed nicely:我们可以将它们添加到我们的 plot 中,它们会得到很好的转换:
ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) +
geom_density(stat = "identity", fill = 'mistyrose', alpha = 0.3) +
geom_vline(xintercept = modes(d)$x) +
scale_x_log10()
To turn the y-axis into counts instead of density, multiply y by the number of observations, which is stored in the density object as n
:要将 y 轴转换为计数而不是密度,请将 y 乘以观察次数,观察次数以
n
形式存储在密度 object 中:
ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) +
geom_density(stat = "identity", fill = 'thistle', alpha = 0.3) +
geom_vline(xintercept = modes(d)$x) +
scale_x_log10()
In this case it looks a little silly because there are only 32 observations spread over a wide domain, but with a larger n and smaller domain, it is more interpretable:在这种情况下,它看起来有点傻,因为只有 32 个观测值分布在一个宽域中,但是对于更大的 n 和更小的域,它更易于解释:
d <- density(diamonds$carat, n = 2048)
ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) +
geom_density(stat = "identity", fill = 'papayawhip', alpha = 0.3) +
geom_point(data = modes(d), aes(y = y * d$n)) +
scale_x_log10()
* Or 0 if the value is exactly 0, but that's unlikely here and will work fine regardless. * 如果值正好为 0,则为 0,但这在这里不太可能并且无论如何都可以正常工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.