简体   繁体   English

R:找到最大密度 plot

[英]R: Find maximum of density plot

I have data with around 25,000 rows myData with column attr having values from 0 -> 45,600.我有大约 25,000 行myData的数据,其中列attr的值从 0 -> 45,600。 I am not sure how to make a simplified or reproducible data...我不确定如何制作简化或可重现的数据...

Anyway, I am plotting the density of attr like below, and I also find the attr value where density is maximum:无论如何,我正在绘制attr的密度,如下所示,我还找到了密度最大的attr值:

library(ggplot)
max <- which.max(density(myData$attr)$y)
density(myData$attr)$x[max]
ggplot(myData, aes(x=attr))+ 
  geom_density(color="darkblue", fill="lightblue")+
  geom_vline(xintercept = density(myData$attr)$x[max])+
  xlab("attr")

Here is the plot I have got with the x-intercept at maximum point:这是 plot 我在最大点处的 x 截距: 在此处输入图像描述

Since the data is skewed, I then attempted to draw x-axis in log scale by adding scale_x_log10() to the ggplot , here is the new graph:由于数据是倾斜的,因此我尝试通过将scale_x_log10()添加到ggplot来以对数比例绘制 x 轴,这是新图: 在此处输入图像描述

My questions now are:我现在的问题是:

1. Why does it have 2 maximum points now? 1.为什么现在最高2分? Why is my x-intercept no longer at the maximum point(s)?为什么我的 x 截距不再位于最大点?

2. How do I find the intercepts for the 2 new maximum points? 2.如何找到 2 个新的最大点的截距?

Finally, I attempt to convert the y-axis to count instead:最后,我尝试将 y 轴转换为count

ggplot(myData, aes(x=attr)) +
  stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3)+
  xlab("attr")+
  scale_x_log10()

I got the following plot:我得到了以下 plot: 在此处输入图像描述

3. How do I find the count of the 2 peaks? 3.如何找到 2 个峰值的count

Why the density shapes are different为什么密度形状不同

To put my comments into a fuller context, ggplot is taking the log before doing the density estimation, which is causing the difference in shape because the binning covers different parts of the domain.为了让我的评论更全面,ggplot 在进行密度估计之前先记录日志,这会导致形状差异,因为分箱覆盖了域的不同部分。 For example,例如,

(bins <- seq(1, 10, length.out = 10))
#>  [1]  1  2  3  4  5  6  7  8  9 10
(bins_log <- 10^seq(log10(1), log10(10), length.out = 10))
#>  [1]  1.000000  1.291550  1.668101  2.154435  2.782559  3.593814  4.641589
#>  [8]  5.994843  7.742637 10.000000

library(ggplot2)

ggplot(data.frame(x = c(bins, bins_log), 
                  trans = rep(c('identity', 'log10'), each = 10)), 
       aes(x, y = trans, col = trans)) + 
    geom_point()

偶数箱与日志箱

This binning can affect the resulting density shape.这种分箱会影响最终的密度形状。 For example, compare an untransformed density:例如,比较未转换的密度:

d <- density(mtcars$disp)
plot(d)

线性箱

to one which is logged beforehand:到预先记录的一个:

d_log <- density(log10(mtcars$disp))
plot(d_log)

在密度之前记录

Note that the height of the modes flips, I believe what you are asking for is the first one, but with the log transformation applied after the density.请注意,模式的高度会翻转,我相信您要的是第一个,但是在密度之后应用了对数变换。 ie IE

d_x_log <- d
d_x_log$x <- log10(d_x_log$x)
plot(d_x_log)

原木前的密度

Here the modes are similar, just compressed.这里的模式是相似的,只是被压缩了。

Moving to ggplot转移到 ggplot

When moving to ggplot, to do the density estimation before the log transformation it's easiest to do it outside of ggplot beforehand:移至 ggplot 时,要在对数转换之前进行密度估计,最简单的方法是事先在 ggplot 之外进行:

library(ggplot2)

d <- density(mtcars$disp)

ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) + 
    geom_density(stat = "identity", fill = 'burlywood', alpha = 0.3) + 
    scale_x_log10()

ggplot 在日志之前具有密度

Finding modes寻找模式

Finding modes when there's a single one is relatively easy;当只有一个模式时找到模式相对容易; it's just d$x[which.max(d$x)] .它只是d$x[which.max(d$x)] But when you have multiple modes, that's not good enough, since it will only show you the highest one.但是当您有多种模式时,这还不够好,因为它只会显示最高的模式。 A solution is to effectively take the derivative and look for where the slope changes from positive to negative.一种解决方案是有效地求导并寻找斜率从正变为负的位置。 We can do this numerically with diff , and since we only care about whether the result is positive or negative, call sign on that to turn everything into -1 and 1.* If we call diff on that , everything will be 0 except the maximums and minimums, which will be -2 and 2, respectively.我们可以用diff以数字方式执行此操作,并且由于我们只关心结果是正数还是负数,因此在其上sign以将所有内容变为 -1 和 1。* 如果我们在that上调用diff ,除最大值外,所有内容都将为 0和最小值,分别为 -2 和 2。 We can then look for which values are less than 0, which we can use to subset.然后我们可以查找which值小于 0,我们可以使用它来进行子集化。 (Because diff does not insert NA s on the end, you'll have to add one to the indices.) Altogether, designed to work on a density object, (因为diff没有在末尾插入NA ,所以您必须在索引中添加一个。)总而言之,设计用于密度 object,

d <- density(mtcars$disp)

modes <- function(d){
    i <- which(diff(sign(diff(d$y))) < 0) + 1
    data.frame(x = d$x[i], y = d$y[i])
}

modes(d)
#>          x           y
#> 1 128.3295 0.003100294
#> 2 305.3759 0.002204658

d$x[which.max(d$y)]    # double-check
#> [1] 128.3295

We can add them to our plot, and they'll get transformed nicely:我们可以将它们添加到我们的 plot 中,它们会得到很好的转换:

ggplot(data.frame(x = d$x, y = d$y), aes(x, y)) + 
    geom_density(stat = "identity", fill = 'mistyrose', alpha = 0.3) + 
    geom_vline(xintercept = modes(d)$x) +
    scale_x_log10()

使用模式线记录 ggplot

Plotting counts instead of density绘制计数而不是密度

To turn the y-axis into counts instead of density, multiply y by the number of observations, which is stored in the density object as n :要将 y 轴转换为计数而不是密度,请将 y 乘以观察次数,观察次数以n形式存储在密度 object 中:

ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) + 
    geom_density(stat = "identity", fill = 'thistle', alpha = 0.3) + 
    geom_vline(xintercept = modes(d)$x) +
    scale_x_log10()

记录的ggplot计数密度

In this case it looks a little silly because there are only 32 observations spread over a wide domain, but with a larger n and smaller domain, it is more interpretable:在这种情况下,它看起来有点傻,因为只有 32 个观测值分布在一个宽域中,但是对于更大的 n 和更小的域,它更易于解释:

d <- density(diamonds$carat, n = 2048)

ggplot(data.frame(x = d$x, y = d$y * d$n), aes(x, y)) + 
    geom_density(stat = "identity", fill = 'papayawhip', alpha = 0.3) + 
    geom_point(data = modes(d), aes(y = y * d$n)) +
    scale_x_log10()

钻石计数密度图


* Or 0 if the value is exactly 0, but that's unlikely here and will work fine regardless. * 如果值正好为 0,则为 0,但这在这里不太可能并且无论如何都可以正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM