简体   繁体   English

如何测量R / ggplot2中2条分布曲线之间的面积

[英]How to measure area between 2 distribution curves in R / ggplot2

The specific example is that imagine x is some continuous variable between 0 and 10 and that the red line is distribution of "goods" and the blue is "bads", I'd like to see if there is value in incorporating this variable into checking for 'goodness' but I'd like to first quantify the amount of stuff in the areas where the blue > red 具体的例子是想象x是0到10之间的一些连续变量,红线是“货物”的分布而蓝色是“坏”,我想看看将这个变量合并到检查中是否有价值为了'善良',但我想首先量化蓝色>红色区域的东西数量

Because this is a distribution chart, the scales look the same, but in reality there is 98 times more good in my sample which complicates things, since it's not actually just measuring the area under the curve, but rather measuring the bad sample where it's distribution is along lines where it's greater than the red. 因为这是一个分布图,尺度看起来是一样的,但实际上我的样本中有98倍的好处使事情变得复杂,因为它实际上并不只是测量曲线下面积,而是测量不良样本的分布情况沿着比红色更大的线。

I've been working to learn R, but am not even sure how to approach this one, any help appreciated. 我一直在努力学习R,但我甚至不确定如何处理这个,任何帮助赞赏。 在此输入图像描述

EDIT sample data: http://pastebin.com/7L3Xc2KU <- a few million rows of that, essentially. 编辑样本数据: http//pastebin.com/7L3Xc2KU < - 基本上是几百万行。

the graph is created with 图表是用。创建的

graph <- qplot(sample_x, bad_is_1, data=sample_data, geom="density", color=bid_is_1)

The only way I can think of to do this is to calculate the area between the curve using simple trapezoids. 我能想到的唯一方法是使用简单的梯形计算曲线之间的面积。 First we manually compute the densities 首先,我们手动计算密度

d0 <- density(sample$sample_x[sample$bad_is_1==0])
d1 <- density(sample$sample_x[sample$bad_is_1==1])

Now we create functions that will interpolate between our observed density points 现在我们创建将在我们观察到的密度点之间插值的函数

f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)

Next we find the x range of the overlap of the densities 接下来我们找到密度重叠的x范围

ovrng <- c(max(min(d0$x), min(d1$x)), min(max(d0$x), max(d1$x)))

and divide that into 500 sections 并将其分为500个部分

i <- seq(min(ovrng), max(ovrng), length.out=500)

Now we calculate the distance between the density curves 现在我们计算密度曲线之间的距离

h <- f0(i)-f1(i)

and using the formula for the area of a trapezoid we add up the area for the regions where d1>d0 并且使用梯形区域的公式,我们将d1> d0的区域加起来

area<-sum( (h[-1]+h[-length(h)]) /2 *diff(i) *(h[-1]>=0+0))
# [1] 0.1957627

We can plot the region using 我们可以使用绘制区域

plot(d0, main="d0=black, d1=green")
lines(d1, col="green")
jj<-which(h>0 & seq_along(h) %% 5==0); j<-i[jj]; 
segments(j, f1(j), j, f1(j)+h[jj])

在此输入图像描述

Here's a way to shade the area between two density plots and calculate the magnitude of that area. 这是一种遮蔽两个密度图之间区域的方法,并计算该区域的大小。

# Create some fake data
set.seed(10)
dat = data.frame(x=c(rnorm(1000, 0, 5), rnorm(2000, 0, 1)), 
                 group=c(rep("Bad", 1000), rep("Good", 2000)))

# Plot densities
# Use y=..count.. to get counts on the vertical axis
p1 = ggplot(dat) +
       geom_density(aes(x=x, y=..count.., colour=group), lwd=1)

Some extra calculations to shade the area between the two density plots (adapted from this SO question ): 一些额外的计算来遮蔽两个密度图之间的区域(改编自这个SO问题 ):

pp1 = ggplot_build(p1)

# Create a new data frame with densities for the two groups ("Bad" and "Good")
dat2 = data.frame(x = pp1$data[[1]]$x[pp1$data[[1]]$group==1],
                 ymin=pp1$data[[1]]$y[pp1$data[[1]]$group==1],
                 ymax=pp1$data[[1]]$y[pp1$data[[1]]$group==2])

# We want ymax and ymin to differ only when the density of "Good" 
# is greater than the density of "Bad"
dat2$ymax[dat2$ymax < dat2$ymin] = dat2$ymin[dat2$ymax < dat2$ymin]

# Shade the area between "Good" and "Bad"
p1a = p1 +  
    geom_ribbon(data=dat2, aes(x=x, ymin=ymin, ymax=ymax), fill='yellow', alpha=0.5)

Here are the two plots: 以下是两个图:

在此输入图像描述

To get the area (number of values) in specific ranges of Good and Bad , use the density function on each group (or you can continue to work with the data pulled from ggplot as above, but this way you get more direct control over how the density distribution is generated): 要获得GoodBad特定范围内的区域(值的数量),请在每个组上使用density函数(或者您可以继续使用从上面的ggplot提取的数据,但这样您可以更直接地控制如何生成密度分布):

## Calculate densities for Bad and Good. 
# Use same number of points and same x-range for each group, so that the density 
# values will line up. Use a higher value for n to get a finer x-grid for the density
# values. Use a power of 2 for n, because the density function rounds up to the nearest 
# power of 2 anyway.
bad = density(dat$x[dat$group=="Bad"], 
             n=1024, from=min(dat$x), to=max(dat$x))
good = density(dat$x[dat$group=="Good"], 
             n=1024, from=min(dat$x), to=max(dat$x))

## Normalize so that densities sum to number of rows in each group

# Number of rows in each group
counts = tapply(dat$x, dat$group, length)

bad$y = counts[1]/sum(bad$y) * bad$y
good$y = counts[2]/sum(good$y) * good$y

## Results

# Number of "Good" in region where "Good" exceeds "Bad"
sum(good$y[good$y > bad$y])
[1] 1931.495  # Out of 2000 total in the data frame

# Number of "Bad" in region where "Good" exceeds "Bad"
sum(bad$y[good$y > bad$y])
[1] 317.7315  # Out of 1000 total in the data frame

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM