简体   繁体   English

如何计算R中曲线下面积的95%可信极限?

[英]how can I calculate the 95% credible limit of an area under a curve in R?

I have the following distribution: 我有以下分布:

x<-c(22.5,28.14285714,33.78571429,39.42857143,45.07142857,50.71428571,56.35714286,62,67.64285714,73.28571429,78.92857143,84.57142857,90.21428571,95.85714286,101.5,107.1428571,112.7857143,118.4285714,124.0714286,129.7142857,135.3571429,141,146.6428571,152.2857143,157.9285714,163.5714286,169.2142857,174.8571429,180.5,186.1428571,191.7857143,197.4285714,203.0714286,208.7142857,214.3571429,220,225.6428571,231.2857143,236.9285714,242.5714286,248.2142857,253.8571429,259.5,265.1428571,270.7857143,276.4285714,282.0714286,287.7142857,293.3571429,299)
y<-c(0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.00328839614285714,0.00296425985714286,0.002655899,0.00236187857142857,0.002080895,0.00181184271428571,0.00155376085714286,0.00130578928571429,0.001074706,0.000877193,0.000709397142857142,0.000567189714285714,0.000447254,0.000346858571428571,0.000263689142857143,0.000195768428571429,0.000141427,9.92657142857141e-05,6.77857142857142e-05,4.48571428571428e-05,2.86428571428571e-05,1.75142857142857e-05,1.01357142857143e-05,5.52e-06,2.78857142857142e-06,1.27285714285713e-06,5.00714285714284e-07,1.5742857142857e-07,3.29857142857142e-08,2.78857142857137e-09,1.74e-12)

plot(x,y)

I would like to find the value of x that separates an area of 0.95 under the distribution to the left and 0.05 of the area to the right (one-tailed 95% interval of credibility). 我想找到x的值,该值将分布左侧的0.95的面积与右侧的0.05的面积(可信度的95%尾巴)分开。

I guess I have to fit my empirical curve to a function and then integrate the function so I can obtain the desired value, but I don't really know from where to start. 我想我必须将经验曲线拟合到一个函数上,然后对函数进行积分,以便获得所需的值,但是我真的不知道从哪里开始。

How can this possibly be done in R? 如何在R中完成此操作?

As other answers have indicated, this is an integration under the curve problem, paired with determining where the area reaches 95% of the total area. 正如其他答案所表明的,这是曲线问题下的积分,与确定面积达到总面积的95%的位置配对。 I take a simpler approach to the integration than David's answer does. 我采用的集成方法比David的答案更简单。 Rather than interpolate a curve and integrate that, I just use the trapezoid integration rules to get the area contributed by each interval. 我没有使用插值法并对其进行积分,而只是使用梯形积分法则来获取每个区间贡献的面积。 Those individual areas are then added to get the total area. 然后将这些单个区域相加以获得总面积。 Then the index where the cumulative area surpasses 95% of the total area is found and a line can be drawn with that. 然后找到累积面积超过总面积95%的索引,并可以用它画一条线。

piece_area <- c(0, (x[-1] - x[-length(x)])*(y[-1] + y[-length(y)]) / 2)
cum_area <- cumsum(piece_area)
total_area <- cum_area[length(cum_area)]
idx095 <- min(which(cum_area > 0.95 * total_area))

abline(v = x[idx095])

在此处输入图片说明

Higher resolution of the exact point where 95% is crossed can be obtained by using more points in the original sample of the distribution. 可以通过在分布的原始样本中使用更多点来获得越过95%的精确点的更高分辨率。

It is an integration problem ( sum under the curve ). 这是一个积分问题(曲线下的总和)。 You can divide your integration into a square one + the curve one. 您可以将积分分为平方一和曲线一。 However, you can use a quick and dirty approximation via splines: 但是,可以通过样条线使用快速且脏的近似值:

y<-c(0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.003541755,0.00328839614285714,0.00296425985714286,0.002655899,0.00236187857142857,0.002080895,0.00181184271428571,0.00155376085714286,0.00130578928571429,0.001074706,0.000877193,0.000709397142857142,0.000567189714285714,0.000447254,0.000346858571428571,0.000263689142857143,0.000195768428571429,0.000141427,9.92657142857141e-05,6.77857142857142e-05,4.48571428571428e-05,2.86428571428571e-05,1.75142857142857e-05,1.01357142857143e-05,5.52e-06,2.78857142857142e-06,1.27285714285713e-06,5.00714285714284e-07,1.5742857142857e-07,3.29857142857142e-08,2.78857142857137e-09,1.74e-12)
x<-c(22.5,28.14285714,33.78571429,39.42857143,45.07142857,50.71428571,56.35714286,62,67.64285714,73.28571429,78.92857143,84.57142857,90.21428571,95.85714286,101.5,107.1428571,112.7857143,118.4285714,124.0714286,129.7142857,135.3571429,141,146.6428571,152.2857143,157.9285714,163.5714286,169.2142857,174.8571429,180.5,186.1428571,191.7857143,197.4285714,203.0714286,208.7142857,214.3571429,220,225.6428571,231.2857143,236.9285714,242.5714286,248.2142857,253.8571429,259.5,265.1428571,270.7857143,276.4285714,282.0714286,287.7142857,293.3571429,299)

sp=smooth.spline(x,y)
f = function(t)
{
    predict(sp,t)$y
}   

N=500 # this is an accuracy parameter
xBis=seq(x[1],x[length(x)],length=N)
yBis=sapply(x,f)

J = function (input)
{   # This function takes input in 1:N
    Integral = 0
    dx=(x[length(x)]-x[1])/N

    for ( j in 1: input)
{   z=xBis[j]
    Integral=Integral+ f(x[1]+z)*dx
}
J=Integral
}
######
I=J(N) # This is the value of the sum under the curve
# It should be roughly equal (given the shape of the curve) to:
index=max(which(y==y[1]))
I = (x[index]-x[1])*(y[index])*3/2
######
res=sapply(1:N,J)/I
Index5=max(which(res<=.05))
Index95=min(which(res>=.95))

x5=xBis[Index5] # This is the 5% quantile 
x95=xBis[Index95]

HTH HTH

Let me know if anything is unclear 让我知道是否有任何不清楚的地方

PS I think there are far better ways to do it .. 附言:我认为有更好的方法来做..

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM