简体   繁体   English

R / Fortran中有效的二元经验cdf计算

[英]Efficient computation of bivariate empirical cdf in R/Fortran

Given an n*2 data matrix X I'd like to calculate the bivariate empirical cdf for each observation, ie for each i in 1:n, return the percentage of observations with 1st element not greater than X[i,1] and 2nd element not greater than X[i,2]. 给定一个n * 2数据矩阵X,我想为每个观察值(即每个1:i中的i)计算二元经验cdf,返回第一个元素不大于X [i,1]和第二个元素的观察值的百分比元素不大于X [i,2]。

Because of the nested search involved it gets terribly slow for n ~ 100k, even after porting it to Fortran. 由于涉及嵌套搜索,因此即使将其移植到Fortran后,它的速度也非常慢,大约需要10k。 Does anyone know if there's a better way of handling sample sizes like this? 有谁知道有没有更好的方法来处理这样的样本量?

Edit : I believe this problem is similar (in terms of complexity) to finding Kendall's tau, which is of order O(n^2). 编辑 :我认为这个问题(就复杂性而言)类似于找到Kendall的tau,阶数为O(n ^ 2)。 In that case Knight (1966) has an algorithm to reduce it to O(n log(n)). 在那种情况下,Knight(1966)有一种算法可以将其简化为O(n log(n))。 Just wondering if there's any O(n*log(n)) algorithm for finding bivariate ecdf already out there. 只是想知道是否已经有O(n * log(n))算法来查找双变量ecdf。

Edit 2 : This is the code I have in Fortran, as requested. 编辑2 :这是我在Fortran中按要求提供的代码。 This is called in R in the usual way, so the R code is omitted here. 通常用R在R中调用它,因此这里省略了R代码。 The code is meant for arbitrary dimensions, but for the specific thing I'm doing a bivariate one is good enough. 该代码适用于任意尺寸,但是对于我正在做的双变量的特定事情来说,已经足够了。

! Calculates multivariate empirical cdf for each point
! n: number of observations
! d: dimension (>=2)
! umat: data matrix
! outvec: vector of ecdf

subroutine mecdf(n,d,umat,outvec)
    implicit none

    integer :: n, d, i, j, k, tempsum
    double precision, dimension(n) :: outvec
    double precision, dimension(n,d) :: umat
    logical :: flag

    do i = 1,n
        tempsum = 0
        do j = 1,n
            flag = .true.
            do k = 1,d
                if (umat(i,k) < umat(j,k)) then
                    flag = .false.
                    exit
                end if
            end do
            if (flag) then
                tempsum = tempsum + 1
            end if
        end do
        outvec(i) = real(tempsum)/n
    end do
    return
end subroutine

I think my first effort was not really an ecdf, although it did map the points to the interval [0,1] The example, a 25 x 2 matrix generated with: 我认为我的第一个努力并不是真正的ecdf,尽管它确实将点映射到间隔[0,1]。该示例是一个25 x 2的矩阵,其生成如下:

#M <- matrix(runif(100), ncol=2)
M <- 
structure(c(0.0468267474789172, 0.296053855214268, 0.205678076483309, 
0.467400068417192, 0.968577065737918, 0.435642971657217, 0.929023026255891, 
0.038406387437135, 0.304360694251955, 0.964778139721602, 0.534192910650745, 
0.741682186257094, 0.0848641532938927, 0.405901980120689, 0.957696850644425, 
0.384813814423978, 0.639882878866047, 0.231505588628352, 0.271994129288942, 
0.786155494628474, 0.349499785574153, 0.279077709652483, 0.206662984099239, 
0.777465222170576, 0.705439242534339, 0.643429880728945, 0.887209519045427, 
0.0794123203959316, 0.849177583120763, 0.704594585578889, 0.736909110797569, 
0.503158083418384, 0.49449566937983, 0.408533290959895, 0.236613316927105, 
0.297427259152755, 0.0677345870062709, 0.623845702270046, 0.139933609170839, 
0.740499466424808, 0.628097783308476, 0.678438259987161, 0.186680511338636, 
0.339367639739066, 0.373212536331266, 0.976724133593962, 0.94558056560345, 
0.610417427960783, 0.887977657606825, 0.663434249348938, 0.447939050383866, 
0.755168803501874, 0.478974275058135, 0.737040047068149, 0.429466919740662, 
0.0021107573993504, 0.697435079608113, 0.444197302218527, 0.108997165458277, 
0.856855363817886, 0.891898229718208, 0.93553287582472, 0.991948011796921, 
0.630414301762357, 0.0604106825776398, 0.908968194155023, 0.0398679254576564, 
0.251426834380254, 0.235532913124189, 0.392070295521989, 0.530511683085933, 
0.319339724024758, 0.534880011575297, 0.92030712752603, 0.138276003766805, 
0.213625695323572, 0.407931711757556, 0.605797187192366, 0.424798395251855, 
0.471233424032107, 0.0105366336647421, 0.625802840106189, 0.524665891425684, 
0.0375960320234299, 0.54812005511485, 0.0105806747451425, 0.438266788609326, 
0.791981092421338, 0.363821814302355, 0.157931488472968, 0.47945317090489, 
0.906797411618754, 0.762243523262441, 0.258681379957125, 0.308056800393388, 
0.91944490163587, 0.412255838746205, 0.347220918396488, 0.68236422073096, 
0.559149842709303), .Dim = c(50L, 2L))

So the task is to do a single summation of a two-part logical test on N items which I suspect is O(N*3). 因此,任务是对我怀疑为O(N * 3)的N个项目进行两部分逻辑测试的一次求和。 It might be marginally faster if implemented in Rcpp, but these are vectorized operations. 如果在Rcpp中实现,可能会稍快一些,但是这些是矢量化操作。

# Wrong: ecdf2d <- function(m,i,j) { ord <- rank(m[ , 1]^2+m[ , 2]^2)
#            ord[i]/nrow(m)}  # scales to [0,1] interval

ecdf2d.v2 <- function(obj, x, y) sum( obj[,1] < x & obj[,2] < y)/nrow(obj)

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM