简体   繁体   English

加快R算法以计算Hellinger距离的距离矩阵

[英]Speed up R algorithm to calculate distance matrix for Hellinger distance

I am looking for a way to speed up this algorithm. 我正在寻找一种加速此算法的方法。

My situation is as follows. 我的情况如下。 I have a dataset with 25,000 users with 6 habits. 我有一个25,000个用户的6个习惯数据集。 My goal is to develop a hierarchical clustering for the 25,000 users. 我的目标是为25,000个用户开发分层集群。 I run this on a server with 16 cores, 128GB RAM. 我在具有16核,128GB RAM的服务器上运行此服务器。 It took me 3 weeks just for 10,000 users using 6 cores non-stop on my server to calculate this distance matrix. 我花了3个星期,仅用10,000个用户就在我的服务器上不停地使用6核来计算此距离矩阵。 As you can imagine this is too long for my research. 您可以想象,这对于我的研究来说太长了。

For each of the 6 habits I have created a probability mass distribution (PMF). 对于这6个习惯,我都创建了概率质量分布(PMF)。 The PMFs may differ in size (columns) per per habbit. 每位PMF的大小(列)可能不同。 Some habits have 10 columns some 256, all depending on the user with most unbahitual behavior. 有些习惯有10列(共256列),具体取决于具有最不正常行为的用户。

The first step in my algrithm is to develop a distance matrix. 我的算法的第一步是建立距离矩阵。 I use the Hellinger distance to calculate the distance, which is contrary to some packages that use eg cathersian/Manhattan. 我使用Hellinger距离来计算距离,这与某些使用cathersian / Manhattan的软件包相反。 I do need the Hellinger distance, see https://en.wikipedia.org/wiki/Hellinger_distance 我确实需要Hellinger距离,请参见https://en.wikipedia.org/wiki/Hellinger_distance

What I currently tried is to speed up the algorithm by applying a multicore proces, 6 habits each on a seperate core. 我目前尝试的是通过应用多核过程来加快算法,多核过程在单独的核上分别有6个习惯。 Two things that may be beneficial for speed up 可能有利于加快速度的两件事

(1) C implementation - but I have no idea how to do this (I am not a C programmer) Could you help me on this C implementation if this would be helpful? (1)C实现-但是我不知道该怎么做(我不是C程序员),如果这有帮助的话,您能帮我这个C实现吗?

(2) make a carthesian product by joining on the table by itself and have all rows and thereafte do a rowwise calculation. (2)通过自己在桌子上连接并让所有行然后进行逐行计算来制作笛卡尔乘积。 The point there is that R gives an error by default in eg data.table. 关键是R默认在data.table中给出一个错误。 Any suggestions for this? 有什么建议吗?

Any other thoughts? 还有其他想法吗?

Best Regards Jurjen 此致Jurjen

# example for 1 habit with 100 users and a PMF of 5 columns
Habit1<-data.frame(col1=abs(rnorm(100)),
               col2=abs(c(rnorm(20),runif(50),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),10))),
               col3=abs(c(rnorm(30),runif(30),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))), 
               col4=abs(c(rnorm(10),runif(10),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),60))),
               col5=abs(c(rnorm(50),runif(10),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))))

  # give all users a username same as rowname 
  rownames(Habit1)<- c(1:100)

  # actual calculation  
  Result<-calculatedistances(Habit1)



         HellingerDistance <-function(x){
           #takes two equal sized vectors and calculates the hellinger distance between the vectors

           # hellinger distance function
           return(sqrt(sum(((sqrt(x[1,]) - sqrt(x[2,]))^2)))/sqrt(2))

         }


       calculatedistances <- function(x){
         # takes a dataframe of user IID in the first column and a set of N values per user thereafter 

         # first set all NA to 0
         x[is.na(x)] <- 0



         #create matrix of 2 subsets based on rownumber
         # 1 first the diagronal with 
         D<-cbind(matrix(rep(1:nrow(x),each=2),nrow=2),combn(1:nrow(x), 2))

         # create a dataframe with hellinger distances
         B <<-data.frame(first=rownames(x)[D[1,]],
                        second=rownames(x)[D[2,]],
                        distance=apply(D, 2, function(y) HellingerDistance(x[ y,]))
         )


         # reshape dataframe into a matrix with users on x and y axis
         B<<-reshape(B, direction="wide", idvar="second", timevar="first")

         # convert wide table to distance table object
         d <<- as.dist(B[,-1], diag = FALSE)
         attr(d, "Labels") <- B[, 1]
         return(d)

       }

I understand this is not a complete answer, but this suggestion is too long for a comment. 我知道这不是一个完整的答案,但是这个建议太长了,无法发表评论。

Here is how I would go about using data.table to speed up the process. 这是我将如何使用data.table来加快过程的方法。 The way it stands, this code still does not achieve what you requested maybe because I'm not entirely sure what you want but hopefully this will give a clear idea of how to proceed from here. 就目前的情况而言,此代码仍未实现您的要求,也许是因为我不确定您想要什么,但希望这会给您一个清晰的思路,让您从这里开始。

Also, you might wanna take a look at the HellingerDist{distrEx} function to calculate Hellinger Distance. 另外,您可能HellingerDist{distrEx}一下HellingerDist{distrEx}函数来计算Hellinger距离。

library(data.table)

# convert Habit1 into a data.table
  setDT(Habit1)

# assign ids instead of working with rownames
  Habit1[, id := 1:100] 

# replace NAs with 0
  for (j in seq_len(ncol(Habit1)))
    set(Habit1, which(is.na(Habit1[[j]])),j,0)

# convert all values to numeric
  for (k in seq_along(Habit1)) set(Habit1, j = k, value = as.numeric(Habit1[[k]]))


# get all possible combinations of id pairs in long format
  D <- cbind(matrix(rep(1:nrow(Habit1),each=2),nrow=2),combn(1:nrow(Habit1), 2))
  D <- as.data.table(D)
  D <- transpose(D)


# add to this dataset the probability mass distribution (PMF) of each id V1 and V2
# this solution dynamically adapts to number of columns in each Habit dataset
  colnumber <- ncol(Habit1) - 1
  cols <- paste0('i.col',1:colnumber) 

  D[Habit1, c(paste0("id1_col",1:colnumber)) := mget(cols ), on=.(V1 = id)]
  D[Habit1, c(paste0("id2_col",1:colnumber)) := mget(cols ), on=.(V2 = id)]


# [STATIC] calculate hellinger distance 
D[, H := sqrt(sum(((sqrt(c(id1_col1,  id1_col2,  id1_col3,  id1_col4,   id1_col5)) - sqrt(c(id2_col1,  id2_col2,  id2_col3,  id2_col4,   id2_col5)))^2)))/sqrt(2) , by = .(V1, V2)]

Now, if you want to make this flexible to the number of columns in each habit data set: 现在,如果要使其灵活地适应每个habit数据集中的列数:

# get names of columns
  part1 <- names(D)[names(D) %like% "id1"]
  part2 <- names(D)[names(D) %like% "id2"]

# calculate distance 
  D[, H2 := sqrt(sum(((sqrt( .SD[, ..part1] ) - sqrt( .SD[, ..part2] ))^2)))/sqrt(2) , by = .(V1,V2) ] 

Now, for a much faster distance calculation 现在,为了更快地计算距离

# change 1st colnames to avoid conflict 
  names(D)[1:2] <- c('x', 'y')

# [dynamic] calculate hellinger distance
  D[melt(D, measure = patterns("^id1", "^id2"), value.name = c("v", "f"))[
  , sqrt(sum(((sqrt( v ) - sqrt( f ))^2)))/sqrt(2), by=.(x,y)], H3 := V1,  on = .(x,y)]

# same results
#> identical(D$H, D$H2, D$H3)
#> [1] TRUE

The first thing to optimize code is profiling. 优化代码的第一件事是分析。 By profiling the code you provided, it seems that the main bottleneck is HellingerDistance function. 通过分析您提供的代码,似乎主要瓶颈是HellingerDistance函数。

  • Improve algorithm. 改进算法。 In your HellingerDistance function, it can be seen when calculating distance of each pair, you recalculate the square-root each time, which is a total waste of time. 在您的HellingerDistance函数中,可以在计算每对对的距离时看到,每次都重新计算平方根,这是在浪费时间。 So here is the improved version, calculatedistances1 is the new function, it first calculate the square-root of x and use new HellingerDistanceSqrt to calculate Hellinger distance, it can be seen the new version speeds up 40%. 因此,这里是改进版本, calculatedistances1是新函数,它首先计算x的平方根,然后使用新的HellingerDistanceSqrt计算Hellinger距离,可以看出新版本将速度提高了40%。

  • Improve data structure. 改善数据结构。 I also notice that your x in your original calulatedistance function is a data.frame which overloads too much, so I transform it to a matrix by as.matrix which makes the code faster by more than a magnitude. 我还注意到,您原始的calulatedistance函数中的x是一个data.frame ,它过载过多,因此我通过as.matrix将其转换为矩阵,这使代码的速度提高了一个数量级。

Finally, the new calculatedistances1 is more than 70 times faster than the original version on my machine. 最后,新的calculatedistances1比我的计算机上的原始版本快70倍以上。

# example for 1 habit with 100 users and a PMF of 5 columns
Habit1<-data.frame(col1=abs(rnorm(100)),
                   col2=abs(c(rnorm(20),runif(50),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),10))),
                   col3=abs(c(rnorm(30),runif(30),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))),
                   col4=abs(c(rnorm(10),runif(10),rep(0.4,20),sample(seq(0.01,0.99,by=0.01),60))),
                   col5=abs(c(rnorm(50),runif(10),rep(0.4,10),sample(seq(0.01,0.99,by=0.01),30))))

# give all users a username same as rowname
rownames(Habit1)<- c(1:100)

HellingerDistance <-function(x){
    #takes two equal sized vectors and calculates the hellinger distance between the vectors

    # hellinger distance function
    return(sqrt(sum(((sqrt(x[1,]) - sqrt(x[2,]))^2)))/sqrt(2))

}

HellingerDistanceSqrt <-function(sqrtx){
    #takes two equal sized vectors and calculates the hellinger distance between the vectors

    # hellinger distance function
    return(sqrt(sum(((sqrtx[1,] - sqrtx[2,])^2)))/sqrt(2))

}

calculatedistances <- function(x){
    # takes a dataframe of user IID in the first column and a set of N values per user thereafter

    # first set all NA to 0
    x[is.na(x)] <- 0



    #create matrix of 2 subsets based on rownumber
    # 1 first the diagronal with
    D<-cbind(matrix(rep(1:nrow(x),each=2),nrow=2),combn(1:nrow(x), 2))

    # create a dataframe with hellinger distances
    B <<-data.frame(first=rownames(x)[D[1,]],
                    second=rownames(x)[D[2,]],
                    distance=apply(D, 2, function(y) HellingerDistance(x[ y,]))
    )


    # reshape dataframe into a matrix with users on x and y axis
    B<<-reshape(B, direction="wide", idvar="second", timevar="first")

    # convert wide table to distance table object
    d <<- as.dist(B[,-1], diag = FALSE)
    attr(d, "Labels") <- B[, 1]
    return(d)

}


calculatedistances1 <- function(x){
    # takes a dataframe of user IID in the first column and a set of N values per user thereafter

    # first set all NA to 0
    x[is.na(x)] <- 0

    x <- sqrt(as.matrix(x))



    #create matrix of 2 subsets based on rownumber
    # 1 first the diagronal with
    D<-cbind(matrix(rep(1:nrow(x),each=2),nrow=2),combn(1:nrow(x), 2))

    # create a dataframe with hellinger distances
    B <<-data.frame(first=rownames(x)[D[1,]],
                    second=rownames(x)[D[2,]],
                    distance=apply(D, 2, function(y) HellingerDistanceSqrt(x[ y,]))
    )


    # reshape dataframe into a matrix with users on x and y axis
    B<<-reshape(B, direction="wide", idvar="second", timevar="first")

    # convert wide table to distance table object
    d <<- as.dist(B[,-1], diag = FALSE)
    attr(d, "Labels") <- B[, 1]
    return(d)

}

# actual calculation
system.time(Result<-calculatedistances(Habit1))
system.time(Result1<-calculatedistances1(Habit1))
identical(Result, Result1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM