简体   繁体   中英

How can i replace nested loop using lapply in R?

Good afternoon ,

I have developped this R function that hashes data in buckets :

#   The used packages 
    library("pacman")
    pacman::p_load(dplyr, tidyr, devtools, MASS, pracma, mvtnorm, interval, intervals) 
    pacman::p_load(sprof, RDocumentation, helpRFunctions, foreach , philentropy , Rcpp , RcppAlgos) 


  hash<-function(v,p){
  if(dot(v,p)>0) return(1) else (0)   }

  LSH_Band<-function(data,K ){

  # We retrieve numerical columns of data 
  t<-list.df.var.types(data)
  df.r<-as.matrix(data[c(t$numeric,t$Intervals)])
  n=nrow(df.r)

  # we create K*K matrice using normal law
  rn=array(rnorm(K*K,0,1),c(K,K))
  # we create K*K matrice of integers using uniform law , integrs are unique in each column
  rd=unique.array(array(unique(ceiling(runif(K*K,0,ncol(df.r)))),c(K,K)))

  buckets<-array(NA,c(K,n)) 
    for (i in 1:K) {
      for (j in 1:n) {
        buckets[i,j]<-hash(df.r[j,][rd[,i]],rn[,i])
      }
    }   
  return(buckets)   
}
> df.r
  age height salaire.1 salaire.2
1  27    180         0      5000
2  26    178         0      5000
3  30    190      7000     10000
4  31    185      7000     10000
5  31    187      7000     10000
6  38    160     10000     15000
7  39    158     10000     15000
> LSH_Band(df.r, 3 )
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1    1    1    1    1    1    1
[2,]    1    1    0    0    0    0    0
[3,]    0    0    0    0    0    0    0

The dot function is the scalar product of two vectors.

  • My Lsh function takes a row of my data , then it takes a part of the obtained row using df.r[j,][rd[,i]] . df.r[j,] is j-éme row of the data.
  • rd[,i] : rd is a K*K matrix of integers between 1 and ncol(df.r) , each column of the matrix contains only unique integers.

  • rn[,i] : rn is a K*K matrix that contains values of N(0,1) law.

  • In the resulting table , observations are represented in columns . I will have k Rows. For the last row , i will compute the scalar product between df.r[j,][rd[,K]] and rn[,K] . I will obtain 1 if the scalar product is positive. rd[,K] and rn[,K] will be used only for the last row in the resulting table and for all observations in that row.

My question :

Is it to replace the loops with variables i and j by a lapply function ?

My real data will be large , this is why i'm asking this question.

Thank you !

The following is a bit too long as a comment, so here are some pointers/issues/remarks:

  1. First off, I have to say I struggle to understand what LHS_Band does. Perhaps some context would help here.

  2. I don't understand the purpose of certain functions like helpRFunctions::list.df.var.type which simply seems to return the column names of data in a list . Note also that t$Intervals returns NULL based on the sample data you give. So I'm not sure what's going on there.

  3. I don't see the point of function pracma::dot either. The dot product between two vectors can be calculated in base R using %*% . There's really no need for an additional package.

  4. Function hash can be written more compactly as

    hash <- function(v, p) +(as.numeric(v %*% p) > 0)

    This avoids the if conditional which is slow.


Notwithstanding my lack of understanding what it is you're trying to do, here are some tweaks to your code

hash <-  function(v, p) +(as.numeric(v %*% p) > 0)

LSH_Band <- function(data, K, seed = NULL) {

    # We retrieve numerical columns of data
    data <- as.matrix(data[sapply(data, is.numeric)])
    # we create K*K matrice using normal law
    if (!is.null(seed)) set.seed(seed)
    rn <- matrix(rnorm(K * K, 0, 1), nrow = K, ncol = K)
    # we create K*K matrice of integers using uniform law , integrs are unique in each column
    rd <- sapply(seq_len(K), function(col) sample.int(ncol(data), K))
    buckets <- matrix(NA, nrow = K, ncol = nrow(data))
    for (i in 1:K) {
        buckets[i, ] <- apply(data, 1, function(row) hash(row[rd[, i]], rn[, i]))
    }
    buckets
}
  1. Always add an option to use a reproducible seed when working with random numbers. That will make debugging a lot easier.
  2. You can replace at least one for loop with apply (which when using MARGIN = 1 iterates through the rows of a matrix (or array )).
  3. I've removed all the unnecessary package dependencies, and replaced the functionality with base R functions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM