简体   繁体   中英

Efficient search & update, data tables or sparse matrix - R

I am trying to find the most efficient way to repeatedly search for combinations of two variables in a reference table. The problem is based on an implementation of a hill climbing algorithm with annealing step size and, as such, adds a lot of complexity to the problem.

To explain, say I have two variables A and B that I want to optimise. I start with 100 combinations of these variables that I will iterate through

set.seed(100)
A_start <- sample(1000,10,rep=F)
B_start <- sample(1000,10,rep=F)
A_B_starts<-expand.grid(A = A_start,
                               B = B_start)

head(A_B_starts)
    A   B
1 714 823
2 503 823
3 358 823
4 624 823
5 985 823
6 718 823

For each of these start combinations, I want to use their immediate neighbours in a predictive model and if their error is less than that of the start combination, continue in that direction. This is repeated until a max number of iterations is hit or the error increases (standard hill climbing). I, however, do not want to recheck combinations I have already looked at so, to do this I want to use a reference table to store checked combinations. I then check each time I generate the immediate neighbours if they are in the reference table before running the predictive model. Any that are present are simply removed. More complexity is added because I want the step size that generates the immediate neighbours to be annealing; get smaller over time. I have implemented this using data.tables

max_iterations <-1e+06
#Set max size so efficient to add new combinations, max size is 100 start points by max iterations allowed 
ref <-data.table(A=numeric(), 
                 B=numeric(),
                 key=c("A","B"))[1:(100*max_iterations)]

ref
            A  B
        1: NA NA
        2: NA NA
        3: NA NA
        4: NA NA
        5: NA NA
       ---      
 99999996: NA NA
 99999997: NA NA
 99999998: NA NA
 99999999: NA NA
100000000: NA NA

So the loop to actually go through the problem

step_A <- 5
step_B <- 5
visited_counter <- 1L
for(start_i in 1:nrow(A_B_starts)){
   initial_error <- get.error.pred.model(A_B_starts[1,])
   A <-A_B_starts[1,1]
   B <-A_B_starts[1,2]
   #Add start i to checked combinations
   set(ref, i=visited_counter, j="A", value=A)
   set(ref, i=visited_counter, j="B", value=B)
   visited_counter <- visited_counter+1L
   iterations <- 1
   while(iterations<max_iterations){
      #Anneal step
      decay_A = step_A / iterations
      decay_B = step_B / iterations
      step_A <- step_A * 1/(1+decay_A*iterations)
      step_B <- step_B * 1/(1+decay_B*iterations)
      #Get neighbours to check
      to_visit_A <- c(A+step_A,A-step_A)
      to_visit_B <- c(B+step_B,B-step_B)
      to_visit <- setDT(expand.grid("A"=to_visit_A,"B" = to_visit_B),
                        key=c("A","B"))
      #Now check if any combination have been checked before and remove if so
      #set key for efficient searching - need to reset in loop as you are updating values in datatable
      setkey(ref,A,B)
      prev_visited<-ref[to_visit,nomatch=0L]
      to_visit<-to_visit[!prev_visited]
      #Run model on remaining combinations and if error reducing continue
      best_neighbour <- get.min.error.pred.model(to_visit)
      if(best_neighbour$error<initial_error){
         initial_error <- best_neighbour_error
         A <- best_neighbour$A
         B <- best_neighbour$B
      }
      else{
         iterations <- max_iterations
      }
      #Add all checked to reference and also update the number of iterations
      for(visit_i in 1L:nrow(to_visit)){
         #This will reset the key of the data table
         set(ref, i=visited_counter, j="A", value=to_visit[visit_i,1])
         set(ref, i=visited_counter, j="B", value=to_visit[visit_i,2])
         visited_neighbour_counter <- visited_counter+1L
         iterations <- iterations+1
      }
   }
}

The problem with this approach is that I have to reset the key each loop iteration as a new combination has been added to ref which makes it very slow:

setkey(ref,A,B)
prev_visited<-ref[to_visit,nomatch=0L]
to_visit<-to_visit[!prev_visited]

Also, the reason I mention the annealing is because I had another idea to use a sparse matrix; Matrix to hold indicators of pairs already checked which would allow very quick checks

require(Matrix)
#Use a sparse matrix for efficient search and optimum RAM usage
sparse_matrix <- sparseMatrix(A = 1:(100*1e+06),
                              B = 1:(100*1e+06) )

However, since the step size is variable ie A/B can hold any value with increasingly small intervals, I don't know how to initialise appropriate values of A and B in the sparse matrix to capture all possible combinations checked?

(Not really an answer, but too long for a comment.)

If the number of possible solutions is huge, it might be impractical or impossible to store them all. What is more, the fastest way to look up a single solution is generally a hashtable; but setting up the hashtable is slow, so you might not gain much (your objective function needs to be more expensive that this set-up/look-up-oberhead). Depending on the problem, much of this storing solutions might be a waste; the algorithm may never revisit them. An alternative suggestion might be a first-in/first-out data structure, which simply stores the last N solutions that have been visited. (Even a linear look-up may be faster than working with a repeatedly-setup hash table for a short list.) But in any case, I'd start with some testing whether and how often the algorithm actually revisits a particular solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM