R - Finding bitwise binary neighbors (flipping one bit at a time)

Question

Is there a more effective way to match matrix rows when using large matrices? I have a vector with values that correspond to a matrix of 2^N rows. N is typically large eg, >20. Each row is a unique combination of N={0,1} values and represents a 'position' on a decision space. Ie, for N=3 the rows would be 0 0 0, 0 0 1, 0 1 0, 1 0 0, ..., 1 1 1

I need to determine whether a position is a local maximum, ie, whether the N neighboring positions are of lower values. For example, to the position 0 0 0, the neighboring positions are 1 0 0, 0 1 0, and 0 0 1, accordingly. I have coded the following solution that does the job but very slowly for large N.

library(prodlim) #for row.match command

set.seed(1234)
N=10

space = as.matrix(expand.grid(rep(list(0:1), N))) #creates all combinations of 0-1 along N-dimensions

performance = replicate(2^N, runif(1, min=0, max=1)) #corresponding values for each space-row (position)

#determine whether a space position is a local maxima, that is, the N neighboring positions are smaller in performance value


system.time({
local_peaks_pos = matrix(NA,nrow=2^N, ncol=1)
for(v in 1:2^N)
{

  for(q in 1:N)
  {
    temp_local_pos = space[v,1:N]
    temp_local_pos[q] = abs(temp_local_pos[q]-1)

    if(performance[row.match(temp_local_pos[1:N], space[,1:N])] > performance[v])
    {
      local_peaks_pos[v,1] = 0
      break
    }

  }

}
local_peaks_pos[is.na(local_peaks_pos)] = 1
})

  user  system elapsed 
   9.94    0.05   10.06

Answer 1

As Gabe mentioned in his comment, you can exploit the fact that your decision space can be interpreted as single integers:

set.seed(1234L)
N <- 10L
performance <- runif(2^N)
powers_of_two <- as.integer(rev(2L ^ (0L:(N - 1L))))

is_local_max <- sapply(0L:(2^N - 1), function(i) {
  multipliers <- as.integer(rev(intToBits(i)[1L:N])) * -1L
  multipliers[multipliers == 0L] <- 1L
  neighbors <- i + powers_of_two * multipliers
  # compensate that R vectors are 1-indexed
  !any(performance[neighbors + 1L] > performance[i + 1L])
})

# compensate again
local_peaks_int <- which(is_local_max) - 1L
local_peaks_binary <- t(sapply(local_peaks_int, function(int) {
  as.integer(rev(intToBits(int)[1L:N]))
}))

> head(local_peaks_binary)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    0    0    0    0    0    0    1    0     0
[2,]    0    0    0    0    1    0    0    1    1     0
[3,]    0    0    0    0    1    1    1    1    0     0
[4,]    0    0    0    1    0    0    0    1    1     1
[5,]    0    0    0    1    0    1    0    1    0     1
[6,]    0    0    0    1    1    0    1    1    1     0

In decimal, multipliers contains the the sign of the powers_of_two so that, when added to the current integer, it represents a bit flip in binary. For example, if the original binary was 0 0 and we flip one bit to get 1 0 , it's as if we added 2^1 in decimal, but if it was originally 1 0 and we flip one bit to get 0 0 , then we subtracted 2^1 in decimal.

Each row in local_peaks_binary is a binary from your decision space, where the least significant bit is on the right. So, for example, the first local peak is a decimal 4.

See this question for the mapping of integers to binary.

EDIT: and if you want to do it in parallel:

library(doParallel)
set.seed(1234L)
N <- 20L
performance <- runif(2^N)
powers_of_two <- as.integer(rev(2 ^ (0:(N - 1))))

num_cores <- detectCores()
workers <- makeCluster(num_cores)
registerDoParallel(workers)

chunks <- splitIndices(length(performance), num_cores)
chunks <- lapply(chunks, "-", 1L)
local_peaks_int <- foreach(chunk=chunks, .combine=c, .multicombine=TRUE) %dopar% {
  is_local_max <- sapply(chunk, function(i) {
    multipliers <- as.integer(rev(intToBits(i)[1L:N])) * -1L
    multipliers[multipliers == 0L] <- 1L
    neighbors <- i + powers_of_two * multipliers
    # compensate that R vectors are 1-indexed
    !any(performance[neighbors + 1L] > performance[i + 1L])
  })

  # return
  chunk[is_local_max]
}

local_peaks_binary <- t(sapply(local_peaks_int, function(int) {
  as.integer(rev(intToBits(int)[1L:N]))
}))

stopCluster(workers); registerDoSEQ()

The above completes in ~2.5 seconds in my system, which has 4 CPU cores.

Here is a C++ version that uses multi-threading but, at least in my system with 4 threads, ~~it doesn't seem faster than Gabe's Fortran version~~ . However, when I try to run Gabe's Fortran code in a new session, I get the following error with N <- 29L : cannot allocate vector of size 4.0 Gb .

EDIT: Apparently I changed something important along the way, because after testing again, the C++ version actually seems faster.

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel)]]
#include <cstddef> // size_t
#include <vector>
#include <Rcpp.h>
#include <RcppParallel.h>

using namespace std;
using namespace Rcpp;
using namespace RcppParallel;

class PeakFinder : public Worker
{
public:
  PeakFinder(const NumericVector& performance, vector<int>& peaks, const int N)
    : performance_(performance)
    , peaks_(peaks)
    , N_(N)
  { }

  void operator()(size_t begin, size_t end) {
    vector<int> peaks;
    for (size_t i = begin; i < end; i++) {
      bool is_local_peak = true;
      unsigned int mask = 1;
      for (int exponent = 0; exponent < N_; exponent++) {
        unsigned int neighbor = static_cast<unsigned int>(i) ^ mask; // bitwise XOR
        if (performance_[i] < performance_[neighbor]) {
          is_local_peak = false;
          break;
        }

        mask <<= 1;
      }

      if (is_local_peak)
        peaks.push_back(static_cast<int>(i));
    }

    mutex_.lock();
    peaks_.insert(peaks_.end(), peaks.begin(), peaks.end());
    mutex_.unlock();
  }

private:
  const RVector<double> performance_;
  vector<int>& peaks_;
  const int N_;
  tthread::mutex mutex_;
};

// [[Rcpp::export]]
IntegerVector local_peaks(const NumericVector& performance, const int N) {
    vector<int> peaks;
    PeakFinder peak_finder(performance, peaks, N);
    // each thread call will check at least 1024 values
    parallelFor(0, performance.length(), peak_finder, 1024);

    IntegerVector result(peaks.size());
    int i = 0;
    for (int peak : peaks) {
        result[i++] = peak;
    }
    return result;
}

After saving the C++ code in local-peaks.cpp , this code:

library(Rcpp)
library(RcppParallel)

sourceCpp("local-peaks.cpp")

set.seed(1234L)
N <- 29L
performance <- runif(2^N)
system.time({
    local_peaks_int <- local_peaks(performance, N)
})

finished in ~2 seconds (without considering the time required to allocate performance ).

If you do need the binary representation, you can change local_peaks like this (see this question ):

// [[Rcpp::export]]
IntegerMatrix local_peaks(const NumericVector& performance, const int N) {
  vector<int> peaks;
  PeakFinder peak_finder(performance, peaks, N);
  // each thread call will check at least 1024 values
  parallelFor(0, performance.length(), peak_finder, 1024);

  // in case you want the same order every time, #include <algorithm> and uncomment next line
  // sort(peaks.begin(), peaks.end());

  IntegerMatrix result(peaks.size(), N);
  int i = 0;
  for (int peak : peaks) {
    for (int j = 0; j < N; j++) {
      result(i, N - j - 1) = peak & 1;
      peak >>= 1;
    }

    i++;
  }

  return result;
}

Answer 2

Here is one solution that follows the same general structure as your example code. intToBits and packBits map to and from the binary representation for each integer (subtracting one to start at zero). The inner loop flips each of the N bits to get the neighbors. On my laptop, this runs in a fraction of a second for N=10 and around a minute for N=20 . The commented code stores some information from neighbors already tested so as to not redo the calculation. Uncommenting those lines makes it run in about 35 seconds for N=20 .

loc_max <- rep(1, 2^N)
for (v in 1:2^N){
  ## if (loc_max[v] == 0) next
  vbits <- intToBits(v-1)
  for (q in 1:N){
    tmp <- vbits
    tmp[q] <- !vbits[q]
    pos <- packBits(tmp, type = "integer") + 1
    if (performance[pos] > performance[v]){
      loc_max[v] <- 0
      break
    ## } else {
    ##   loc_max[pos] <- 0
    }
  }
}

identical(loc_max, local_peaks_pos[, 1])
## [1] TRUE

EDIT: It sounds like you need every bit of speed possible, so here's another suggestion that relies on compiled code to run significantly faster than my first example. A fraction of a second for N=20 and a bit under 20 seconds for N=29 (the largest example I could fit in my laptop's RAM).

This is using a single core; you could either parallelize this, or alternatively run it in a single core and parallelize your Monte Carlo simulations instead.

library(inline)

loopcode <-
"  integer v, q, pos
   do v = 0, (2**N)-1
      do q = 0, N-1
         if ( btest(v,q) ) then
            pos = ibclr(v, q)
         else
            pos = ibset(v, q)
         end if
         if (performance(pos) > performance(v)) then
            loc_max(v) = 0
            exit
         end if
      end do
   end do
"

loopfun <- cfunction(sig = signature(performance="numeric", loc_max="integer", n="integer"),
                     dim=c("(0:(2**n-1))", "(0:(2**n-1))", ""),
                     loopcode,
                     language="F95")

N <- 20
performance = runif(2^N, min=0, max=1)
system.time({
  floop <- loopfun(performance, rep(1, 2^N), N)
})
##  user  system elapsed
## 0.049   0.003   0.052

N <- 29
performance = runif(2^N, min=0, max=1)
system.time({
  floop <- loopfun(performance, rep(1, 2^N), N)
})
##   user  system elapsed
## 17.892   1.848  19.741

I don't think pre-computing the neighbors would help much here since I'd guess the comparisons accessing different sections of such a large array are the most time consuming part.

R - Finding bitwise binary neighbors (flipping one bit at a time)

Question

2 answers

solution1
1 ACCPTED 2018-06-22 22:56:42

solution2
1 2018-06-22 23:10:39

R - Finding bitwise binary neighbors (flipping one bit at a time)

Question

2 answers

solution1 1 ACCPTED 2018-06-22 22:56:42

solution2 1 2018-06-22 23:10:39

solution1
1 ACCPTED 2018-06-22 22:56:42

solution2
1 2018-06-22 23:10:39