optimize R code for min() and sample() by group

Question

I generate a network with npeople(=80), ncomp(=4) components and I want each component to have density equal to dens(=0.2).

I want to optimize 2 lines of the code which take most of the time (especially if I want to have 5k people in the network).

the 2 lines are:


# adjust probability to keep density
nodes[,p:= as.numeric(min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]

# simulate edges
nodes[, edge := sample(c(0,1),1, prob = c(1-p,p)), by = c("ID","ALTERID")]

I have tried using the lapply() function, but the execution time increased - see below the line of code:

nodes[,lapply(.SD, function(p) min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]

rm(list=ls())
library(data.table)
library(intergraph)
library(igraph)
library(Matrix)
library(profvis)
library(ggplot2)

draw.var <- function(n, var1, rho, mean){
  C <- matrix(rho, nrow = 2, ncol = 2)
  diag(C) <- 1
  C <- chol(C)
  S <- rnorm(n, mean = mean)
  S <- cbind(scale(var1)[1:n],S) 
  ZS <- S %*% C
  return(ZS[,2])
}

set.seed(1123)

profvis({

    # create empty list to store data
    dt.list <- list()
    npeople <- 500
    dens <- .2
    OC.impact <- FALSE

    cor_iv_si <- .6
    cor_iv_uc <- 0
    cor_uc_oc <- 0.6
    ncomp <- 4 

    beta_oc <- 2   # observed characteristics 
    beta_uc <- 2   # unobserved characteristics 
    beta_si <- 1 


    # create data.table
    dt.people <- data.table(ego = 1:npeople)

    # draw observed characteristics 
    dt.people[, OC :=  abs(rt(npeople,2))]

    # draw unobserved variable
    dt.people[, UC := draw.var(npeople, dt.people$OC, rho = cor_uc_oc,mean = 5)]

    # set component idientifier
    dt.people$group <- cut_number(dt.people$UC, ncomp,labels = F)

    for(q in 1:ncomp){

      # subset comp
      dt.sub <- dt.people[group == q]

      # create undirected graph
      nodes <- as.data.table(t(combn(dt.sub$ego, 2)))
      setnames(nodes,c("ID","ALTERID"))

      # add attributes
      nodes <- merge(nodes,dt.people[,list(ID = ego, ID.UC = UC, ID.OC = OC)], by = "ID")
      nodes <- merge(nodes,dt.people[,list(ALTERID = ego, ALTERID.UC = UC, ALTERID.OC = OC)], by = "ALTERID")

      # calculate distance
      nodes[,d := abs(ID.UC - ALTERID.UC)]

      # estimate the appropiate density per component
      n.edges <- (dens * (npeople * (npeople - 1)))/ncomp
      n.nodes <- npeople/ncomp
      c.dens <- n.edges/(n.nodes * (n.nodes - 1))

      # estimate initial probability of tie based on distance
      coefficient <- log(c.dens / (1 - c.dens))
      alpha <- coefficient / mean(nodes$d)
      nodes[,p := exp(alpha * d) / (1 + exp(alpha * d))]

      # adjust probability to keep density
      nodes[,p:= as.numeric(min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]

      # simulate edges
      nodes[, edge := sample(c(0,1),1, prob = c(1-p,p)), by = c("ID","ALTERID")]

      # keep the edges
      nodes <- nodes[edge == 1,list(ID,ALTERID)]

      # bind the networks 
      if(q == 1){
        net <- copy(nodes)
      } else{
        net <- rbind(net,nodes)
      }

    }

    # create opposide direction
    net <- rbind(net,net[,list(ID = ALTERID, ALTERID = ID)])
  })

Answer 1

This incorporates @BenBolker and @ DavidArenburg's suggestions and also incorporates some of data.table 's tools.

Non-Equi joins

The OP code loops through each group. One part of the code also uses combn and multiple joins to get the data in the right format. Using non-equi joins, we can combine all of those steps in one data.table call

  dt_non_sub <- dt.people[dt.people, 
                          on = .(ego < ego, group = group), 
                          allow.cartesian = T, 
                          nomatch = 0L,
                          .(group, 
                            ALTERID = i.ego, ID = x.ego, 
                            ID.UC = UC, ID.OC = OC, 
                            ALTERID.OC = i.OC, ALTERID.UC = i.UC,
                            d = abs(UC - i.UC)) #added to be more efficient
                          ]
  # dt_non_sub[, d:= abs(ID.UC - ALTERID.UC)]

Vectorization

The original code was mostly slow because of two calls with by groupings. Since each call split the dataframe in around 8,000 individual groups, there were 8,000 functions calls each time. This eliminates those by using pmin as suggested by @DavidArenburg and then uses runif(N)<p as suggested by @BenBolker. My addition was that since your final result don't seem to care about p , I only assigned the edge by using {} to only return the last thing calculated in the call.

  # alpha <- coefficient / mean(nodes$d)
  dt_non_sub[,
             edge := {
               alpha = coefficient / mean(d)
               p = exp(alpha * d) / (1 + exp(alpha * d))
               p_mean = mean(p)
               p = pmin(1, p * (1/(p_mean / c.dens)))
               as.numeric(runif(.N)<p)
               }
             , by = .(group)]

  net2 <- rbindlist(dt_non_sub[edge == 1, .(group, ALTERID, ID)],
                dt_non_sub[edge == 1, .(group, ID = ALTERID, ALTERID = ID)]

One thing to note is that the vectorization is not 100% identical. Your code was recursive, each split updated the mean(node$p) for the next ID, ALTERID group. If you need that recursive part of the call, there's not much help to make it faster.

In the end, the modified code runs in 20 ms vs. the 810 ms of your original function. The results, while different, are somewhat similar in the total number of results:

Original :

net
        ID ALTERID
    1:   5      10
    2:  10      14
    3:   5      25
    4:  10      25
    5:  14      25
   ---            
48646: 498     458
48647: 498     477
48648: 498     486
48649: 498     487
48650: 498     493

Modified

net2
       group ALTERID  ID
    1:     2       4   3
    2:     2       6   4
    3:     4       7   1
    4:     4       8   7
    5:     2       9   4
   ---                  
49512:     3     460 500
49513:     3     465 500
49514:     3     478 500
49515:     3     482 500
49516:     3     497 500

optimize R code for min() and sample() by group

Question

1 answers

solution1
3 ACCPTED 2019-08-25 18:05:46

Non-Equi joins

Vectorization

optimize R code for min() and sample() by group

Question

1 answers

solution1 3 ACCPTED 2019-08-25 18:05:46

Non-Equi joins

Vectorization

solution1
3 ACCPTED 2019-08-25 18:05:46