简体   繁体   English

在R中存储模拟功能输出的频率计数

[英]Storing Frequency Count of Simulation Function Output in R

I have a program where I am running a simulation function for a large number of iterations. 我有一个程序,我正在运行一个模拟函数进行大量的迭代。 I'm stuck, however, on what I expected to be the easiest part: figuring out how to store frequency counts of the function's results. 然而,我被困在我预期的最简单的部分:弄清楚如何存储函数结果的频率计数。

The simulation function itself is complicated, but is analogous to the R's sample() function. 模拟函数本身很复杂,但类似于R的sample()函数。 A large amount of data goes in, and the function outputs a vector containing a subset of elements. 大量数据进入,函数输出包含元素子集的向量。

x <- c("red", "blue", "yellow", "orange", "green", "black", "white", "pink")

run_simulation <- function(input_data, iterations = 100){
  for (i in 1:iterations){
    result <- sample(input_data, 3, replace=FALSE)
    results <- ????
  }
}

run_simulation(x)

My question is what is the best (most efficient and R-like) data structure to store the frequency counts of the results of the function inside the simulation loop. 我的问题是什么是最好的(最有效和类似R)数据结构,用于存储模拟循环内函数结果的频率计数。 As you might be able to tell from the for loop, my background is in languages like Python, where I would create a dict keyed by tuples that increments every time a particular combination is output: 正如你可以从for循环中看到for ,我的背景是像Python这样的语言,在那里我将创建一个由元组键入的dict,每次输出特定组合时它都会递增:

counts[results_tuple] = counts.get(results_tuple, 0) + 1

However, there is no equivalent dict/hashmap type structure in R, and I've often found that trying to emulate other languages in R is a recipe for ugly and inefficient code. 但是,R中没有等效的dict / hashmap类型结构,我经常发现尝试在R中模拟其他语言是一种丑陋和低效的代码。 (Right now I am converting the output vector to a string and appending it to a result list that I count later with table() , but that is very memory inefficient for a high number of iterations over a function that has a limited number of possible output vectors.) (现在我正在将输出向量转换为字符串并将其附加到我稍后用table()计算的结果列表中,但对于具有有限数量的可能的函数的大量迭代,这是非常低效的内存输出向量。)

To be clear, here is kind of output I want: 要清楚,这是我想要的输出:

               Result Freq
   black, pink, green    8
     blue, red, white    7
    black, pink, blue    7
   blue, green, black    5
     blue, green, red    4
   green, blue, white    3
   pink, green, white    3
   white, blue, green    1
   white, orange, red    1
yellow, black, orange    1
  yellow, blue, green    1

I don't care about the frequency of any particular element, only the set. 我不关心任何特定元素的频率,只关心集合。 And I don't care about the order of output, just the frequency. 我不关心输出的顺序,只关心频率。

Any advice is appreciated! 任何建议表示赞赏!

You can use a data.table (a juiced-up data.frame implementation) that uses possible values as key. 您可以使用data.tabledata.frame -up data.frame实现),它使用可能的值作为键。 They require a specific syntax, but are very efficient. 它们需要特定的语法,但效率很高。

Here is how I would go about it. 这是我如何去做。 Matching simulation outputs back to the index requires sorting it, so I saved it under a new variable: 将模拟输出匹配回索引需要对其进行排序,因此我将其保存在一个新变量下:

require(data.table)

x <- c("red", "blue", "yellow", "orange", "green", "black", "white", "pink")

run_simulation <- function(input_data, iterations = 100){

  # generate set of all possible outputs
  possible_values <- sort(input_data)  ## needed to match simulations

  # combn() seems to preserve input order
  # have to sort each column from combn() output if this is not guaranteed
  results <- as.data.table(t(combn(possible_values, 3)))
  setnames(results, c("first", "second", "third"))
  results[, count:=0]  ## initiate counts column
  setkey(results, first, second, third)  ## use index columns as table key

  for (i in 1:iterations){
    result <- sample(input_data, 3, replace=FALSE)
    result_sorted <- t(sort(result))  ## t() needed to specify it's a row
    colnames(result_sorted) <- c('first', 'second', 'third')
    result_sorted <- as.data.table(result_sorted)
    results[result_sorted, count:=count + 1]
  }
  return(results)
}

Most of the lines after generation are needed to get the vector into the right format for data.table to look up the correct row. 生成后的大多数行都需要将向量转换为data.table的正确格式以查找正确的行。 This may be overkill for a small number of possible combinations, but should pay dividends if the possible set is larger. 对于少数可能的组合,这可能是过度的,但如果可能的组合更大,则应该支付股息。

The following is a short solution using base R which seems to give fairly quick execution times. 以下是使用base R的简短解决方案,它似乎可以提供相当快的执行时间。

 run_simulation <- function(input_data, iterations = 100){
 Results  <-  replicate(iterations, paste0(sort(sample(input_data, 3, replace=FALSE)),collapse=", ")  )
 results <- as.data.frame(table(Results) )
 }

run_simulation(x) gives run_simulation(x)给出

                  Results Freq
 1     black, blue, green    2
 2    black, blue, orange    2
 3      black, blue, pink    6
 4       black, blue, red    6
 5     black, blue, white    2
 6   black, green, orange    3
 7     black, green, pink    1
 8      black, green, red    1

Benchmarking this for 100, 1,000, 10,000, and 100,000 iterations shows that the times increase linearly with the number of iterations which seems desirable. 对100,1,000,10,000和100,000次迭代进行基准测试表明,时间随迭代次数呈线性增加,这似乎是可取的。 Also the total time for 100,000 iterations is about 2,200 milliseconds or 2.2 secs. 此外,100,000次迭代的总时间约为2,200毫秒或2.2秒。 You describe your simulation as complicated using a great deal of data so it may well be that the total time doing your simulation significantly exceeds the time spent in this bit of code tabulating the results. 您使用大量数据将模拟描述为复杂,因此很可能模拟的总时间明显超过了将这些代码列入表格所花费的时间。

 library(microbenchmark)

 microbenchmark(run_simulation(x,iterations=100), run_simulation(x,iterations=1000), run_simulation(x,iterations=10000), run_simulation(x,iterations=100000), times=100)

 Unit: milliseconds
                                   expr         min          lq      median          uq        max neval
    run_simulation(x, iterations = 100)    2.352262    2.447647    2.488282    2.573545   71.96314   100
    run_simulation(x, iterations = 1000)   19.161997   19.751702   20.476572   24.411885   90.42650   100
    run_simulation(x, iterations = 10000)  193.688216  208.453087  217.130138  226.166201  289.13177   100
    run_simulation(x, iterations = 1e+05) 2012.773904 2125.986609 2169.870885 2236.038487 2426.02379   100

You could also use an environment (which does in fact use a hash table). 您还可以使用environment (实际上使用哈希表)。 In this way you do not need to enumerate all outcomes of your simulation as you are anyways just interested in the counts: 通过这种方式,您无需枚举模拟的所有结果,因为您无论如何只对计数感兴趣:

runSimulation <- function(input.size = 300L, iterations = 100L) {
   x <- paste0("E", 1L:input.size)
   results <- new.env(hash = TRUE)
   for (i in 1:iterations){
      result <- sample(x, 3, replace = FALSE)
      nam <- paste0(sort(result), collapse = ".")
      if (exists(nam, results)) {
         results[[nam]] <- results[[nam]] + 1
      } else {
         assign(nam, 1, envir = results)
      }
   }
   l <- as.list(results)
   d <- data.frame(tuple = names(l), count = unlist(l))
   rownames(d) <- NULL
   d
}

However, timewise this is comparable to the solution using table . 但是,时间上这与使用table的解决方案相当。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM