简体   繁体   English

在R中的单个数据框中组合数据框作为函数输出

[英]Combine dataframes as function output in a single dataframe in R

I would like to combine multiple dataframes, as output of a function, into one big dataframe in R. 我想将多个数据框(作为一个函数的输出)组合成R中的一个大数据框。

I am simulating data within a function, eg: 我正在模拟一个函数中的数据,例如:

set.seed(123)

x <- function(){
return( data.frame( matrix(rnorm(10, 1, .5), ncol=2) ) )
}

I would like to run multiple simulations and tie the dataframes together. 我想运行多个模拟并将数据框捆绑在一起。

Attempt 尝试

set.seed(123)

x_improved <- function(sim_nr){
  df <- data.frame( matrix(rnorm(10, 1, .5), ncol=2) )  # simulate data
  sim_nr <- rep(sim_nr, length(df[,1])).                # add reference number
  df <- cbind(df, sim_nr)                               # bind columns
  return(df)
}

list_dataframes <- lapply(c(1,2,3), x_improved)         # create list of dataframes

df <- do.call("rbind", list_dataframes)                 # convert list to dataframe

The code above does so, see "Expected output" below. 上面的代码这样做,请参见下面的“预期输出”。

Expected output: 预期产量:

> df
          X1        X2 sim_nr
1  0.4660881 0.1566533      1
2  0.8910125 1.4188935      1
3  0.4869978 1.0766866      1
4  0.6355544 0.4309315      1
5  0.6874804 1.6269075      1
6  1.2132321 1.3443201      2
7  0.8524643 1.2769588      2
8  1.4475628 0.9690441      2
9  1.4390667 0.8470187      2
10 1.4107905 0.8097645      2
11 0.6526465 0.4384457      3
12 0.8960414 0.7985576      3
13 0.3673018 0.7666723      3
14 2.0844780 1.3899826      3
15 1.6039810 0.9583155      3

Question : 问题

Is this the proper (or R) way to address this problem? 这是解决此问题的正确方法吗? Are there more efficient (or convenient) solutions? 是否有更有效(或更方便)的解决方案?

Another approach would be to use an array which can be more performant if you need to do a lot of grouping operations. 另一种方法是使用一个array ,如果您需要执行很多分组操作,则可以提高性能。

set.seed(123)
replicate(3, matrix(rnorm(10, 1, 0.5), ncol = 2))
, , 1

          [,1]      [,2]
[1,] 0.7197622 1.8575325
[2,] 0.8849113 1.2304581
[3,] 1.7793542 0.3674694
[4,] 1.0352542 0.6565736
[5,] 1.0646439 0.7771690

, , 2

          [,1]       [,2]
[1,] 1.6120409 1.89345657
[2,] 1.1799069 1.24892524
[3,] 1.2003857 0.01669142
[4,] 1.0553414 1.35067795
[5,] 0.7220794 0.76360430

, , 3

          [,1]      [,2]
[1,] 0.4660881 0.1566533
[2,] 0.8910125 1.4188935
[3,] 0.4869978 1.0766866
[4,] 0.6355544 0.4309315
[5,] 0.6874804 1.6269075

Or, if you want a data.frame , it's oftentimes faster to do all of your rnorm simulations at once. 或者,如果您需要data.frame ,通常一次进行所有rnorm仿真通常会更快。 Note that even with the seed set that this isn't an exact match - the matrix fills up by the column so the ordering is slightly different. 请注意,即使是种子集,也不完全匹配-矩阵被列填充,因此顺序略有不同。

set.seed(123)
nsim <- 3
data.frame(matrix(rnorm(10 * n_sim, 1, 0.5), ncol = 2),
           sim_nr = rep(seq_len(n_sim), each = 5)
  )

One way to improve at least by number of lines would be to use transform and the function x_improved becomes one-liner 至少通过行数改进的一种方法是使用transform ,并且函数x_improved变为x_improved

set.seed(123)
x_improved <- function(sim_nr){
   transform(data.frame(matrix(rnorm(10, 1,.5), ncol=2), sim_nr = sim_nr))
}

do.call(rbind, lapply(1:3, x_improved))


#          X1         X2 sim_nr
#1  0.7197622 1.85753249      1
#2  0.8849113 1.23045810      1
#3  1.7793542 0.36746938      1
#4  1.0352542 0.65657357      1
#5  1.0646439 0.77716901      1
#6  1.6120409 1.89345657      2
#7  1.1799069 1.24892524      2
#8  1.2003857 0.01669142      2
#9  1.0553414 1.35067795      2
#10 0.7220794 0.76360430      2
#11 0.4660881 0.15665334      3
#12 0.8910125 1.41889352      3
#13 0.4869978 1.07668656      3
#14 0.6355544 0.43093153      3
#15 0.6874804 1.62690746      3

Or depending on your use-case you could construct the dataframe all together. 或者根据您的用例,您可以一起构造数据框。

num <- 1:3
transform(data.frame(matrix(rnorm(10 * length(num), 1,.5), ncol=2)), 
          sim_nr = rep(num, each = 10/2))

Using purrr library 使用purrr

purrr::map_df(c(1,2,3), ~data.frame(matrix(rnorm(10, 1, .5), ncol=2)), .id='sim_nr') 
#Using the x function it would be 
purrr::map_df(c(1,2,3), ~x() , .id='sim_nr')

The simplest solution is to use rbindlist from the data.table library: 最简单的解决方案是使用rbindlistdata.table库:

> library(data.table)
> rbindlist(list_dataframes)

You can of course do it for your list_dataframes either outside or inside of the "for" loop. 您当然可以在“ for”循环的外部或内部对list_dataframes进行操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM