简体   繁体   English

将一个数据帧的每一行乘以第二个数据帧的所有行

[英]Multiply each row of one dataframe by all rows of a second dataframe

Am struggling with operation as my datasets are very large and i have provided an example of what i want.由于我的数据集非常大,我正在努力进行操作,并且我提供了一个我想要的示例。

I have two dataframes.我有两个数据框。

df1 - contains sampling-derived iterations for each parameter of a variable defined as the column name (10,000 rows) df1 - 包含定义为列名(10,000 行)的变量的每个参数的抽样衍生迭代

df2 - contains the actual value of each of the variable defined as the column name (4,000 rows) df2 - 包含定义为列名的每个变量的实际值(4,000 行)

I want a df3 which is effectively the multiplication of each row of df2 by df1 and would therefore be 4000*10000 rows我想要一个 df3,它实际上是 df2 的每一行乘以 df1,因此是 4000*10000 行

As a short example i have provided a minimal example of df1 and df2.作为一个简短的例子,我提供了一个 df1 和 df2 的最小例子。 I have provided the output that i would be looking at shown in df3.我已经提供了我将在 df3 中查看的输出。

df1 <- structure(list(intercept = c(3.4, 3.6, 3.7), age = c(0.08, 0.05, 
0.06), male = c(0.07, 0.06, 0.07)), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(id = structure(1:2, .Label = c("a", "b"), class = "factor"), 
intercept = c(1L, 1L), age = c(40L, 45L), male = 1:0), class = "data.frame", row.names = c(NA, 
-2L))

df3 <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", 
"b"), class = "factor"), intercept = c(3.4, 3.6, 3.7, 3.4, 3.6, 
3.7), age = c(3.2, 2, 2.4, 3.6, 2.25, 2.7), male = c(0.07, 0.06, 
0.07, 0, 0, 0)), class = "data.frame", row.names = c(NA, -6L))

Can somebody point me to an efficient way to do this in R?有人可以指出我在 R 中执行此操作的有效方法吗?

Another idea via base R using outer ,另一个想法是通过使用outer基础 R ,

data.frame(id = rep(df2$id, each = nrow(df1)), 
           mapply(function(x, y)c(outer(x, y, `*`)), df1, df2[-1])
           )

which gives,这使,

 id intercept age male 1 a 3.4 3.20 0.07 2 a 3.6 2.00 0.06 3 a 3.7 2.40 0.07 4 b 3.4 3.60 0.00 5 b 3.6 2.25 0.00 6 b 3.7 2.70 0.00

You can perform row-wise Kronecker product (from package MGLM ) like below您可以像下面这样按行执行 Kronecker 产品(来自包MGLM

out <- data.frame(id = rep(df2$id,each=nrow(df1)),
                  t(MGLM::kr(t(df2[-1]),t(df1))))

such that以至于

> out
  id intercept  age male
1  a       3.4 3.20 0.07
2  a       3.6 2.00 0.06
3  a       3.7 2.40 0.07
4  b       3.4 3.60 0.00
5  b       3.6 2.25 0.00
6  b       3.7 2.70 0.00

Benchmarking (so far the approach by @Sotos is the winner)基准测试(到目前为止@Sotos的方法是赢家)

df1 <- do.call(rbind,replicate(500,structure(list(intercept = c(3.4, 3.6, 3.7), age = c(0.08, 0.05, 
                                                            0.06), male = c(0.07, 0.06, 0.07)), class = "data.frame", row.names = c(NA, 
                                                                                                                                    -3L)),simplify = F))

df2 <- do.call(rbind,replicate(100,structure(list(id = structure(1:2, .Label = c("a", "b"), class = "factor"), 
                      intercept = c(1L, 1L), age = c(40L, 45L), male = 1:0), class = "data.frame", row.names = c(NA, 
                                                                                                                 -2L)),simplify = F))

library(MGLM)
library(purrr)

f_ThomasIsCoding <- function() {
  data.frame(id = rep(df2$id,each=nrow(df1)),
                    t(MGLM::kr(t(df2[-1]),t(df1))))
}

f_tmfmnk_1 <- function() {
  map_dfr(.x = asplit(df2[-1], 1), ~ sweep(df1, 2, FUN = `*`, .x))
}

f_tmfmnk_2 <- function() {
  data.frame(do.call(rbind, lapply(asplit(df2[-1], 1), function(x) sweep(df1, 2, FUN = `*`, x))),
             id = rep(df2$id, each = nrow(df1)))
}

f_RonakShah <- function() {
  new1 <- df1[rep(seq(nrow(df1)), nrow(df2)), ] 
  new2 <- df2[rep(seq(nrow(df2)), each = nrow(df1)),]
  out <- cbind(new2[1], new1 * new2[-1])
  rownames(out) <- NULL
  out
}

f_Sotos <- function() {
  data.frame(id = rep(df2$id, each = nrow(df1)), 
             mapply(function(x, y)c(outer(x, y, `*`)), df1, df2[-1])
  )
}

bmk <- microbenchmark(times = 20,
               unit = "relative",
               f_ThomasIsCoding(),
               f_tmfmnk_1(),
               f_tmfmnk_2(),
               f_RonakShah(),
               f_Sotos())

which gives这使

> bmk
Unit: relative
               expr       min        lq      mean    median       uq       max neval
 f_ThomasIsCoding()  1.186124  1.218201  1.197346  1.321731 1.042721  1.077854    20
       f_tmfmnk_1()  7.594520  7.572723  4.539698  7.297610 2.437621  3.446436    20
       f_tmfmnk_2()  9.670286 12.212220  6.583183 11.888061 3.370593  4.088534    20
      f_RonakShah() 28.918724 28.861437 16.707258 27.889563 8.403161 11.668252    20
          f_Sotos()  1.000000  1.000000  1.000000  1.000000 1.000000  1.000000    20

One option involving purrr could be:涉及purrr一种选择可能是:

map_dfr(.x = asplit(df2[-1], 1), ~ sweep(df1, 2, FUN = `*`, .x))

  intercept  age male
1       3.4 3.20 0.07
2       3.6 2.00 0.06
3       3.7 2.40 0.07
4       3.4 3.60 0.00
5       3.6 2.25 0.00
6       3.7 2.70 0.00

If also the id column is important:如果 id 列也很重要:

data.frame(map_dfr(.x = asplit(df2[-1], 1), ~ sweep(df1, 2, FUN = `*`, .x)),
           id = rep(df2$id, each = nrow(df1)))

  intercept  age male id
1       3.4 3.20 0.07  a
2       3.6 2.00 0.06  a
3       3.7 2.40 0.07  a
4       3.4 3.60 0.00  b
5       3.6 2.25 0.00  b
6       3.7 2.70 0.00  b

The same with base R :base R相同:

do.call(rbind, lapply(asplit(df2[-1], 1), function(x) sweep(df1, 2, FUN = `*`, x)))

Or:或者:

data.frame(do.call(rbind, lapply(asplit(df2[-1], 1), function(x) sweep(df1, 2, FUN = `*`, x))),
           id = rep(df2$id, each = nrow(df1)))

You could repeat rows in both the dataframes based on number of rows in other dataframe and multiply them directly您可以根据其他数据帧中的行数重复两个数据帧中的行并直接将它们相乘

df1[rep(seq(nrow(df1)), nrow(df2)),] * df2[rep(seq(nrow(df2)), each = nrow(df1)),-1]

#    intercept  age male
#1         3.4 3.20 0.07
#2         3.6 2.00 0.06
#3         3.7 2.40 0.07
#1.1       3.4 3.60 0.00
#2.1       3.6 2.25 0.00
#3.1       3.7 2.70 0.00

To also get id column还要获取id

new1 <- df1[rep(seq(nrow(df1)), nrow(df2)), ] 
new2 <- df2[rep(seq(nrow(df2)), each = nrow(df1)),]
out <- cbind(new2[1], new1 * new2[-1])
rownames(out) <- NULL

out
#  id intercept  age male
#1  a       3.4 3.20 0.07
#2  a       3.6 2.00 0.06
#3  a       3.7 2.40 0.07
#4  b       3.4 3.60 0.00
#5  b       3.6 2.25 0.00
#6  b       3.7 2.70 0.00

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将一个数据帧乘以另一个数据帧中的每一行并聚合结果 - Multiply one dataframe by each row in another dataframe and aggregate result 如何将一个数据帧的一行的EACH值与另一数据帧的一行的所有值相乘 - How to multiply EACH value from one row from a dataframe, with all values of a row from another datafrane R 将一个 dataframe 的每一行值与另一个的每一行值相乘,创建新的 dataframe - R multiply each row value of one dataframe with each row value of another, create new dataframe 将 dataframe 的每一行乘以它的向量 R - multiply each row of a dataframe by it's vector R 将 dataframe 的每一行中的每个元素除以 R 中的一行中的值 - Divide each element in each row of a dataframe by the value in one of the rows in R 数据框的每个值乘以ID搜索的另一数据框的一行 - Multiply each value of a dataframe by a row of another dataframe searched by id 将数据帧的每一列除以数据帧的一行 - Divide each column of a dataframe by one row of the dataframe 将所有行与R数据帧中的特定行进行比较 - Compare all rows to one specific row in r dataframe 将 dataframe 的所有行与新变量合并为一行 - Combine all rows of dataframe into one row with new variables 计算数据帧的每一行与另一个数据帧中的所有其他行之间的欧几里德距离 - calculating the euclidean dist between each row of a dataframe with all other rows in another dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM