繁体   English   中英

如何根据 R 中的列名组合列?

[英]How to combine columns based on column name in R?

这里这里这里都提出了类似的问题。 但是,我无法让这些解决方案解决我的问题。

我正在尝试根据它们的名称组合列,然后为每对变量创建一个矩阵/数据框。 希望我的例子能解释得更清楚。

例如,假设我们有一个如下所示的数据框:

# create some data
set.seed(100)
dfOG <- data.frame(
  day = sample(c('1', '2'), 3, replace = T),
  rain = sample(c('yes', 'no'), 3, replace = T),
  val1 = runif(3)
)

我正在应用一个过程(这是我无法控制的),将分类变量拆分为虚拟变量(每个级别都有一个虚拟变量)。 最后,我得到一个矩阵,其中的列是每个成对变量。 output 看起来像这样:

# create matrix of all pairs
name2 <- c('day.1', 'day.2',
           'rain.yes', 'rain.no', 'val1')

nam2 <- expand.grid(name2, name2)
newName2 <- NULL
for(i in 1:length(nam2$Var1)){
  newName2[i] <- paste0(nam2$Var2[i], ":", nam2$Var1[i])
}
set.seed(100)
newMat2 <- matrix(rexp(75, rate=.1), nrow = 3, ncol = length(newName2))
colnames(newMat2) <- newName2
> newMat2
     day.1:day.1 day.1:day.2 day.1:rain.yes day.1:rain.no day.1:val1 day.2:day.1
[1,]    9.242116   30.973623      0.9311719      1.943265   20.23192   3.8058106
[2,]    7.238372    6.248052     17.4839077      5.251022   11.23247   0.7162231
[3,]    1.046449   11.744293      2.4999295      3.380434   11.31048   4.2160769
     day.2:day.2 day.2:rain.yes day.2:rain.no day.2:val1 rain.yes:day.1
[1,]    0.766974      17.576561      9.348420  0.9030936       2.066487
[2,]    4.979445       5.406032      3.905483  5.5516888       8.371235
[3,]   13.735530       1.925034      1.250488  6.5460690       9.214908
     rain.yes:day.2 rain.yes:rain.yes rain.yes:rain.no rain.yes:val1
[1,]       10.15267          7.067098         3.527963     13.420953
[2,]       19.75727         22.259788         9.411371     10.040507
[3,]       15.76831         11.416835         8.630324      1.451295
     rain.no:day.1 rain.no:day.2 rain.no:rain.yes rain.no:rain.no rain.no:val1
[1,]      4.330075    25.4360600        15.317283        0.349195     12.51062
[2,]      5.495578    11.0861832        11.256991       13.882071     30.86277
[3,]      6.680542     0.2620275         9.630859       36.926827     11.38734
     val1:day.1 val1:day.2 val1:rain.yes val1:rain.no val1:val1
[1,] 1.41956168   6.731429     14.124068    12.797966 41.294648
[2,] 3.69760484   6.137335      1.391675    12.639562  1.033024
[3,] 0.08002734  23.743638      6.804015     9.374034 27.107049

我们可以在上面看到, newMat2包含每一对变量,在分类变量被分成虚拟变量之后。

我要做的是通过对相应列的行求和来将这些虚拟变量重新组合成一个变量。 我最终的 output 将是每对重组变量的矩阵/数据框。

例如,如果我们只看变量day 此变量已拆分为day.1day.2 如果我为每一对重新组合这个变量,我们将有一个列day.dayday.rainday.val1 手动执行此操作可能如下所示:

day.day  = apply(newMat2[,c(1,2,6,7)], 1, sum)
day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum)
day.val1 = apply(newMat2[,c(5,10)], 1, sum)

在上面的代码中,我对应该组合的列进行求和(按行)。

所需的 output:

更明确地说,如果我要手动重新组合整个newMat2 ,它看起来像这样:

dfNew <- data.frame(
            day.day  = apply(newMat2[,c(1,2,6,7)], 1, sum),
            day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum),
            day.val1 = apply(newMat2[,c(5,10)], 1, sum),
            rain.day = apply(newMat2[,c(11,12,16,17)], 1, sum),
            rain.rain = apply(newMat2[,c(13,14,18,19)], 1, sum),
            rain.val1 = apply(newMat2[,c(15,20)], 1, sum),
            val1.day = apply(newMat2[,c(21,22)], 1, sum),
            val1.rain = apply(newMat2[,c(23,24)], 1, sum), 
            val1.val1 = newMat2[,c(25)] 
)
> dfNew  
   day.day  day.rain day.val1 rain.day rain.rain rain.val1  val1.day val1.rain
1 44.78852 29.799418 21.13501 41.98529  26.26154  25.93157  8.150991  26.92203
2 19.18209 32.046444 16.78415 44.71027  56.81022  40.90328  9.834940  14.03124
3 30.74235  9.055886 17.85655 31.92579  66.60485  12.83863 23.823666  16.17805
  val1.val1
1 41.294648
2  1.033024
3 27.107049

但是,在我的真实数据中,我有超过 1000 列,其中一些具有许多不同的因子水平,因此,手动组合它们需要很长时间。 有没有办法自动化这个过程?

使用tidyverse函数:

library(tidyverse)

newMat2 %>% 
  as_tibble(rownames = "id") %>% 
  pivot_longer(-id) %>% 
  mutate(name = map_chr(str_extract_all(name, paste(c(colnames(dfOG), ":"), collapse = "|")),
                        paste0, collapse = "")) %>% 
  group_by(id, name) %>% 
  summarise(value = sum(value)) %>% 
  pivot_wider()

  id    `day:day` `day:rain` `day:val1` `rain:day` `rain:rain` `rain:val1` `val1:day` `val1:rain` `val1:val1`
1 1          44.8      29.8        21.1       42.0        26.3        25.9       8.15        26.9       41.3 
2 2          19.2      32.0        16.8       44.7        56.8        40.9       9.83        14.0        1.03
3 3          30.7       9.06       17.9       31.9        66.6        12.8      23.8         16.2       27.1 
library(rlist)
df <- as.data.frame(newMat2)
L <- split.default(df, f = gsub("(^[a-z0-9]+).*:([a-z0-9]+).*$", "\\1.\\2", colnames(df)))
rlist::list.cbind(lapply(L, rowSums))

      day.day  day.rain day.val1 rain.day rain.rain rain.val1  val1.day val1.rain val1.val1
[1,] 44.78852 29.799418 21.13501 41.98529  26.26154  25.93157  8.150991  26.92203 41.294648
[2,] 19.18209 32.046444 16.78415 44.71027  56.81022  40.90328  9.834940  14.03124  1.033024
[3,] 30.74235  9.055886 17.85655 31.92579  66.60485  12.83863 23.823666  16.17805 27.107049

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM