简体   繁体   English

如何根据 R 中的列名组合列?

[英]How to combine columns based on column name in R?

Similar questions have been asked here , here , and here . 这里这里这里都提出了类似的问题。 However, I cant get those solutions to work for my problem.但是,我无法让这些解决方案解决我的问题。

I'm trying to combine columns based on their names and then create a matrix/dataframe of every pair of variables.我正在尝试根据它们的名称组合列,然后为每对变量创建一个矩阵/数据框。 Hopefully my example will explain more clearly.希望我的例子能解释得更清楚。

For example, imagine we have a data frame like the one below:例如,假设我们有一个如下所示的数据框:

# create some data
set.seed(100)
dfOG <- data.frame(
  day = sample(c('1', '2'), 3, replace = T),
  rain = sample(c('yes', 'no'), 3, replace = T),
  val1 = runif(3)
)

I am applying a process (that is outside of my control) that splits the categorical variables into dummy variables (with a dummy for each level).我正在应用一个过程(这是我无法控制的),将分类变量拆分为虚拟变量(每个级别都有一个虚拟变量)。 In the end, I get a matrix where the columns are every pairwise variable.最后,我得到一个矩阵,其中的列是每个成对变量。 The output looks something like this: output 看起来像这样:

# create matrix of all pairs
name2 <- c('day.1', 'day.2',
           'rain.yes', 'rain.no', 'val1')

nam2 <- expand.grid(name2, name2)
newName2 <- NULL
for(i in 1:length(nam2$Var1)){
  newName2[i] <- paste0(nam2$Var2[i], ":", nam2$Var1[i])
}
set.seed(100)
newMat2 <- matrix(rexp(75, rate=.1), nrow = 3, ncol = length(newName2))
colnames(newMat2) <- newName2
> newMat2
     day.1:day.1 day.1:day.2 day.1:rain.yes day.1:rain.no day.1:val1 day.2:day.1
[1,]    9.242116   30.973623      0.9311719      1.943265   20.23192   3.8058106
[2,]    7.238372    6.248052     17.4839077      5.251022   11.23247   0.7162231
[3,]    1.046449   11.744293      2.4999295      3.380434   11.31048   4.2160769
     day.2:day.2 day.2:rain.yes day.2:rain.no day.2:val1 rain.yes:day.1
[1,]    0.766974      17.576561      9.348420  0.9030936       2.066487
[2,]    4.979445       5.406032      3.905483  5.5516888       8.371235
[3,]   13.735530       1.925034      1.250488  6.5460690       9.214908
     rain.yes:day.2 rain.yes:rain.yes rain.yes:rain.no rain.yes:val1
[1,]       10.15267          7.067098         3.527963     13.420953
[2,]       19.75727         22.259788         9.411371     10.040507
[3,]       15.76831         11.416835         8.630324      1.451295
     rain.no:day.1 rain.no:day.2 rain.no:rain.yes rain.no:rain.no rain.no:val1
[1,]      4.330075    25.4360600        15.317283        0.349195     12.51062
[2,]      5.495578    11.0861832        11.256991       13.882071     30.86277
[3,]      6.680542     0.2620275         9.630859       36.926827     11.38734
     val1:day.1 val1:day.2 val1:rain.yes val1:rain.no val1:val1
[1,] 1.41956168   6.731429     14.124068    12.797966 41.294648
[2,] 3.69760484   6.137335      1.391675    12.639562  1.033024
[3,] 0.08002734  23.743638      6.804015     9.374034 27.107049

We can see above, that newMat2 contains every pair of variables, after the categorical variables have been split into dummies.我们可以在上面看到, newMat2包含每一对变量,在分类变量被分成虚拟变量之后。

What I'm trying to do is recombine those dummy variables back into a single variable by summing the appropriate column's rows.我要做的是通过对相应列的行求和来将这些虚拟变量重新组合成一个变量。 My final output would be a matrix/dataframe of every pair of recombined variables.我最终的 output 将是每对重组变量的矩阵/数据框。

For example, if we just look at the variable day .例如,如果我们只看变量day This variable has been split into day.1 and day.2 .此变量已拆分为day.1day.2 If I recombined this variable for every pair we would have a column for day.day , day.rain , and day.val1 .如果我为每一对重新组合这个变量,我们将有一个列day.dayday.rainday.val1 Doing this manually could look like this:手动执行此操作可能如下所示:

day.day  = apply(newMat2[,c(1,2,6,7)], 1, sum)
day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum)
day.val1 = apply(newMat2[,c(5,10)], 1, sum)

In the above code, I'm summing (row-wise), the columns that should be combined.在上面的代码中,我对应该组合的列进行求和(按行)。

Desired output:所需的 output:

More explicitly, If I were to recombine the entire of newMat2 manually, it would look like this:更明确地说,如果我要手动重新组合整个newMat2 ,它看起来像这样:

dfNew <- data.frame(
            day.day  = apply(newMat2[,c(1,2,6,7)], 1, sum),
            day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum),
            day.val1 = apply(newMat2[,c(5,10)], 1, sum),
            rain.day = apply(newMat2[,c(11,12,16,17)], 1, sum),
            rain.rain = apply(newMat2[,c(13,14,18,19)], 1, sum),
            rain.val1 = apply(newMat2[,c(15,20)], 1, sum),
            val1.day = apply(newMat2[,c(21,22)], 1, sum),
            val1.rain = apply(newMat2[,c(23,24)], 1, sum), 
            val1.val1 = newMat2[,c(25)] 
)
> dfNew  
   day.day  day.rain day.val1 rain.day rain.rain rain.val1  val1.day val1.rain
1 44.78852 29.799418 21.13501 41.98529  26.26154  25.93157  8.150991  26.92203
2 19.18209 32.046444 16.78415 44.71027  56.81022  40.90328  9.834940  14.03124
3 30.74235  9.055886 17.85655 31.92579  66.60485  12.83863 23.823666  16.17805
  val1.val1
1 41.294648
2  1.033024
3 27.107049

However, in my real data, I have over 1000 columns, some with many different factor levels and as a result, manually combining them would take a long time.但是,在我的真实数据中,我有超过 1000 列,其中一些具有许多不同的因子水平,因此,手动组合它们需要很长时间。 Is there any way to automate this process?有没有办法自动化这个过程?

With tidyverse functions:使用tidyverse函数:

library(tidyverse)

newMat2 %>% 
  as_tibble(rownames = "id") %>% 
  pivot_longer(-id) %>% 
  mutate(name = map_chr(str_extract_all(name, paste(c(colnames(dfOG), ":"), collapse = "|")),
                        paste0, collapse = "")) %>% 
  group_by(id, name) %>% 
  summarise(value = sum(value)) %>% 
  pivot_wider()

  id    `day:day` `day:rain` `day:val1` `rain:day` `rain:rain` `rain:val1` `val1:day` `val1:rain` `val1:val1`
1 1          44.8      29.8        21.1       42.0        26.3        25.9       8.15        26.9       41.3 
2 2          19.2      32.0        16.8       44.7        56.8        40.9       9.83        14.0        1.03
3 3          30.7       9.06       17.9       31.9        66.6        12.8      23.8         16.2       27.1 
library(rlist)
df <- as.data.frame(newMat2)
L <- split.default(df, f = gsub("(^[a-z0-9]+).*:([a-z0-9]+).*$", "\\1.\\2", colnames(df)))
rlist::list.cbind(lapply(L, rowSums))

      day.day  day.rain day.val1 rain.day rain.rain rain.val1  val1.day val1.rain val1.val1
[1,] 44.78852 29.799418 21.13501 41.98529  26.26154  25.93157  8.150991  26.92203 41.294648
[2,] 19.18209 32.046444 16.78415 44.71027  56.81022  40.90328  9.834940  14.03124  1.033024
[3,] 30.74235  9.055886 17.85655 31.92579  66.60485  12.83863 23.823666  16.17805 27.107049

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM