[英]How to combine columns based on column name in R?
这里、 这里和这里都提出了类似的问题。 但是,我无法让这些解决方案解决我的问题。
我正在尝试根据它们的名称组合列,然后为每对变量创建一个矩阵/数据框。 希望我的例子能解释得更清楚。
例如,假设我们有一个如下所示的数据框:
# create some data
set.seed(100)
dfOG <- data.frame(
day = sample(c('1', '2'), 3, replace = T),
rain = sample(c('yes', 'no'), 3, replace = T),
val1 = runif(3)
)
我正在应用一个过程(这是我无法控制的),将分类变量拆分为虚拟变量(每个级别都有一个虚拟变量)。 最后,我得到一个矩阵,其中的列是每个成对变量。 output 看起来像这样:
# create matrix of all pairs
name2 <- c('day.1', 'day.2',
'rain.yes', 'rain.no', 'val1')
nam2 <- expand.grid(name2, name2)
newName2 <- NULL
for(i in 1:length(nam2$Var1)){
newName2[i] <- paste0(nam2$Var2[i], ":", nam2$Var1[i])
}
set.seed(100)
newMat2 <- matrix(rexp(75, rate=.1), nrow = 3, ncol = length(newName2))
colnames(newMat2) <- newName2
> newMat2
day.1:day.1 day.1:day.2 day.1:rain.yes day.1:rain.no day.1:val1 day.2:day.1
[1,] 9.242116 30.973623 0.9311719 1.943265 20.23192 3.8058106
[2,] 7.238372 6.248052 17.4839077 5.251022 11.23247 0.7162231
[3,] 1.046449 11.744293 2.4999295 3.380434 11.31048 4.2160769
day.2:day.2 day.2:rain.yes day.2:rain.no day.2:val1 rain.yes:day.1
[1,] 0.766974 17.576561 9.348420 0.9030936 2.066487
[2,] 4.979445 5.406032 3.905483 5.5516888 8.371235
[3,] 13.735530 1.925034 1.250488 6.5460690 9.214908
rain.yes:day.2 rain.yes:rain.yes rain.yes:rain.no rain.yes:val1
[1,] 10.15267 7.067098 3.527963 13.420953
[2,] 19.75727 22.259788 9.411371 10.040507
[3,] 15.76831 11.416835 8.630324 1.451295
rain.no:day.1 rain.no:day.2 rain.no:rain.yes rain.no:rain.no rain.no:val1
[1,] 4.330075 25.4360600 15.317283 0.349195 12.51062
[2,] 5.495578 11.0861832 11.256991 13.882071 30.86277
[3,] 6.680542 0.2620275 9.630859 36.926827 11.38734
val1:day.1 val1:day.2 val1:rain.yes val1:rain.no val1:val1
[1,] 1.41956168 6.731429 14.124068 12.797966 41.294648
[2,] 3.69760484 6.137335 1.391675 12.639562 1.033024
[3,] 0.08002734 23.743638 6.804015 9.374034 27.107049
我们可以在上面看到, newMat2
包含每一对变量,在分类变量被分成虚拟变量之后。
我要做的是通过对相应列的行求和来将这些虚拟变量重新组合成一个变量。 我最终的 output 将是每对重组变量的矩阵/数据框。
例如,如果我们只看变量day
。 此变量已拆分为day.1
和day.2
。 如果我为每一对重新组合这个变量,我们将有一个列day.day
、 day.rain
和day.val1
。 手动执行此操作可能如下所示:
day.day = apply(newMat2[,c(1,2,6,7)], 1, sum)
day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum)
day.val1 = apply(newMat2[,c(5,10)], 1, sum)
在上面的代码中,我对应该组合的列进行求和(按行)。
所需的 output:
更明确地说,如果我要手动重新组合整个newMat2
,它看起来像这样:
dfNew <- data.frame(
day.day = apply(newMat2[,c(1,2,6,7)], 1, sum),
day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum),
day.val1 = apply(newMat2[,c(5,10)], 1, sum),
rain.day = apply(newMat2[,c(11,12,16,17)], 1, sum),
rain.rain = apply(newMat2[,c(13,14,18,19)], 1, sum),
rain.val1 = apply(newMat2[,c(15,20)], 1, sum),
val1.day = apply(newMat2[,c(21,22)], 1, sum),
val1.rain = apply(newMat2[,c(23,24)], 1, sum),
val1.val1 = newMat2[,c(25)]
)
> dfNew
day.day day.rain day.val1 rain.day rain.rain rain.val1 val1.day val1.rain
1 44.78852 29.799418 21.13501 41.98529 26.26154 25.93157 8.150991 26.92203
2 19.18209 32.046444 16.78415 44.71027 56.81022 40.90328 9.834940 14.03124
3 30.74235 9.055886 17.85655 31.92579 66.60485 12.83863 23.823666 16.17805
val1.val1
1 41.294648
2 1.033024
3 27.107049
但是,在我的真实数据中,我有超过 1000 列,其中一些具有许多不同的因子水平,因此,手动组合它们需要很长时间。 有没有办法自动化这个过程?
使用tidyverse
函数:
library(tidyverse)
newMat2 %>%
as_tibble(rownames = "id") %>%
pivot_longer(-id) %>%
mutate(name = map_chr(str_extract_all(name, paste(c(colnames(dfOG), ":"), collapse = "|")),
paste0, collapse = "")) %>%
group_by(id, name) %>%
summarise(value = sum(value)) %>%
pivot_wider()
id `day:day` `day:rain` `day:val1` `rain:day` `rain:rain` `rain:val1` `val1:day` `val1:rain` `val1:val1`
1 1 44.8 29.8 21.1 42.0 26.3 25.9 8.15 26.9 41.3
2 2 19.2 32.0 16.8 44.7 56.8 40.9 9.83 14.0 1.03
3 3 30.7 9.06 17.9 31.9 66.6 12.8 23.8 16.2 27.1
library(rlist)
df <- as.data.frame(newMat2)
L <- split.default(df, f = gsub("(^[a-z0-9]+).*:([a-z0-9]+).*$", "\\1.\\2", colnames(df)))
rlist::list.cbind(lapply(L, rowSums))
day.day day.rain day.val1 rain.day rain.rain rain.val1 val1.day val1.rain val1.val1
[1,] 44.78852 29.799418 21.13501 41.98529 26.26154 25.93157 8.150991 26.92203 41.294648
[2,] 19.18209 32.046444 16.78415 44.71027 56.81022 40.90328 9.834940 14.03124 1.033024
[3,] 30.74235 9.055886 17.85655 31.92579 66.60485 12.83863 23.823666 16.17805 27.107049
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.