[英]How to combine columns based on column name in R?
Similar questions have been asked here , here , and here . 这里、 这里和这里都提出了类似的问题。 However, I cant get those solutions to work for my problem.
但是,我无法让这些解决方案解决我的问题。
I'm trying to combine columns based on their names and then create a matrix/dataframe of every pair of variables.我正在尝试根据它们的名称组合列,然后为每对变量创建一个矩阵/数据框。 Hopefully my example will explain more clearly.
希望我的例子能解释得更清楚。
For example, imagine we have a data frame like the one below:例如,假设我们有一个如下所示的数据框:
# create some data
set.seed(100)
dfOG <- data.frame(
day = sample(c('1', '2'), 3, replace = T),
rain = sample(c('yes', 'no'), 3, replace = T),
val1 = runif(3)
)
I am applying a process (that is outside of my control) that splits the categorical variables into dummy variables (with a dummy for each level).我正在应用一个过程(这是我无法控制的),将分类变量拆分为虚拟变量(每个级别都有一个虚拟变量)。 In the end, I get a matrix where the columns are every pairwise variable.
最后,我得到一个矩阵,其中的列是每个成对变量。 The output looks something like this:
output 看起来像这样:
# create matrix of all pairs
name2 <- c('day.1', 'day.2',
'rain.yes', 'rain.no', 'val1')
nam2 <- expand.grid(name2, name2)
newName2 <- NULL
for(i in 1:length(nam2$Var1)){
newName2[i] <- paste0(nam2$Var2[i], ":", nam2$Var1[i])
}
set.seed(100)
newMat2 <- matrix(rexp(75, rate=.1), nrow = 3, ncol = length(newName2))
colnames(newMat2) <- newName2
> newMat2
day.1:day.1 day.1:day.2 day.1:rain.yes day.1:rain.no day.1:val1 day.2:day.1
[1,] 9.242116 30.973623 0.9311719 1.943265 20.23192 3.8058106
[2,] 7.238372 6.248052 17.4839077 5.251022 11.23247 0.7162231
[3,] 1.046449 11.744293 2.4999295 3.380434 11.31048 4.2160769
day.2:day.2 day.2:rain.yes day.2:rain.no day.2:val1 rain.yes:day.1
[1,] 0.766974 17.576561 9.348420 0.9030936 2.066487
[2,] 4.979445 5.406032 3.905483 5.5516888 8.371235
[3,] 13.735530 1.925034 1.250488 6.5460690 9.214908
rain.yes:day.2 rain.yes:rain.yes rain.yes:rain.no rain.yes:val1
[1,] 10.15267 7.067098 3.527963 13.420953
[2,] 19.75727 22.259788 9.411371 10.040507
[3,] 15.76831 11.416835 8.630324 1.451295
rain.no:day.1 rain.no:day.2 rain.no:rain.yes rain.no:rain.no rain.no:val1
[1,] 4.330075 25.4360600 15.317283 0.349195 12.51062
[2,] 5.495578 11.0861832 11.256991 13.882071 30.86277
[3,] 6.680542 0.2620275 9.630859 36.926827 11.38734
val1:day.1 val1:day.2 val1:rain.yes val1:rain.no val1:val1
[1,] 1.41956168 6.731429 14.124068 12.797966 41.294648
[2,] 3.69760484 6.137335 1.391675 12.639562 1.033024
[3,] 0.08002734 23.743638 6.804015 9.374034 27.107049
We can see above, that newMat2
contains every pair of variables, after the categorical variables have been split into dummies.我们可以在上面看到,
newMat2
包含每一对变量,在分类变量被分成虚拟变量之后。
What I'm trying to do is recombine those dummy variables back into a single variable by summing the appropriate column's rows.我要做的是通过对相应列的行求和来将这些虚拟变量重新组合成一个变量。 My final output would be a matrix/dataframe of every pair of recombined variables.
我最终的 output 将是每对重组变量的矩阵/数据框。
For example, if we just look at the variable day
.例如,如果我们只看变量
day
。 This variable has been split into day.1
and day.2
.此变量已拆分为
day.1
和day.2
。 If I recombined this variable for every pair we would have a column for day.day
, day.rain
, and day.val1
.如果我为每一对重新组合这个变量,我们将有一个列
day.day
、 day.rain
和day.val1
。 Doing this manually could look like this:手动执行此操作可能如下所示:
day.day = apply(newMat2[,c(1,2,6,7)], 1, sum)
day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum)
day.val1 = apply(newMat2[,c(5,10)], 1, sum)
In the above code, I'm summing (row-wise), the columns that should be combined.在上面的代码中,我对应该组合的列进行求和(按行)。
Desired output:所需的 output:
More explicitly, If I were to recombine the entire of newMat2
manually, it would look like this:更明确地说,如果我要手动重新组合整个
newMat2
,它看起来像这样:
dfNew <- data.frame(
day.day = apply(newMat2[,c(1,2,6,7)], 1, sum),
day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum),
day.val1 = apply(newMat2[,c(5,10)], 1, sum),
rain.day = apply(newMat2[,c(11,12,16,17)], 1, sum),
rain.rain = apply(newMat2[,c(13,14,18,19)], 1, sum),
rain.val1 = apply(newMat2[,c(15,20)], 1, sum),
val1.day = apply(newMat2[,c(21,22)], 1, sum),
val1.rain = apply(newMat2[,c(23,24)], 1, sum),
val1.val1 = newMat2[,c(25)]
)
> dfNew
day.day day.rain day.val1 rain.day rain.rain rain.val1 val1.day val1.rain
1 44.78852 29.799418 21.13501 41.98529 26.26154 25.93157 8.150991 26.92203
2 19.18209 32.046444 16.78415 44.71027 56.81022 40.90328 9.834940 14.03124
3 30.74235 9.055886 17.85655 31.92579 66.60485 12.83863 23.823666 16.17805
val1.val1
1 41.294648
2 1.033024
3 27.107049
However, in my real data, I have over 1000 columns, some with many different factor levels and as a result, manually combining them would take a long time.但是,在我的真实数据中,我有超过 1000 列,其中一些具有许多不同的因子水平,因此,手动组合它们需要很长时间。 Is there any way to automate this process?
有没有办法自动化这个过程?
With tidyverse
functions:使用
tidyverse
函数:
library(tidyverse)
newMat2 %>%
as_tibble(rownames = "id") %>%
pivot_longer(-id) %>%
mutate(name = map_chr(str_extract_all(name, paste(c(colnames(dfOG), ":"), collapse = "|")),
paste0, collapse = "")) %>%
group_by(id, name) %>%
summarise(value = sum(value)) %>%
pivot_wider()
id `day:day` `day:rain` `day:val1` `rain:day` `rain:rain` `rain:val1` `val1:day` `val1:rain` `val1:val1`
1 1 44.8 29.8 21.1 42.0 26.3 25.9 8.15 26.9 41.3
2 2 19.2 32.0 16.8 44.7 56.8 40.9 9.83 14.0 1.03
3 3 30.7 9.06 17.9 31.9 66.6 12.8 23.8 16.2 27.1
library(rlist)
df <- as.data.frame(newMat2)
L <- split.default(df, f = gsub("(^[a-z0-9]+).*:([a-z0-9]+).*$", "\\1.\\2", colnames(df)))
rlist::list.cbind(lapply(L, rowSums))
day.day day.rain day.val1 rain.day rain.rain rain.val1 val1.day val1.rain val1.val1
[1,] 44.78852 29.799418 21.13501 41.98529 26.26154 25.93157 8.150991 26.92203 41.294648
[2,] 19.18209 32.046444 16.78415 44.71027 56.81022 40.90328 9.834940 14.03124 1.033024
[3,] 30.74235 9.055886 17.85655 31.92579 66.60485 12.83863 23.823666 16.17805 27.107049
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.