簡體   English   中英

如何根據 R 中的列名組合列?

[英]How to combine columns based on column name in R?

這里這里這里都提出了類似的問題。 但是,我無法讓這些解決方案解決我的問題。

我正在嘗試根據它們的名稱組合列,然后為每對變量創建一個矩陣/數據框。 希望我的例子能解釋得更清楚。

例如,假設我們有一個如下所示的數據框:

# create some data
set.seed(100)
dfOG <- data.frame(
  day = sample(c('1', '2'), 3, replace = T),
  rain = sample(c('yes', 'no'), 3, replace = T),
  val1 = runif(3)
)

我正在應用一個過程(這是我無法控制的),將分類變量拆分為虛擬變量(每個級別都有一個虛擬變量)。 最后,我得到一個矩陣,其中的列是每個成對變量。 output 看起來像這樣:

# create matrix of all pairs
name2 <- c('day.1', 'day.2',
           'rain.yes', 'rain.no', 'val1')

nam2 <- expand.grid(name2, name2)
newName2 <- NULL
for(i in 1:length(nam2$Var1)){
  newName2[i] <- paste0(nam2$Var2[i], ":", nam2$Var1[i])
}
set.seed(100)
newMat2 <- matrix(rexp(75, rate=.1), nrow = 3, ncol = length(newName2))
colnames(newMat2) <- newName2
> newMat2
     day.1:day.1 day.1:day.2 day.1:rain.yes day.1:rain.no day.1:val1 day.2:day.1
[1,]    9.242116   30.973623      0.9311719      1.943265   20.23192   3.8058106
[2,]    7.238372    6.248052     17.4839077      5.251022   11.23247   0.7162231
[3,]    1.046449   11.744293      2.4999295      3.380434   11.31048   4.2160769
     day.2:day.2 day.2:rain.yes day.2:rain.no day.2:val1 rain.yes:day.1
[1,]    0.766974      17.576561      9.348420  0.9030936       2.066487
[2,]    4.979445       5.406032      3.905483  5.5516888       8.371235
[3,]   13.735530       1.925034      1.250488  6.5460690       9.214908
     rain.yes:day.2 rain.yes:rain.yes rain.yes:rain.no rain.yes:val1
[1,]       10.15267          7.067098         3.527963     13.420953
[2,]       19.75727         22.259788         9.411371     10.040507
[3,]       15.76831         11.416835         8.630324      1.451295
     rain.no:day.1 rain.no:day.2 rain.no:rain.yes rain.no:rain.no rain.no:val1
[1,]      4.330075    25.4360600        15.317283        0.349195     12.51062
[2,]      5.495578    11.0861832        11.256991       13.882071     30.86277
[3,]      6.680542     0.2620275         9.630859       36.926827     11.38734
     val1:day.1 val1:day.2 val1:rain.yes val1:rain.no val1:val1
[1,] 1.41956168   6.731429     14.124068    12.797966 41.294648
[2,] 3.69760484   6.137335      1.391675    12.639562  1.033024
[3,] 0.08002734  23.743638      6.804015     9.374034 27.107049

我們可以在上面看到, newMat2包含每一對變量,在分類變量被分成虛擬變量之后。

我要做的是通過對相應列的行求和來將這些虛擬變量重新組合成一個變量。 我最終的 output 將是每對重組變量的矩陣/數據框。

例如,如果我們只看變量day 此變量已拆分為day.1day.2 如果我為每一對重新組合這個變量,我們將有一個列day.dayday.rainday.val1 手動執行此操作可能如下所示:

day.day  = apply(newMat2[,c(1,2,6,7)], 1, sum)
day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum)
day.val1 = apply(newMat2[,c(5,10)], 1, sum)

在上面的代碼中,我對應該組合的列進行求和(按行)。

所需的 output:

更明確地說,如果我要手動重新組合整個newMat2 ,它看起來像這樣:

dfNew <- data.frame(
            day.day  = apply(newMat2[,c(1,2,6,7)], 1, sum),
            day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum),
            day.val1 = apply(newMat2[,c(5,10)], 1, sum),
            rain.day = apply(newMat2[,c(11,12,16,17)], 1, sum),
            rain.rain = apply(newMat2[,c(13,14,18,19)], 1, sum),
            rain.val1 = apply(newMat2[,c(15,20)], 1, sum),
            val1.day = apply(newMat2[,c(21,22)], 1, sum),
            val1.rain = apply(newMat2[,c(23,24)], 1, sum), 
            val1.val1 = newMat2[,c(25)] 
)
> dfNew  
   day.day  day.rain day.val1 rain.day rain.rain rain.val1  val1.day val1.rain
1 44.78852 29.799418 21.13501 41.98529  26.26154  25.93157  8.150991  26.92203
2 19.18209 32.046444 16.78415 44.71027  56.81022  40.90328  9.834940  14.03124
3 30.74235  9.055886 17.85655 31.92579  66.60485  12.83863 23.823666  16.17805
  val1.val1
1 41.294648
2  1.033024
3 27.107049

但是,在我的真實數據中,我有超過 1000 列,其中一些具有許多不同的因子水平,因此,手動組合它們需要很長時間。 有沒有辦法自動化這個過程?

使用tidyverse函數:

library(tidyverse)

newMat2 %>% 
  as_tibble(rownames = "id") %>% 
  pivot_longer(-id) %>% 
  mutate(name = map_chr(str_extract_all(name, paste(c(colnames(dfOG), ":"), collapse = "|")),
                        paste0, collapse = "")) %>% 
  group_by(id, name) %>% 
  summarise(value = sum(value)) %>% 
  pivot_wider()

  id    `day:day` `day:rain` `day:val1` `rain:day` `rain:rain` `rain:val1` `val1:day` `val1:rain` `val1:val1`
1 1          44.8      29.8        21.1       42.0        26.3        25.9       8.15        26.9       41.3 
2 2          19.2      32.0        16.8       44.7        56.8        40.9       9.83        14.0        1.03
3 3          30.7       9.06       17.9       31.9        66.6        12.8      23.8         16.2       27.1 
library(rlist)
df <- as.data.frame(newMat2)
L <- split.default(df, f = gsub("(^[a-z0-9]+).*:([a-z0-9]+).*$", "\\1.\\2", colnames(df)))
rlist::list.cbind(lapply(L, rowSums))

      day.day  day.rain day.val1 rain.day rain.rain rain.val1  val1.day val1.rain val1.val1
[1,] 44.78852 29.799418 21.13501 41.98529  26.26154  25.93157  8.150991  26.92203 41.294648
[2,] 19.18209 32.046444 16.78415 44.71027  56.81022  40.90328  9.834940  14.03124  1.033024
[3,] 30.74235  9.055886 17.85655 31.92579  66.60485  12.83863 23.823666  16.17805 27.107049

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM