簡體   English   中英

R 計算多列部分字符串匹配的總和

[英]R count sum of partial string matches over multiple columns

我正在處理一個凌亂的夏令營登記表。 表格 output 如下:

          leaders         teen_adventure
1 camp, overnight                   <NA>
2            <NA>                   <NA>
3 camp, overnight camp, float, overnight

我想生成新的列,對每個可能的答案求和。

          leaders         teen_adventure camps overnights floats
1 camp, overnight                   <NA>     1          1      0
2            <NA>                   <NA>     0          0      0
3 camp, overnight camp, float, overnight     2          2      1

我從骨子里覺得這有一個 dplyr 解決方案,類似於:

reprex %>%
  mutate(camps = sum(case_when(
    str_detect(select(., everything()), "camp") ~ 1,
    TRUE ~ 0
  )))

或者也許使用 across()。

這是示例數據集:

# data
reprex <- structure(list(leaders = c("camp, overnight", NA, "camp, overnight"), 
          teen_adventure = c(NA, NA, "camp, float, overnight")), 
          row.names = c(NA, -3L), class = "data.frame")

我們可以通過遍歷列( map )來使用str_extract_all提取單詞,然后使用mtabulate獲取頻率計數,綁定list元素, summarise數字列以獲得sum

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(qdapTools)
library(data.table)
reprex %>% 
   map_dfr(~ str_extract_all(.x, "\\w+") %>%
             mtabulate, .id = 'grp') %>%
   group_by(grp = rowid(grp)) %>% 
   summarise(across(everything(), sum, na.rm = TRUE), 
       .groups = 'drop') %>%
   select(-grp) %>% 
   bind_cols(reprex, .)

-輸出

#            leaders         teen_adventure camp overnight float
#1 camp, overnight                   <NA>    1         1     0
#2            <NA>                   <NA>    0         0     0
#3 camp, overnight camp, float, overnight    2         2     1

單程:

library(stringr)
library(tidyr)
reprex %>%
  replace_na(list(leaders='unknown',teen_adventure='unknown'))%>%
  mutate(camp=as.numeric(str_detect(leaders, 'camp')+str_detect(teen_adventure,'camp')),
         float=as.numeric(str_detect(leaders,'float')+str_detect(teen_adventure,'float')),
         overnight=as.numeric(str_detect(leaders,'overnight')+str_detect(teen_adventure,'overnight')))

Output:

          leaders         teen_adventure camp float overnight
1 camp, overnight                unknown    1     0         1
2         unknown                unknown    0     0         0
3 camp, overnight camp, float, overnight    2     1         2

基本 R 選項

v <- unique(unlist(strsplit(na.omit(unlist(reprex)), ",\\s+")))
reprex <- cbind(
  reprex,
  do.call(
    rbind,
    lapply(
      1:nrow(reprex),
      function(k) table(factor(unlist(strsplit(na.omit(unlist(reprex[k, ])), ",\\s+")), levels = v))
    )
  )
)

這使

          leaders         teen_adventure camp overnight float
1 camp, overnight                   <NA>    1         1     0
2            <NA>                   <NA>    0         0     0
3 camp, overnight camp, float, overnight    2         2     1

此解決方案適用於任意數量的列和值:

reprex %>%
 as_tibble %>%
 # split the values by `, `
 mutate_all(strsplit, ", ") %>%
 # map through each column then each cell in order make it a named vector
 # for example the first cell : c("camp", "overnight") => c("camp"=1, "overnight"=1)
 # then pivot it longer by the row_number (this is done for quickly suming the values)
 map_dfr( function(x) x %>% map_dfr( ~ set_names(rep(1, length(.x<-.x[!is.na(.x)])), .x)) %>%
     mutate(id = row_number()) %>% 
     pivot_longer(!id) ) %>%
 # group by id and name so group the same variables that are found in the same row
 group_by(id, name) %>%
 # get the sum
 summarise_all(sum, na.rm=T) %>%
 ungroup %>%
 # return the tibble to wide format
 pivot_wider %>%
 # remove the id column
 select(-id) %>%
 # add the original data.frame to it
 tibble(reprex, .)
# A tibble: 3 x 5
  leaders         teen_adventure          camp float overnight
  <chr>           <chr>                  <dbl> <dbl>     <dbl>
1 camp, overnight NA                         1     0         1
2 NA              NA                         0     0         0
3 camp, overnight camp, float, overnight     2     1         2

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM