[英]R count sum of partial string matches over multiple columns
我正在處理一個凌亂的夏令營登記表。 表格 output 如下:
leaders teen_adventure
1 camp, overnight <NA>
2 <NA> <NA>
3 camp, overnight camp, float, overnight
我想生成新的列,對每個可能的答案求和。
leaders teen_adventure camps overnights floats
1 camp, overnight <NA> 1 1 0
2 <NA> <NA> 0 0 0
3 camp, overnight camp, float, overnight 2 2 1
我從骨子里覺得這有一個 dplyr 解決方案,類似於:
reprex %>%
mutate(camps = sum(case_when(
str_detect(select(., everything()), "camp") ~ 1,
TRUE ~ 0
)))
或者也許使用 across()。
這是示例數據集:
# data
reprex <- structure(list(leaders = c("camp, overnight", NA, "camp, overnight"),
teen_adventure = c(NA, NA, "camp, float, overnight")),
row.names = c(NA, -3L), class = "data.frame")
我們可以通過遍歷列( map
)來使用str_extract_all
提取單詞,然后使用mtabulate
獲取頻率計數,綁定list
元素, summarise
數字列以獲得sum
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(qdapTools)
library(data.table)
reprex %>%
map_dfr(~ str_extract_all(.x, "\\w+") %>%
mtabulate, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), sum, na.rm = TRUE),
.groups = 'drop') %>%
select(-grp) %>%
bind_cols(reprex, .)
-輸出
# leaders teen_adventure camp overnight float
#1 camp, overnight <NA> 1 1 0
#2 <NA> <NA> 0 0 0
#3 camp, overnight camp, float, overnight 2 2 1
單程:
library(stringr)
library(tidyr)
reprex %>%
replace_na(list(leaders='unknown',teen_adventure='unknown'))%>%
mutate(camp=as.numeric(str_detect(leaders, 'camp')+str_detect(teen_adventure,'camp')),
float=as.numeric(str_detect(leaders,'float')+str_detect(teen_adventure,'float')),
overnight=as.numeric(str_detect(leaders,'overnight')+str_detect(teen_adventure,'overnight')))
Output:
leaders teen_adventure camp float overnight
1 camp, overnight unknown 1 0 1
2 unknown unknown 0 0 0
3 camp, overnight camp, float, overnight 2 1 2
基本 R 選項
v <- unique(unlist(strsplit(na.omit(unlist(reprex)), ",\\s+")))
reprex <- cbind(
reprex,
do.call(
rbind,
lapply(
1:nrow(reprex),
function(k) table(factor(unlist(strsplit(na.omit(unlist(reprex[k, ])), ",\\s+")), levels = v))
)
)
)
這使
leaders teen_adventure camp overnight float
1 camp, overnight <NA> 1 1 0
2 <NA> <NA> 0 0 0
3 camp, overnight camp, float, overnight 2 2 1
此解決方案適用於任意數量的列和值:
reprex %>%
as_tibble %>%
# split the values by `, `
mutate_all(strsplit, ", ") %>%
# map through each column then each cell in order make it a named vector
# for example the first cell : c("camp", "overnight") => c("camp"=1, "overnight"=1)
# then pivot it longer by the row_number (this is done for quickly suming the values)
map_dfr( function(x) x %>% map_dfr( ~ set_names(rep(1, length(.x<-.x[!is.na(.x)])), .x)) %>%
mutate(id = row_number()) %>%
pivot_longer(!id) ) %>%
# group by id and name so group the same variables that are found in the same row
group_by(id, name) %>%
# get the sum
summarise_all(sum, na.rm=T) %>%
ungroup %>%
# return the tibble to wide format
pivot_wider %>%
# remove the id column
select(-id) %>%
# add the original data.frame to it
tibble(reprex, .)
# A tibble: 3 x 5
leaders teen_adventure camp float overnight
<chr> <chr> <dbl> <dbl> <dbl>
1 camp, overnight NA 1 0 1
2 NA NA 0 0 0
3 camp, overnight camp, float, overnight 2 1 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.