R tidyr：使用單獨的函數將帶有逗號分隔文本的字符列使用 RegEx 分隔為多列

Question

我有以下數據框

df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three"))

看起來像這樣

                x
1             one
2        one, two
3      two, three
4 one, two, three

我希望能夠將此x列分成許多不同的列，列x中的每個不同的單詞一個。 基本上我希望最終的結果是這樣的

    one  two  three
1    1    0     0
2    1    1     0
3    0    1     1
4    1    1     1

我認為為了獲得該數據框，我可能需要能夠使用tidyr提供並記錄在此處的separate功能。 然而，這需要正則表達式的知識，我不擅長使用它們。 誰能幫我獲取這個數據框？

重要提示：我不知道數字，也不知道先驗詞的拼寫。

重要例子

它也應該適用於空字符串。 例如，如果我們有

df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three", ""))

那么它也應該工作。

Answer 1

這是一個基本的 R 解決方案

# split strings by ", " and save in to a list `lst`
lst <- apply(df, 1, function(x) unlist(strsplit(x,", ")))

# a common set including all distinct words
common <- Reduce(union,lst)

# generate matrix which is obtained by checking if `common` can be found in the array in `lst`
dfout <- `names<-`(data.frame(Reduce(rbind,lapply(lst, function(x) +(common %in% x))),row.names = NULL),common)

以至於

> dfout
  one two three
1   1   0     0
2   1   1     0
3   0   1     1
4   1   1     1

Answer 2

使用tidyverse ，我們可以使用separate_rows拆分 'x' 列，創建一個序列列並使用pivot_wider的tidyr

library(dplyr)
library(tidyr)
df %>% 
   filter(!(is.na(x)|x==""))%>% 
   mutate(rn = row_number()) %>% 
   separate_rows(x) %>%
   mutate(i1 = 1) %>% 
   pivot_wider(names_from = x, values_from = i1, , values_fill = list(i1 = 0)) %>%
   select(-rn)
# A tibble: 4 x 3
#    one   two three
#  <dbl> <dbl> <dbl>
#1     1     0     0
#2     1     1     0
#3     0     1     1
#4     1     1     1

在上面的代碼中，我們在用separate_rows展開行后，添加了rn列以對每一行具有不同的標識符，否則，當有重復元素時，它會導致pivot_wider中的list輸出列。 添加值為 1 的 'i1' 以在values_from 。 另一種選擇是指定values_fn = length

或者我們可以在拆分base R的“x”列后使用table

table(stack(setNames(strsplit(as.character(df$x), ",\\s+"), seq_len(nrow(df))))[2:1])

Answer 3

您可以從您的列中構建一個模式並將其與tidyr::extract() ：

library(tidyverse)
cols <- c("one","two","three")
pattern <- paste0("(",cols,")*", collapse= "(?:, )*")
df %>% 
  extract(x, into = c("one","two","three"), regex = pattern) %>%
  mutate_all(~as.numeric(!is.na(.)))
#>   one two three
#> 1   1   0     0
#> 2   1   1     0
#> 3   0   1     1
#> 4   1   1     1

R tidyr：使用單獨的函數將帶有逗號分隔文本的字符列使用 RegEx 分隔為多列

問題描述

重要例子

3 個解決方案

解決方案1
2 2019-12-28 20:49:27

解決方案2
1 已采納 2019-12-28 18:30:37

解決方案3
1 2019-12-28 22:38:39

R tidyr：使用單獨的函數將帶有逗號分隔文本的字符列使用 RegEx 分隔為多列

問題描述

重要例子

3 個解決方案

解決方案1 2 2019-12-28 20:49:27

解決方案2 1 已采納 2019-12-28 18:30:37

解決方案3 1 2019-12-28 22:38:39

解決方案1
2 2019-12-28 20:49:27

解決方案2
1 已采納 2019-12-28 18:30:37

解決方案3
1 2019-12-28 22:38:39