R 將一列中用空格和逗號分隔的詞分隔到不同的列中

Question

我正在處理一個非常大的電影數據集。 數據如下（示例）

前任：

title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")

movies<-data.frame(title, genre)

如果您在流派列中觀察，某些流派是逗號分隔的，很少是空格分隔的。 而 SciFi 這個詞有兩種不同的表現形式：SciFi 和 Sci-Fi。 這是我在擁有大約 5000 部電影的整個數據集中的情況。

對於以下結果，我堅持采用適當的方法：

如何將每部電影的類型分成不同的類型。 例如：我想將 Interstellar 的類型分開為：

genre1 = 冒險

genre2= 戲劇

genre3=科幻

我使用了以下命令：

movie_genres<-separate(movies, genre, into=c(genre1, genre2, genre3)

上面的命令將 Sci-Fi 一詞分為兩種類型（Sci 和 Fi 或僅 Sci）。

如何在整個流派中刪除 Sci-Fi 一詞中的連字符 (-)，以便單獨的 function 正常工作。

或者

是否有解決方法在流派之間添加逗號（在流派列中）並單獨用逗號分隔它們？

Answer 1

我通常從“清理”數據開始。 在這種情況下，我會讓你的流派列的格式保持一致（流派列分隔，沒有尾隨空格，......）然后使用單獨的。

library(stringr)
library(tidyr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")

movies<-data.frame(title, genre)
movies$genre <- str_replace_all(movies$genre, ",\\s+", ",") 
movies$genre <- str_replace_all(movies$genre, "\\s+$", "") 
movies$genre <- str_replace_all(movies$genre, "\\s+", ",") 
movies$genre <- str_replace_all(movies$genre, "Sci-Fi", "SciFi")
movies$genre
#> [1] "Adventure,Drama,SciFi"  "Adventure,Comedy,SciFi" "Adventure,SciFi"       
#> [4] "Adventure,Drama,SciFi"
separate(movies, genre, into = c("genre1", "genre2", "genre3"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [3].
#>                   title    genre1 genre2 genre3
#> 1          Interstellar Adventure  Drama  SciFi
#> 2    Back to the Future Adventure Comedy  SciFi
#> 3 2001: A Space Odyssey Adventure  SciFi   <NA>
#> 4           The Martian Adventure  Drama  SciFi

^{由reprex package (v2.0.1) 創建於 2023-01-31}

Answer 2

長格式或列表列怎么樣？ 兩者都可以讓您過濾流派，同時處理多個未對齊的流派列並不是很有趣。 例如這樣的事情：

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)

# long format
movies_long <- movies %>% 
  mutate(id = row_number(), .before = 1, genre = str_remove_all(genre, "-") %>% str_squish()) %>%  
  separate_rows(genre, sep = ",? ")

movies_long
#> # A tibble: 11 × 3
#>       id title                 genre    
#>    <int> <chr>                 <chr>    
#>  1     1 Interstellar          Adventure
#>  2     1 Interstellar          Drama    
#>  3     1 Interstellar          SciFi    
#>  4     2 Back to the Future    Adventure
#>  5     2 Back to the Future    Comedy   
#>  6     2 Back to the Future    SciFi    
#>  7     3 2001: A Space Odyssey Adventure
#>  8     3 2001: A Space Odyssey SciFi    
#>  9     4 The Martian           Adventure
#> 10     4 The Martian           Drama    
#> 11     4 The Martian           SciFi

# filter by genre
movies_long %>% filter(genre == "Adventure")
#> # A tibble: 4 × 3
#>      id title                 genre    
#>   <int> <chr>                 <chr>    
#> 1     1 Interstellar          Adventure
#> 2     2 Back to the Future    Adventure
#> 3     3 2001: A Space Odyssey Adventure
#> 4     4 The Martian           Adventure

# list columns, genre column will be filled with lists of genres
movies_lst <- movies %>% 
  mutate(genre = str_remove_all(genre, "-") %>% str_squish() %>% str_split(",? ")) %>% 
  as_tibble()

movies_lst
#> # A tibble: 4 × 2
#>   title                 genre    
#>   <chr>                 <list>   
#> 1 Interstellar          <chr [3]>
#> 2 Back to the Future    <chr [3]>
#> 3 2001: A Space Odyssey <chr [2]>
#> 4 The Martian           <chr [3]>

# can be filtered with e.g. map_lgl and for output can be concatenated to a single string
movies_lst %>% filter(
  map_lgl(genre, ~ all(c("Drama", "SciFi") %in% .x))) %>% 
  mutate(genre = map_chr(genre, ~paste(.x , collapse = ", ")))
#> # A tibble: 2 × 2
#>   title        genre                  
#>   <chr>        <chr>                  
#> 1 Interstellar Adventure, Drama, SciFi
#> 2 The Martian  Adventure, Drama, SciFi

^{創建於 2023-01-31，使用reprex v2.0.2}

Answer 3

這種方法在它們的列中單獨列出流派，並在出現新流派時自動擴展。

library(dplyr)
library(tidyr)

movies %>% 
  mutate(genre = strsplit(genre, ", | ")) %>% 
  rowwise() %>% 
  mutate(genre = list(sub("-", "", genre))) %>% 
  unnest(genre) %>% 
  group_by(genre) %>% 
  mutate(grp = cur_group_id()) %>% 
  arrange(grp) %>% 
  pivot_wider(names_from=grp, names_prefix="genre_", values_from=genre)
# A tibble: 4 × 5
  title                 genre_1   genre_2 genre_3 genre_4
  <chr>                 <chr>     <chr>   <chr>   <chr>  
1 Interstellar          Adventure NA      Drama   SciFi  
2 Back to the Future    Adventure Comedy  NA      SciFi  
3 2001: A Space Odyssey Adventure NA      NA      SciFi  
4 The Martian           Adventure NA      Drama   SciFi

R 將一列中用空格和逗號分隔的詞分隔到不同的列中

問題描述

3 個解決方案

解決方案1
1 已采納 2023-01-31 19:22:06

解決方案2
1 2023-01-31 19:48:14

解決方案3
1 2023-01-31 20:04:45

R 將一列中用空格和逗號分隔的詞分隔到不同的列中

問題描述

3 個解決方案

解決方案1 1 已采納 2023-01-31 19:22:06

解決方案2 1 2023-01-31 19:48:14

解決方案3 1 2023-01-31 20:04:45

解決方案1
1 已采納 2023-01-31 19:22:06

解決方案2
1 2023-01-31 19:48:14

解決方案3
1 2023-01-31 20:04:45