簡體   English   中英

R 將一列中用空格和逗號分隔的詞分隔到不同的列中

[英]Separate words in a column that are separated by space and comma into different columns in R

我正在處理一個非常大的電影數據集。 數據如下(示例)

前任:

title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")

movies<-data.frame(title, genre)

如果您在流派列中觀察,某些流派是逗號分隔的,很少是空格分隔的。 而 SciFi 這個詞有兩種不同的表現形式:SciFi 和 Sci-Fi。 這是我在擁有大約 5000 部電影的整個數據集中的情況。

對於以下結果,我堅持采用適當的方法:

  1. 如何將每部電影的類型分成不同的類型。 例如:我想將 Interstellar 的類型分開為:

genre1 = 冒險


genre2= 戲劇


genre3=科幻


我使用了以下命令:

movie_genres<-separate(movies, genre, into=c(genre1, genre2, genre3)

上面的命令將 Sci-Fi 一詞分為兩種類型(Sci 和 Fi 或僅 Sci)。

  1. 如何在整個流派中刪除 Sci-Fi 一詞中的連字符 (-),以便單獨的 function 正常工作。

或者

  1. 是否有解決方法在流派之間添加逗號(在流派列中)並單獨用逗號分隔它們?

我通常從“清理”數據開始。 在這種情況下,我會讓你的流派列的格式保持一致(流派列分隔,沒有尾隨空格,......)然后使用單獨的。

library(stringr)
library(tidyr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")

movies<-data.frame(title, genre)
movies$genre <- str_replace_all(movies$genre, ",\\s+", ",") 
movies$genre <- str_replace_all(movies$genre, "\\s+$", "") 
movies$genre <- str_replace_all(movies$genre, "\\s+", ",") 
movies$genre <- str_replace_all(movies$genre, "Sci-Fi", "SciFi")
movies$genre
#> [1] "Adventure,Drama,SciFi"  "Adventure,Comedy,SciFi" "Adventure,SciFi"       
#> [4] "Adventure,Drama,SciFi"
separate(movies, genre, into = c("genre1", "genre2", "genre3"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [3].
#>                   title    genre1 genre2 genre3
#> 1          Interstellar Adventure  Drama  SciFi
#> 2    Back to the Future Adventure Comedy  SciFi
#> 3 2001: A Space Odyssey Adventure  SciFi   <NA>
#> 4           The Martian Adventure  Drama  SciFi

reprex package (v2.0.1) 創建於 2023-01-31

長格式或列表列怎么樣? 兩者都可以讓您過濾流派,同時處理多個未對齊的流派列並不是很有趣。 例如這樣的事情:

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)

# long format
movies_long <- movies %>% 
  mutate(id = row_number(), .before = 1, genre = str_remove_all(genre, "-") %>% str_squish()) %>%  
  separate_rows(genre, sep = ",? ")

movies_long
#> # A tibble: 11 × 3
#>       id title                 genre    
#>    <int> <chr>                 <chr>    
#>  1     1 Interstellar          Adventure
#>  2     1 Interstellar          Drama    
#>  3     1 Interstellar          SciFi    
#>  4     2 Back to the Future    Adventure
#>  5     2 Back to the Future    Comedy   
#>  6     2 Back to the Future    SciFi    
#>  7     3 2001: A Space Odyssey Adventure
#>  8     3 2001: A Space Odyssey SciFi    
#>  9     4 The Martian           Adventure
#> 10     4 The Martian           Drama    
#> 11     4 The Martian           SciFi

# filter by genre
movies_long %>% filter(genre == "Adventure")
#> # A tibble: 4 × 3
#>      id title                 genre    
#>   <int> <chr>                 <chr>    
#> 1     1 Interstellar          Adventure
#> 2     2 Back to the Future    Adventure
#> 3     3 2001: A Space Odyssey Adventure
#> 4     4 The Martian           Adventure
# list columns, genre column will be filled with lists of genres
movies_lst <- movies %>% 
  mutate(genre = str_remove_all(genre, "-") %>% str_squish() %>% str_split(",? ")) %>% 
  as_tibble()

movies_lst
#> # A tibble: 4 × 2
#>   title                 genre    
#>   <chr>                 <list>   
#> 1 Interstellar          <chr [3]>
#> 2 Back to the Future    <chr [3]>
#> 3 2001: A Space Odyssey <chr [2]>
#> 4 The Martian           <chr [3]>

# can be filtered with e.g. map_lgl and for output can be concatenated to a single string
movies_lst %>% filter(
  map_lgl(genre, ~ all(c("Drama", "SciFi") %in% .x))) %>% 
  mutate(genre = map_chr(genre, ~paste(.x , collapse = ", ")))
#> # A tibble: 2 × 2
#>   title        genre                  
#>   <chr>        <chr>                  
#> 1 Interstellar Adventure, Drama, SciFi
#> 2 The Martian  Adventure, Drama, SciFi

創建於 2023-01-31,使用reprex v2.0.2

這種方法在它們的列中單獨列出流派,並在出現新流派時自動擴展。

library(dplyr)
library(tidyr)

movies %>% 
  mutate(genre = strsplit(genre, ", | ")) %>% 
  rowwise() %>% 
  mutate(genre = list(sub("-", "", genre))) %>% 
  unnest(genre) %>% 
  group_by(genre) %>% 
  mutate(grp = cur_group_id()) %>% 
  arrange(grp) %>% 
  pivot_wider(names_from=grp, names_prefix="genre_", values_from=genre)
# A tibble: 4 × 5
  title                 genre_1   genre_2 genre_3 genre_4
  <chr>                 <chr>     <chr>   <chr>   <chr>  
1 Interstellar          Adventure NA      Drama   SciFi  
2 Back to the Future    Adventure Comedy  NA      SciFi  
3 2001: A Space Odyssey Adventure NA      NA      SciFi  
4 The Martian           Adventure NA      Drama   SciFi

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM