[英]Separate words in a column that are separated by space and comma into different columns in R
我正在處理一個非常大的電影數據集。 數據如下(示例)
前任:
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)
如果您在流派列中觀察,某些流派是逗號分隔的,很少是空格分隔的。 而 SciFi 這個詞有兩種不同的表現形式:SciFi 和 Sci-Fi。 這是我在擁有大約 5000 部電影的整個數據集中的情況。
對於以下結果,我堅持采用適當的方法:
genre1 = 冒險
genre2= 戲劇
genre3=科幻
我使用了以下命令:
movie_genres<-separate(movies, genre, into=c(genre1, genre2, genre3)
上面的命令將 Sci-Fi 一詞分為兩種類型(Sci 和 Fi 或僅 Sci)。
或者
我通常從“清理”數據開始。 在這種情況下,我會讓你的流派列的格式保持一致(流派列分隔,沒有尾隨空格,......)然后使用單獨的。
library(stringr)
library(tidyr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)
movies$genre <- str_replace_all(movies$genre, ",\\s+", ",")
movies$genre <- str_replace_all(movies$genre, "\\s+$", "")
movies$genre <- str_replace_all(movies$genre, "\\s+", ",")
movies$genre <- str_replace_all(movies$genre, "Sci-Fi", "SciFi")
movies$genre
#> [1] "Adventure,Drama,SciFi" "Adventure,Comedy,SciFi" "Adventure,SciFi"
#> [4] "Adventure,Drama,SciFi"
separate(movies, genre, into = c("genre1", "genre2", "genre3"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [3].
#> title genre1 genre2 genre3
#> 1 Interstellar Adventure Drama SciFi
#> 2 Back to the Future Adventure Comedy SciFi
#> 3 2001: A Space Odyssey Adventure SciFi <NA>
#> 4 The Martian Adventure Drama SciFi
由reprex package (v2.0.1) 創建於 2023-01-31
長格式或列表列怎么樣? 兩者都可以讓您過濾流派,同時處理多個未對齊的流派列並不是很有趣。 例如這樣的事情:
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)
# long format
movies_long <- movies %>%
mutate(id = row_number(), .before = 1, genre = str_remove_all(genre, "-") %>% str_squish()) %>%
separate_rows(genre, sep = ",? ")
movies_long
#> # A tibble: 11 × 3
#> id title genre
#> <int> <chr> <chr>
#> 1 1 Interstellar Adventure
#> 2 1 Interstellar Drama
#> 3 1 Interstellar SciFi
#> 4 2 Back to the Future Adventure
#> 5 2 Back to the Future Comedy
#> 6 2 Back to the Future SciFi
#> 7 3 2001: A Space Odyssey Adventure
#> 8 3 2001: A Space Odyssey SciFi
#> 9 4 The Martian Adventure
#> 10 4 The Martian Drama
#> 11 4 The Martian SciFi
# filter by genre
movies_long %>% filter(genre == "Adventure")
#> # A tibble: 4 × 3
#> id title genre
#> <int> <chr> <chr>
#> 1 1 Interstellar Adventure
#> 2 2 Back to the Future Adventure
#> 3 3 2001: A Space Odyssey Adventure
#> 4 4 The Martian Adventure
# list columns, genre column will be filled with lists of genres
movies_lst <- movies %>%
mutate(genre = str_remove_all(genre, "-") %>% str_squish() %>% str_split(",? ")) %>%
as_tibble()
movies_lst
#> # A tibble: 4 × 2
#> title genre
#> <chr> <list>
#> 1 Interstellar <chr [3]>
#> 2 Back to the Future <chr [3]>
#> 3 2001: A Space Odyssey <chr [2]>
#> 4 The Martian <chr [3]>
# can be filtered with e.g. map_lgl and for output can be concatenated to a single string
movies_lst %>% filter(
map_lgl(genre, ~ all(c("Drama", "SciFi") %in% .x))) %>%
mutate(genre = map_chr(genre, ~paste(.x , collapse = ", ")))
#> # A tibble: 2 × 2
#> title genre
#> <chr> <chr>
#> 1 Interstellar Adventure, Drama, SciFi
#> 2 The Martian Adventure, Drama, SciFi
創建於 2023-01-31,使用reprex v2.0.2
這種方法在它們的列中單獨列出流派,並在出現新流派時自動擴展。
library(dplyr)
library(tidyr)
movies %>%
mutate(genre = strsplit(genre, ", | ")) %>%
rowwise() %>%
mutate(genre = list(sub("-", "", genre))) %>%
unnest(genre) %>%
group_by(genre) %>%
mutate(grp = cur_group_id()) %>%
arrange(grp) %>%
pivot_wider(names_from=grp, names_prefix="genre_", values_from=genre)
# A tibble: 4 × 5
title genre_1 genre_2 genre_3 genre_4
<chr> <chr> <chr> <chr> <chr>
1 Interstellar Adventure NA Drama SciFi
2 Back to the Future Adventure Comedy NA SciFi
3 2001: A Space Odyssey Adventure NA NA SciFi
4 The Martian Adventure NA Drama SciFi
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.