简体   繁体   中英

split columns with delimiters and multiple categories in R

I have a messy table which has a single column that contains multiple category labels, separated by several delimiters. I want to us R to split that column at each delimiter, and create a new column for each category label. The methods I have seen can only split at one delimiter at a time.

My current table looks like this:

my_table = read.csv("./my_table.csv")
# > my_table
#   ID       TYPE          TEXT
# 1  1          a    blue water
# 2  2      a,b,c   fresh water
# 3  3      a;b,f   cold stream
# 4  4 f, b and c lovely sunset
# 5  5        b;c      up there

I want a table that looks like this:

#   ID A B C D          TEXT
# 1  1 a          blue water
# 2  2 a b c     fresh water
# 3  3 a b   d   cold stream
# 4  4   b c d lovely sunset
# 5  5   b c        up there

Here is what I have tried:

my_table1 <- my_table %>%
  separate(TYPE, c('A', 'B'), ",")
my_table1
# > docs1
#   ID   A        B          TEXT
# 1  1   a     <NA>    blue water
# 2  2   a        b   fresh water
# 3  3 a;b        f   cold stream
# 4  4   f  b and c lovely sunset
# 5  5 b;c     <NA>      up there

my_table2 <- my_table1 %>%
  separate(A, c('A', 'C' ), ";")
# > docs2
#   ID A    C        B          TEXT
# 1  1 a <NA>     <NA>    blue water
# 2  2 a <NA>        b   fresh water
# 3  3 a    b        f   cold stream
# 4  4 f <NA>  b and c lovely sunset
# 5  5 b    c     <NA>      up there

my_table3 <- my_table2 %>%
  separate(A, c('A', 'D'), "and")
# > docs3
#   ID A    D    C        B          TEXT
# 1  1 a <NA> <NA>     <NA>    blue water
# 2  2 a <NA> <NA>        b   fresh water
# 3  3 a <NA>    b        f   cold stream
# 4  4 f <NA> <NA>  b and c lovely sunset
# 5  5 b <NA>    c     <NA>      up there

This gets me close, but the column names are off. Plus, I don't want to have to guess about where the string "b and c" ends up after a couple iterations. I have thousands of rows and maybe five or six categories. My guess is that there is a simpler way to do this.

As an alternative and to extend your tidyverse attempt, here is a solution using strsplit and unnest :

df %>%
    mutate(
        val = strsplit(as.character(TYPE), "(;|,\\s*|\\s*and\\s*)")) %>%
    unnest() %>%
    select(-TYPE) %>%
    group_by(ID, TEXT) %>%
    mutate(n = 1:n()) %>%
    spread(n, val)
## A tibble: 5 x 5
## Groups:   ID, TEXT [5]
#     ID TEXT          `1`   `2`   `3`
#  <int> <fct>         <chr> <chr> <chr>
#1     1 blue water    a     NA    NA
#2     2 fresh water   a     b     c
#3     3 cold stream   a     b     f
#4     4 lovely sunset f     b     c
#5     5 up there      b     c     NA

Note that this is not exactly the same as your expected output. It does however match @MKR's output.


Sample data

df <- read.table(text =
    "ID       TYPE          TEXT
1  1          'a'    'blue water'
2  2      'a,b,c'   'fresh water'
3  3      'a;b,f'   'cold stream'
4  4 'f, b and c' 'lovely sunset'
5  5        'b;c'      'up there'")

The cSplit function from splitstackshape package can make problem easier to solve. An approach could be as:

library(splitstackshape)

# First use `gsub` to replace other delimiter and have only ',' delimiter. 
my_table$TYPE <- gsub("and|;",",",my_table$TYPE)

Mod_df <- cSplit(my_table, "TYPE", sep = ",")

Mod_df
#    ID          TEXT TYPE_1 TYPE_2 TYPE_3
# 1:  1    blue water      a     NA     NA
# 2:  2   fresh water      a      b      c
# 3:  3   cold stream      a      b      f
# 4:  4 lovely sunset      f      b      c
# 5:  5      up there      b      c     NA

The tidyr::gather and spread can be used to get the format mentioned by OP as:

library(tidyr)

gather(Mod_df, key, value, -ID,-TEXT) %>% mutate_if(is.factor, as.character) %>%
  mutate(K = toupper(value)) %>%
  select(-key) %>%
  filter(!is.na(K)) %>%
  spread(K, value)
# ID          TEXT    A    B    C    F
# 1  1    blue water    a <NA> <NA> <NA>
# 2  2   fresh water    a    b    c <NA>
# 3  3   cold stream    a    b <NA>    f
# 4  4 lovely sunset <NA>    b    c    f
# 5  5      up there <NA>    b    c <NA>

Data

my_table <- read.table(text = 
"  ID       TYPE          TEXT
1  1          a    'blue water'
2  2      'a,b,c'   'fresh water'
3  3      'a;b,f'   'cold stream'
4  4 'f, b and c' 'lovely sunset'
5  5        'b;c'      'up there'",
header = TRUE, stringsAsFactors = FALSE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM