简体   繁体   中英

split character column into multiple columns

What I am trying to do is split a character column into multiple columns without losing the additional data in the df and the number of columns is variable depending on the input. I guess it's easier with an example:

df <- data.frame(a = c(1:4), b = c("bla", "word", "otherword", "nice"), c = c("one \n two \n three", "bla \n why \n morebla \n helpme", "bla \n bla", "bla"))

I want to split column c into multiple columns by sep = "\n".

I tried using seperate(df$c, "\n", 10) but it doesn't work because I use character as separator. 10 is just an idea, so that I rather have more columns than needed instead of dropping info.
I tried using str_split_fixed(df$c, "\n", 10) which works fine, but it deletes column a and b and I don't know why or how I could fix this.

Additional info: in the end I want to use the command on a list.

Edit:

df1 <- data.frame(a = c(1:4), b = c("bla", "word", "otherword", "nice"), c = c("one \n two \n three", "bla \n why \n morebla \n helpme", "bla \n bla", "bla"))
df2 <- data.frame(a = c(1:4), b = c("bla", "word", "otherword", "nice"), c = c("one \n two \n three", "bla \n why \n morebla \n helpme", "bla \n bla  \n ghfdghf \n hdhdh \n hjgfj \n td", "bla"))

map(list(df1, df2), ~.x %>% separate(c, into = paste0('c', seq_len(max(str_count(.x$c, '\n')+1))), sep = '\n', fill = 'right'))

[[1]]
  a         b   c1    c2        c3      c4
1 1       bla one   two      three    <NA>
2 2      word bla   why   morebla   helpme
3 3 otherword bla    bla      <NA>    <NA>
4 4      nice  bla  <NA>      <NA>    <NA>

[[2]]
  a         b   c1     c2        c3      c4      c5   c6
1 1       bla one    two      three    <NA>    <NA> <NA>
2 2      word bla    why   morebla   helpme    <NA> <NA>
3 3 otherword bla   bla    ghfdghf   hdhdh   hjgfj    td
4 4      nice  bla   <NA>      <NA>    <NA>    <NA> <NA> 

df <- data.frame(unlist(list))

I guess this could cause problems as number of columns are not the same within the list. Expected outcome:

  a         b   c1     c2        c3      c4      c5   c6
1 1       bla one   two      three    <NA>     <NA>  <NA>
2 2      word bla   why   morebla   helpme     <NA>  <NA>
3 3 otherword bla    bla      <NA>    <NA>     <NA>  <NA>
4 4      nice  bla  <NA>      <NA>    <NA>     <NA>  <NA>
5 1       bla one    two      three    <NA>    <NA> <NA>
6 2      word bla    why   morebla   helpme    <NA> <NA>
7 3 otherword bla   bla    ghfdghf   hdhdh   hjgfj    td
8 4      nice  bla   <NA>      <NA>    <NA>    <NA> <NA> 

If doing in tidyverse/dplyr pipe kinda syntax, you may use separate from tidyr in conjunction with stringr::str_count which does exactly as you require.

df <- data.frame(a = c(1:4), b = c("bla", "word", "otherword", "nice"), c = c("one \n two \n three", "bla \n why \n morebla \n helpme", "bla \n bla", "bla"))

library(tidyverse)
df %>% separate(c, into = paste0('c', seq_len(max(str_count(df$c, '\n')+1))), sep = '\n', fill = 'right')

  a         b   c1    c2        c3      c4
1 1       bla one   two      three    <NA>
2 2      word bla   why   morebla   helpme
3 3 otherword bla    bla      <NA>    <NA>
4 4      nice  bla  <NA>      <NA>    <NA>

For doing it on list of data.frames, do it like this

df1 <- data.frame(a = c(1:4), b = c("bla", "word", "otherword", "nice"), c = c("one \n two \n three", "bla \n why \n morebla \n helpme", "bla \n bla", "bla"))
df2 <- data.frame(a = c(1:4), b = c("bla", "word", "otherword", "nice"), c = c("one \n two \n three", "bla \n why \n morebla \n helpme", "bla \n bla  \n ghfdghf \n hdhdh \n hjgfj \n td", "bla"))

map(list(df1, df2), ~.x %>% separate(c, into = paste0('c', seq_len(max(str_count(.x$c, '\n')+1))), sep = '\n', fill = 'right'))

[[1]]
  a         b   c1    c2        c3      c4
1 1       bla one   two      three    <NA>
2 2      word bla   why   morebla   helpme
3 3 otherword bla    bla      <NA>    <NA>
4 4      nice  bla  <NA>      <NA>    <NA>

[[2]]
  a         b   c1     c2        c3      c4      c5   c6
1 1       bla one    two      three    <NA>    <NA> <NA>
2 2      word bla    why   morebla   helpme    <NA> <NA>
3 3 otherword bla   bla    ghfdghf   hdhdh   hjgfj    td
4 4      nice  bla   <NA>      <NA>    <NA>    <NA> <NA>

Further Edit in view of revised question

  1. Use map_dfr instead
map_dfr(list(df1, df2), ~.x %>% separate(c, into = paste0('c', seq_len(max(str_count(.x$c, '\n')+1))), sep = '\n', fill = 'right'))

  a         b   c1     c2        c3      c4      c5   c6
1 1       bla one    two      three    <NA>    <NA> <NA>
2 2      word bla    why   morebla   helpme    <NA> <NA>
3 3 otherword bla     bla      <NA>    <NA>    <NA> <NA>
4 4      nice  bla   <NA>      <NA>    <NA>    <NA> <NA>
5 1       bla one    two      three    <NA>    <NA> <NA>
6 2      word bla    why   morebla   helpme    <NA> <NA>
7 3 otherword bla   bla    ghfdghf   hdhdh   hjgfj    td
8 4      nice  bla   <NA>      <NA>    <NA>    <NA> <NA>
  1. But I cannot see a reason why doing it on separate items of list and then r-binding instead of first r-binding and then simply doing it without map*
df1 %>% rbind(df2) %>% separate(c, into = paste0('c', seq_len(max(str_count(.$c, '\n')+1))), sep = '\n', fill = 'right')

  a         b   c1     c2        c3      c4      c5   c6
1 1       bla one    two      three    <NA>    <NA> <NA>
2 2      word bla    why   morebla   helpme    <NA> <NA>
3 3 otherword bla     bla      <NA>    <NA>    <NA> <NA>
4 4      nice  bla   <NA>      <NA>    <NA>    <NA> <NA>
5 1       bla one    two      three    <NA>    <NA> <NA>
6 2      word bla    why   morebla   helpme    <NA> <NA>
7 3 otherword bla   bla    ghfdghf   hdhdh   hjgfj    td
8 4      nice  bla   <NA>      <NA>    <NA>    <NA> <NA>
cc = strsplit(df$c, "\n")
l = max(lengths(cc))
CC = lapply(cc, function(x) c(x, rep(NA, l-length(x))))
CC = do.call(rbind, CC)
cbind(df[c('a', 'b')], CC)

a         b    1     2         3       4
1       bla one   two      three    <NA>
2      word bla   why   morebla   helpme
3 otherword bla    bla      <NA>    <NA>
4      nice  bla  <NA>      <NA>    <NA>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM