简体   繁体   中英

How to separate a column of strings into multiple columns, each containing a single char of a string, with strings of unequal length and no separator?

My data frame is this:

data.frame(stringsAsFactors=FALSE,
       A = c("1234", "abc.", "e-2.1ad"),
       B = c("5-4", "1-0", "a,d")
)

I want to separate the columns into multiple columns containing individual characters.

The other answers that I found, all involved using some regular expression or pattern or separator, which as you see, I can't do here, or convoluted solutions using sapply (which used the position, but for me it didn't work). I'm sure there's a more elegant solution out there and I would really love a solution using tidyr if possible, but whatever does it cleanly is much appreciated.

This is what it should like, after all is said and done:

 newdf <- data.frame(stringsAsFactors=FALSE,
      A1 = c("1", "a", "e"),
      A2 = c("2", "b", "-"),
      A3 = c("3", "c", "2"),
      A4 = c("4", ".", "."),
      A5 = c(NA, NA, 1),
      A6 = c(NA, NA, "a"),
      A7 = c(NA, NA, "d"),
      B1 = c("5", "1", "a"),
      B2 = c("-", "-", ","),
      B3 = c("4", "0", "d")
)

And, if the answer is more than throwing a function or two at it, I would really appreciate if you could explain how you go about it, rather than just the solution itself. Thank you!

Later edit: I was able to almost do it using the qdap package but I could get around it filling what should've been NAs (because of the strings' unequal lengths) with characters from the beginning of the string. Very odd behavior which wasn't explained in the documentation, otherwise a very promising function.

Another strange behavior that I noticed in my lame attempts to solve this was automatically transforming from characters into factors. However, I wasn't able to pinpoint where it happens along the way.

There are a number of potential options, depending on details of what you are interested in. See @Elin's comment above regarding missing 32 in 5-432.

One possibility to consider is str_split_fixed from stringr package:

str_split_fixed("1234", "", 7)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "1"  "2"  "3"  "4"  ""   ""   ""  

An empty pattern "" would split by character, and in this case try to return 7 pieces as a character matrix (with the last 3 empty strings). Right now, if no character is available, it returns an empty string, not NA. (see github issue ).

If the number of columns was based on the maximum number of characters possible for columns A and B (7 and 5 for example), one could do the following:

as.data.frame(lapply(df, function(x) str_split_fixed(x, "", n=max(nchar(x)))))

  A.1 A.2 A.3 A.4 A.5 A.6 A.7 B.1 B.2 B.3 B.4 B.5
1   1   2   3   4               5   -   4   3   2
2   a   b   c   .               1   -   0        
3   e   -   2   .   1   a   d   a   ,   d        

Note: To replace the empty strings afterwards with NA:

df[df==""] <- NA

  A.1 A.2 A.3 A.4  A.5  A.6  A.7 B.1 B.2 B.3  B.4  B.5
1   1   2   3   4 <NA> <NA> <NA>   5   -   4    3    2
2   a   b   c   . <NA> <NA> <NA>   1   -   0 <NA> <NA>
3   e   -   2   .    1    a    d   a   ,   d <NA> <NA>

This is my tidyverse solutions. Writing a function is new to me, any suggestions for improvement would be appreciated.

library(tidyverse)
df <- data.frame(stringsAsFactors=FALSE,
        A = c("1234", "abc.", "e-2.1ad"),
        B = c("5-432", "1-0", "a,d"))    

a_split<- str_split(df$A, "")
b_split<- str_split(df$B, "")
f1 <- function(num, s)(c(s[[1]][num], s[[2]][num], s[[3]][num]))
x <- c(1:7)
all_a <- lapply(x, f1, a_split)
x <- c(1:5)
all_b <- lapply(x, f1, b_split)

We can use cSplit from splitstackshape and split every character in column A and B into separate column

df1 <- splitstackshape::cSplit(df, c('A', 'B'), sep = '', stripWhite = FALSE)
df1

#   A_1 A_2 A_3 A_4 A_5  A_6  A_7 B_1 B_2 B_3 B_4 B_5 B_6 B_7
#1:   1   2   3   4  NA <NA> <NA>   5   -   4   3   2  NA  NA
#2:   a   b   c   .  NA <NA> <NA>   1   -   0  NA  NA  NA  NA
#3:   e   -   2   .   1    a    d   a   ,   d  NA  NA  NA  NA

However, this gave me some additional columns with NA for B which can be removed using Filter

Filter(function(x) any(!is.na(x)), df1)
#   A_1 A_2 A_3 A_4 A_5  A_6  A_7 B_1 B_2 B_3 B_4 B_5
#1:   1   2   3   4  NA <NA> <NA>   5   -   4   3   2
#2:   a   b   c   .  NA <NA> <NA>   1   -   0  NA  NA
#3:   e   -   2   .   1    a    d   a   ,   d  NA  NA

data

df <- data.frame(stringsAsFactors=FALSE,
             A = c("1234", "abc.", "e-2.1ad"),
             B = c("5-432", "1-0", "a,d"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM