简体   繁体   中英

Splitting character string in R based on characters

I have a column in my dataset where there is a string of character that I want to split.

df = data.frame(col = c("BrBkRY","BBkRBr","YBRG","RBBk"))

This is the vector that I want to use to conditionally split.

sep = c("Br","Bk","R","Y","B","G")

This is what it should look like in the end. I did that by hand.

df2 = data.frame(col = c("BrBkRY","BBkRBr","YBRG","RBBk"), 
                 col1 = c("Br","B","Y","R"),
                 col2 = c("Bk","Bk","B","B"),
                 col3 = c("R","R","R","Bk"),
                 col4 = c("Y","Br","G",""))
df2 
     col col1 col2 col3 col4
1 BrBkRY   Br   Bk    R    Y
2 BBkRBr    B   Bk    R   Br
3   YBRG    Y    B    R    G
4   RBBk    R    B   Bk     

I was thinking using a regex, but usually, you need a splitting character like a . or - . But with a string based on character, I don't know. Moreover, don't want to split BkB in B, k and B. But I do want to separate it in Bk and B. Is there a package that can do this ?

You can use lookahead and lookbehind to do the split with a regular expression. This expression says to split on the space between any character and a capitol letter. (?<=.) specifies a leading "any character" and (?=[AZ]) specifies a following capitol. The "any character" and the capitol are not actually part of the match, so they don't get "sucked up" in the split.

> lst <- strsplit(as.character(df$col), '(?<=.)(?=[A-Z])', perl=TRUE)
> lst
[[1]]
[1] "Br" "Bk" "R"  "Y" 

[[2]]
[1] "B"  "Bk" "R"  "Br"

[[3]]
[1] "Y" "B" "R" "G"

[[4]]
[1] "R"  "B"  "Bk"

Then the columns can be built, for example exactly as in akrun's answer:

dfN <- cbind(df[1], do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
colnames(dfN)[-1] <- paste0("col", colnames(dfN)[-1])

We can use str_extract_all to extract the components in a list , then rbind the list elements after padding NA to make the length of list elements same, and cbind with the original dataset

library(stringr)
lst <- str_extract_all(df$col, paste(sep, collapse="|"))
dfN <- cbind(df[1], do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
colnames(dfN)[-1] <- paste0("col", colnames(dfN)[-1])
dfN
#     col col1 col2 col3 col4
#1 BrBkRY   Br   Bk    R    Y
#2 BBkRBr    B   Bk    R   Br
#3   YBRG    Y    B    R    G
#4   RBBk    R    B   Bk <NA>

Or a base R option is with read.csv and gsub

cbind(df[1], read.csv(text=sub("^,", "", gsub(paste0("(?=(",
    paste(sep, collapse="|"), "))"), ",", df$col, perl = TRUE)),  
     header=FALSE, col.names = paste0("col", 1:4), fill = TRUE))
#     col col1 col2 col3 col4
#1 BrBkRY   Br   Bk    R    Y
#2 BBkRBr    B   Bk    R   Br
#3   YBRG    Y    B    R    G
#4   RBBk    R    B   Bk     

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM