I have a column in my dataset where there is a string of character that I want to split.
df = data.frame(col = c("BrBkRY","BBkRBr","YBRG","RBBk"))
This is the vector that I want to use to conditionally split.
sep = c("Br","Bk","R","Y","B","G")
This is what it should look like in the end. I did that by hand.
df2 = data.frame(col = c("BrBkRY","BBkRBr","YBRG","RBBk"),
col1 = c("Br","B","Y","R"),
col2 = c("Bk","Bk","B","B"),
col3 = c("R","R","R","Bk"),
col4 = c("Y","Br","G",""))
df2
col col1 col2 col3 col4
1 BrBkRY Br Bk R Y
2 BBkRBr B Bk R Br
3 YBRG Y B R G
4 RBBk R B Bk
I was thinking using a regex, but usually, you need a splitting character like a .
or -
. But with a string based on character, I don't know. Moreover, don't want to split BkB in B, k and B. But I do want to separate it in Bk and B. Is there a package that can do this ?
You can use lookahead and lookbehind to do the split with a regular expression. This expression says to split on the space between any character and a capitol letter. (?<=.)
specifies a leading "any character" and (?=[AZ])
specifies a following capitol. The "any character" and the capitol are not actually part of the match, so they don't get "sucked up" in the split.
> lst <- strsplit(as.character(df$col), '(?<=.)(?=[A-Z])', perl=TRUE)
> lst
[[1]]
[1] "Br" "Bk" "R" "Y"
[[2]]
[1] "B" "Bk" "R" "Br"
[[3]]
[1] "Y" "B" "R" "G"
[[4]]
[1] "R" "B" "Bk"
Then the columns can be built, for example exactly as in akrun's answer:
dfN <- cbind(df[1], do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
colnames(dfN)[-1] <- paste0("col", colnames(dfN)[-1])
We can use str_extract_all
to extract the components in a list
, then rbind
the list
elements after padding NA to make the length
of list
elements same, and cbind
with the original dataset
library(stringr)
lst <- str_extract_all(df$col, paste(sep, collapse="|"))
dfN <- cbind(df[1], do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
colnames(dfN)[-1] <- paste0("col", colnames(dfN)[-1])
dfN
# col col1 col2 col3 col4
#1 BrBkRY Br Bk R Y
#2 BBkRBr B Bk R Br
#3 YBRG Y B R G
#4 RBBk R B Bk <NA>
Or a base R
option is with read.csv
and gsub
cbind(df[1], read.csv(text=sub("^,", "", gsub(paste0("(?=(",
paste(sep, collapse="|"), "))"), ",", df$col, perl = TRUE)),
header=FALSE, col.names = paste0("col", 1:4), fill = TRUE))
# col col1 col2 col3 col4
#1 BrBkRY Br Bk R Y
#2 BBkRBr B Bk R Br
#3 YBRG Y B R G
#4 RBBk R B Bk
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.