简体   繁体   中英

How can I split a character string in a dataframe into multiple columns

I'm working with a dataframe, one column of which contains values that are mostly numeric but may contain non-numeric entries. I would like to split this column into multiple columns. One of the new columns should contain the numeric portion of the original entry and another column should contain any non-numeric elements.

Here is a sample data frame:

df <- data.frame(ID=1:4,x=c('< 0.1','100','A 2.5', '200')) 

Here is what I would like the data frame to look like:

ID   x1   x2
1    <    0.1
2         100
3    A    2.5
4         200

On feature of the data I am currently taking advantage of is that the structure of the character strings is always as follows: the non-numeric elements (if they exist) always precede the numeric elements and the two elements are always separated with a space.

I can use colsplit from the reshape package to split the column based on whitespace. The problem with this is that it replicates any entry that can't be split into two elements,

require(reshape)
df <- transform(df, x=colsplit(x,split=" ", names("x1","x2")))
df
ID  x1   x2
1   <    0.1
2   100  100
3   A    2.5
4   200  200

This is not terribly problematic as I can just do some post-processing to remove the numeric elements from column "x1."

I can also accomplish what I would like to do using strsplit inside a function:

split.fn <- function(id){
 new.val <- unlist(strsplit(as.character(df$x[df$ID==id])," "))
   if(length(new.val)==1){
     return(data.frame(ID=id,x1="NA",x2=new.val))
   }else{
     return(data.frame(ID=id,x1=new.val[1],x2=new.val[2]))
   }  

}
data.frame(rbindlist(lapply(unique(df$ID),split.fn)))
ID   x1   x2
1    <    0.1
2    NA   100
3    A    2.5
4    NA   200      

but this seems cumbersome.

Basically both options I've outlined here will work. But I suspect there is a more elegant or direct way to do get the desired data frame.

You can use separate() from tidyr

tidyr::separate(df, x, c("x1", "x2"), " ", fill = "left")
#   ID   x1  x2
# 1  1    < 0.1
# 2  2 <NA> 100
# 3  3    A 2.5
# 4  4 <NA> 200

If you absolutely need to remove the NA values, then you can do

tdy <- tidyr::separate(df, x, c("x1", "x2"), " ", fill = "left")
tdy[is.na(tdy)] <- ""

and then we have

tdy
#   ID x1  x2
# 1  1  < 0.1
# 2  2    100
# 3  3  A 2.5
# 4  4    200

This does not use any packages:

transform(df,
  x1 = ifelse(grepl(" ", x), sub(" .*", "", x), NA),
  x2 = sub(".* ", "", paste(x)))

giving:

  ID     x   x1  x2
1  1 < 0.1    < 0.1
2  2   100 <NA> 100
3  3 A 2.5    A 2.5
4  4   200 <NA> 200

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM