I'm working with a dataframe, one column of which contains values that are mostly numeric but may contain non-numeric entries. I would like to split this column into multiple columns. One of the new columns should contain the numeric portion of the original entry and another column should contain any non-numeric elements.
Here is a sample data frame:
df <- data.frame(ID=1:4,x=c('< 0.1','100','A 2.5', '200'))
Here is what I would like the data frame to look like:
ID x1 x2
1 < 0.1
2 100
3 A 2.5
4 200
On feature of the data I am currently taking advantage of is that the structure of the character strings is always as follows: the non-numeric elements (if they exist) always precede the numeric elements and the two elements are always separated with a space.
I can use colsplit from the reshape package to split the column based on whitespace. The problem with this is that it replicates any entry that can't be split into two elements,
require(reshape)
df <- transform(df, x=colsplit(x,split=" ", names("x1","x2")))
df
ID x1 x2
1 < 0.1
2 100 100
3 A 2.5
4 200 200
This is not terribly problematic as I can just do some post-processing to remove the numeric elements from column "x1."
I can also accomplish what I would like to do using strsplit inside a function:
split.fn <- function(id){
new.val <- unlist(strsplit(as.character(df$x[df$ID==id])," "))
if(length(new.val)==1){
return(data.frame(ID=id,x1="NA",x2=new.val))
}else{
return(data.frame(ID=id,x1=new.val[1],x2=new.val[2]))
}
}
data.frame(rbindlist(lapply(unique(df$ID),split.fn)))
ID x1 x2
1 < 0.1
2 NA 100
3 A 2.5
4 NA 200
but this seems cumbersome.
Basically both options I've outlined here will work. But I suspect there is a more elegant or direct way to do get the desired data frame.
You can use separate()
from tidyr
tidyr::separate(df, x, c("x1", "x2"), " ", fill = "left")
# ID x1 x2
# 1 1 < 0.1
# 2 2 <NA> 100
# 3 3 A 2.5
# 4 4 <NA> 200
If you absolutely need to remove the NA
values, then you can do
tdy <- tidyr::separate(df, x, c("x1", "x2"), " ", fill = "left")
tdy[is.na(tdy)] <- ""
and then we have
tdy
# ID x1 x2
# 1 1 < 0.1
# 2 2 100
# 3 3 A 2.5
# 4 4 200
This does not use any packages:
transform(df,
x1 = ifelse(grepl(" ", x), sub(" .*", "", x), NA),
x2 = sub(".* ", "", paste(x)))
giving:
ID x x1 x2
1 1 < 0.1 < 0.1
2 2 100 <NA> 100
3 3 A 2.5 A 2.5
4 4 200 <NA> 200
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.