简体   繁体   English

如何将数据框中的字符串拆分为多个列

[英]How can I split a character string in a dataframe into multiple columns

I'm working with a dataframe, one column of which contains values that are mostly numeric but may contain non-numeric entries. 我正在使用数据框,其中一列包含大多数数字但可能包含非数字条目的值。 I would like to split this column into multiple columns. 我想将此列拆分为多列。 One of the new columns should contain the numeric portion of the original entry and another column should contain any non-numeric elements. 其中一个新列应包含原始条目的数字部分,另一列应包含任何非数字元素。

Here is a sample data frame: 这是一个示例数据框:

df <- data.frame(ID=1:4,x=c('< 0.1','100','A 2.5', '200')) 

Here is what I would like the data frame to look like: 以下是我希望数据框看起来像:

ID   x1   x2
1    <    0.1
2         100
3    A    2.5
4         200

On feature of the data I am currently taking advantage of is that the structure of the character strings is always as follows: the non-numeric elements (if they exist) always precede the numeric elements and the two elements are always separated with a space. 我目前正在利用的数据的特征是字符串的结构总是如下:非数字元素(如果存在)总是在数字元素之前,并且两个元素总是用空格分隔。

I can use colsplit from the reshape package to split the column based on whitespace. 我可以使用reshape包中的colsplit来根据空格拆分列。 The problem with this is that it replicates any entry that can't be split into two elements, 这个问题是它复制了任何不能分成两个元素的条目,

require(reshape)
df <- transform(df, x=colsplit(x,split=" ", names("x1","x2")))
df
ID  x1   x2
1   <    0.1
2   100  100
3   A    2.5
4   200  200

This is not terribly problematic as I can just do some post-processing to remove the numeric elements from column "x1." 这不是非常有问题,因为我可以进行一些后处理以从列“x1”中删除数字元素。

I can also accomplish what I would like to do using strsplit inside a function: 我也可以在函数内使用strsplit完成我想做的事情:

split.fn <- function(id){
 new.val <- unlist(strsplit(as.character(df$x[df$ID==id])," "))
   if(length(new.val)==1){
     return(data.frame(ID=id,x1="NA",x2=new.val))
   }else{
     return(data.frame(ID=id,x1=new.val[1],x2=new.val[2]))
   }  

}
data.frame(rbindlist(lapply(unique(df$ID),split.fn)))
ID   x1   x2
1    <    0.1
2    NA   100
3    A    2.5
4    NA   200      

but this seems cumbersome. 但这看起来很麻烦。

Basically both options I've outlined here will work. 基本上我在这里概述的两个选项都可行。 But I suspect there is a more elegant or direct way to do get the desired data frame. 但我怀疑有更优雅或直接的方法来获得所需的数据框架。

You can use separate() from tidyr 你可以使用tidyr的 separate()

tidyr::separate(df, x, c("x1", "x2"), " ", fill = "left")
#   ID   x1  x2
# 1  1    < 0.1
# 2  2 <NA> 100
# 3  3    A 2.5
# 4  4 <NA> 200

If you absolutely need to remove the NA values, then you can do 如果您绝对需要删除NA值,那么您可以这样做

tdy <- tidyr::separate(df, x, c("x1", "x2"), " ", fill = "left")
tdy[is.na(tdy)] <- ""

and then we have 然后我们有

tdy
#   ID x1  x2
# 1  1  < 0.1
# 2  2    100
# 3  3  A 2.5
# 4  4    200

This does not use any packages: 这不使用任何包:

transform(df,
  x1 = ifelse(grepl(" ", x), sub(" .*", "", x), NA),
  x2 = sub(".* ", "", paste(x)))

giving: 赠送:

  ID     x   x1  x2
1  1 < 0.1    < 0.1
2  2   100 <NA> 100
3  3 A 2.5    A 2.5
4  4   200 <NA> 200

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM