简体   繁体   English

如何通过正则表达式在数据框中拆分列?

[英]How can I split a column in a dataframe by regex?

Example: 例:

ID <- c(1:5)
v1 <- c("abc1", "d2", "eF34", "GHi567", "NoNumber")
df <- data.frame(ID, v1, stringsAsFactors = FALSE)

I want to do something like the following, but in simpler code and possibly one function. 我想做类似下面的事情,但是要用更简单的代码,也可能要用一个函数。 str_match_all in stringr would be good, but it requires an atomic vector. str_match_all中的stringr会很好,但是它需要一个原子向量。 I suppose I could write in a row loop, but I'd like something already vectorized. 我想我可以在行循环中写,但是我想要一些已经向量化的东西。 Will something in stringi be useful here? 请问stringi在这里有用吗?

pattern1 <- "([A-Za-z]+)([0-9]+)"
df$v2 <- sub(pattern = pattern1, replacement = "\\1", x = df$v1)
df$v3 <- sub(pattern = pattern1, replacement = "\\2", x = df$v1)

Also, I'd like to be able to handle the matching issue in Row 5, making df$v3[5] <- NA . 另外,我希望能够处理第5行中的匹配问题,使df$v3[5] <- NA

One option would be splitting the string ( strsplit ) by specifying the lookaround as pattern, and convert the 'list' output to 'matrix using stri_list2matrix from stringi . 一种选择是分裂的字符串( strsplit通过指定) lookaround为图案,并且使用“列表”输出转换为“矩阵stri_list2matrixstringi This will pad NA to list elements that have length less than the maximum list element length. 这将填充NA以列出长度小于最大列表元素长度的列表元素。

library(stringi)
df[paste0('v', 2:3)] <- stri_list2matrix(strsplit(df$v1,
                  '(?<=[A-Za-z])(?=[0-9])', perl=TRUE), byrow=TRUE)

 df
 #  ID       v1       v2   v3
 #1  1     abc1      abc    1
 #2  2       d2        d    2
 #3  3     eF34       eF   34
 #4  4   GHi567      GHi  567
 #5  5 NoNumber NoNumber <NA>

Or use extract from tidyr . 或使用tidyr extract We can paste the strings that don't have a numeric element at the end with NA and use extract . 我们可以用NA paste没有数字元素的字符串,然后使用extract This also have the option to convert the 'class' by specifying convert=TRUE . 也可以通过指定convert=TRUE来转换“类”。

library(tidyr)
df$v1 <- with(df, ifelse(grepl('\\d+$', v1),v1, paste0(v1,NA)) )
extract(df, v1, into=c('v2', 'v3'), '([A-Za-z]+)([0-9]+|NA)', 
             remove=FALSE, convert=TRUE)
#   ID         v1       v2  v3
#1  1       abc1      abc   1
#2  2         d2        d   2
#3  3       eF34       eF  34
#4  4     GHi567      GHi 567
#5  5 NoNumberNA NoNumber  NA

Or a base R option would be 否则base R选项将是

df[paste0('v', 2:3)] <- read.table(text=gsub('([A-Za-z]*)([0-9]*)',
    '\\1 \\2', df$v1), header=FALSE, stringsAsFactors=FALSE, fill=TRUE)

Slight mod to pattern and using as.numeric for coercion to proper class: 轻微修改模式并使用as.numeric强制转换为适当的类:

pattern1 <- "([A-Za-z]+)([0-9]*)"
df$v2 <- sub(pattern = pattern1, replacement = "\\1", x = df$v1)
 df$v3 <- as.numeric(sub(pattern = pattern1, replacement = "\\2", x = df$v1))
 df

  ID       v1       v2  v3
1  1     abc1      abc   1
2  2       d2        d   2
3  3     eF34      eF4  34
4  4   GHi567    GHi67 567
5  5 NoNumber NoNumber  NA

Using development version of data.table, v1.9.5 : 使用data.table,v1.9.5的开发版本

require(data.table) #v1.9.5
setDT(df)[, c("c1", "c2") := tstrsplit(v1, "(?<=[[:alpha:]])(?=[0-9])", perl=TRUE)]
#    ID       v1       c1  c2
# 1:  1     abc1      abc   1
# 2:  2       d2        d   2
# 3:  3     eF34       eF  34
# 4:  4   GHi567      GHi 567
# 5:  5 NoNumber NoNumber  NA

regex borrowed from @akrun. 正则表达式是从@akrun借来的。 Use type.convert=TRUE if you want c2 to be converted to numeric automatically during tstrsplit() . 如果要在tstrsplit()期间将c2自动转换为数字,请使用type.convert=TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM