[英]How can I split a column in a dataframe by regex?
Example: 例:
ID <- c(1:5)
v1 <- c("abc1", "d2", "eF34", "GHi567", "NoNumber")
df <- data.frame(ID, v1, stringsAsFactors = FALSE)
I want to do something like the following, but in simpler code and possibly one function. 我想做类似下面的事情,但是要用更简单的代码,也可能要用一个函数。 str_match_all
in stringr
would be good, but it requires an atomic vector. str_match_all
中的stringr
会很好,但是它需要一个原子向量。 I suppose I could write in a row loop, but I'd like something already vectorized. 我想我可以在行循环中写,但是我想要一些已经向量化的东西。 Will something in stringi
be useful here? 请问stringi
在这里有用吗?
pattern1 <- "([A-Za-z]+)([0-9]+)"
df$v2 <- sub(pattern = pattern1, replacement = "\\1", x = df$v1)
df$v3 <- sub(pattern = pattern1, replacement = "\\2", x = df$v1)
Also, I'd like to be able to handle the matching issue in Row 5, making df$v3[5] <- NA
. 另外,我希望能够处理第5行中的匹配问题,使df$v3[5] <- NA
。
One option would be splitting the string ( strsplit
) by specifying the lookaround
as pattern, and convert the 'list' output to 'matrix using stri_list2matrix
from stringi
. 一种选择是分裂的字符串( strsplit
通过指定) lookaround
为图案,并且使用“列表”输出转换为“矩阵stri_list2matrix
从stringi
。 This will pad NA
to list elements that have length less than the maximum list element length. 这将填充NA
以列出长度小于最大列表元素长度的列表元素。
library(stringi)
df[paste0('v', 2:3)] <- stri_list2matrix(strsplit(df$v1,
'(?<=[A-Za-z])(?=[0-9])', perl=TRUE), byrow=TRUE)
df
# ID v1 v2 v3
#1 1 abc1 abc 1
#2 2 d2 d 2
#3 3 eF34 eF 34
#4 4 GHi567 GHi 567
#5 5 NoNumber NoNumber <NA>
Or use extract
from tidyr
. 或使用tidyr
extract
。 We can paste
the strings that don't have a numeric element at the end with NA
and use extract
. 我们可以用NA
paste
没有数字元素的字符串,然后使用extract
。 This also have the option to convert the 'class' by specifying convert=TRUE
. 也可以通过指定convert=TRUE
来转换“类”。
library(tidyr)
df$v1 <- with(df, ifelse(grepl('\\d+$', v1),v1, paste0(v1,NA)) )
extract(df, v1, into=c('v2', 'v3'), '([A-Za-z]+)([0-9]+|NA)',
remove=FALSE, convert=TRUE)
# ID v1 v2 v3
#1 1 abc1 abc 1
#2 2 d2 d 2
#3 3 eF34 eF 34
#4 4 GHi567 GHi 567
#5 5 NoNumberNA NoNumber NA
Or a base R
option would be 否则base R
选项将是
df[paste0('v', 2:3)] <- read.table(text=gsub('([A-Za-z]*)([0-9]*)',
'\\1 \\2', df$v1), header=FALSE, stringsAsFactors=FALSE, fill=TRUE)
Slight mod to pattern and using as.numeric
for coercion to proper class: 轻微修改模式并使用as.numeric
强制转换为适当的类:
pattern1 <- "([A-Za-z]+)([0-9]*)"
df$v2 <- sub(pattern = pattern1, replacement = "\\1", x = df$v1)
df$v3 <- as.numeric(sub(pattern = pattern1, replacement = "\\2", x = df$v1))
df
ID v1 v2 v3
1 1 abc1 abc 1
2 2 d2 d 2
3 3 eF34 eF4 34
4 4 GHi567 GHi67 567
5 5 NoNumber NoNumber NA
Using development version of data.table, v1.9.5 : 使用data.table,v1.9.5的开发版本 :
require(data.table) #v1.9.5
setDT(df)[, c("c1", "c2") := tstrsplit(v1, "(?<=[[:alpha:]])(?=[0-9])", perl=TRUE)]
# ID v1 c1 c2
# 1: 1 abc1 abc 1
# 2: 2 d2 d 2
# 3: 3 eF34 eF 34
# 4: 4 GHi567 GHi 567
# 5: 5 NoNumber NoNumber NA
regex borrowed from @akrun. 正则表达式是从@akrun借来的。 Use type.convert=TRUE
if you want c2
to be converted to numeric automatically during tstrsplit()
. 如果要在tstrsplit()
期间将c2
自动转换为数字,请使用type.convert=TRUE
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.