[英]Select and extract different capture groups from string using regex
I would like to extract various parts of a string using regex patterns and capturing groups.我想使用正则表达式模式和捕获组来提取字符串的各个部分。 I am able to filter the string using
str_match_all
, but I would like to have the possibility to explicitely select one of the capturing groups, defined in the regex.我可以使用
str_match_all
过滤字符串,但我希望有可能明确选择正则表达式中定义的捕获组之一。 The problem is that using that inside of a data.table does not yield the desired results.问题是在 data.table 内部使用它不会产生预期的结果。
dt.test <- data.table(file_names = c("20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv"
, "20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv" ))
I am able to extract the various defined capturing groups using the following command:我可以使用以下命令提取各种定义的捕获组:
stringi::stri_match_all(dt.test[1,file_names],regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)")
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210" "20200131" "20210228" "PROD_TEST_MF_delta" "20210228" "20210107" "20210210"
However, when accessing the resulting list and trying that command inside of a data.table
it assigns the value of the first value for all rows of the data.table
, which is kind of unexpected.但是,当访问结果列表并在
data.table
data.table
所有行分配第一个值的值,这有点出乎意料。
dt.test[,.(
file_names
, Extract.1 = unlist(stringi::stri_match_all(file_names,regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)"))[3]
)]
Output:输出:
file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv 20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv 20210228
Expected Output:预期输出:
file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv 20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv 20210531
Basically, I think I am looking for a way to define which capture group of the regex to extract.基本上,我想我正在寻找一种方法来定义要提取的正则表达式的哪个捕获组。 A not so elegant workaround is split the string into columns using
stringi::stri_match_first
and then select the relevant column afterwards.一个不太优雅的解决方法是使用
stringi::stri_match_first
将字符串拆分为列,然后选择相关列。
dt.test[,.(
file_names
, Extract.1 = stringi::stri_match_first(file_names,regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)")
)][,.(file_names,Extract.1.V2)]
file_names Extract.1.V2
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv 20200131
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv 20200531
You could use the builtin function sub
as follows:您可以使用内置函数
sub
如下:
dt.test[, Extract.1 := sub(".*delta_(\\d+)_.*", "\\1", file_names)]
file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv 20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv 20210531
you could can use the function tstrsplit
from data.table to split the variables into multiple groups and select the elements that are needed by specifying the argument keep
:您可以使用 data.table 中的函数
tstrsplit
将变量分成多个组,并通过指定参数keep
选择所需的元素:
dt.test[, Extract.1 := tstrsplit(file_names, "_", keep=7)]
file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv 20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv 20210531
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.