使用正则表达式从字符串中选择和提取不同的捕获组

Question

I would like to extract various parts of a string using regex patterns and capturing groups.我想使用正则表达式模式和捕获组来提取字符串的各个部分。 I am able to filter the string using str_match_all , but I would like to have the possibility to explicitely select one of the capturing groups, defined in the regex.我可以使用str_match_all过滤字符串，但我希望有可能明确选择正则表达式中定义的捕获组之一。 The problem is that using that inside of a data.table does not yield the desired results.问题是在 data.table 内部使用它不会产生预期的结果。

dt.test <- data.table(file_names = c("20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv"
                                   , "20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv" ))

I am able to extract the various defined capturing groups using the following command:我可以使用以下命令提取各种定义的捕获组：

stringi::stri_match_all(dt.test[1,file_names],regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)")
[[1]]
      [,1]                                                              [,2]       [,3]       [,4]                 [,5]       [,6]       [,7]      
 [1,] "20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210" "20200131" "20210228" "PROD_TEST_MF_delta" "20210228" "20210107" "20210210"

However, when accessing the resulting list and trying that command inside of a data.table it assigns the value of the first value for all rows of the data.table , which is kind of unexpected.但是，当访问结果列表并在data.table data.table所有行分配第一个值的值，这有点出乎意料。

dt.test[,.(
file_names
 , Extract.1 = unlist(stringi::stri_match_all(file_names,regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)"))[3]
)]

Output:输出：

                                                            file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv  20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv  20210228

Expected Output:预期输出：

                                                            file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv  20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv  20210531

Basically, I think I am looking for a way to define which capture group of the regex to extract.基本上，我想我正在寻找一种方法来定义要提取的正则表达式的哪个捕获组。 A not so elegant workaround is split the string into columns using stringi::stri_match_first and then select the relevant column afterwards.一个不太优雅的解决方法是使用stringi::stri_match_first将字符串拆分为列，然后选择相关列。

dt.test[,.(
  file_names
  , Extract.1 = stringi::stri_match_first(file_names,regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)")
)][,.(file_names,Extract.1.V2)] 



                                                            file_names Extract.1.V2
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv     20200131
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv     20200531

Answer 1

You could use the builtin function sub as follows:您可以使用内置函数sub如下：

dt.test[, Extract.1 := sub(".*delta_(\\d+)_.*", "\\1", file_names)]

                                                            file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv  20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv  20210531

you could can use the function tstrsplit from data.table to split the variables into multiple groups and select the elements that are needed by specifying the argument keep :您可以使用 data.table 中的函数tstrsplit将变量分成多个组，并通过指定参数keep选择所需的元素：

dt.test[, Extract.1 := tstrsplit(file_names, "_", keep=7)]

                                                            file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv  20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv  20210531

使用正则表达式从字符串中选择和提取不同的捕获组

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-07-06 12:30:52

使用正则表达式从字符串中选择和提取不同的捕获组

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-07-06 12:30:52

解决方案1
1 已采纳 2022-07-06 12:30:52