使用正则表达式从R中的数据框列中提取子字符串

Question

I am fairly new to R so please go easy on me if this is a stupid question. 我对R相当陌生，所以如果这是一个愚蠢的问题，请对我轻松一点。

I have a dataframe called foo : 我有一个名为foo的数据框：

< head(foo)
  Old.Clone.Name New.Clone.Name                                  File
1         A          Aa           A_mask_MF_final_IS2_SAEE7-1_02.nrrd
2         B          Bb   B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
3         C          Cc   C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
4         D          Dd    D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
5         E          Ee           F_mask_MF_final_IS2_SAED9-1_02.nrrd
6         F          Ff    F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd

I want to extract codes from the File column that match the regular expression (S[AZ]{3}[0-9]{1,2}-[0-9]_02) , to give me: 我想从“ File列中提取与正则表达式(S[AZ]{3}[0-9]{1,2}-[0-9]_02)相匹配的代码，以便给我：

SAEE7-1_02
SADQ15-1_02
SAEC16-1_02
SAEJ6-1_02
SAED9-1_02
SAGP3-1_02

I then want to use these codes to search another directory for other files that contain the same code. 然后，我想使用这些代码在另一个目录中搜索包含相同代码的其他文件。

I fail, however, at the first hurdle and cannot extract the codes from that column of the data frame. 但是，我首先遇到了困难，无法从数据帧的该列中提取代码。

I have tried: 我努力了：

library('stringr')
str_extract(foo[3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = TRUE))

but this just returns [1] NA . 但这只返回[1] NA 。

Am I simply missing something obvious? 我只是缺少明显的东西吗？ I look forward to cracking this with a bit of help from the community. 我期待在社区的帮助下破解此问题。

Answer 1

Hello if you are reading the data as a table file then foo[3] is a list and str_extract does not accept lists, only strings, then you should use lapply to extract the match of every element. 您好，如果您将数据作为表文件读取，则foo[3]是一个列表，而str_extract不接受列表，仅接受字符串，那么您应该使用lapply提取每个元素的匹配项。

lapply(foo[3], function(x) str_extract(x, "[sS][a-zA-Z]{3}[0-9]{1,2}-[0-9]_02"))

Result: 结果：

[1] "SAEE7-1_02"  "SADQ15-1_02" "SAEC16-1_02" "SAEJ6-1_02"  "SAED9-1_02"
[6] "SAGP3-1_02"

Answer 2

str_extract(foo[3],"(?i)S[A-Z]{3}[0-9]{1,2}-[0-9]_02")

seems to work. 似乎有效。 Somehow, my R gave me 不知何故，我的R给了我

"Error in check_pattern(pattern, string) : could not find function "regex"" “ check_pattern（pattern，string）中的错误：找不到函数“ regex””

when using your original expression. 使用原始表达时。

Answer 3

The following code will repeat what you asked (just copy and paste to your R console ): 以下代码将重复您的要求（只需复制并粘贴到R控制台中 ）：

library(stringr)
foo = scan(what='')
Old.Clone.Name New.Clone.Name File
A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd


foo = matrix(foo,ncol=3,byrow=T)
colnames(foo)=foo[1,]
foo = foo[-1,]
foo
str_extract(foo[,3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = T))

The reason you get NULL is hidden: R stores entries by column, hence foo[3] is the 3rd row and 1st column of foo matrix/data frame. 您得到NULL的原因是隐藏的：R按列存储条目，因此foo[3]是foo矩阵/数据帧的第3行和第1列。 To quote the third column, you may need to use foo[,3] . 要引用第三列，您可能需要使用foo[,3] 。 or foo<-data.frame(foo); foo[[3]] 或foo<-data.frame(foo); foo[[3]] foo<-data.frame(foo); foo[[3]] . foo<-data.frame(foo); foo[[3]] 。

使用正则表达式从R中的数据框列中提取子字符串

问题描述

3 个解决方案

解决方案1
1 2016-04-06 15:52:34

解决方案2
0 2016-04-06 15:18:24

解决方案3
0 2016-04-06 15:41:40

使用正则表达式从R中的数据框列中提取子字符串

问题描述

3 个解决方案

解决方案1 1 2016-04-06 15:52:34

解决方案2 0 2016-04-06 15:18:24

解决方案3 0 2016-04-06 15:41:40

解决方案1
1 2016-04-06 15:52:34

解决方案2
0 2016-04-06 15:18:24

解决方案3
0 2016-04-06 15:41:40