R中的正则表达式-仅匹配

Question

My strings look like as follows: 我的字符串如下所示：

crb_gdp_g_100000_16_16_ftv_all.txt
crb_gdp_g_100000_16_20_fweo2_all.txt
crb_gdp_g_100000_4_40_fweo2_galt_1.txt

I only want to extract the part between f and the following underscore (in these three cases "tv", "weo2" and "weo2"). 我只想提取f和以下下划线之间的部分（在这三种情况下为“ tv”，“ weo2”和“ weo2”）。

My regular expression is: 我的正则表达式是：

regex.f = "_f([[:alnum:]]+)_"

There is no string with more than one part matching the pattern. 没有字符串与模式匹配的部分不止一个。 Why does the following command not work? 为什么以下命令不起作用？

sub(regex.f, "\\1", "crb_gdp_g_100000_16_16_ftv_all.txt")

The command only removes "_f" from the string and returns the remaining string. 该命令仅从字符串中删除“ _f”，并返回剩余的字符串。

Answer 1

Can easily be achived with qdapRegex 可以通过qdapRegex轻松实现

df <- c("crb_gdp_g_100000_16_16_ftv_all.txt", 
"crb_gdp_g_100000_16_20_fweo2_all.txt", 
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")

library(qdapRegex)
rm_between(df, "_f", "_", extract=TRUE)

Answer 2

We can use sub extract the strings by matching the character f followed by one or more characters that are not an underscore or numbers ( [^_0-9]+ ), capture as a group ( (...) ), followed by 0 or more numbers ( \\\\d* ) followed by an _ and other characters. 我们可以使用sub提取字符串，方法是匹配字符f然后是一个或多个不是下划线或数字的字符（ [^_0-9]+ ），将其捕获为一个组（ (...) ），然后是0或更多数字（ \\\\d* ），后跟_和其他字符。 Replace with the backreference ( \\\\1 ) of the captured group 替换为捕获组的后向引用（ \\\\1 ）

sub(".*_f([^_0-9]+)\\d*_.*", "\\1", str1)
#[1] "tv"  "weo" "weo"

data 数据

str1 <- c("crb_gdp_g_100000_16_16_ftv_all.txt", 
    "crb_gdp_g_100000_16_20_fweo2_all.xml",
     "crb_gdp_g_100000_4_40_fweo2_galt_1.txt")

Answer 3

My usual regex for extracting the text between two characters comes from https://stackoverflow.com/a/13499594/1017276 , which specifically looks at extracting text between parentheses. 我通常用于提取两个字符之间的文本的正则表达式来自https://stackoverflow.com/a/13499594/1017276 ，该正则表达式专门用于提取括号之间的文本。 This approach only changes the parentheses to f and _ . 此方法仅将括号更改为f和_ 。

x <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
       "crb_gdp_g_100000_16_20_fweo2_all.xml",
       "crb_gdp_g_100000_4_40_fweo2_galt_1.txt",
       "crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml")

regmatches(x,gregexpr("(?<=_f).*?(?=_)", x, perl=TRUE))

Or with the stringr package. 或与stringr包一起使用。

library(stringr)

str_extract(x, "(?<=_f).*?(?=_)")

edited to start the match on _f instead of f . 编辑以在 _f 而不是 f 上开始比赛 。

NOTE 注意

akrun's answer runs a few milliseconds faster than the stringr approach, and about ten times faster than the base approach. akrun的答案比stringr方法快几毫秒，比base方法快十倍。 The base approach clocks in at about 100 milliseconds for a character vector of 10,000 elements. 对于10,000个元素的字符向量， base方法的时钟约为100毫秒。

Answer 4

update: capture match using str_match 更新：使用str_match捕获匹配

library(stringr)  
m <- str_match("crb_gdp_g_100000_16_20_fweo2_all.txt", "_f([[:alnum:]]+)_")
print(m[[2]])
# weo2

your regex not work because missing starting and ending match .* and use \\w for shorthand [:alnum:] 您的正则表达式无法正常工作，因为缺少开头和结尾匹配项.*并用\\w表示速记[:alnum:]

sub(".*_f(\\w+?)_.*", "\\1", "crb_gdp_g_100000_16_20_fweo2_all.txt")

Answer 5

We could use the package unglue : 我们可以使用unglue软件包：

library(unglue)
txt <- c("crb_gdp_g_100000_16_16_ftv_all.txt", 
       "crb_gdp_g_100000_16_20_fweo2_all.txt", 
       "crb_gdp_g_100000_4_40_fweo2_galt_1.txt")

pattern <-
  "crb_gdp_g_100000_{=\\d+}_{=\\d+}_f{x}_{=.+?}.txt"
unglue_vec(txt,pattern)
#> [1] "tv"   "weo2" "weo2"

^{Created on 2019-10-09 by the reprex package (v0.3.0)} ^{由reprex软件包（v0.3.0）创建于2019-10-09}

R中的正则表达式-仅匹配

问题描述

5 个解决方案

解决方案1
4 2017-07-27 12:42:16

解决方案2
2 2017-07-27 12:21:56

data 数据

解决方案3
2 2017-07-27 12:26:43

NOTE 注意

解决方案4
1 2017-07-27 12:48:52

解决方案5
0 2019-10-09 12:25:58

R中的正则表达式-仅匹配

问题描述

5 个解决方案

解决方案1 4 2017-07-27 12:42:16

解决方案2 2 2017-07-27 12:21:56

data 数据

解决方案3 2 2017-07-27 12:26:43

NOTE 注意

解决方案4 1 2017-07-27 12:48:52

解决方案5 0 2019-10-09 12:25:58

解决方案1
4 2017-07-27 12:42:16

解决方案2
2 2017-07-27 12:21:56

解决方案3
2 2017-07-27 12:26:43

解决方案4
1 2017-07-27 12:48:52

解决方案5
0 2019-10-09 12:25:58