[英]Regular expression in R - extract only match
My strings look like as follows: 我的字符串如下所示:
crb_gdp_g_100000_16_16_ftv_all.txt
crb_gdp_g_100000_16_20_fweo2_all.txt
crb_gdp_g_100000_4_40_fweo2_galt_1.txt
I only want to extract the part between f and the following underscore (in these three cases "tv", "weo2" and "weo2"). 我只想提取f和以下下划线之间的部分(在这三种情况下为“ tv”,“ weo2”和“ weo2”)。
My regular expression is: 我的正则表达式是:
regex.f = "_f([[:alnum:]]+)_"
There is no string with more than one part matching the pattern. 没有字符串与模式匹配的部分不止一个。 Why does the following command not work? 为什么以下命令不起作用?
sub(regex.f, "\\1", "crb_gdp_g_100000_16_16_ftv_all.txt")
The command only removes "_f" from the string and returns the remaining string. 该命令仅从字符串中删除“ _f”,并返回剩余的字符串。
Can easily be achived with qdapRegex
可以通过qdapRegex
轻松实现
df <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
library(qdapRegex)
rm_between(df, "_f", "_", extract=TRUE)
We can use sub
extract the strings by matching the character f
followed by one or more characters that are not an underscore or numbers ( [^_0-9]+
), capture as a group ( (...)
), followed by 0 or more numbers ( \\\\d*
) followed by an _
and other characters. 我们可以使用sub
提取字符串,方法是匹配字符f
然后是一个或多个不是下划线或数字的字符( [^_0-9]+
),将其捕获为一个组( (...)
),然后是0或更多数字( \\\\d*
),后跟_
和其他字符。 Replace with the backreference ( \\\\1
) of the captured group 替换为捕获组的后向引用( \\\\1
)
sub(".*_f([^_0-9]+)\\d*_.*", "\\1", str1)
#[1] "tv" "weo" "weo"
str1 <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
My usual regex for extracting the text between two characters comes from https://stackoverflow.com/a/13499594/1017276 , which specifically looks at extracting text between parentheses. 我通常用于提取两个字符之间的文本的正则表达式来自https://stackoverflow.com/a/13499594/1017276 ,该正则表达式专门用于提取括号之间的文本。 This approach only changes the parentheses to f
and _
. 此方法仅将括号更改为f
和_
。
x <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt",
"crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml")
regmatches(x,gregexpr("(?<=_f).*?(?=_)", x, perl=TRUE))
Or with the stringr
package. 或与stringr
包一起使用。
library(stringr)
str_extract(x, "(?<=_f).*?(?=_)")
edited to start the match on _f
instead of f
. 编辑以在 _f
而不是 f
上开始比赛 。
akrun's answer runs a few milliseconds faster than the stringr
approach, and about ten times faster than the base
approach. akrun的答案比stringr
方法快几毫秒,比base
方法快十倍。 The base
approach clocks in at about 100 milliseconds for a character vector of 10,000 elements. 对于10,000个元素的字符向量, base
方法的时钟约为100毫秒。
update: capture match using str_match
更新:使用str_match
捕获匹配
library(stringr)
m <- str_match("crb_gdp_g_100000_16_20_fweo2_all.txt", "_f([[:alnum:]]+)_")
print(m[[2]])
# weo2
your regex not work because missing starting and ending match .*
and use \\w
for shorthand [:alnum:]
您的正则表达式无法正常工作,因为缺少开头和结尾匹配项.*
并用\\w
表示速记[:alnum:]
sub(".*_f(\\w+?)_.*", "\\1", "crb_gdp_g_100000_16_20_fweo2_all.txt")
We could use the package unglue : 我们可以使用unglue软件包:
library(unglue)
txt <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
pattern <-
"crb_gdp_g_100000_{=\\d+}_{=\\d+}_f{x}_{=.+?}.txt"
unglue_vec(txt,pattern)
#> [1] "tv" "weo2" "weo2"
Created on 2019-10-09 by the reprex package (v0.3.0) 由reprex软件包 (v0.3.0)创建于2019-10-09
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.