[英]R Regular Expression Lookbehind
I have a vector filled with strings of the following format: <year1><year2><id1><id2>
我有一个向量填充以下格式的字符串: <year1><year2><id1><id2>
the first entries of the vector looks like this: 向量的第一个条目如下所示:
199719982001
199719982002
199719982003
199719982003
For the first entry we have: year1 = 1997, year2 = 1998, id1 = 2, id2 = 001. 对于第一个条目,我们有:year1 = 1997,year2 = 1998,id1 = 2,id2 = 001。
I want to write a regular expression that pulls out year1, id1, and the digits of id2 that are not zero. 我想写一个正则表达式,它取出year1,id1和id2的数字不为零。 So for the first entry the regex should output: 199721. 所以对于第一个条目,正则表达式应该输出:199721。
I have tried doing this with the stringr package, and created the following regex: 我尝试使用stringr包,并创建了以下正则表达式:
"^\\d{4}|\\d{1}(?<=\\d{3}$)"
to pull out year1 and id1, however when using the lookbehind i get a "invalid regular expression" error. 拉出year1和id1,然而当使用lookbehind我得到一个“无效的正则表达式”错误。 This is a bit puzzling to me, can R not handle lookaheads and lookbehinds? 这对我来说有点令人费解,R不能处理前瞻和外观吗?
You will need to use gregexpr
from the base
package. 您将需要使用base
包中的gregexpr
。 This works: 这有效:
> s <- "199719982001"
> gregexpr("^\\d{4}|\\d{1}(?<=\\d{3}$)",s,perl=TRUE)
[[1]]
[1] 1 12
attr(,"match.length")
[1] 4 1
attr(,"useBytes")
[1] TRUE
Note the perl=TRUE
setting. 请注意perl=TRUE
设置。 For more details look into ?regex
. 有关详细信息,请查看?regex
。
Judging from the output your regular expression does not catch id1
though. 从输出来看,你的正则表达式不会捕获id1
。
Since this is fixed format, why not use substr? 由于这是固定格式,为什么不使用substr? year1
is extracted using substr(s,1,4)
, id1
is extracted using substr(s,9,9)
and the id2
as as.numeric(substr(s,10,13))
. 使用substr(s,1,4)
提取year1
,使用substr(s,9,9)
提取id1
,将id2
提取为as.numeric(substr(s,10,13))
。 In the last case I used as.numeric
to get rid of the zeroes. 在最后一种情况下,我使用as.numeric
来摆脱零。
你可以使用sub。
sub("^(.{4}).{4}(.{1}).*([1-9]{1,3})$","\\1\\2\\3",s)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.