[英]Regex to find specific pattern in R
I have a dataset like below: 我有一个如下的数据集:
dput(d1)
structure(list(FNUM = structure(1L, .Label = "20140824-0227", class = "factor"),
DESCRIPTION = "From : J LTo : feedback@lsd.goe.sfcc : Bcc : Sent On : Mon Apr 13 08:59:18 S 2015Subject : RE:Re: Suspect illegally modified vehiclesBody : Our Ref: BS-CT-1408-0665Date : 2-Apr-2015Our Ref: 2015/Jan/3224Date : 2-Apr-2015Thank you very much! Please conduct a thorough check on the vehicle other than the exhaust system. Warm regards,J L--------------------------------------------On Mon, 4/13/15, feedback@lsd.goe.sf <feedback@lsd.goe.sf> wrote: Subject: RE:Re: Suspect illegally modified vehicles To: jl1229@yahoo.ca Received: Monday, April 13, 2015, 8:56 AM Our Ref: GCE/VS/VS/VE/F20.000.000/38104 Date : 8-Apr-2015 Tel : 1800 2255 582 Fax : 6553 5329 -------------------------------------------- On Mon, 4/6/15, feedback@lsd.goe.sf <feedback@lsd.goe.sf> wrote: Subject: Suspect illegally modified vehicles To: joa@dccs.ca Received: Monday, April 6, 2015, 11:06 AM Our Ref: GCE/VS/VS/VE/F20.000.000/37661 Date : 2-Apr-2015 Tel : 1812 2235 582 Fax : 6553 5329 Dear Ms L Our records show that the vehicle bearing registration"), .Names = c("FNUM",
"DESCRIPTION"), row.names = "1", class = "data.frame")
I use the below regex
to identfiy values Our Ref:
我使用下面的
regex
来标识Our Ref:
值Our Ref:
> gsub(" *(Our Ref|Date) *:? *","",regmatches(d1[1,2],gregexpr("Our Ref *:[^:]+",d1[1,2]))[[1]])
[1] "BS-CT-1408-0665" "2015/Jan/3224"
[3] "GCE/VS/VS/VE/F20.000.000/38104" "GCE/VS/VS/VE/F20.000.000/37661"
But i only wanted values of Our Ref:
which starts with GCE
, how do i limit my output to those values which begins with GCE
. 但是我只想要以
GCE
开头的Our Ref:
值, Our Ref:
如何将输出限制为以GCE
开头的值。
Desired Result: 所需结果:
[1] "GCE/VS/VS/VE/F20.000.000/38104" "GCE/VS/VS/VE/F20.000.000/37661"
Updated For Second part of the problem: 问题第二部分的更新 :
dput(d1)
structure(list(FNUM = structure(1L, .Label = "20140824-0227", class = "factor"),
DESCRIPTION = "From : J LTo : feedback@lsd.goe.sfcc : Bcc : Sent On : Mon Apr 13 08:59:18 S 2015Subject : RE:Re: Suspect illegally modified vehiclesBody : Our Ref: BS-CT-1408-0665Date : 2-Apr-2015Our Ref: 2015/Jan/3224Date : 2-Apr-2015Thank you very much! Please conduct a thorough check on the vehicle other than the exhaust system. Warm regards,J L--------------------------------------------On Mon, 4/13/15, feedback@lsd.goe.sf <feedback@lsd.goe.sf> wrote: Subject: RE:Re: Suspect illegally modified vehicles To: jl1229@yahoo.ca Received: Monday, April 13, 2015, 8:56 AM Our Ref: GCE/VS/VS/VE/F20.000.000/38104 Date : 8-Apr-2015 Tel : 1800 2255 582 Fax : 6553 5329 -------------------------------------------- On Mon, 4/6/15, feedback@lsd.goe.sf <feedback@lsd.goe.sf> wrote: Subject: Suspect illegally modified vehicles To: joa@dccs.ca Received: Monday, April 6, 2015, 11:06 AM Our Ref: GCE/QSMO/SQSS/SQ/F20.000.000/503533/lc Date : 2-Apr-2015 Tel : 1812 2235 582 Fax : 6553 5329 Our Ref: GCE/CC/PCF/FB/F20.000.000/233546/SK/PW Date : 2-Apr-2015 Dear Ms L Our records show that the vehicle bearing registration "), .Names = c("FNUM",
"DESCRIPTION"), row.names = "1", class = "data.frame")
> gsub(" *(Our Ref|Date) *:? *","",regmatches(d1[1,2],gregexpr("Our Ref *:\\s+GCE[^:]+",d1[1,2]))[[1]])
[1] "GCE/VS/VS/VE/F20.000.000/38104" "GCE/QSMO/SQSS/SQ/F20.000.000/503533/lc"
[3] "GCE/CC/PCF/FB/F20.000.000/233546/SK/PW"
However i want to limit my result to 但是我想将结果限制为
[1] "GCE/VS/VS/VE/F20.000.000/38104" "GCE/QSMO/SQSS/SQ/F20.000.000/503533"
[3] "GCE/CC/PCF/FB/F20.000.000/233546"
which is i wanted only v1/v2/v3/v4/v5/v6
anything after 6 values
should be removed or ends with number after 5 /(slashes)
. 我只希望
v1/v2/v3/v4/v5/v6
之后的6 values
被删除或ends with number after 5 /(slashes)
。 GCE/QSMO/SQSS/SQ/F20.000.000/503533/lc
should change to GCE/QSMO/SQSS/SQ/F20.000.000/503533
and GCE/CC/PCF/FB/F20.000.000/233546/SK/PW
should change to GCE/CC/PCF/FB/F20.000.000/233546
GCE/QSMO/SQSS/SQ/F20.000.000/503533/lc
应更改为GCE/QSMO/SQSS/SQ/F20.000.000/503533
和GCE/CC/PCF/FB/F20.000.000/233546/SK/PW
应更改为GCE/CC/PCF/FB/F20.000.000/233546
You can add in a requirement that "GCE" (with space before it) occurs before your [^:]
您可以添加一个要求,即在您的
[^:]
之前必须出现“ GCE”(在其前面有空格)
regmatches(d1[1,2],gregexpr("Our Ref *:\\s+GCE[^:]+",d1[1,2]))
EDIT: try this, you can match groups n numbers of times with {n}
, 编辑:尝试此操作,您可以使用
{n}
匹配n次,
gsub(" *(Our Ref|Date) *:? *", "",
regmatches(d1[1,2],
gregexpr("Our Ref *:\\s+GCE(/[^/-]+){5}",
d1[1,2], perl=T))[[1]])
Here is a different approach using strpslit
to split on any non-digit character one or more times: \\\\D+
followed by a space: 这是一种使用
strpslit
一次或多次对任何非数字字符进行分割的不同方法: \\\\D+
后跟一个空格:
splts <- strsplit(d1$DESCRIPTION, "\\D+ ")[[1]]
splts[grep("GCE", splts)]
# [1] "GCE/VS/VS/VE/F20.000.000/38104" "GCE/QSMO/SQSS/SQ/F20.000.000/503533"
# [3] "GCE/CC/PCF/FB/F20.000.000/233546"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.