[英]Removing Custom Words From Text Variables in R
I have Data set which looks like following: 我有数据集,如下所示:
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
> dat
ID ADDRESS
1 1 EAST SS BLVD
2 2 SOUTH AA STREET
3 3 XX EAST ST
4 4 ZZ NORTH ROAD
5 5 WEST TR TRAIL
I want to remove all details in address not in list of words I want. 我想删除地址中的所有详细信息,而不是我想要的单词列表。 I am using following code which is not proper and is not working.
我正在使用以下不正确且无法正常工作的代码。
dat$FEATURE <- gsub("^[(BLVD)|(BOULEVARD)|(DRIVE)|(DR)|(ROAD)|(RD)|(PL)|(PLACE)
|(SL)|(CIRCLE)|(CT)|(COURT)|(WY)|(WAY)|(ST)|(STREET)|(AVE)
|(AVENUE)|(PKWY)|(WAY)|(PARKWAY)|(LN)|(LANE)|(HWY)|(HIGHWAY)
|(TRAIL$)|(CIR$)]","",dat$ADDRESS)
> dat
ID ADDRESS FEATURE
1 1 EAST SS BLVD AST SS BLVD
2 2 SOUTH AA STREET OUTH AA STREET
3 3 XX EAST ST XX EAST ST
4 4 ZZ NORTH ROAD ZZ NORTH ROAD
5 5 WEST TR TRAIL EST TR TRAIL
Output that I want is : 我想要的输出是:
> dat1
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
I am not great regex any help is appreciated and any references for regex in R will be helpful. 我不是很好的正则表达式,不胜感激,R中任何有关正则表达式的参考都将有所帮助。
You may use 您可以使用
(?xs).*\b # any 0+ chars, as many as possible, then word boundary
( # Group 1 start:
BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)? # Various words
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)? # you need to keep
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY # here
|TRAIL$|CIR$ # and here
) # Group 1 end
\b # Word boundary
.* # Rest of the string.
See the regex demo 见正则表达式演示
Here, (?x)
is a free spacing/comment/verbose modifier enabling formatting whitespace inside the pattern and comments inside. 在这里,
(?x)
是一个自由的空格/注释/详细修饰符,用于设置模式内部的空格和内部的注释的格式。 (?s)
is a DOTALL modifier allowing .
(?s)
是DOTALL修饰符,允许.
match any char including a newline (it is necessary as it is a PCRE pattern, pay attention to perl=TRUE
). 匹配包含换行符的任何字符(由于它是PCRE模式,所以有必要,请注意
perl=TRUE
)。
The "\\\\1"
replacement inserts the value in Group 1 back into the replaced string. "\\\\1"
替换将组1中的值重新插入替换的字符串中。
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("(?xs).*\\b(BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)?
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)?
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY
|TRAIL$|CIR$)\\b.*","\\1",dat$ADDRESS, perl=TRUE)
dat
Output: 输出:
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
You could do it like this 你可以这样
#R version 3.3.2
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("\\b(?!AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y)).+?\\b","",dat$ADDRESS, perl=TRUE)
dat
http://rextester.com/GGYN78288 http://rextester.com/GGYN78288
https://regex101.com/r/6RcXTi/1 https://regex101.com/r/6RcXTi/1
I guess technically, this is more exact: 我猜从技术上来说,这是更准确的:
"\\\\b(?!(?:AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y))\\\\b).+?\\\\b"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.