简体   繁体   English

从R中的文本变量中删除自定义单词

[英]Removing Custom Words From Text Variables in R

I have Data set which looks like following: 我有数据集,如下所示:

dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))

> dat
  ID         ADDRESS
1  1    EAST SS BLVD
2  2 SOUTH AA STREET
3  3      XX EAST ST
4  4   ZZ NORTH ROAD
5  5   WEST TR TRAIL

I want to remove all details in address not in list of words I want. 我想删除地址中的所有详细信息,而不是我想要的单词列表。 I am using following code which is not proper and is not working. 我正在使用以下不正确且无法正常工作的代码。

 dat$FEATURE <- gsub("^[(BLVD)|(BOULEVARD)|(DRIVE)|(DR)|(ROAD)|(RD)|(PL)|(PLACE)
                |(SL)|(CIRCLE)|(CT)|(COURT)|(WY)|(WAY)|(ST)|(STREET)|(AVE)
                |(AVENUE)|(PKWY)|(WAY)|(PARKWAY)|(LN)|(LANE)|(HWY)|(HIGHWAY)
                |(TRAIL$)|(CIR$)]","",dat$ADDRESS)

> dat
  ID         ADDRESS        FEATURE
1  1    EAST SS BLVD    AST SS BLVD
2  2 SOUTH AA STREET OUTH AA STREET
3  3      XX EAST ST     XX EAST ST
4  4   ZZ NORTH ROAD  ZZ NORTH ROAD
5  5   WEST TR TRAIL   EST TR TRAIL

Output that I want is : 我想要的输出是:

> dat1
  ID         ADDRESS FEATURE
1  1    EAST SS BLVD    BLVD
2  2 SOUTH AA STREET  STREET
3  3      XX EAST ST      ST
4  4   ZZ NORTH ROAD    ROAD
5  5   WEST TR TRAIL   TRAIL

I am not great regex any help is appreciated and any references for regex in R will be helpful. 我不是很好的正则表达式,不胜感激,R中任何有关正则表达式的参考都将有所帮助。

You may use 您可以使用

(?xs).*\b        # any 0+ chars, as many as possible, then word boundary
 (               # Group 1 start:
   BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)?      # Various words
   |SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)?    # you need to keep
   |PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY               # here
   |TRAIL$|CIR$                                        # and here
 )               # Group 1 end
 \b              # Word boundary
 .*              # Rest of the string.

See the regex demo 正则表达式演示

Here, (?x) is a free spacing/comment/verbose modifier enabling formatting whitespace inside the pattern and comments inside. 在这里, (?x)是一个自由的空格/注释/详细修饰符,用于设置模式内部的空格和内部的注释的格式。 (?s) is a DOTALL modifier allowing . (?s)是DOTALL修饰符,允许. match any char including a newline (it is necessary as it is a PCRE pattern, pay attention to perl=TRUE ). 匹配包含换行符的任何字符(由于它是PCRE模式,所以有必要,请注意perl=TRUE )。

The "\\\\1" replacement inserts the value in Group 1 back into the replaced string. "\\\\1"替换将组1中的值重新插入替换的字符串中。

See the R demo : 参见R演示

dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("(?xs).*\\b(BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)?
                |SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)?
                |PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY
                |TRAIL$|CIR$)\\b.*","\\1",dat$ADDRESS, perl=TRUE)
dat

Output: 输出:

  ID         ADDRESS FEATURE
1  1    EAST SS BLVD    BLVD
2  2 SOUTH AA STREET  STREET
3  3      XX EAST ST      ST
4  4   ZZ NORTH ROAD    ROAD
5  5   WEST TR TRAIL   TRAIL

You could do it like this 你可以这样

#R version 3.3.2 

dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("\\b(?!AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y)).+?\\b","",dat$ADDRESS, perl=TRUE)
dat

http://rextester.com/GGYN78288 http://rextester.com/GGYN78288

https://regex101.com/r/6RcXTi/1 https://regex101.com/r/6RcXTi/1


I guess technically, this is more exact: 我猜从技术上来说,这是更准确的:

"\\\\b(?!(?:AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y))\\\\b).+?\\\\b"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM