[英]Removing Custom Words From Text Variables in R
我有數據集,如下所示:
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
> dat
ID ADDRESS
1 1 EAST SS BLVD
2 2 SOUTH AA STREET
3 3 XX EAST ST
4 4 ZZ NORTH ROAD
5 5 WEST TR TRAIL
我想刪除地址中的所有詳細信息,而不是我想要的單詞列表。 我正在使用以下不正確且無法正常工作的代碼。
dat$FEATURE <- gsub("^[(BLVD)|(BOULEVARD)|(DRIVE)|(DR)|(ROAD)|(RD)|(PL)|(PLACE)
|(SL)|(CIRCLE)|(CT)|(COURT)|(WY)|(WAY)|(ST)|(STREET)|(AVE)
|(AVENUE)|(PKWY)|(WAY)|(PARKWAY)|(LN)|(LANE)|(HWY)|(HIGHWAY)
|(TRAIL$)|(CIR$)]","",dat$ADDRESS)
> dat
ID ADDRESS FEATURE
1 1 EAST SS BLVD AST SS BLVD
2 2 SOUTH AA STREET OUTH AA STREET
3 3 XX EAST ST XX EAST ST
4 4 ZZ NORTH ROAD ZZ NORTH ROAD
5 5 WEST TR TRAIL EST TR TRAIL
我想要的輸出是:
> dat1
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
我不是很好的正則表達式,不勝感激,R中任何有關正則表達式的參考都將有所幫助。
您可以使用
(?xs).*\b # any 0+ chars, as many as possible, then word boundary
( # Group 1 start:
BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)? # Various words
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)? # you need to keep
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY # here
|TRAIL$|CIR$ # and here
) # Group 1 end
\b # Word boundary
.* # Rest of the string.
在這里, (?x)
是一個自由的空格/注釋/詳細修飾符,用於設置模式內部的空格和內部的注釋的格式。 (?s)
是DOTALL修飾符,允許.
匹配包含換行符的任何字符(由於它是PCRE模式,所以有必要,請注意perl=TRUE
)。
"\\\\1"
替換將組1中的值重新插入替換的字符串中。
參見R演示 :
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("(?xs).*\\b(BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)?
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)?
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY
|TRAIL$|CIR$)\\b.*","\\1",dat$ADDRESS, perl=TRUE)
dat
輸出:
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
你可以這樣
#R version 3.3.2
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("\\b(?!AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y)).+?\\b","",dat$ADDRESS, perl=TRUE)
dat
http://rextester.com/GGYN78288
https://regex101.com/r/6RcXTi/1
我猜從技術上來說,這是更准確的:
"\\\\b(?!(?:AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y))\\\\b).+?\\\\b"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.