Using R, I want to extract the building, plaza or mansion names. The names are ahead of whether its specified a building,mansion, plaza. Here is an example
addresses<-c("big fake plaza, 12 this street,district, city",
"Green mansion, district, city",
"Block 7 of orange building district, city",
"98 main street block a blue plaza, city",
"blue red mansion, 46 pearl street, city")
What I want to get is
"big fake" "Green" "orange" "blue" "blue red"
The code I currently using is
str_extract(addresses, "[[a-z]]*\\s*[[a-z]+]*\\s*(?=(building|mansion|plaza))")
Sometime the name is two words sometimes one. However because of the varied format, sometimes there is an 'a' or 'of' which is also getting extracted. How do I continue to extract the two word formats of the building name but exclude the 'a' or 'of'
Thanks in advance
I can't really come up with a solution that can handle all of it in one regex.
Here's a two step process.
(building|mansion|plaza)
(on|of|a)
from it.vals <- stringr::str_match(addresses, "(\\w+?\\s?\\w+)\\s(building|mansion|plaza)")[, 2]
trimws(gsub('\\b(on|of|a)\\b', '', vals))
#[1] "big fake" "Green" "orange" "blue" "blue red"
One option is to optionally match a first word, ruling out some of the words that are not accepted using a negative lookahead.
\b(?:(?!of|a)[a-zA-Z]+\s+)?[a-zA-Z]+\b(?=\s+(?:building|mansion|plaza)\b)
The pattern matches:
\b
A word boundary (?:
Non capture group
(?!of|a)
Negative lookahead, assert not of
or a
directly to the right [a-zA-Z]+\s+
If the assertion is true, match 1+ times a char a-zA-Z followed by 1+ whitespace chars )?
Close group and make it optional[a-zA-Z]+\b
Match 1+ times a char a-zA-Z and a word boundary (?=
Positive lookahead, assert what is on the right is
\s+
Match 1+ whitespace chars (?:building|mansion|plaza)\b
Match one of the alternatives )
Close lookahead addresses<-c("big fake plaza, 12 this street,district, city",
"Green mansion, district, city",
"Block 7 of orange building district, city",
"98 main street block a blue plaza, city",
"blue red mansion, 46 pearl street, city")
str_extract(addresses, "\\b(?:(?!of|a)[a-zA-Z]+\\s+)?[a-zA-Z]+\\b(?=\\s+(?:building|mansion|plaza)\\b)")
Output
[1] "big fake" "Green" "orange" "blue" "blue red"
Note that [[az]]*
should be with single brackets [az]*
if you optionally want to repeat the range az in the character class, and [[az]+]*
should be [az]+
if you want to repeat the range 1+ times in the character class.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.