Extracting certain word(s) after specific pattern, while excluding specified patterns. in R

Question

Using R, I want to extract the building, plaza or mansion names. The names are ahead of whether its specified a building,mansion, plaza. Here is an example

addresses<-c("big fake plaza, 12 this street,district, city", 
"Green mansion, district, city", 
 "Block 7 of orange building  district, city",
"98 main street block a blue plaza, city",
 "blue red mansion, 46 pearl street, city")

What I want to get is

"big fake" "Green" "orange" "blue" "blue red"

The code I currently using is

str_extract(addresses, "[[a-z]]*\\s*[[a-z]+]*\\s*(?=(building|mansion|plaza))")

Sometime the name is two words sometimes one. However because of the varied format, sometimes there is an 'a' or 'of' which is also getting extracted. How do I continue to extract the two word formats of the building name but exclude the 'a' or 'of'

Thanks in advance

Answer 1

I can't really come up with a solution that can handle all of it in one regex.

Here's a two step process.

Extract one or two words before (building|mansion|plaza)
From the extracted words remove (on|of|a) from it.

vals <- stringr::str_match(addresses, "(\\w+?\\s?\\w+)\\s(building|mansion|plaza)")[, 2]
trimws(gsub('\\b(on|of|a)\\b', '', vals))

#[1] "big fake" "Green"    "orange"   "blue"     "blue red"

Answer 2

One option is to optionally match a first word, ruling out some of the words that are not accepted using a negative lookahead.

\b(?:(?!of|a)[a-zA-Z]+\s+)?[a-zA-Z]+\b(?=\s+(?:building|mansion|plaza)\b)

The pattern matches:

\b A word boundary
(?: Non capture group
- (?!of|a) Negative lookahead, assert not of or a directly to the right
- [a-zA-Z]+\s+ If the assertion is true, match 1+ times a char a-zA-Z followed by 1+ whitespace chars
)? Close group and make it optional
[a-zA-Z]+\b Match 1+ times a char a-zA-Z and a word boundary
(?= Positive lookahead, assert what is on the right is
- \s+ Match 1+ whitespace chars
- (?:building|mansion|plaza)\b Match one of the alternatives
) Close lookahead

Regex demo

addresses<-c("big fake plaza, 12 this street,district, city", 
"Green mansion, district, city", 
 "Block 7 of orange building  district, city",
"98 main street block a blue plaza, city",
 "blue red mansion, 46 pearl street, city")
 
str_extract(addresses, "\\b(?:(?!of|a)[a-zA-Z]+\\s+)?[a-zA-Z]+\\b(?=\\s+(?:building|mansion|plaza)\\b)")

Output

[1] "big fake" "Green"    "orange"   "blue"     "blue red"

Note that [[az]]* should be with single brackets [az]* if you optionally want to repeat the range az in the character class, and [[az]+]* should be [az]+ if you want to repeat the range 1+ times in the character class.

Extracting certain word(s) after specific pattern, while excluding specified patterns. in R

Question

2 answers

solution1
1 2021-03-25 03:56:47

solution2
1 ACCPTED 2021-03-25 08:25:56

Extracting certain word(s) after specific pattern, while excluding specified patterns. in R

Question

2 answers

solution1 1 2021-03-25 03:56:47

solution2 1 ACCPTED 2021-03-25 08:25:56

solution1
1 2021-03-25 03:56:47

solution2
1 ACCPTED 2021-03-25 08:25:56