简体   繁体   English

在特定模式之后提取某些单词,同时排除特定模式。 在 R

[英]Extracting certain word(s) after specific pattern, while excluding specified patterns. in R

Using R, I want to extract the building, plaza or mansion names.使用 R,我想提取建筑物、广场或豪宅的名称。 The names are ahead of whether its specified a building,mansion, plaza.名称前面是否指定建筑物,豪宅,广场。 Here is an example这是一个例子

addresses<-c("big fake plaza, 12 this street,district, city", 
"Green mansion, district, city", 
 "Block 7 of orange building  district, city",
"98 main street block a blue plaza, city",
 "blue red mansion, 46 pearl street, city")            

What I want to get is我想要得到的是

"big fake" "Green" "orange" "blue" "blue red"

The code I currently using is我目前使用的代码是

str_extract(addresses, "[[a-z]]*\\s*[[a-z]+]*\\s*(?=(building|mansion|plaza))")

Sometime the name is two words sometimes one.有时名字是两个词,有时是一个词。 However because of the varied format, sometimes there is an 'a' or 'of' which is also getting extracted.然而,由于格式不同,有时也会提取一个“a”或“of”。 How do I continue to extract the two word formats of the building name but exclude the 'a' or 'of'如何继续提取建筑物名称的两个单词格式但排除“a”或“of”

Thanks in advance提前致谢

I can't really come up with a solution that can handle all of it in one regex.我真的想不出一个可以在一个正则表达式中处理所有这些的解决方案。

Here's a two step process.这是一个两步过程。

  1. Extract one or two words before (building|mansion|plaza)提取(building|mansion|plaza)之前的一两个词
  2. From the extracted words remove (on|of|a) from it.从提取的单词中删除(on|of|a)
vals <- stringr::str_match(addresses, "(\\w+?\\s?\\w+)\\s(building|mansion|plaza)")[, 2]
trimws(gsub('\\b(on|of|a)\\b', '', vals))

#[1] "big fake" "Green"    "orange"   "blue"     "blue red"

One option is to optionally match a first word, ruling out some of the words that are not accepted using a negative lookahead.一个选项是可选地匹配第一个单词,排除一些使用否定前瞻不接受的单词。

\b(?:(?!of|a)[a-zA-Z]+\s+)?[a-zA-Z]+\b(?=\s+(?:building|mansion|plaza)\b)

The pattern matches:模式匹配:

  • \b A word boundary \b一个词的边界
  • (?: Non capture group (?:非捕获组
    • (?!of|a) Negative lookahead, assert not of or a directly to the right (?!of|a)负前瞻,断言 not ofa直接向右
    • [a-zA-Z]+\s+ If the assertion is true, match 1+ times a char a-zA-Z followed by 1+ whitespace chars [a-zA-Z]+\s+如果断言为真,则匹配 1+ 次字符 a-zA-Z 后跟 1+ 空白字符
  • )? Close group and make it optional关闭组并使其可选
  • [a-zA-Z]+\b Match 1+ times a char a-zA-Z and a word boundary [a-zA-Z]+\b匹配 1+ 次字符 a-zA-Z 和单词边界
  • (?= Positive lookahead, assert what is on the right is (?=正向前瞻,断言右边是
    • \s+ Match 1+ whitespace chars \s+匹配 1+ 个空格字符
    • (?:building|mansion|plaza)\b Match one of the alternatives (?:building|mansion|plaza)\b匹配其中一个选项
  • ) Close lookahead )关闭前瞻

Regex demo正则表达式演示

addresses<-c("big fake plaza, 12 this street,district, city", 
"Green mansion, district, city", 
 "Block 7 of orange building  district, city",
"98 main street block a blue plaza, city",
 "blue red mansion, 46 pearl street, city")
 
str_extract(addresses, "\\b(?:(?!of|a)[a-zA-Z]+\\s+)?[a-zA-Z]+\\b(?=\\s+(?:building|mansion|plaza)\\b)")

Output Output

[1] "big fake" "Green"    "orange"   "blue"     "blue red"

Note that [[az]]* should be with single brackets [az]* if you optionally want to repeat the range az in the character class, and [[az]+]* should be [az]+ if you want to repeat the range 1+ times in the character class.请注意[[az]]*应该带有单括号[az]* ,如果您可以选择在字符 class 中重复范围 az,如果您想重复, [[az]+]*应该是[az]+字符 class 中的范围 1+ 倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM