简体   繁体   English

R使用模式匹配从字符串中提取单词

[英]R Extract a word from a character string using pattern matching

I need some help with pattern matching in R. I need to extract a whole word that starts with a common prefix, from a long character string. 我需要一些有关R中模式匹配的帮助。我需要从一个长字符串中提取一个以公共前缀开头的整个单词。 The word I want to extract always starts with the same prefix (AA), but the word is not the same length, and does not occur in the same location of the string. 我要提取的单词始终以相同的前缀(AA)开头,但是该单词的长度不同,并且不在字符串的相同位置出现。

mytext1 <- as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH") # Return AA12345

mytext2 <- as.character("ELEPHANT AA100 KOALA POLAR.BEAR") # Want to return AA100

mytext3 <- as.character("CROCODILE DRAGON.FLY ANTELOPE") # Want to return NA 

As an extension of this, what if there were two different patterns to match and I wanted to return a character string with both? 作为对此的扩展,如果要匹配两个不同的模式并且我想同时返回两个字符串怎么办?

mytext4 <- as.character("TULIP AA999 DAISY BB123") 
# Pattern matching to AA and BB 
# Want to return AA999 BB123

Any help with this would be greatly appreciated :) 任何帮助,将不胜感激:)

Here is a stringr approach. 这是一种更stringr方法。 The regular expression matches AA preceded by a space or the start of the string (?<=^| ) , and then as few characters as possible .*? 正则表达式匹配AA后跟一个空格或字符串的开头(?<=^| ) ,然后匹配尽可能少的字符.*? until the next space or the end of the string (?=$| ) . 直到下一个空格或字符串的末尾(?=$| ) Note that you can combine all the strings into a vector and a vector will be returned. 请注意,您可以将所有字符串组合成一个向量,并且将返回一个向量。 If you want all matches for each string, then use str_extract_all instead of str_extract and you get a list with a vector for each string. 如果您希望每个字符串都匹配,则使用str_extract_all而不是str_extract ,您将获得一个列表,其中包含每个字符串的向量。 If you want to specify multiple matches, use an option and a capturing group (AA|BB) as shown. 如果要指定多个匹配项,请使用一个选项和一个捕获组(AA|BB) ,如图所示。

mytext <- c(
  as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH"), # Return AA12345
  as.character("ELEPHANT AA100 KOALA POLAR.BEAR"), # Want to return AA100,
  as.character("AA3273 ELEPHANT KOALA POLAR.BEAR"), # Want to return AA3273
  as.character("ELEPHANT KOALA POLAR.BEAR AA5785"), # Want to return AA5785
  as.character("ELEPHANT KOALA POLAR.BEAR"), # Want to return nothing
  as.character("ELEPHANT AA12345 KOALA POLAR.BEAR AA5785") # Can return only AA12345 or both
)

library(stringr)
mytext %>% str_extract("(?<=^| )AA.*?(?=$| )")
#> [1] "AA12345" "AA100"   "AA3273"  "AA5785"  NA        "AA12345"
mytext %>% str_extract_all("(?<=^| )AA.*?(?=$| )")
#> [[1]]
#> [1] "AA12345"
#> 
#> [[2]]
#> [1] "AA100"
#> 
#> [[3]]
#> [1] "AA3273"
#> 
#> [[4]]
#> [1] "AA5785"
#> 
#> [[5]]
#> character(0)
#> 
#> [[6]]
#> [1] "AA12345" "AA5785"

as.character("TULIP AA999 DAISY BB123") %>% str_extract_all("(?<=^| )(AA|BB).*?(?=$| )")
#> [[1]]
#> [1] "AA999" "BB123"

Created on 2018-04-29 by the reprex package (v0.2.0). reprex软件包 (v0.2.0)于2018-04-29创建。

You can get a base R solution using sub 您可以使用sub获得基本的R解决方案

sub(".*\\b(AA\\w*).*", "\\1", mytext1)
[1] "AA12345"
> sub(".*\\b(AA\\w*).*", "\\1", mytext2)
[1] "AA100"

I like keeping things in base R whenever possible, and there is already a solution for this. 我喜欢尽可能将内容保存在R中,并且已经有解决方案。 What you really are looking for is the regmatches() function. 您真正要寻找的是regmatches()函数。 See Here 看这里

Extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec. 从regexpr,gregexpr或regexec获得的匹配数据中提取或替换匹配的子字符串。

To solve your specific problem 解决您的特定问题

matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext1, perl=T)
regmatches(mytext1, matches)
> [1] "AA12345"

When there is no match: 如果没有匹配项:

matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext3, perl=T)
regmatches(mytext3, matches)
> character(0)

If you want to avoid character(0) put your strings in a vector and run them all at once. 如果要避免使用character(0)请将字符串放入向量中并立即运行它们。

alltext = c(mytext1, mytext2, mytext3)
matches = regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T)
regmatches(alltext, matches)
> [1] "AA12345" "AA100"

And finally, if you want a one-liner 最后,如果您想要单线

regmatches(alltext, regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T))
> [1] "AA12345" "AA100"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM