简体   繁体   English

如果子字符串在另一个向量中具有完全匹配,则提取子字符串

[英]Extract a substring if it has an exact match in another vector

Update : the first version of this question was implicitly asking how to extract a substring if it has ANY match in another vector, for which @Colonel Beauvel provided an elegant response: 更新 :这个问题的第一个版本隐含地询问如果在另一个向量中有任何匹配的情况下如何提取子字符串,@ Cellon Beauvel提供了一个优雅的响应:

This does the trick, base R : 这就是诀窍,基础R

 newname = sapply(nametitle, function(u){ bool = sapply(name, function(x) grepl(x, u)) if(any(bool)) name[bool][1] else NA }) newname John Smith, MD PhD Jane Doe, JD "John" "Jane" 

However, I did not realize that I was actually asking for a way to find exact matches until the function kindly contributed did not work for all elements in my vector. 但是,我没有意识到我实际上是在寻找一种找到完全匹配的方法,直到该函数对我的向量中的所有元素都不起作用。 Therefore, the following is my revised question. 因此,以下是我修改过的问题。


Say I have the following character vector of generic names and their academic degrees: 假设我有以下通用名称的字符向量及其学位:

nametitle <- c("John Smith, MD PhD", "Jane Doe, JD", "John-Paul Jones, MS")

And I have a "look-up" vector of first names: 我有一个名字的“查找”矢量:

name <- c("John", "Jane", "Mark", "Steve")

What I want to do is search each element of nametitle , and if part of the element (ie, a substring of each string) is an exact match of an element from name , then in a new vector newname , write that element of nametitle with the corresponding element of name , or if there is no exact match, write the original value from nametitle . 我想要做的是搜索nametitle每个元素,如果元素的一部分(即每个字符串的子字符串)是一个元素与name的完全匹配,那么在一个新的vector newname ,写下nametitle元素name的相应元素,或者如果没有完全匹配,则从nametitle写入原始值。

Therefore, what I'd expect the proper function to do is return newname with the three elements below: 因此,我期望正确的功能是使用以下三个元素返回newname

[1] "John" [2] "Jane" [3] "John-Paul Jones, MS"

I've attempted the following using the function contributed above: 我使用上面提供的功能尝试了以下内容:

newname = sapply(nametitle, function(u){
  bool = sapply(name, function(x) grepl(x, u))
  if(any(bool)) name[bool][1] else NA })

Which performs just fine for elements "John Smith, MD Phd" and "Jane Doe, JD" , but not for "John-Paul Jones, MS" -- this element is replaced with "John" in the new vector newname . 这对于"John Smith, MD Phd""Jane Doe, JD"元素表现得很好,但不适用于"John-Paul Jones, MS" - 这个元素在新的向量newname被替换为"John"

There may be a simple change that can be made to the original function contributed by @Colonel Beauvel to resolve this issue, but using nested sapply functions is throwing me through a loop (pun intended?). 可能会对@Colonel Beauvel提供的原始函数进行简单的更改来解决此问题,但使用嵌套的sapply函数sapply我完成一个循环(双关语意图?)。 Thanks. 谢谢。

This does the trick, base R : 这就是诀窍,基础R

newname = sapply(nametitle, function(u){
    bool = sapply(name, function(x) grepl(x, u))
    if(any(bool)) name[bool][1] else NA
})

#>newname
#John Smith, MD PhD       Jane Doe, JD 
#            "John"             "Jane" 

Here's an easy way. 这是一个简单的方法。 First, create a regex pattern based on your name vector: 首先,根据您的name向量创建一个正则表达式模式:

pattern <- paste0(".*(?<=\\s|^)(", paste(name, collapse = "|"), ")(?=\\s|$).*")
# [1] ".*(?<=\\s|^)(John|Jane|Mark|Steve)(?=\\s|$).*"

If you use this pattern, a single sub command will do the trick: 如果您使用此模式,单个sub命令将执行此操作:

sub(pattern, "\\1", nametitle, perl = TRUE)
# [1] "John"                "Jane"                "John-Paul Jones, MS"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM