简体   繁体   English

R gsub仅在字符串末尾删除单词变体

[英]R gsub remove word variation ONLY at end of string

I have the following vector: 我有以下向量:

a <- c("SOCORRO SANTANDER", "SANTANDER DE QUILICHAO", 
       "LOS PATIOS NORTE DE SANTANDER", "LOS PATIOS NTE DE S DER")

and need to remove all occurrences of "SANTANDER" or it's abbreviation (and preceding NORTE or its abbreviation, if existing) when they are only at the end of string. 并且当它们位于字符串的末尾时,需要删除所有出现的“SANTANDER”或它的缩写(以及前面的NORTE或其缩写,如果存在)。

So far I've tried (in comment why it fails): 到目前为止,我已经尝试过(在评论中为什么会失败):

gsub("(.*)( N.*DER$)", "\\1", a)       # Fails at SOCORRO
gsub("(.*)( N.*DER$| DER$)", "\\1", a) # Only removes DER at LOS PATIOS
gsub("(.*)([ N.*DER$]|[ DER$])", "\\1", a) # Removes trailing R (??)
gsub("(.*)( N?.*DER$)", "\\1", a)  # Fails removing " NTE DE S" and "NORTE DE"

So, in particular, I'd like to know how to adequately remove the unwanted part of the string, but more in general I'd like to know the right way to create regexes to test this kind of situations (my first writing was "to use OR ( | ) inside a group", I seriously expected attempts 2 or 3 to work). 所以,特别是,我想知道如何充分删除字符串中不需要的部分, 但更多的是我想知道正确的方法来创建正则表达式来测试这种情况(我的第一次写作是“在组内使用OR( | ),我认真地期望尝试2或3工作)。

Expected result is: 预期结果是:

a
## [1] "SOCORRO"  "SANTANDER DE QUILICHAO"  "LOS PATIOS"  "LOS PATIOS"
sub('(\\s*\\b(NORTE\\s+DE|NTE\\s+DE))?\\s*\\b(SANTANDER|S\\s+DER)$','',a);
## [1] "SOCORRO"  "SANTANDER DE QUILICHAO"  "LOS PATIOS"  "LOS PATIOS"
  • We don't need gsub() , since we don't need to match multiple times within the same string. 我们不需要gsub() ,因为我们不需要在同一个字符串中多次匹配。
  • A bracket expression will match only a single character, hence it's not appropriate for this regex. 括号表达式只匹配单个字符,因此它不适合此正则表达式。
  • The dollar character is only special when outside of a bracket expression. 只有在括号表达式之外,美元字符才是特殊的。
  • You seem to have tried matching both the abbreviation and full-length words with the same regex piece. 您似乎尝试使用相同的正则表达式匹配缩写和全长单词。 I would advise against this; 我会建议不要这样做; they are conceptually completely different pieces. 它们在概念上完全不同。 If a word and its abbreviation happen to share a suffix, then that's circumstantial; 如果一个单词及其缩写碰巧共享一个后缀,那么这是间接的; you shouldn't build a regex around that fact. 你不应该围绕这个事实建立一个正则表达式。 Hence I think alternations are most appropriate here. 因此,我认为交替最适合这里。

We can try 我们可以试试

sub("(.*)(\\s+N.*(DER)$)|\\s+SANTANDER$", "\\1", a)
#[1] "SOCORRO"                "SANTANDER DE QUILICHAO" "LOS PATIOS"            
#[4] "LOS PATIOS"     

Or 要么

sub("\\s+(N(\\S+\\s+){1,}|)\\S*DER$", "", a)
#[1] "SOCORRO"                "SANTANDER DE QUILICHAO" "LOS PATIOS"            
#[4] "LOS PATIOS"  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM