[英]R gsub remove word variation ONLY at end of string
I have the following vector: 我有以下向量:
a <- c("SOCORRO SANTANDER", "SANTANDER DE QUILICHAO",
"LOS PATIOS NORTE DE SANTANDER", "LOS PATIOS NTE DE S DER")
and need to remove all occurrences of "SANTANDER" or it's abbreviation (and preceding NORTE or its abbreviation, if existing) when they are only at the end of string. 并且当它们仅位于字符串的末尾时,需要删除所有出现的“SANTANDER”或它的缩写(以及前面的NORTE或其缩写,如果存在)。
So far I've tried (in comment why it fails): 到目前为止,我已经尝试过(在评论中为什么会失败):
gsub("(.*)( N.*DER$)", "\\1", a) # Fails at SOCORRO
gsub("(.*)( N.*DER$| DER$)", "\\1", a) # Only removes DER at LOS PATIOS
gsub("(.*)([ N.*DER$]|[ DER$])", "\\1", a) # Removes trailing R (??)
gsub("(.*)( N?.*DER$)", "\\1", a) # Fails removing " NTE DE S" and "NORTE DE"
So, in particular, I'd like to know how to adequately remove the unwanted part of the string, but more in general I'd like to know the right way to create regexes to test this kind of situations (my first writing was "to use OR ( |
) inside a group", I seriously expected attempts 2 or 3 to work). 所以,特别是,我想知道如何充分删除字符串中不需要的部分, 但更多的是我想知道正确的方法来创建正则表达式来测试这种情况(我的第一次写作是“在组内使用OR(
|
),我认真地期望尝试2或3工作)。
Expected result is: 预期结果是:
a
## [1] "SOCORRO" "SANTANDER DE QUILICHAO" "LOS PATIOS" "LOS PATIOS"
sub('(\\s*\\b(NORTE\\s+DE|NTE\\s+DE))?\\s*\\b(SANTANDER|S\\s+DER)$','',a);
## [1] "SOCORRO" "SANTANDER DE QUILICHAO" "LOS PATIOS" "LOS PATIOS"
gsub()
, since we don't need to match multiple times within the same string. gsub()
,因为我们不需要在同一个字符串中多次匹配。 We can try 我们可以试试
sub("(.*)(\\s+N.*(DER)$)|\\s+SANTANDER$", "\\1", a)
#[1] "SOCORRO" "SANTANDER DE QUILICHAO" "LOS PATIOS"
#[4] "LOS PATIOS"
Or 要么
sub("\\s+(N(\\S+\\s+){1,}|)\\S*DER$", "", a)
#[1] "SOCORRO" "SANTANDER DE QUILICHAO" "LOS PATIOS"
#[4] "LOS PATIOS"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.