简体   繁体   English

用R中的regexp替换完全匹配的字符串

[英]replace exact string match with regexp in R

I have a vector of strings that need cleaning. 我有一个需要清洗的字符串向量。 I have been able to clean it quite a lot on my own but I am having problems one thing. 我已经能够自己清理很多东西,但是我遇到一件事。

Some strings have the chain '@56;' 有些字符串的链为“ @ 56;”。 at the beginning (numbers vary). 开头(数字有所不同)。 So a string can be '@56;trousers' or '@897;trousers' I would like to leave it just like 'trousers'. 因此,字符串可以是“ @ 56;裤子”或“ @ 897;裤子”,我想像“裤子”一样保留它。

I have written the following code: 我写了以下代码:

gsub("[@[:digit:];]", "", 'mystring')   

but it fails in cases like: 但在以下情况下失败:

gsub("[@[:digit:];]", "", '@34skirt') # returns 'skirt'

I would like it to return '@34skirt' in this case because the ; 我想在这种情况下返回'@ 34skirt',因为 is missing from the end. 从最后开始消失了。

I want a exact match. 我要完全匹配。 Any ideas about how to do this? 有关如何执行此操作的任何想法? I ahve tried to add \\ and it does not work 我试着添加\\,但是它不起作用

The [@[:digit:];] regex matches a single character that is either a @ , or a digit, or a ; [@[:digit:];]正则表达式匹配单个字符,该字符可以是@或数字,也可以是; . Thus, it will remove those at any position in the string, as many times as it finds them with gsub . 因此,它将删除字符串中任意位置的那些字符,与使用gsub找到它们的次数相同。

You may use a regex defining a sequence of characters to remove, not a character class: 您可以使用正则表达式定义要删除的字符序列 ,而不是字符类:

@[0-9]+;

See the regex demo 正则表达式演示

You can even tell the regex engine to only remove those at the beginning of the string only: 您甚至可以告诉正则表达式引擎仅删除仅在字符串开头的那些:

^@[0-9]+;

Sample demo : 样本演示

sub("^@[0-9]+;", "", '@34skirt')     ## [1] "@34skirt"
sub("^@[0-9]+;", "", '@34;trousers') ## [1] "trousers"

We can try 我们可以试试

sub("@\\d+;", "", v1)
#[1] "mystring" "@34skirt" "trousers" "trousers"

data 数据

v1 <- c('mystring', '@34skirt',  '@56;trousers', '@897;trousers') 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM