[英]Replacing some characters in a text column in R
我有一个包含文本列的数据集,其中包含文本和一个以诸如sa
之类的术语开头并在following digits
的术语。 字母可以是从 a 到 z 的任何字母,可以是小写字母也可以是大写字母。 数据快照如下:
df_new <- data.frame(
given_info=c('SA12 is given','he has his sa12',
'she will get Sr15','why not having an ra31',
'his tA23 is missing', 'pa12 is given'))
df_new %>% select(given_info)
given_info
1 SA12 is given
2 he has his sa12
3 she will get Sr15
4 why not having an ra31
5 his tA23 is missing
6 pa12 is given
我需要用术语document
替换任何具有sa (or any other combinations of two random letters with the two digits
。因此,感兴趣的结果是:
given_info
1 document is given
2 he has his document
3 she will get document
4 why not having an document
5 his document is missing
6 document is given
非常感谢您的提前帮助!
我们可以在这里使用gsub()
如下:
df_new$given_info <- gsub("\\b[A-Za-z]{2}\\d{2}\\b", "document", df_new$given_info)
df_new
given_info
1 document is given
2 he has his document
3 she will get document
4 why not having an document
5 his document is missing
6 document is given
此处使用的正则表达式模式表示匹配:
\b
单词边界(意味着前面的不是单词字符)[A-Za-z]{2}
匹配任意 2 个字母\d{2}
匹配 2 个数字\b
另一个单词边界(数字后面的不是单词字符) 例如,单词边界确保文本中的abc12
不会被替换为document
。 如果我们不使用 boundaries 这个词,那么你也会得到 substring 个匹配项,这可能是你不想要的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.