繁体   English   中英

替换R中文本列中的一些字符

[英]Replacing some characters in a text column in R

我有一个包含文本列的数据集,其中包含文本和一个以诸如sa之类的术语开头并在following digits的术语。 字母可以是从 a 到 z 的任何字母,可以是小写字母也可以是大写字母。 数据快照如下:

df_new <- data.frame(
  given_info=c('SA12 is given','he has his sa12',
         'she will get Sr15','why not having an ra31',
         'his tA23 is missing', 'pa12 is given'))

df_new %>% select(given_info)

              given_info
1          SA12 is given
2        he has his sa12
3      she will get Sr15
4 why not having an ra31
5    his tA23 is missing
6          pa12 is given

我需要用术语document替换任何具有sa (or any other combinations of two random letters with the two digits 。因此,感兴趣的结果是:

              given_info
1          document is given
2        he has his document
3      she will get document
4      why not having an document
5    his document is missing
6          document is given

非常感谢您的提前帮助!

我们可以在这里使用gsub()如下:

df_new$given_info <- gsub("\\b[A-Za-z]{2}\\d{2}\\b", "document", df_new$given_info)
df_new

                  given_info
1          document is given
2        he has his document
3      she will get document
4 why not having an document
5    his document is missing
6          document is given

此处使用的正则表达式模式表示匹配:

  • \b单词边界(意味着前面的不是单词字符)
  • [A-Za-z]{2}匹配任意 2 个字母
  • \d{2}匹配 2 个数字
  • \b另一个单词边界(数字后面的不是单词字符)

例如,单词边界确保文本中的abc12不会被替换为document 如果我们不使用 boundaries 这个词,那么你也会得到 substring 个匹配项,这可能是你不想要的。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM