简体   繁体   English

如何使用 R 中的正则表达式对 dataframe 中的字符串进行索引和 gsub

[英]How to index and gsub a string within a dataframe using regex in R

I am working on a text cleaning pipeline where I hope to apply a list of target words and corresponding replacement words within a dataframe to a given string (eg, goats) goats <- c("goats like apples applesauce. goats like bananas bananasplits. goats like cheese cheesecake.")我正在研究一个文本清理管道,我希望将 dataframe 中的目标词列表和相应的替换词应用到给定的字符串(例如,山羊) goats <- c("goats like apples applesauce. goats like bananas bananasplits. goats like cheese cheesecake.")

I am using a for loop to run down the list of targets and gsub with their corresponding replacements in the specified text (goats).我正在使用 for 循环来运行目标列表和 gsub,并在指定的文本(山羊)中使用它们的相应替换。 I want the substitution to only catch exact string matches (eg, banana but not bananasplit).我希望替换只捕获精确的字符串匹配(例如,banana 但不是bananasplit)。 Here's the loop:这是循环:

goatclean <- goats
for (i in seq_along(swap$target)) {
    goatclean <- gsub(swap$target[i], swap$replace[i], goatclean)
}
print(goatclean)

The output of this loop is: "goats like mary maryauce. goats like linda lindaplits. goats like jane janecake."这个循环的 output 是:“山羊喜欢玛丽玛丽亚斯。山羊喜欢琳达林达普利特。山羊喜欢简简蛋糕。”

I cannot figure out a way to gsub 'apples' from the dataframe when it is only an isolated word using regex -- I am getting errors when I add \s+ to:当它只是一个使用正则表达式的孤立词时,我无法从 dataframe 中找出 gsub 'apples' 的方法——当我将 \s+ 添加到时出现错误:

gsub(\\s+(swap$target[i])\\s+, swap$replace[i], goatclean)

Any advice on how to get the output to the following: "goats like mary applesauce. goats like linda bananasplits. goats like jane cheesecake."关于如何获得 output 的任何建议如下:“山羊喜欢玛丽苹果酱。山羊喜欢琳达香蕉皮。山羊喜欢简芝士蛋糕。”

Thanks everyone!感谢大家!

Try using word boundaries ( \\b ) around the pattern -尝试在模式周围使用单词边界( \\b ) -

for (i in seq_along(swap$target)) {
  goatclean <- gsub(paste0('\\b', swap$target[i], '\\b'), swap$replace[i], goatclean)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM