用 R 中的一个唯一单词替换单词列表

Question

I am working on a text analysis with R and have a dataset (text corpus) with various sentences about different fruits.我正在使用 R 进行文本分析，并拥有一个数据集（文本语料库），其中包含有关不同水果的各种句子。 For example: " apple ", " banana ", " orange ", " pear ", etc.例如：“苹果”、“香蕉”、“橙子”、“梨子”等。

Since it is not relevant for the analysis whether someone writes about "apples" or "bananas", I want to replace all different fruits with one specific word, for example " allfruits ".由于某人写的是“apples”还是“bananas”与分析无关，我想用一个特定的词替换所有不同的水果，例如“ allfruits ”。

I thought about using regex but I am facing two issues;我考虑过使用正则表达式，但我面临两个问题；

1) I want to avoid separate code lines for each kind of fruit. 1）我想避免每种水果都有单独的代码行。 Thus, is there a way to define a list or a vector that I can use so that the function replaces all words in that list (apple, bananas, pear, etc.) with one specific word " allfruits "?因此，有没有办法定义我可以使用的列表或向量，以便 function 用一个特定的单词“ allfruits ”替换该列表中的所有单词（苹果、香蕉、梨等）？

2) I want to avoid that words that are NOT a fruit but contain the same string as a fruit (eg the word "appletini) get replaced by the function. 2) 我想避免那些不是水果但包含与水果相同的字符串的单词（例如单词“appletini”）被 function 替换。

Example: If I have a sentence that says: " Apple is my favourite fruit, appletini is my favourite drink. I also like bananas ! " I want following to happen: allfruits is my favourite fruit, appletini is my favourite drink.示例：如果我有一句话说：“ Apple是我最喜欢的水果， appletini是我最喜欢的饮料。我也喜欢香蕉！ ”我希望发生以下情况： allfruits是我最喜欢的水果，appletini 是我最喜欢的饮料。 I also like allfruits !我也喜欢所有水果！

I am not sure whether it is possible to write this with a gsub function.我不确定是否可以用 gsub function 来编写。 Thus, all help is much appreciated.因此，非常感谢所有帮助。

Thank you!谢谢！

Answer 1

allfruits can be extended to contain any words to be replaced: allfruits可以扩展为包含任何要替换的单词：

allfruits = c("apple", "banana" , "orange", "pear")
replacement = "allfruits"
text = "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!"

gsub(paste0("\\b(", paste0(allfruits, collapse="|"), ")[s]?\\b"), replacement, text, ignore.case = TRUE)

Returns退货

[1] "allfruits is my favourite fruit, appletini is my favourite drink. I also like allfruits!"

The regex:正则表达式：

\\b - wordboundary \\b - 字边界
(", paste0(allfruits, collapse="|"), ") - all fruits names separated by a | (", paste0(allfruits, collapse="|"), ") - 所有水果名称由|分隔(or) （或者）
s? - optional letter 's' - 可选字母's'
\\b - wordboundary \\b - 字边界
ignore.case = TRUE - ignore case ignore.case = TRUE - 忽略大小写

Answer 2

str <- "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!"
gsub("(\\bapples?\\b)|(\\bbananas?\\b)", "allfruits", str, ignore.case = T)

\\b means boundary , that is the end of a word (punctuation, space, nothing after...) \\b表示边界，即单词的结尾（标点符号，空格，后面什么都没有...）
| means OR表示或
() defines a group ()定义一个组
s? means with a s if possible尽可能用s表示

用 R 中的一个唯一单词替换单词列表

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-04 20:36:10

解决方案2
0 2020-06-04 20:33:23

用 R 中的一个唯一单词替换单词列表

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-04 20:36:10

解决方案2 0 2020-06-04 20:33:23

解决方案1
1 已采纳 2020-06-04 20:36:10

解决方案2
0 2020-06-04 20:33:23