根据词典数据框替换语料库中的单词

Question

我有兴趣根据由两列数据帧组成的字典替换tm Corpus对象中的所有单词，其中第一列是要匹配的单词，第二列是替换单词。

我被translate功能所困扰。 我看到了这个答案，但无法将其转换为要传递给tm_map的函数。

请考虑以下MWE

library(tm)

docs <- c("first text", "second text")
corp <- Corpus(VectorSource(docs))

dictionary <- data.frame(word = c('first', 'second', 'text'),
                      translation = c('primo', 'secondo', 'testo'))

translate <- function(text, dictionary) {
  # Would like to replace each word of text with corresponding word in dictionary
}

corp_translated <- tm_map (corp, translate)

inspect(corp_translated)

# Expected result

# A corpus with 2 text documents
#
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
#   create_date creator 
# Available variables in the data frame are:
#   MetaID 

# [[1]]
# primo testo

# [[2]]
# secondo testo

Answer 1

我建议不要将data.frame用于字典，因为默认情况下R的基本对象（向量）是字典。

      dict  <- c('primo', 'secondo', 'testo')
names(dict) <- c('first', 'second', 'text')

然后将"tanslate" x ，其中x可能是"second" ，您只需使用：

   dict[[x]]

您甚至不需要包装函数。

如果要以相反的方向平移，请使用

   name(dict)[names(dict) %in% x]

或者你可以翻字典

         dict.flip  <- names(dict)
   names(dict.flip) <- dict

Answer 2

与tm软件包的tm_map函数结合使用，可以使用stri_replace_all_fixed软件包中的stringi 。 例如：

library(tm)
library(stringi)

docs <- c("first text", "second text")
corp <- Corpus(VectorSource(docs))

word <- c('first', 'second', 'text')
tran <- c('primo', 'secondo', 'testo')

corp <- tm_map(corp, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))

根据词典数据框替换语料库中的单词

问题描述

2 个解决方案

解决方案1
3 已采纳 2013-12-14 06:28:24

解决方案2
3 2015-09-12 11:42:09

根据词典数据框替换语料库中的单词

问题描述

2 个解决方案

解决方案1 3 已采纳 2013-12-14 06:28:24

解决方案2 3 2015-09-12 11:42:09

解决方案1
3 已采纳 2013-12-14 06:28:24

解决方案2
3 2015-09-12 11:42:09