如何在语料库中检测外来词？

Question

假设我正在用tm软件包解析英语语料库，然后执行通常的清洁步骤。

library(tm)
data("crude")
corpus <- Corpus(crude)

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removeWords)) stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)

# text matrices
tdm <- TermDocumentMatrix(corpus)
dtm<- DocumentTermMatrix(corpus)

我如何识别用与语料库之一不同的语言书写的单词？ Python在这里也面临类似的问题，但是我的研究并未产生有趣的结果。

Answer 1

这不是一个完整的解决方案，但我觉得可能会有所帮助。 最近，我不得不做类似的事情，即必须从具有汉字的语料库中删除单词。 我最终使用了带有正则表达式的自定义转换，以删除其中包含非az 0-9字符的任何内容。

corpus <- tm_map(corpus, content_transformer(function(s){
  gsub(pattern = '[^a-zA-Z0-9\\s]+',
       x = s,
       replacement = " ",
       ignore.case = TRUE,
       perl = TRUE)
}))

例如，如果其中有一个中文单词，它将被删除。

gsub(pattern = '[^a-zA-Z0-9\\s]+',
     x = 'English 象形字 Chinese',
     replacement = "",
     ignore.case = TRUE,
     perl = TRUE)

输出：“英文中文”

如果您尝试从西班牙语这样的语言中删除单词，则比较棘手，因为有些字母带有重音，而另一些则没有。 例如，这不能完全起作用，但是也许这只是一个开始。

gsub(pattern = '[a-zA-Z0-9]+[^a-zA-Z0-9\\s]+[a-zA-Z0-9]+',
     x = 'El jalapeño es caliente',
     replacement = "",
     ignore.case = TRUE,
     perl = TRUE)

输出：“ El es caliente”

希望这可以帮助！

如何在语料库中检测外来词？

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-04-29 17:00:11

如何在语料库中检测外来词？

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-04-29 17:00:11

解决方案1
1 已采纳 2016-04-29 17:00:11