使用tm（）从R中的语料库中删除非英语文本

Question

我正在使用tm()和wordcloud()进行R中的一些基本数据挖掘，但由于我的数据集中存在非英文字符，因此遇到了困难（尽管我试图根据背景变量过滤掉其他语言）。

假设我的TXT文件中的一些行（在TextWrangler中保存为UTF-8）如下所示：

Special
satisfação
Happy
Sad
Potential für

然后我将我的txt文件读入R：

words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language = "lat"))

这会产生警告消息：

Warning message:
In readLines(y, encoding = x$Encoding) :
  incomplete final line found on '/temp/file.txt'

但由于这是一个警告，而不是错误，我继续向前推进。

words <- tm_map(words, stripWhitespace)
words <- tm_map(words, tolower)

然后产生错误：

Error in FUN(X[[1L]], ...) : invalid input 'satisfa��o' in 'utf8towcs'

我愿意找到在TextWrangler或R中过滤掉非英文字符的方法; 无论什么是最权宜之计。 谢谢你的帮助！

Answer 1

这是一种在创建语料库之前删除包含非ASCII字符的单词的方法：

# remove words with non-ASCII characters
# assuming you read your txt file in as a vector, eg. 
# dat <- readLines('~/temp/dat.txt')
dat <- "Special,  satisfação, Happy, Sad, Potential, für"
# convert string to vector of words
dat2 <- unlist(strsplit(dat, split=", "))
# find indices of words with non-ASCII characters
dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
# subset original vector of words to exclude words with non-ASCII char
dat4 <- dat2[-dat3]
# convert vector back to a string
dat5 <- paste(dat4, collapse = ", ")
# make corpus
require(tm)
words1 <- Corpus(VectorSource(dat5))
inspect(words1)

A corpus with 1 text document

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
Special, Happy, Sad, Potential

Answer 2

您也可以使用“stringi”包。

使用上面的例子：

library(stringi)
dat <- "Special,  satisfação, Happy, Sad, Potential, für"
stringi::stri_trans_general(dat, "latin-ascii")

输出：

[1] "Special,  satisfacao, Happy, Sad, Potential, fur"

使用tm（）从R中的语料库中删除非英语文本

问题描述

2 个解决方案

解决方案1
9 已采纳 2013-08-09 19:59:50

解决方案2
0 2019-02-15 03:21:24

使用tm（）从R中的语料库中删除非英语文本

问题描述

2 个解决方案

解决方案1 9 已采纳 2013-08-09 19:59:50

解决方案2 0 2019-02-15 03:21:24

解决方案1
9 已采纳 2013-08-09 19:59:50

解决方案2
0 2019-02-15 03:21:24