在R中使用Quanteda时，从文本语料库中删除非ASCII字符的最佳方法是什么？

Question

I am in dire need. 我迫切需要。 I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. 我有一个语料库，我已经转换成一种共同语言，但有些词语没有正确地转换成英语。 Therefore, my corpus has non-ASCII characters such as U+00F8 . 因此，我的语料库有非ASCII字符，如U+00F8 。

I am using Quanteda and I have imported my text using this code: 我正在使用Quanteda并使用以下代码导入了我的文本：

 EUCorpus <- corpus(textfile(file="/Users/RiohBurke/Documents/RStudio/PROJECT/*.txt"), encodingFrom = "UTF-8-BOM")

My corpus consists of 166 documents. 我的语料库包含166个文档。 Having imported the documents into R, what would be the best way to get rid of these non-ASCII characters? 将文档导入R后，什么是摆脱这些非ASCII字符的最佳方法？

Answer 1

Try: 尝试：

texts(EUCorpus) <- iconv(texts(EUCorpus), from = "UTF-8", to = "ASCII", sub = "")

This converts the encoding to ASCII, replacing any non-translatable characters (those not in the 0-127 ASCII range) to nothingness. 这会将编码转换为ASCII，将任何不可翻译的字符（不在0-127 ASCII范围内）替换为虚无。

在R中使用Quanteda时，从文本语料库中删除非ASCII字符的最佳方法是什么？

问题描述

1 个解决方案

解决方案1
4 2016-07-04 12:31:13

在R中使用Quanteda时，从文本语料库中删除非ASCII字符的最佳方法是什么？

问题描述

1 个解决方案

解决方案1 4 2016-07-04 12:31:13

解决方案1
4 2016-07-04 12:31:13