[英]what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R?
I am in dire need. 我迫切需要。 I have a corpus that I have converted into a common language, but some of the words were not properly converted into English.
我有一个语料库,我已经转换成一种共同语言,但有些词语没有正确地转换成英语。 Therefore, my corpus has non-ASCII characters such as
U+00F8
. 因此,我的语料库有非ASCII字符,如
U+00F8
。
I am using Quanteda and I have imported my text using this code: 我正在使用Quanteda并使用以下代码导入了我的文本:
EUCorpus <- corpus(textfile(file="/Users/RiohBurke/Documents/RStudio/PROJECT/*.txt"), encodingFrom = "UTF-8-BOM")
My corpus consists of 166 documents. 我的语料库包含166个文档。 Having imported the documents into R, what would be the best way to get rid of these non-ASCII characters?
将文档导入R后,什么是摆脱这些非ASCII字符的最佳方法?
Try: 尝试:
texts(EUCorpus) <- iconv(texts(EUCorpus), from = "UTF-8", to = "ASCII", sub = "")
This converts the encoding to ASCII, replacing any non-translatable characters (those not in the 0-127 ASCII range) to nothingness. 这会将编码转换为ASCII,将任何不可翻译的字符(不在0-127 ASCII范围内)替换为虚无。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.