简体   繁体   English

在R中使用Quanteda时,从文本语料库中删除非ASCII字符的最佳方法是什么?

[英]what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R?

I am in dire need. 我迫切需要。 I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. 我有一个语料库,我已经转换成一种共同语言,但有些词语没有正确地转换成英语。 Therefore, my corpus has non-ASCII characters such as U+00F8 . 因此,我的语料库有非ASCII字符,如U+00F8

I am using Quanteda and I have imported my text using this code: 我正在使用Quanteda并使用以下代码导入了我的文本:

 EUCorpus <- corpus(textfile(file="/Users/RiohBurke/Documents/RStudio/PROJECT/*.txt"), encodingFrom = "UTF-8-BOM")

My corpus consists of 166 documents. 我的语料库包含166个文档。 Having imported the documents into R, what would be the best way to get rid of these non-ASCII characters? 将文档导入R后,什么是摆脱这些非ASCII字符的最佳方法?

Try: 尝试:

texts(EUCorpus) <- iconv(texts(EUCorpus), from = "UTF-8", to = "ASCII", sub = "")

This converts the encoding to ASCII, replacing any non-translatable characters (those not in the 0-127 ASCII range) to nothingness. 这会将编码转换为ASCII,将任何不可翻译的字符(不在0-127 ASCII范围内)替换为虚无。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM