从语料库中删除非ASCII

Question

I'm using NLTK for my project. 我正在为我的项目使用NLTK。 However, if a non-ascii word like '•' exist. 但是，如果存在像'•'这样的非ascii词。 NLTK cannot tokenize it. NLTK无法对其进行标记。 I'm using nltk.word_tokenize as the tokenizer. 我正在使用nltk.word_tokenize作为标记器。 How do I remove such words from entire corpus or make the tokenizer aware of such words? 如何从整个语料库中删除此类单词或使标记生成器识别出这些单词？

Answer 1

Use the below code to remove nonascii from your corpus: 使用以下代码从您的语料库中删除nonascii ：

ip=open(nonascii.txt,'r')
#Edit should be in w mode
op=open(ascii.txt,'w')
for line in ip:
        line=line.strip().decode("ascii","ignore").encode("ascii")
        if line=="":continue
        op.write(line)
ip.close()
op.close()

从语料库中删除非ASCII

问题描述

1 个解决方案

解决方案1
5 已采纳 2014-11-04 07:32:28

从语料库中删除非ASCII

问题描述

1 个解决方案

解决方案1 5 已采纳 2014-11-04 07:32:28

解决方案1
5 已采纳 2014-11-04 07:32:28