简体   繁体   English

无法从CSV文件读取非ASCII字符

[英]Unable to read non ascii characters from csv file

I am trying to read csv file which contains a sentence each line. 我正在尝试读取每行包含一个句子的csv文件。 Each sentence may contain foreign words such as Chinese characters. 每个句子可能包含诸如汉字之类的外来词。 I want to remove or ignore those foreign characters and return only English words or ascii words. 我想删除或忽略这些外来字符,仅返回英文单词或ascii单词。

Example of how the string may look like in the file: 字符串在文件中的外观示例:

'小心 Careful'

Desired output: Careful 所需输出:小心

import csv
from string import ascii_letters, punctuation

def remove_non_ascii(string):
    ascii_letters = set(ascii_letters)
    tokens = nltk.word_tokenize(string)
    ascii_words = [word for word in tokens if any(letter in ascii_letters for letter in word)]
    return ascii_words

with open(job_file, mode = 'r', encoding = 'utf8') as infile:   
    line_reader  = csv.reader(infile)
    for row in line_reader:
        new_line = remove_non_ascii(row[1])
        print (new_line)
        if row[1]:
            open(output_file, 'a', newline='', encoding = 'utf8') as outfile:
            line_writer = csv.writer(outfile)
            line_writer.writerow('')             

This is the error I get when I run that code. 这是我运行该代码时遇到的错误。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 2848: invalid start byte

The error goes away if I change the encoding from utf8 to cp1252 because the Chinese characters are converted into '????'. 如果我将编码从utf8更改为cp1252,该错误就会消失,因为汉字会转换为“ ????”。 Is it possible to remove those unwanted characters and return only ascii compliant characters? 是否可以删除那些不需要的字符并仅返回符合ASCII的字符?

If you are interested in the ascii part of your input file only, you can use 如果仅对输入文件的ascii部分感兴趣,则可以使用

open(job_file, mode = 'r', encoding = 'ascii', errors = 'ignore')

This should ignore all characters that are not ascii compliant. 这应该忽略所有不符合ASCII的字符。 The Python docs for open() give you more options you might want to look at. open()的Python文档为您提供了更多您可能要看的选项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM