简体   繁体   English

读取CSV文件并验证基于UTF-8字符的列

[英]read CSV file and validate column which on the basis of UTF-8 character

I have to read a CSV file which is containing columns (PersonName, age, address) and I have to validate the the PersonName. 我必须读取一个包含列(PersonName,年龄,地址)的CSV文件,并且必须验证PersonName。 "PersonName may only contain UTF-8 characters." “ PersonName只能包含UTF-8字符。”

I am using python3.x so cant use decode method after opening the file. 我正在使用python3.x,因此打开文件后无法使用解码方法。

Please tell me how to open and read the file so that PersonName who is not containing any UTF-8 character can be ignored and I can move to next line for validation. 请告诉我如何打开和读取文件,以便可以忽略不包含任何UTF-8字符的PersonName,然后我可以移至下一行进行验证。

Assuming the rest of the file requires no checking or is UTF-8 legal (which includes ASCII data), you can open the file with encoding='utf-8' and errors='replace' . 假设文件的其余部分不需要检查或UTF-8合法(包括ASCII数据),则可以使用encoding='utf-8'errors='replace' open文件。 This will change any invalid bytes (in UTF-8 encoding) into the Unicode replacement character, \� . 这会将所有无效字节(采用UTF-8编码)更改为Unicode替换字符\� Alternatively, to preserve the data, you can use 'surrogateescape' as the errors handler, which uses private use Unicode codes to represent the original value in a way that can be undone later. 另外,要保留数据,您可以使用'surrogateescape'作为errors处理程序,该处理程序使用私有使用的Unicode代码表示原始值,以便以后撤消。 You can then check for those as you go: 然后,您可以随时检查这些内容:

with open(csvname, encoding='utf-8', errors='replace', newline='') as f:
    for PersonName, age, address in csv.reader(f):
        if '\ufffd' in PersonName:
            continue
        ... PersonName was decoded without errors, so process the row ...

Or with surrogateescape , you can ensure any non-UTF-8 data (if that's "possible") in the other fields is restored on write: 或使用surrogateescape ,可以确保其他字段中的任何非UTF-8数据(如果可能的话)在写入时都被恢复:

with open(incsvname, encoding='utf-8', errors='surrogateescape', newline='') as inf,\
     open(outcsvname, 'w', encoding='utf-8', errors='surrogateescape', newline='') as outf:
    csvout = csv.writer(outf)
    for PersonName, age, address in csv.reader(f):
        try:
            # Check for surrogate escapes, and reject PersonNames containing them
            # Most efficient way to do so is a test encode; surrogates will fail
            # to encode with default error handler
            PersonName.encode('utf-8')
        except UnicodeEncodeError:
            continue  # Had non-UTF-8, skip this row

        ... PersonName was decoded without surrogate escapes, so process the row ...

        # You can recover the original file bytes in your code for a field with:
        #     fieldname.encode('utf-8', errors='surrogateescape')
        # Or if you're just passing data to a new file, write the same strings
        # back to a file opened with the same encoding/errors handling; the surrogates
        # will be restored to their original values:
        csvout.writerow([PersonName, age, address])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM