讀取CSV文件並驗證基於UTF-8字符的列

Question

我必須讀取一個包含列（PersonName，年齡，地址）的CSV文件，並且必須驗證PersonName。 “ PersonName只能包含UTF-8字符。”

我正在使用python3.x，因此打開文件后無法使用解碼方法。

請告訴我如何打開和讀取文件，以便可以忽略不包含任何UTF-8字符的PersonName，然后我可以移至下一行進行驗證。

Answer 1

假設文件的其余部分不需要檢查或UTF-8合法（包括ASCII數據），則可以使用encoding='utf-8'和errors='replace' open文件。 這會將所有無效字節（采用UTF-8編碼）更改為Unicode替換字符\� 。 另外，要保留數據，您可以使用'surrogateescape'作為errors處理程序，該處理程序使用私有使用的Unicode代碼表示原始值，以便以后撤消。 然后，您可以隨時檢查這些內容：

with open(csvname, encoding='utf-8', errors='replace', newline='') as f:
    for PersonName, age, address in csv.reader(f):
        if '\ufffd' in PersonName:
            continue
        ... PersonName was decoded without errors, so process the row ...

或使用surrogateescape ，可以確保其他字段中的任何非UTF-8數據（如果可能的話）在寫入時都被恢復：

with open(incsvname, encoding='utf-8', errors='surrogateescape', newline='') as inf,\
     open(outcsvname, 'w', encoding='utf-8', errors='surrogateescape', newline='') as outf:
    csvout = csv.writer(outf)
    for PersonName, age, address in csv.reader(f):
        try:
            # Check for surrogate escapes, and reject PersonNames containing them
            # Most efficient way to do so is a test encode; surrogates will fail
            # to encode with default error handler
            PersonName.encode('utf-8')
        except UnicodeEncodeError:
            continue  # Had non-UTF-8, skip this row

        ... PersonName was decoded without surrogate escapes, so process the row ...

        # You can recover the original file bytes in your code for a field with:
        #     fieldname.encode('utf-8', errors='surrogateescape')
        # Or if you're just passing data to a new file, write the same strings
        # back to a file opened with the same encoding/errors handling; the surrogates
        # will be restored to their original values:
        csvout.writerow([PersonName, age, address])

讀取CSV文件並驗證基於UTF-8字符的列

問題描述

1 個解決方案

解決方案1
0 2016-09-30 22:43:34

讀取CSV文件並驗證基於UTF-8字符的列

問題描述

1 個解決方案

解決方案1 0 2016-09-30 22:43:34

解決方案1
0 2016-09-30 22:43:34