[英]read CSV file and validate column which on the basis of UTF-8 character
我必須讀取一個包含列(PersonName,年齡,地址)的CSV文件,並且必須驗證PersonName。 “ PersonName只能包含UTF-8字符。”
我正在使用python3.x,因此打開文件后無法使用解碼方法。
請告訴我如何打開和讀取文件,以便可以忽略不包含任何UTF-8字符的PersonName,然后我可以移至下一行進行驗證。
假設文件的其余部分不需要檢查或UTF-8合法(包括ASCII數據),則可以使用encoding='utf-8'
和errors='replace'
open
文件。 這會將所有無效字節(采用UTF-8編碼)更改為Unicode替換字符\�
。 另外,要保留數據,您可以使用'surrogateescape'
作為errors
處理程序,該處理程序使用私有使用的Unicode代碼表示原始值,以便以后撤消。 然后,您可以隨時檢查這些內容:
with open(csvname, encoding='utf-8', errors='replace', newline='') as f:
for PersonName, age, address in csv.reader(f):
if '\ufffd' in PersonName:
continue
... PersonName was decoded without errors, so process the row ...
或使用surrogateescape
,可以確保其他字段中的任何非UTF-8數據(如果可能的話)在寫入時都被恢復:
with open(incsvname, encoding='utf-8', errors='surrogateescape', newline='') as inf,\
open(outcsvname, 'w', encoding='utf-8', errors='surrogateescape', newline='') as outf:
csvout = csv.writer(outf)
for PersonName, age, address in csv.reader(f):
try:
# Check for surrogate escapes, and reject PersonNames containing them
# Most efficient way to do so is a test encode; surrogates will fail
# to encode with default error handler
PersonName.encode('utf-8')
except UnicodeEncodeError:
continue # Had non-UTF-8, skip this row
... PersonName was decoded without surrogate escapes, so process the row ...
# You can recover the original file bytes in your code for a field with:
# fieldname.encode('utf-8', errors='surrogateescape')
# Or if you're just passing data to a new file, write the same strings
# back to a file opened with the same encoding/errors handling; the surrogates
# will be restored to their original values:
csvout.writerow([PersonName, age, address])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.