使用python在unix中解析unicode字符

Question

在UNIX中運行pyhton代碼時，我遇到一個奇怪的問題。 我已經編寫了一個代碼來解析unicode字符，當我在Windows上執行代碼時，這可以完美地工作。 但是，當我在Unix中運行相同的代碼時，正在添加一些附加值並使我的輸出數據不正確。

我的源文件是這樣的：

bash-4.2$ more sourcefile.csv
"ひとみ","Abràmoff","70141558"

我正在使用python3.7 verson

import requests
import csv
import urllib.parse


with open('sourcefile.csv', "r",newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for lines in csv_reader:

        FRST_NM = (lines[1])




FRST_NM1 = urllib.parse.quote(FRST_NM)
print(FRST_NM1)


Windows output : Abr%C3%A0moff
Unix output : Abr%C3%83%C2%A0moff


Can someone please help, how to get fid of this "83%C2" in unix

Answer 1

看起來與Unix計算機上的編碼有些不匹配。 您如何在Windows和Unix上創建sourcefile.csv？ 是否涉及一些顯式的編碼/解碼操作？

據我了解，在某處將utf-8編碼的字節解釋為unicode代碼，可能是這樣的：

unicode_string = 'Abràmoff'

utf8_bytes = unicode_string.encode('utf-8')  # b'Abr\xc3\xa0moff'

mismatched_unicode_string = ''.join(chr(b) for b in utf8_bytes)  # 'AbrÃ\xa0moff'

這是瘋狂的部分。 字節C3和A0解釋為Unicode \Ã （ Ã ）和\ （NO-BREAK SPACE）。 結果，您得到了：

quoted_string = urllib.parse.quote(mismatched_unicode_string)  # 'Abr%C3%83%C2%A0moff'

看起來像正確的字符%C3%A0有一些額外的字節（ %83%C2 ），但實際上這是一個巧合，因為這實際上是2個單獨的字符%C3%83 %C2%A0

要扭轉這種情況，您可以這樣做：

with open('sourcefile.csv', "r",newline='') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for lines in csv_reader:
    FRST_NM = lines[1]
    FRST_NM = bytes([ord(c) for c in FRST_NM]).decode('utf-8')  # interpret unicode codes as single UTF-8 bytes and decode to string.

FRST_NM1 = urllib.parse.quote(FRST_NM)
print(FRST_NM1)

在Unix上哪個會給您預期的結果（最有可能在Windows上中斷），但老實說，您應該找出文件編碼的問題並加以解決。

使用python在unix中解析unicode字符

問題描述

1 個解決方案

解決方案1
0 2019-10-22 12:30:49

使用python在unix中解析unicode字符

問題描述

1 個解決方案

解決方案1 0 2019-10-22 12:30:49

解決方案1
0 2019-10-22 12:30:49