简体   繁体   English

使用python在unix中解析unicode字符

[英]Parsing unicode characters in unix using python

I am facing a strange issue while running pyhton code in unix. 在UNIX中运行pyhton代码时,我遇到一个奇怪的问题。 I have written a code to parse unicode characters and this is working perfectly when I execute my code on windows. 我已经编写了一个代码来解析unicode字符,当我在Windows上执行代码时,这可以完美地工作。 However when I run same code in unix, is is adding some additionals values and making my output data incorrect. 但是,当我在Unix中运行相同的代码时,正在添加一些附加值并使我的输出数据不正确。

My source file is like : 我的源文件是这样的:

bash-4.2$ more sourcefile.csv
"ひとみ","Abràmoff","70141558"

I am using python3.7 verson 我正在使用python3.7 verson

import requests
import csv
import urllib.parse


with open('sourcefile.csv', "r",newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for lines in csv_reader:

        FRST_NM = (lines[1])




FRST_NM1 = urllib.parse.quote(FRST_NM)
print(FRST_NM1)


Windows output : Abr%C3%A0moff
Unix output : Abr%C3%83%C2%A0moff


Can someone please help, how to get fid of this "83%C2" in unix

This looks like some kind of mismatch with encodings on Unix machine. 看起来与Unix计算机上的编码有些不匹配。 How did you create sourcefile.csv on both Windows and Unix? 您如何在Windows和Unix上创建sourcefile.csv? Was there some explicit encoding/decoding operations involved? 是否涉及一些显式的编码/解码操作?

From what I understand there were utf-8 encoded bytes interpreted as unicode codes somewhere, possibly like this: 据我了解,在某处将utf-8编码的字节解释为unicode代码,可能是这样的:

unicode_string = 'Abràmoff'

utf8_bytes = unicode_string.encode('utf-8')  # b'Abr\xc3\xa0moff'

mismatched_unicode_string = ''.join(chr(b) for b in utf8_bytes)  # 'AbrÃ\xa0moff'

This is the crazy part. 这是疯狂的部分。 Bytes C3 and A0 are interpreted as unicode ( à ) and (NO-BREAK SPACE). 字节C3A0解释为Unicode à )和 (NO-BREAK SPACE)。 In result you've got: 结果,您得到了:

quoted_string = urllib.parse.quote(mismatched_unicode_string)  # 'Abr%C3%83%C2%A0moff'

Which looks like correct character %C3%A0 has some additional bytes ( %83%C2 ) inside but in reality this is a coincident as this is actually 2 separate characters %C3%83 %C2%A0 看起来像正确的字符%C3%A0有一些额外的字节( %83%C2 ),但实际上这是一个巧合,因为这实际上是2个单独的字符%C3%83 %C2%A0

To reverse this you could do: 要扭转这种情况,您可以这样做:

with open('sourcefile.csv', "r",newline='') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for lines in csv_reader:
    FRST_NM = lines[1]
    FRST_NM = bytes([ord(c) for c in FRST_NM]).decode('utf-8')  # interpret unicode codes as single UTF-8 bytes and decode to string.

FRST_NM1 = urllib.parse.quote(FRST_NM)
print(FRST_NM1)

Which on Unix will give you expected result (and will most likely break on Windows) but honestly you should figure out what is wrong with your files encoding and fix that. 在Unix上哪个会给您预期的结果(最有可能在Windows上中断),但老实说,您应该找出文件编码的问题并加以解决。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM