Parsing unicode characters in unix using python

Question

I am facing a strange issue while running pyhton code in unix. I have written a code to parse unicode characters and this is working perfectly when I execute my code on windows. However when I run same code in unix, is is adding some additionals values and making my output data incorrect.

My source file is like :

bash-4.2$ more sourcefile.csv
"ひとみ","Abràmoff","70141558"

I am using python3.7 verson

import requests
import csv
import urllib.parse


with open('sourcefile.csv', "r",newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for lines in csv_reader:

        FRST_NM = (lines[1])




FRST_NM1 = urllib.parse.quote(FRST_NM)
print(FRST_NM1)


Windows output : Abr%C3%A0moff
Unix output : Abr%C3%83%C2%A0moff


Can someone please help, how to get fid of this "83%C2" in unix

Answer 1

This looks like some kind of mismatch with encodings on Unix machine. How did you create sourcefile.csv on both Windows and Unix? Was there some explicit encoding/decoding operations involved?

From what I understand there were utf-8 encoded bytes interpreted as unicode codes somewhere, possibly like this:

unicode_string = 'Abràmoff'

utf8_bytes = unicode_string.encode('utf-8')  # b'Abr\xc3\xa0moff'

mismatched_unicode_string = ''.join(chr(b) for b in utf8_bytes)  # 'AbrÃ\xa0moff'

This is the crazy part. Bytes C3 and A0 are interpreted as unicode \Ã ( Ã ) and \ (NO-BREAK SPACE). In result you've got:

quoted_string = urllib.parse.quote(mismatched_unicode_string)  # 'Abr%C3%83%C2%A0moff'

Which looks like correct character %C3%A0 has some additional bytes ( %83%C2 ) inside but in reality this is a coincident as this is actually 2 separate characters %C3%83 %C2%A0

To reverse this you could do:

with open('sourcefile.csv', "r",newline='') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for lines in csv_reader:
    FRST_NM = lines[1]
    FRST_NM = bytes([ord(c) for c in FRST_NM]).decode('utf-8')  # interpret unicode codes as single UTF-8 bytes and decode to string.

FRST_NM1 = urllib.parse.quote(FRST_NM)
print(FRST_NM1)

Which on Unix will give you expected result (and will most likely break on Windows) but honestly you should figure out what is wrong with your files encoding and fix that.

Parsing unicode characters in unix using python

Question

1 answers

solution1
0 2019-10-22 12:30:49

Parsing unicode characters in unix using python

Question

1 answers

solution1 0 2019-10-22 12:30:49

solution1
0 2019-10-22 12:30:49