简体   繁体   中英

Decode to UTF-8 in pandas

I have csv file with this content - as you can see some of field rows are not string values. I read the file using this command:

data = gpd.read_file('data.csv', encoding='utf8')

The CSV file:

CSV 文件

Notebook:

Jupyter 笔记本图像

As you can see, the column name is still not decoded. I have tried the following command, but it was not successful, because it sees the column as str , and can't call decode() function on it.

data['name'] = data['name'].apply(lambda x:x.decode('utf8', 'strict') if not isinstance(x, str) else x)

It works:

data['name'] = data['name'].apply(
    lambda x:x[2:-1].encode().decode("unicode_escape").encode('raw_unicode_escape').decode()
)

Step by step

In:

x = r"b'\xd9\x85\xd9\x86\xd8\xaa\xd8\xb2\xd9\x87\xd8\xb1\xd8\xa7\xd8\xa8'"
print(f"x {type(x)}\n\t= {x}\n")

x = x[2:-1]
print(f"x[2:-1] {type(x)}\n\t= {x}\n")

x = x.encode()
print(f"x[2:-1].encode() {type(x)}\n\t= {x}\n")

x = x.decode("unicode_escape").encode('raw_unicode_escape')
print(f"x[2:-1].encode().decode('unicode_escape').encode('raw_unicode_escape') {type(x)}\n\t= {x}\n")

x = x.decode()
print(f"x[2:-1].encode().decode('unicode_escape').encode('raw_unicode_escape').decode() {type(x)}\n\t= {x}\n")

Out:

x <class 'str'>
    = b'\xd9\x85\xd9\x86\xd8\xaa\xd8\xb2\xd9\x87\xd8\xb1\xd8\xa7\xd8\xa8'

x[2:-1] <class 'str'>
    = \xd9\x85\xd9\x86\xd8\xaa\xd8\xb2\xd9\x87\xd8\xb1\xd8\xa7\xd8\xa8

x[2:-1].encode() <class 'bytes'>
    = b'\\xd9\\x85\\xd9\\x86\\xd8\\xaa\\xd8\\xb2\\xd9\\x87\\xd8\\xb1\\xd8\\xa7\\xd8\\xa8'

x[2:-1].encode().decode('unicode_escape').encode('raw_unicode_escape') <class 'bytes'>
    = b'\xd9\x85\xd9\x86\xd8\xaa\xd8\xb2\xd9\x87\xd8\xb1\xd8\xa7\xd8\xa8'

x[2:-1].encode().decode('unicode_escape').encode('raw_unicode_escape').decode() <class 'str'>
    = منتزهراب

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM