Decode to UTF-8 in pandas

Question

I have csv file with this content - as you can see some of field rows are not string values. I read the file using this command:

data = gpd.read_file('data.csv', encoding='utf8')

The CSV file:

CSV 文件

Notebook:

Jupyter 笔记本图像

As you can see, the column name is still not decoded. I have tried the following command, but it was not successful, because it sees the column as str , and can't call decode() function on it.

data['name'] = data['name'].apply(lambda x:x.decode('utf8', 'strict') if not isinstance(x, str) else x)

Answer 1

It works:

data['name'] = data['name'].apply(
    lambda x:x[2:-1].encode().decode("unicode_escape").encode('raw_unicode_escape').decode()
)

Step by step

In:

x = r"b'\xd9\x85\xd9\x86\xd8\xaa\xd8\xb2\xd9\x87\xd8\xb1\xd8\xa7\xd8\xa8'"
print(f"x {type(x)}\n\t= {x}\n")

x = x[2:-1]
print(f"x[2:-1] {type(x)}\n\t= {x}\n")

x = x.encode()
print(f"x[2:-1].encode() {type(x)}\n\t= {x}\n")

x = x.decode("unicode_escape").encode('raw_unicode_escape')
print(f"x[2:-1].encode().decode('unicode_escape').encode('raw_unicode_escape') {type(x)}\n\t= {x}\n")

x = x.decode()
print(f"x[2:-1].encode().decode('unicode_escape').encode('raw_unicode_escape').decode() {type(x)}\n\t= {x}\n")

Out:

x <class 'str'>
    = b'\xd9\x85\xd9\x86\xd8\xaa\xd8\xb2\xd9\x87\xd8\xb1\xd8\xa7\xd8\xa8'

x[2:-1] <class 'str'>
    = \xd9\x85\xd9\x86\xd8\xaa\xd8\xb2\xd9\x87\xd8\xb1\xd8\xa7\xd8\xa8

x[2:-1].encode() <class 'bytes'>
    = b'\\xd9\\x85\\xd9\\x86\\xd8\\xaa\\xd8\\xb2\\xd9\\x87\\xd8\\xb1\\xd8\\xa7\\xd8\\xa8'

x[2:-1].encode().decode('unicode_escape').encode('raw_unicode_escape') <class 'bytes'>
    = b'\xd9\x85\xd9\x86\xd8\xaa\xd8\xb2\xd9\x87\xd8\xb1\xd8\xa7\xd8\xa8'

x[2:-1].encode().decode('unicode_escape').encode('raw_unicode_escape').decode() <class 'str'>
    = منتزهراب

Decode to UTF-8 in pandas

Question

1 answers

solution1
1 ACCPTED 2021-07-27 18:20:28

Step by step

Decode to UTF-8 in pandas

Question

1 answers

solution1 1 ACCPTED 2021-07-27 18:20:28

Step by step

solution1
1 ACCPTED 2021-07-27 18:20:28