How can I use unicode string as index for pd.DataFrame?

Question

I use Python 2.7 and I have created a pandas DataFrame using pd.read_excel(my_path, encoding="utf-8") named my_reader . One of its columns is named 'Descrição'.

I have all the columns names in a list named client_list .

When I'm trying to use my list's data as index for my_reader I get an error

KeyError: 'Descri\xc3\xa7\xc3\xa3o'

It works fine with all other data which contain only English letters. When I print client_list I get the names correctly displayed

print client_list[0]
Descrição

But

 client_list[0]
'Descri\xc3\xa7\xc3\xa3o'

So I can't use

my_reader[client_list[i]]

Any ideas?

Thanks

Answer 1

Your dataframe is saved with encoding="utf-8" , when you use the 'Descri\\xc3\\xa7\\xc3\\xa3o' as the index of the dataframe, better decode it with "utf-8" , then you can get the data. For example:

import pandas as pd
my_reader = pd.read_excel('comparison.xlsx',encoding="utf-8")
my_reader

my_reader will be:

    Col_1   Col_2   file    Descrição
0   Abc     Abk     cnl     DFSDF
1   Nck     Nck     Abk     DSFAF
2   xkl     cnl     Abc     FDAS
3   mzn     mzn     NaN     DFAS

You can use :

my_reader['Descrição'.decode('utf-8')]

This will give you the result:

0    DFSDF
1    DSFAF
2     FDAS
3     DFAS
Name: Descrição, dtype: object

For other column you also can trace with unicode :

my_reader['Col_2'.decode("utf-8")]

Output:

0    Abk 
1     Nck
2     cnl
3     mzn
Name: Col_2, dtype: object

Answer 2

Your list of column names is a list of str in the utf-8 encoding. But the pandas columns have unicode strings as names, so the easiest solution is to "decode" your list of column names to unicode as well.

client_list = [ c.decode("utf8") for c in client_list ]

I can't see into your dataframe but I'll wager that all columns , not just the non-ascii ones, are unicode strings. The reason the other column names don't give you trouble is that Python 2 does a lot of implicit conversions behind the scenes (and pandas probably adds some of its own). With ascii strings the mapping between str and unicode is trivial, but with non-ascii things it is encoding-dependent. So just convert the entire list of names to unicode. Better yet, migrate all your text handling to unicode, as recommended for any application that sometimes deals with non-ascii data.

A better solution to your predicament would be to switch to Python 3. Its handling of non-ascii encodings is much more intuitive and robust-- you're likely to find that your code will "just work", just like it did for me under Python 3.

How can I use unicode string as index for pd.DataFrame?

Question

2 answers

solution1
0 2017-05-17 14:26:56

solution2
0 ACCPTED 2017-05-17 15:01:28

How can I use unicode string as index for pd.DataFrame?

Question

2 answers

solution1 0 2017-05-17 14:26:56

solution2 0 ACCPTED 2017-05-17 15:01:28

solution1
0 2017-05-17 14:26:56

solution2
0 ACCPTED 2017-05-17 15:01:28