简体   繁体   English

Pandas:选择带有unicode字符的字符串

[英]Pandas: select string with unicode characters

I am trying to select rows by specifying the value of one of the columns. 我试图通过指定其中一列的值来选择行。 That works perfectly well, as long as the value selected is pure ascii. 只要选择的值是纯粹的ascii,那就非常有效。 If however, it contains non-ascii characters, I cannot get it to work no matter how I encode the value. 但是,如果它包含非ascii字符,无论我如何对值进行编码,我都无法使其工作。

Simplified example to illustrate the problem: 用于说明问题的简化示例:

>>> from __future__ import (absolute_import, division, 
                            print_function, unicode_literals)
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 'Stuttgart'], [2, 'München']], columns=['id', 'city'])
>>> df['city'] = df['city'].map(lambda x: x.encode('latin-1'))
>>> store = pd.HDFStore('test_store.h5')
>>> store.append('test_key', df, data_columns=True)
>>> store['test_key']
   id       city
0   1  Stuttgart
1   2    M�nchen

Note that the non-asci string is indeed properly stored: 请注意,非asci字符串确实正确存储:

>>> store['test_key']['city'][1]
'M\xfcnchen'

Selecting for asci value works just fine: 选择asci值可以正常工作:

>>> store.select('test_key', where='city==%r' % 'Stuttgart')
   id       city
0   1  Stuttgart

But selecting for the non-ascii value fails to return the row: 但是选择非ascii值无法返回行:

>>> store.select('test_key', where='city==%r' % 'München')
Empty DataFrame
Columns: [id, city]
Index: []

>>> store.select('test_key', where='city==%r' % 'München'.encode('latin-1'))
Empty DataFrame
Columns: [id, city]
Index: []

Clearly I am doing something wrong... How does one solve this issue? 显然,我做错了什么......如何解决这个问题?

Oddly, selection seems to work fine if the encoding is utf-8 instead of latin-1: 奇怪的是,如果编码是utf-8而不是latin-1,选择似乎工作正常:

from __future__ import (absolute_import, division, 
                        print_function, unicode_literals)

import pandas as pd

df = pd.DataFrame([[1, 'Stuttgart'], [2, 'München']], columns=['id', 'city'])
df['city'] = df['city'].map(lambda x: x.encode('utf-8'))
store = pd.HDFStore('/tmp/test_store.h5', 'w')
store.append('test_key', df, data_columns=True)
print(store.select('test_key', where='city==%r' % 'Stuttgart'.encode('utf-8')))
#    id       city
# 0   1  Stuttgart

print(store.select('test_key', where='city==%r' % 'München'.encode('utf-8')))
#    id     city
# 1   2  München

store.close()

It looks like PyTables 3.1.1 may not support unicode columns. 看起来PyTables 3.1.1可能不支持unicode列。 I'm not a PyTables user but this bug report suggests it is a known problem postponed until version 3.2. 我不是PyTables用户,但是这个错误报告表明这是一个已知的问题,推迟到3.2版本。 This other issue is maybe relevant. 这个问题可能是相关的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM