简体   繁体   English

pandas dataframe 与包含行 'nan' 的字符串索引在保存到 hd5 和从 hd5 读取时被转换为 NaN

[英]pandas dataframe with string index containing row 'nan' gets converted to NaN when saved to and read from hd5

I have a simple dataframe with a string index:我有一个带有字符串索引的简单 dataframe :

>>> df = pd.DataFrame(dict(x=['a','nan', 'NA', 'na', 'NaN'],
                           y=[1,2,3,4,5])).set_index('x')
>>> df
     y
x
a    1
nan  2
NA   3
na   4
NaN  5

It properly sets the index as strings.它正确地将索引设置为字符串。

>>> df.index.isna()
array([False, False, False, False, False])

However, when I save it to a hdf5 file, and read again, it somehow changes the index entry 'nan' to NaN.但是,当我将它保存到 hdf5 文件并再次读取时,它以某种方式将索引条目“nan”更改为 NaN。

>>> df.to_hdf('test.h5', key='test')
>>> df2=pd.read_hdf('test.h5')
>>> df2.index.isna()
array([False,  True, False, False, False])

Is there a way to avoid this conversion?有没有办法避免这种转换? In my actual code, the index is based on Drosophila gene nan and I don't want it to be converted.在我的实际代码中,索引是基于果蝇基因nan的,我不希望它被转换。

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f2c8480af2f25efdbd803218b9d87980f416563e
python           : 3.8.8.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1160.45.1.el7.x86_64
Version          : #1 SMP Wed Oct 13 17:20:51 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : POSIX
LANG             : en_US.UTF-8
LOCALE           : None.None

pandas           : 1.2.3
numpy            : 1.20.2
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.0.1
setuptools       : 52.0.0.post20210125
Cython           : 0.29.23
pytest           : 6.2.3
hypothesis       : None
sphinx           : 4.0.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.8
lxml.etree       : 4.6.3
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.3
IPython          : 7.22.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 0.7.4
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.1
numexpr          : 2.7.3
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : 0.4.2
scipy            : 1.6.2
sqlalchemy       : 1.4.6
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : 1.3.0
numba            : 0.53.1
>>>

I think that's an issue with read_hdf module.I could be wrong.我认为这是 read_hdf 模块的问题。我可能是错的。

but one work around is to not set x as index when you save it as hd5 but after you read it back from hdf,set the index to x:但是一种解决方法是在将 x 保存为 hd5 时不将其设置为索引,但在从 hdf 读回后,将索引设置为 x:

import pandas as pd

df = pd.DataFrame(dict(x=['a', 'nan', 'NA', 'na', 'NaN'],
                       y=[1, 2, 3, 4, 5]))

df.to_hdf('test.h5', key='test')

df2 = pd.read_hdf('test.h5')
df2 = df2.set_index('x')

test:测试:

>>> df2.index.isna()
array([False, False, False, False, False])

or if you want to keep the index on your original dataframe, just reset the index when you want to save the hdf:或者如果您想保留原始 dataframe 上的索引,只需在要保存 hdf 时重置索引:

import pandas as pd

df = pd.DataFrame(dict(x=['a', 'nan', 'NA', 'na', 'NaN'],
                       y=[1, 2, 3, 4, 5])).set_index('x')

df.reset_index().to_hdf('test.h5', key='test')

df2 = pd.read_hdf('test.h5')
df2 = df2.set_index('x')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM