[英]pandas dataframe with string index containing row 'nan' gets converted to NaN when saved to and read from hd5
I have a simple dataframe with a string index:我有一个带有字符串索引的简单 dataframe :
>>> df = pd.DataFrame(dict(x=['a','nan', 'NA', 'na', 'NaN'],
y=[1,2,3,4,5])).set_index('x')
>>> df
y
x
a 1
nan 2
NA 3
na 4
NaN 5
It properly sets the index as strings.它正确地将索引设置为字符串。
>>> df.index.isna()
array([False, False, False, False, False])
However, when I save it to a hdf5 file, and read again, it somehow changes the index entry 'nan' to NaN.但是,当我将它保存到 hdf5 文件并再次读取时,它以某种方式将索引条目“nan”更改为 NaN。
>>> df.to_hdf('test.h5', key='test')
>>> df2=pd.read_hdf('test.h5')
>>> df2.index.isna()
array([False, True, False, False, False])
Is there a way to avoid this conversion?有没有办法避免这种转换? In my actual code, the index is based on Drosophila gene nan and I don't want it to be converted.在我的实际代码中,索引是基于果蝇基因nan的,我不希望它被转换。
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit : f2c8480af2f25efdbd803218b9d87980f416563e
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1160.45.1.el7.x86_64
Version : #1 SMP Wed Oct 13 17:20:51 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : POSIX
LANG : en_US.UTF-8
LOCALE : None.None
pandas : 1.2.3
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : 1.3.8
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.6.2
sqlalchemy : 1.4.6
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
>>>
I think that's an issue with read_hdf module.I could be wrong.我认为这是 read_hdf 模块的问题。我可能是错的。
but one work around is to not set x as index when you save it as hd5 but after you read it back from hdf,set the index to x:但是一种解决方法是在将 x 保存为 hd5 时不将其设置为索引,但在从 hdf 读回后,将索引设置为 x:
import pandas as pd
df = pd.DataFrame(dict(x=['a', 'nan', 'NA', 'na', 'NaN'],
y=[1, 2, 3, 4, 5]))
df.to_hdf('test.h5', key='test')
df2 = pd.read_hdf('test.h5')
df2 = df2.set_index('x')
test:测试:
>>> df2.index.isna()
array([False, False, False, False, False])
or if you want to keep the index on your original dataframe, just reset the index when you want to save the hdf:或者如果您想保留原始 dataframe 上的索引,只需在要保存 hdf 时重置索引:
import pandas as pd
df = pd.DataFrame(dict(x=['a', 'nan', 'NA', 'na', 'NaN'],
y=[1, 2, 3, 4, 5])).set_index('x')
df.reset_index().to_hdf('test.h5', key='test')
df2 = pd.read_hdf('test.h5')
df2 = df2.set_index('x')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.