简体   繁体   中英

How to add attributes to a pandas dataframe that is stored as a group in a HDF5 file?

I have a multidimensional pandas dataframe created like this:

import numpy as np
import pandas as pd
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
mindex = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=mindex)
store = pd.HDFStore("df.h5")
store["df"] = df
store.close()

I would like to add attributes to df stored in the HDFStore. How can I do this? There doesn't seem to be any documentation regarding the attributes, and the group that is used to store the df is not of the same type as the HDF5 Group in the h5py module:

type(list(store.groups())[0])
Out[24]: tables.group.Group

It seems to be the pytables group, that has only this private member function that concerns some other kind of attribute:

__setattr__(self, name, value)
 |      Set a Python attribute called name with the given value.

What I would like is to simply store a bunch of DataFrames with multidimensional indices that are "marked" by attributes in a structured way, so that I can compare them and sub-select them based on those attributes.

Basically what HDF5 is meant to be used for + multidim DataFrames from pandas.

There are questions like this one , that deal with reading HDF5 files with other readers than pandas, but they all have DataFrames with one-dim indices, which makes it easy to simply dump numpy ndarrays, and store the index additionally.

I haven't gotten any answers so far, and this is what I managed to do using both the pandas and the h5py modules: pandas is used to store and read the multidimensional DataFrame, and h5py to store and read the attributes of the HDF5 group:

import numpy as np
import pandas as pd
import h5py

# Create a random multidim DataFrame
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
mindex = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=mindex)

pdStore = pd.HDFStore("df.h5")
h5pyFile = h5py.File("df.h5")

# Dumping the data and storing the attributes
pdStore["df"] = df
h5pyFile["/df"].attrs["number"] = 1

# Reading the data conditionally based on stored attributes.
dfg = h5pyFile["/df"]
readDf = pd.DataFrame()
if dfg.attrs["number"] == 1:
    readDf = pdStore["/df"]

print (readDf - df)
h5pyFile.close()
pdStore.close()

I still don't know if there are any issues in having both the h5py and pandas handling the .h5 file simultaneously.

Adding attributes to a group from within pandas seems to be available by now (could not find out since which release, tested code snippet with pandas 1.4.2 and Python 3.10.4). According to pandas' HDF cookbook the following approach can be used:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(8, 3))
store = pd.HDFStore("test.h5")
store.put("df", df)
store.get_storer("df").attrs.my_attribute = {"A": 10}
store.close()

The HDFStore() does provide a contextmanager as well:

with pd.HDFStore("test.h5") as store:
    store.put("df", df)
    store.get_storer("df").attrs.my_attribute = {"A": 10}

Please mind, that the attribute's name can be set as you like ( data_origin in the following) and does not need to be a dictionary mandatorily:

store.get_storer("df").attrs.data_origin = 'random data generation'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM