Creating HDF5 from multiple panda data frames

Question

I have 100 panda dataframes stored in .pkl files in a directory on my computer. I want to go through all dataframes and save them all in 1 HDF5 file. I was planning on saving all dataframes in 1 pickle file, but I heard HDF5 is significantly better and faster.

First I was doing this:

path = '/Users/srayan/Desktop/data/Pickle'
df = pd.DataFrame()
for filename in glob.glob(os.path.join(path, '*.pkl')):
    newDF = pd.read_pickle(filename)
    df = df.append(newDF)
df.to_pickle('/Users/srayan/Desktop/data/Pickle/Merged.pkl')

But the longest part was converting the huge dataframe into a pickle. Any way to put this large data frame into a HDF5 or any better advice of how to converge all pickle files into 1 dataframe which can be saved?

Answer 1

An HDF5 file is like its own filesystem internally, and you can store as many things inside it as you like. For example:

for filename in glob.glob('*.pkl'):
    df = pd.read_pickle(filename)
    key = os.path.basename(filename) # or choose another name
    df.to_hdf('merged.h5', key)

This will store all the DataFrames into a single HDF5 file. You can either use the old filenames as the keys in the new file, or choose some other naming convention.

If you prefer the data to be concatenated into a single dataset stored in HDF5:

dfs = []
for filename in glob.glob('*.pkl'):
    dfs.append(pd.read_pickle(filename))

df = pd.concat(dfs)
key = 'all_the_things'
df.to_hdf('merged.h5', key)

I usually enable compression within HDF5. This doesn't make the file any harder to read, and can save a lot of disk space:

df.to_hdf('merged.h5', key, complib='zlib', complevel=5)

Creating HDF5 from multiple panda data frames

Question

1 answers

solution1
1 ACCPTED 2017-07-30 02:54:14

Creating HDF5 from multiple panda data frames

Question

1 answers

solution1 1 ACCPTED 2017-07-30 02:54:14

solution1
1 ACCPTED 2017-07-30 02:54:14