Convert the whole (large) schema into hdf5

Question

I am trying to export the whole database schema (around 20 GB) using postgreSQL query to create a final unique hdf5 file.

Because this size don't fit on my computers memory, I am using chuncks argument.

First I use this function to establish conection:

def make_connectstring(prefix, db, uname, passa, hostname, port):
    """return an sql connectstring"""
    connectstring = prefix + "://" + uname + ":" + passa + "@" + hostname + \
                    ":" + port + "/" + db
    return connectstring

Then I created a temporary folder to save each of hdf5 file.

def query_to_hdf5(connectstring, query, verbose=False, chunksize=50000):

    engine = sqlalchemy.create_engine(connectstring, 
        server_side_cursors=True)    

    # get the data to temp chunk filese
    i = 0
    paths_chunks = []
    with tempfile.TemporaryDirectory() as td:
        for df in pd.read_sql_query(sql=query, con=engine, chunksize=chunksize):
            path = td + "/chunk" + str(i) + ".hdf5"
            df.to_hdf(path, key='data')
            print(path)
            if verbose:
                print("wrote", path)
            paths_chunks.append(path)
            i+=1


connectstring = make_connectstring(prefix, db, uname, passa, hostname, port)
query = "SELECT * FROM public.zz_ges"
df = query_to_hdf5(connectstring, query)

What is the best way to merge all these files into 1 single file that represents the whole dataframe?

I tried something like this:

    df = pd.DataFrame()
    print(path)
    for path in paths_chunks:
        df_scratch = pd.read_hdf(path)
        df = pd.concat([df, df_scratch])
        if verbose:
            print("read", path)

However, the memory goes up very fast. I need something that could be more efficient.

Update:

def make_connectstring(prefix, db, uname, passa, hostname, port):
    """return an sql connectstring"""
    connectstring = prefix + "://" + uname + ":" + passa + "@" + hostname + \
                    ":" + port + "/" + db
    return connectstring

def query_to_df(connectstring, query, verbose=False, chunksize=50000):

    engine = sqlalchemy.create_engine(connectstring, 
        server_side_cursors=True)    

    # get the data to temp chunk filese
    with pd.HDFStore('output.h5', 'w') as store:
        for df in pd.read_sql_query(sql=query, con=engine, chunksize=chunksize):
            store.append('data', df)

Answer 1

I'd suggest using a HDFStore directly, that way you can append chunks as you get them from the database, something like:

with pd.HDFStore('output.h5', 'w') as store:
  for df in pd.read_sql_query(sql=query, con=engine, chunksize=chunksize):
    store.append('data', df)

this is based around your existing code so isn't complete, let me know if it isn't clear

note I'm opening the store in w mode so it'll delete the file every time. otherwise append will just keep adding the same rows to the end of the table. alternatively you could remove the key first

when you open the store you also get lots of options like compression to use but it doesn't seem to be well documented, help(pd.HDFStore) describes complevel and complib for me

Convert the whole (large) schema into hdf5

Question

Update:

1 answers

solution1
1 ACCPTED 2019-09-30 18:30:19

Convert the whole (large) schema into hdf5

Question

Update:

1 answers

solution1 1 ACCPTED 2019-09-30 18:30:19

solution1
1 ACCPTED 2019-09-30 18:30:19