Optimizing pd.read_sql

Question

conn1 = pyodbc.connect('DSN=LUDP-Training Presto',uid='*****', pwd='****', autocommit=True) 

 sql_query =        "SELECT             zsourc_sy, zmsgeo, salesorg, crm_begdat, zmcn, zrtm, crm_obj_id, zcrmprod, prod_hier, hier_type, zsoldto, zendcst, crmtgqtycv, currency, zwukrs, netvalord, zgtnper,zsub_4_t \
                    FROM                `prd_updated`.`bw_ms_zocsfs05l_udl` \
                    WHERE               zdcgflag = 'DCG' AND crm_begdat >= '20200101' AND zmsgeo IN ('AP', 'LA', 'EMEA', 'NA')"

I have to load the following query into a pandas dataframe but the pd.read_sql statement has been loading for more than a couple hours since the table is > 10 million rows of data. Is there a way to speed this process up?

contract_table = pd.read_sql(sql_query,conn1)

Answer 1

You can pass a chunksize param to the read_sql function ( docs ), which turns it into a generator that returns an iterator of dataframes with the specified number of rows.

df_iter = pd.read_sql(sql_query,conn1, chunksize=100)

for df in df_iter:
    for row in df: # 100 rows in each dataframe in this example
        # do work here

Generators are an efficient way of processing data that's too large to all fit in memory at once.

Optimizing pd.read_sql

Question

1 answers

solution1
1 ACCPTED 2021-01-22 02:50:03

Optimizing pd.read_sql

Question

1 answers

solution1 1 ACCPTED 2021-01-22 02:50:03

solution1
1 ACCPTED 2021-01-22 02:50:03