简体   繁体   中英

Optimizing pd.read_sql

conn1 = pyodbc.connect('DSN=LUDP-Training Presto',uid='*****', pwd='****', autocommit=True) 

 sql_query =        "SELECT             zsourc_sy, zmsgeo, salesorg, crm_begdat, zmcn, zrtm, crm_obj_id, zcrmprod, prod_hier, hier_type, zsoldto, zendcst, crmtgqtycv, currency, zwukrs, netvalord, zgtnper,zsub_4_t \
                    FROM                `prd_updated`.`bw_ms_zocsfs05l_udl` \
                    WHERE               zdcgflag = 'DCG' AND crm_begdat >= '20200101' AND zmsgeo IN ('AP', 'LA', 'EMEA', 'NA')"

I have to load the following query into a pandas dataframe but the pd.read_sql statement has been loading for more than a couple hours since the table is > 10 million rows of data. Is there a way to speed this process up?

contract_table = pd.read_sql(sql_query,conn1)

You can pass a chunksize param to the read_sql function ( docs ), which turns it into a generator that returns an iterator of dataframes with the specified number of rows.

df_iter = pd.read_sql(sql_query,conn1, chunksize=100)

for df in df_iter:
    for row in df: # 100 rows in each dataframe in this example
        # do work here

Generators are an efficient way of processing data that's too large to all fit in memory at once.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM