简体   繁体   中英

What is an optimum way to read huge data from oracle table and fetch into a data frame

I am going to read data from a table in my oracle database and fetch it in a data frame in python. The table has 22 million records and using fetchall() takes a long time without any result. (the query runs in oracle in 1 second)

I have tried using slicing the data with below code, but still it is not efficient.

import cx_Oracle
import pandas as pd
from pandas import DataFrame
connect_serv = cx_Oracle.connect(user='', password='', dsn='')
cur = connect_serv.cursor()  

table_row_count=22242387;
batch_size=100000;

sql="""select t.* from (select a.*,ROW_NUMBER() OVER (ORDER BY column1 ) as row_num  from  table1 a) T where t.row_num between :LOWER_BOUND and :UPPER_BOUND"""

data=[]
for lower_bound in range (0,table_row_count,batch_size):
    cur.execute(sql,{'LOWER_BOUND':lower_bound,
                     'UPPER_BOUND':lower_bound + batch_size - 1})
    for row in cur.fetchall():
        data.append(row)

I would like to know what is the proper solution to fetch this amount of data in python in a reasonable time.

It's not the query that is slow, it's the stacking of the data with data.append(row) .

Try using

data.extend(cur.fetchall())

for starters. It will avoid the repeated single-row appending, but rather append the entire set of rows coming from fetchall at once.

You will have to tune arraysize and prefetchrow parameters. I was having the same issue. Increasing arraysize resolved the issue. Choose the values based on the memory you have.

Link: https://cx-oracle.readthedocs.io/en/latest/user_guide/tuning.html?highlight=arraysize#choosing-values-for-arraysize-and-prefetchrows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM