简体   繁体   中英

Fastest way to read huge volume of data as dataframe from Oracle using Python

I need to read huge amount of data from Oracle (around 1 million and 450 columns) and do a bulk load in Greenplum. I am using following approach:

import pandas as pd
from psycopg2 import *
from sqlalchemy import create_engine
import cx_Oracle
import sqlalchemy
import psycopg2 as pg
import io

engineor = create_engine('oracle+cx_oracle://xxxx:xxxx@xxxxx:xxxx/?service_name=xxxxx')
sql = "select * from xxxxxx"
enginegp = create_engine('xxxxx@xxxxx:xxxx/xxxx')
connection = enginegp.raw_connection()
output = io.StringIO()
for df in pd.read_sql(sql, engineor, chunksize=10000):
df.to_csv(output, header=False, index=False,mode='a')
output.seek(0)
cur = connection.cursor()
cur.copy_expert("COPY test FROM STDIN WITH CSV NULL '' ", output)
connection.commit()
cur.close()

I have been reading the data in chunks:

for df in pd.read_sql(sql, engineor, chunksize=10000):
    df.to_csv(output, header=False, index=False,mode='a')

Is there a quicker and seamless way to read big tables from Oracle as a dataframe? This method just works, and doesn't seem seamless as connection to Oracle times out or killed by DBA at times, and it runs successfully at times. Seems less reliable given the table size. I need this as a dataframe as I need to load it in to Greenplum later using copy method.

Outsourcer was specifically created to do what you are trying to do but it was written in Java.

http://www.pivotalguru.com/?page_id=20

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM