Fastest way to read huge volume of data as dataframe from Oracle using Python

Question

I need to read huge amount of data from Oracle (around 1 million and 450 columns) and do a bulk load in Greenplum. I am using following approach:

import pandas as pd
from psycopg2 import *
from sqlalchemy import create_engine
import cx_Oracle
import sqlalchemy
import psycopg2 as pg
import io

engineor = create_engine('oracle+cx_oracle://xxxx:xxxx@xxxxx:xxxx/?service_name=xxxxx')
sql = "select * from xxxxxx"
enginegp = create_engine('xxxxx@xxxxx:xxxx/xxxx')
connection = enginegp.raw_connection()
output = io.StringIO()
for df in pd.read_sql(sql, engineor, chunksize=10000):
df.to_csv(output, header=False, index=False,mode='a')
output.seek(0)
cur = connection.cursor()
cur.copy_expert("COPY test FROM STDIN WITH CSV NULL '' ", output)
connection.commit()
cur.close()

I have been reading the data in chunks:

for df in pd.read_sql(sql, engineor, chunksize=10000):
    df.to_csv(output, header=False, index=False,mode='a')

Is there a quicker and seamless way to read big tables from Oracle as a dataframe? This method just works, and doesn't seem seamless as connection to Oracle times out or killed by DBA at times, and it runs successfully at times. Seems less reliable given the table size. I need this as a dataframe as I need to load it in to Greenplum later using copy method.

Answer 1

Outsourcer was specifically created to do what you are trying to do but it was written in Java.

http://www.pivotalguru.com/?page_id=20

Fastest way to read huge volume of data as dataframe from Oracle using Python

Question

1 answers

solution1
1 2018-06-13 14:57:46

Fastest way to read huge volume of data as dataframe from Oracle using Python

Question

1 answers

solution1 1 2018-06-13 14:57:46

solution1
1 2018-06-13 14:57:46