[英]Fastest way to read huge volume of data as dataframe from Oracle using Python
I need to read huge amount of data from Oracle (around 1 million and 450 columns) and do a bulk load in Greenplum. 我需要从Oracle中读取大量数据(大约100万列和450列),并在Greenplum中进行批量加载。 I am using following approach: 我正在使用以下方法:
import pandas as pd
from psycopg2 import *
from sqlalchemy import create_engine
import cx_Oracle
import sqlalchemy
import psycopg2 as pg
import io
engineor = create_engine('oracle+cx_oracle://xxxx:xxxx@xxxxx:xxxx/?service_name=xxxxx')
sql = "select * from xxxxxx"
enginegp = create_engine('xxxxx@xxxxx:xxxx/xxxx')
connection = enginegp.raw_connection()
output = io.StringIO()
for df in pd.read_sql(sql, engineor, chunksize=10000):
df.to_csv(output, header=False, index=False,mode='a')
output.seek(0)
cur = connection.cursor()
cur.copy_expert("COPY test FROM STDIN WITH CSV NULL '' ", output)
connection.commit()
cur.close()
I have been reading the data in chunks: 我一直在大块读取数据:
for df in pd.read_sql(sql, engineor, chunksize=10000):
df.to_csv(output, header=False, index=False,mode='a')
Is there a quicker and seamless way to read big tables from Oracle as a dataframe? 有没有更快,更无缝的方式从Oracle作为数据帧读取大表? This method just works, and doesn't seem seamless as connection to Oracle times out or killed by DBA at times, and it runs successfully at times. 该方法可以正常工作,并且似乎无法无缝连接,因为与Oracle的连接有时会被DBA终止或终止,并且有时会成功运行。 Seems less reliable given the table size. 给定表的大小,似乎不太可靠。 I need this as a dataframe as I need to load it in to Greenplum later using copy method. 我需要将此作为数据框,因为稍后需要使用复制方法将其加载到Greenplum中。
Outsourcer was specifically created to do what you are trying to do but it was written in Java. 专门创建Outsourcer是为了执行您想做的事情,但是它是用Java编写的。
http://www.pivotalguru.com/?page_id=20 http://www.pivotalguru.com/?page_id=20
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.