使用Python从Oracle以数据帧的形式读取海量数据的最快方法

Question

I need to read huge amount of data from Oracle (around 1 million and 450 columns) and do a bulk load in Greenplum. 我需要从Oracle中读取大量数据（大约100万列和450列），并在Greenplum中进行批量加载。 I am using following approach: 我正在使用以下方法：

import pandas as pd
from psycopg2 import *
from sqlalchemy import create_engine
import cx_Oracle
import sqlalchemy
import psycopg2 as pg
import io

engineor = create_engine('oracle+cx_oracle://xxxx:xxxx@xxxxx:xxxx/?service_name=xxxxx')
sql = "select * from xxxxxx"
enginegp = create_engine('xxxxx@xxxxx:xxxx/xxxx')
connection = enginegp.raw_connection()
output = io.StringIO()
for df in pd.read_sql(sql, engineor, chunksize=10000):
df.to_csv(output, header=False, index=False,mode='a')
output.seek(0)
cur = connection.cursor()
cur.copy_expert("COPY test FROM STDIN WITH CSV NULL '' ", output)
connection.commit()
cur.close()

I have been reading the data in chunks: 我一直在大块读取数据：

for df in pd.read_sql(sql, engineor, chunksize=10000):
    df.to_csv(output, header=False, index=False,mode='a')

Is there a quicker and seamless way to read big tables from Oracle as a dataframe? 有没有更快，更无缝的方式从Oracle作为数据帧读取大表？ This method just works, and doesn't seem seamless as connection to Oracle times out or killed by DBA at times, and it runs successfully at times. 该方法可以正常工作，并且似乎无法无缝连接，因为与Oracle的连接有时会被DBA终止或终止，并且有时会成功运行。 Seems less reliable given the table size. 给定表的大小，似乎不太可靠。 I need this as a dataframe as I need to load it in to Greenplum later using copy method. 我需要将此作为数据框，因为稍后需要使用复制方法将其加载到Greenplum中。

Answer 1

Outsourcer was specifically created to do what you are trying to do but it was written in Java. 专门创建Outsourcer是为了执行您想做的事情，但是它是用Java编写的。

http://www.pivotalguru.com/?page_id=20 http://www.pivotalguru.com/?page_id=20

使用Python从Oracle以数据帧的形式读取海量数据的最快方法

问题描述

1 个解决方案

解决方案1
1 2018-06-13 14:57:46

使用Python从Oracle以数据帧的形式读取海量数据的最快方法

问题描述

1 个解决方案

解决方案1 1 2018-06-13 14:57:46

解决方案1
1 2018-06-13 14:57:46