简体   繁体   中英

Batch downloading of table using cx_oracle

I need to download a large table from an oracle database into a python server, using cx_oracle to do so. However, the ram is limited on the python server and so I need to do it in a batch way.

I know already how to do generally a whole table

usr = ''
pwd = ''
tns = '(Description = ...'
orcl = cx_Oracle.connect(user, pwd, tns)
curs = orcl.cursor()
printHeader=True
tabletoget = 'BIGTABLE'
sql = "SELECT * FROM " + "SCHEMA." + tabletoget
curs.execute(sql)
data = pd.read_sql(sql, orcl)
data.to_csv(tabletoget + '.csv' 

I'm not sure what to do though to load say a batch of 10000 rows at a time and then save it off to a csv and then rejoin.

You can use cx_Oracle directly to perform this sort of batch:

curs.arraysize = 10000
curs.execute(sql)
while True:
    rows = cursor.fetchmany()
    if rows:
        write_to_csv(rows)
    if len(rows) < curs.arraysize:
        break

If you are using Oracle Database 12c or higher you can also use the OFFSET and FETCH NEXT ROWS options, like this:

offset = 0
numRowsInBatch = 10000
while True:
    curs.execute("select * from tabletoget offset :offset fetch next :nrows only",
            offset=offset, nrows=numRowsInBatch)
    rows = curs.fetchall()
    if rows:
        write_to_csv(rows)
    if len(rows) < numRowsInBatch:
        break
    offset += len(rows)

This option isn't as efficient as the first one and involves giving the database more work to do but it may be better for you depending on your circumstances.

None of these examples use pandas directly. I am not particularly familiar with that package, but if you (or someone else) can adapt this appropriately, hopefully this will help!

You can achieve your result like this. Here I am loading data to df.

import cx_Oracle
import time
import pandas

user = "test"
pw = "test"
dsn="localhost:port/TEST"

con = cx_Oracle.connect(user,pw,dsn)
start = time.time()
cur = con.cursor()
cur.arraysize = 10000
try:
    cur.execute( "select * from test_table" )
    names = [ x[0] for x in cur.description]
    rows = cur.fetchall()
    df=pandas.DataFrame( rows, columns=names)
    print(df.shape)
    print(df.head())
finally:
    if cur is not None:
        cur.close()

elapsed = (time.time() - start)
print(elapsed, "seconds")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM