简体   繁体   English

使用python将大量XLS数据加载到Oracle中

[英]loading huge XLS data into Oracle using python

I have a 3+ million record XLS file which i need to dump in Oracle 12C DB (direct dump) using a python 2.7.我有一个超过 3 万条记录的 XLS 文件,我需要使用 python 2.7 将其转储到 Oracle 12C DB(直接转储)中。

I am using Cx_Oracle python package to establish connectivity to Oracle , but reading and dumping the XLS (using openpyxl pckg) is extremely slow and performance degrades for thousands/million records.我正在使用 Cx_Oracle python 包来建立与 Oracle 的连接,但读取和转储 XLS(使用 openpyxl pckg)非常慢,并且性能下降了数千/百万条记录。

From a scripting stand point used two ways-从脚本的角度来看,使用了两种方式-

  1. I've tried bulk load , by reading all the values in array and then dumping it using cursor prepare (with bind variables) and cursor fetchmany.This doesn't work well with huge data.我已经尝试了批量加载,通过读取数组中的所有值,然后使用游标准备(带有绑定变量)和游标 fetchmany 转储它。这不适用于大量数据。

  2. Iterative loading of the data as it is being fetched.Even this way has performance issues.在获取数据时迭代加载数据。即使这种方式也存在性能问题。

What options and techniques/packages can i deploy as a best practise to load this volume of data from XLS to Oracle DB ?Is it advisable to load this volume of data via scripting or should i necessarily use an ETL tool ?作为将大量数据从 XLS 加载到 Oracle DB 的最佳实践,我可以部署哪些选项和技术/包?是否建议通过脚本加载这些数据量,或者我必须使用 ETL 工具? As of now i only have option via python scripting so please do answer the former截至目前,我只能通过 python 脚本进行选择,所以请回答前者

如果可以将您的 excel 文件导出为 CSV,那么您只需要使用sqlldr将文件加载到 db 中

Excel also comes with ODBC support so you could pump straight from Excel to Oracle assuming you have the drivers. Excel 还带有 ODBC 支持,因此假设您有驱动程序,您可以直接从 Excel 泵送至 Oracle。 That said, anything that involves transforming a large amount of data in memory (from whatever Excel is using internally) and then passing it to the DB is likely to be less performant than a specialised bulk operation which can be optimised to use less memory.也就是说,任何涉及在内存中转换大量数据(来自 Excel 在内部使用的任何数据)然后将其传递给数据库的任何事情都可能比专门的批量操作性能低,专门的批量操作可以优化为使用更少的内存。 Going through Python just adds another layer to the task (Excel to Python to Oracle), though it might be possible to set this up to use streams.通过 Python 只是为任务添加了另一层(Excel 到 Python 到 Oracle),尽管可以将其设置为使用流。

Basically for high volume data any language going to be stressed on I/O, except C .基本上对于大量数据,任何语言都会在 I/O 上受到压力,除了 C Best way is to use native tools/utilities provided by the DB vendor.最好的方法是使用数据库供应商提供的本机工具/实用程序。 For oracle the correct fit is SQL Loader.对于 oracle,正确的选择是 SQL Loader。

Refer this link for quick tutorial http://www.thegeekstuff.com/2012/06/oracle-sqlldr/请参阅此链接以获取快速教程http://www.thegeekstuff.com/2012/06/oracle-sqlldr/

Here you go… Sample code that runs SQL Loader and gets you back with return code, output & error给你... 运行 SQL Loader 并返回返回代码、输出和错误的示例代码

sql_ld_command = ['sqlldr ',  'uid/passwd', 'CONTROL=', 
'your_ctrl_file_path', 'DATA=', 'your_data_file_path']   

sql_ldr_proc  = subprocess.Popen(sql_ld_command, stdin=subprocess.PIPE,stdout=subprocess.PIPE, stderr=subprocess.PIPE)   

out, err  = sql_ldr_proc.communicate()  
retn_code = sql_ldr_proc.wait()

Here are all of the steps: load xlsx, produce csv (tab delimited) and ctrl file, load with sqlldr.以下是所有步骤:加载 xlsx,生成 csv(制表符分隔)和 ctrl 文件,使用 sqlldr 加载。

# %%
import sys
import pandas as pd
import subprocess
# %%
user = 'in_user_name'
password = 'in_password'
host = 'in_host'
database = 'in_service_name'
in_file = r"in_file.xlsx"
in_sheet_name = 'in_sheet'
tablename = 'in_table'

# %%
df = pd.read_excel(in_file, sheet_name=in_sheet_name)
print(f"Loaded {df.shape[0]} records from {in_file}")
# %%
inflie = f'{tablename}.csv'
controlfile = f'{tablename}.ctrl'
# %%,
df.to_csv(inflie, index=False, sep='\t',)
# %%
columns = df.columns.tolist()
with open(controlfile, 'w') as file:
    header = f"""OPTIONS (SKIP=1, DIRECT=TRUE ) 
LOAD DATA
INFILE '{inflie}' 
BADFILE '{tablename}.bad'
DISCARDFILE '{tablename}.dsc'
TRUNCATE
INTO TABLE {tablename}
FIELDS TERMINATED BY X'9'  
TRAILING NULLCOLS
( """
    file.write(header)
    for c in columns[:-1]:
        file.write(f'{c},\n')
    file.write(f'{columns[-1]})')
# %%
sqlldr_command = f"""sqlldr USERID='{user}/{password}@(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST={host})(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME ={database}) ))'  control={controlfile}"""
print(f"Running sqlldr. Log file: {tablename}.log")
subprocess.call(sqlldr_command, shell=True)

Automate the export of XLSX to CSV as mentioned in a previous answer.如上一个答案中所述,自动将 XLSX 导出为 CSV。 But, instead of then calling a sqlldr script, create an external table that uses your sqlldr code.但是,不是调用 sqlldr 脚本,而是创建一个使用 sqlldr 代码的外部表。 It will load your table from the CSV each time the table is selected from.每次选择表格时,它都会从 CSV 加载您的表格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM