[英]Improving database query speed with Python
Edit - I am using Windows 10 编辑 - 我正在使用Windows 10
Is there a faster alternative to pd._read_sql_query for a MS SQL database? 对于MS SQL数据库,是否有更快的替代pd._read_sql_query?
I was using pandas to read the data and add some columns and calculations on the data. 我正在使用pandas来读取数据并在数据上添加一些列和计算。 I have cut out most of the alterations now and I am basically just reading (1-2 million rows per day at a time; my query is to read all of the data from the previous date) the data and saving it to a local database (Postgres).
我现在已经删除了大部分的改动,我基本上只是阅读(每天1-2万行,我的查询是读取前一天的所有数据)数据并将其保存到本地数据库(Postgres的)。
The server I am connecting to is across the world and I have no privileges at all other than to query for the data. 我连接的服务器遍布全球,除了查询数据外,我没有任何权限。 I want the solution to remain in Python if possible.
如果可能的话,我希望解决方案保留在Python中。 I'd like to speed it up though and remove any overhead.
我想加快速度并消除任何开销。 Also, you can see that I am writing a file to disk temporarily and then opening it to COPY FROM STDIN.
此外,您可以看到我暂时将文件写入磁盘,然后将其打开到COPY FROM STDIN。 Is there a way to skip the file creation?
有没有办法跳过文件创建? It is sometimes over 500mb which seems like a waste.
它有时超过500mb,这似乎是浪费。
engine = create_engine(engine_name)
query = 'SELECT * FROM {} WHERE row_date = %s;'
df = pd.read_sql_query(query.format(table_name), engine, params={query_date})
df.to_csv('../raw/temp_table.csv', index=False)
df= open('../raw/temp_table.csv')
process_file(conn=pg_engine, table_name=table_name, file_object=df)
UPDATE: 更新:
you can also try to unload data using bcp utility , which might be lot faster compared to pd.read_sql()
, but you will need a local installation of Microsoft Command Line Utilities for SQL Server
您还可以尝试使用bcp实用程序卸载数据,与
pd.read_sql()
相比,这可能要快得多,但您需要Microsoft Command Line Utilities for SQL Server
安装Microsoft Command Line Utilities for SQL Server
After that you can use PostgreSQL's COPY ... FROM ...
... 之后你可以使用PostgreSQL的
COPY ... FROM ...
......
OLD answer: 老答案:
you can try to write your DF directly to PostgreSQL (skipping the df.to_csv(...)
and df= open('../raw/temp_table.csv')
parts): 您可以尝试将DF直接写入PostgreSQL(跳过
df.to_csv(...)
和df= open('../raw/temp_table.csv')
部分):
from sqlalchemy import create_engine
engine = create_engine(engine_name)
query = 'SELECT * FROM {} WHERE row_date = %s;'
df = pd.read_sql_query(query.format(table_name), engine, params={query_date})
pg_engine = create_engine('postgresql+psycopg2://user:password@host:port/dbname')
df.to_sql(table_name, pg_engine, if_exists='append')
Just test whether it's faster compared to COPY FROM STDIN
... 只是测试它是否比
COPY FROM STDIN
更快......
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.