[英]copy MySQL query result to tempfile in Python
I'm kinda new to the SQL world, but I was following a tutorial called Optimizing pandas.read_sql for Postgres .我对 SQL 世界有点陌生,但我正在学习一个名为Optimizing pandas.read_sql for Postgres的教程。 The thing is, I'm working with a big dataset, similar to the example in the tutorial and I need a faster way to execute my query and turn it into a DataFrame. There, they use this function:问题是,我正在处理一个大数据集,类似于教程中的示例,我需要一种更快的方法来执行我的查询并将其转换为 DataFrame。在那里,他们使用这个 function:
def read_sql_tmpfile(query, db_engine):
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile) # I want to replicate this
tmpfile.seek(0)
df = pandas.read_csv(tmpfile)
return df
And I tried to replicate it, like this:我试着复制它,像这样:
def read_sql_tmpfile(query, connection):
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
cur = connection.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pandas.read_csv(tmpfile)
return df
The thing is, cursor.copy_expert
comes from the psycopg2
library for PostgreSQL, and I can't find a way to do the same thing with pymysql
.问题是, cursor.copy_expert
来自 PostgreSQL 的psycopg2
库,我找不到用pymysql
做同样事情的方法。 Is there any way to do this?有什么办法吗? What should I do?我应该怎么办? Thanks谢谢
I'm aware, that the question is basically answered by wa.netech's comment.我知道,wa.netech 的评论基本上回答了这个问题。 But I was interested and the details and implications are not always obvious, so here is the tested, copy-pastable solution.但我很感兴趣,细节和含义并不总是很明显,所以这里是经过测试的、可复制粘贴的解决方案。
Since the output file ends up on the DB server, the solution involves handling the temp directory on the server and transferring the file to the client.由于 output 文件最终位于数据库服务器上,解决方案涉及处理服务器上的临时目录并将文件传输到客户端。 For sake of simplicity I used SSH & SFT for this.为了简单起见,我为此使用了 SSH & SFT。 This assumes that the SSH keys of both machines have been exchanged beforehand.这假设两台机器的 SSH 密钥已经预先交换。 The remotefile transfer and handling maybe easier by involving a samba share or something like that.通过涉及 samba 共享或类似的东西,远程文件传输和处理可能会更容易。
@Nick ODell: Please give this solution a chance, do a benchmark. @Nick ODell:请给这个解决方案一个机会,做一个基准测试。 I'm pretty sure the copy overhead isn't significant for larger amounts of data.我很确定复制开销对于大量数据来说并不重要。
def read_sql_tmpfile(query, connection):
df = None
# Create unique temp directory on server side
cmd = "mktemp -d"
(out_mktemp, err) = subprocess.Popen(f'ssh {username}@{db_server} "{cmd}"', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
if err or not out_mktemp:
return
# remove additional white spaces around the output
tmp_dir = out_mktemp.strip().decode()
# The following command should be made superfluous by tweaking the group memberships
# to grant `mysql` user full access to the directory created by the user which executes the `mktemp` command
cmd = f"chmod 777 -R {tmp_dir}"
res = os.system(f'ssh {username}@{db_server} "{cmd}"')
if res:
return
try:
remote_tmp_file = f'{tmp_dir}/sql_tmpfile'
# remember: db-connection's user need `FILE` privilege
# think about sql injection, pass MySql parameters in query and corresponding parameters list to this function if appropriate
copy_sql = f"{query} INTO OUTFILE '{remote_tmp_file}'"
cur = connection.cursor()
cur.execute(copy_sql)
local_tmp_file = os.path.basename(remote_tmp_file)
cmd = f"sftp {username}@{db_server}:{remote_tmp_file} {local_tmp_file}"
res = os.system(cmd)
if not res and os.path.isfile(local_tmp_file):
try:
df = pandas.read_csv(local_tmp_file)
finally:
# cleanup local temp file
os.remove(local_tmp_file)
finally:
# cleanup remote temp dir
cmd = f"rm -R {tmp_dir}"
os.system(f'ssh {username}@{db_server} "{cmd}"')
return df
Assuming that Nick's question is假设尼克的问题是
How can I create a CSV file on the client from a MySQL table?如何从 MySQL 表在客户端上创建 CSV 文件?
At a commandline prompt do在命令行提示符下执行
mysql -u ... -p -h ... dbname -e '...' >localfile.csv
where the executable statement is something like可执行语句类似于
SELECT col1, col2, col3, col4
FROM mytable
Notes:笔记:
cmd
; Windows: cmd
; *nix: some 'terminal' app. *nix:一些“终端”应用程序。dbname
has the effect of "use dbname;". dbname
具有“使用 dbname;”的效果。WHERE
(etc) can be included as needed.可以根据需要包含WHERE
(等)。SHOW...
acts very much like SELECT
. SHOW...
非常像SELECT
。mysql
.可以选择mysql
跳过 header 行。Example (without -u -p -h showing):示例(不显示 -u -p -h):
# mysql -e "show variables like 'max%size'" | tr '\t' ','
Variable_name,Value
max_binlog_cache_size,18446744073709547520
max_binlog_size,104857600
max_binlog_stmt_cache_size,18446744073709547520
max_heap_table_size,16777216
max_join_size,18446744073709551615
max_relay_log_size,0
To figure out which of these answers was fastest, I benchmarked each of them on a synthetic dataset.为了找出这些答案中哪一个最快,我在一个合成数据集上对每个答案进行了基准测试。 This dataset consisted of 100MB of time-series data, and 500MB of text data.该数据集包含 100MB 的时间序列数据和 500MB 的文本数据。 (Note: this is measured using Pandas, which heavily penalizes small objects versus data which can be represented in NumPy.) (注意:这是使用 Pandas 测量的,相对于可以在 NumPy 中表示的数据,它会严重惩罚小对象。)
I benchmarked 5 methods:我对 5 种方法进行了基准测试:
read_sql()
. naive : read_sql()
的基线。All methods were tried seven times, in random order.所有方法都以随机顺序尝试了七次。 In the following tables, a lower score is better.在下表中,分数越低越好。
Time series benchmark:时间序列基准:
Method方法 | Time (s)时间(秒) | Standard Error (s)标准误差 |
---|---|---|
pipe pipe | 6.719870 6.719870 | 0.064610 0.064610 |
pipe_no_fcntl pipe_no_fcntl | 7.243937 7.243937 | 0.104802 0.104802 |
tofile文件 | 7.636196 7.636196 | 0.125963 0.125963 |
sftp传输协议 | 9.926580 9.926580 | 0.171262 0.171262 |
naive幼稚的 | 11.125657 11.125657 | 0.470146 0.470146 |
Text benchmark:文本基准:
Method方法 | Time (s)时间(秒) | Standard Error (s)标准误差 |
---|---|---|
pipe pipe | 8.452694 8.452694 | 0.217661 0.217661 |
tofile文件 | 9.502743 9.502743 | 0.265003 0.265003 |
pipe_no_fcntl pipe_no_fcntl | 9.620349 9.620349 | 0.420255 0.420255 |
sftp传输协议 | 12.189046 12.189046 | 0.294148 0.294148 |
naive幼稚的 | 13.769322 13.769322 | 0.695961 0.695961 |
This is the pipe method, which was fastest.这是最快的 pipe 方法。
import os
import pandas as pd
import subprocess
import tempfile
import time
import fcntl
db_server = '...'
F_SETPIPE_SZ = 1031
def read_sql_pipe(query, database):
args = ['mysql', f'--login-path={db_server}', database, '-B', '-e', query]
try:
# Run mysql and capture output
proc = subprocess.Popen(args, stdout=subprocess.PIPE)
except FileNotFoundError:
# MySQL is not installed. Raise a better error message.
raise Exception("The mysql command is not installed. Use brew or apt to install it.") from None
# Raise amount of CSV data buffered up to 1MB.
# This is a Linux-only syscall.
fcntl.fcntl(proc.stdout.fileno(), F_SETPIPE_SZ, 1 << 20)
df = pd.read_csv(proc.stdout, delimiter='\t')
retcode = proc.wait()
if retcode != 0:
raise subprocess.CalledProcessError(
retcode, proc.args, output=proc.stdout, stderr=proc.stderr
)
return df
The basic idea is to use the subprocess module to invoke mysql, with the stdout of MySQL being fed to a pipe. A pipe is a file-like object, which can be directly passed to pd.read_csv()
.基本思想是使用 subprocess 模块调用 mysql,将 MySQL 的标准输出馈送到 pipe。pipe 是一个类似于 object 的文件,可以直接传递给pd.read_csv()
。 The MySQL process creates the CSV concurrently with Pandas reading the CSV, so this leads to an advantage over the method which writes the entire file before Pandas starts reading it. MySQL 进程创建 CSV 同时 Pandas 读取 CSV,因此这比在 Pandas 开始读取之前写入整个文件的方法有优势。
A note about fcntl: fcntl is useful here because the amount of data which can be buffered in the pipe is limited to 64kB by default.关于 fcntl 的注意事项:fcntl 在这里很有用,因为 pipe 中可以缓冲的数据量默认限制为 64kB。 I found that raising this to 1MB lead to a ~10% speedup.我发现将其提高到 1MB 会导致大约 10% 的加速。 If this is unavailable, a solution which writes the CSV to a file may outperform the pipe method.如果这不可用,将 CSV 写入文件的解决方案可能优于 pipe 方法。
This solution is most similar to @MikeF's solution, so they get the bounty.这个解决方案与@MikeF 的解决方案最相似,所以他们得到了赏金。
The dataset was generated with the following script.数据集是使用以下脚本生成的。
import pandas as pd
import numpy as np
from english_words import get_english_words_set
np.random.seed(42)
import util
def gen_benchmark_df(data_function, limit):
i = 0
df = data_function(i)
i += 1
while df.memory_usage(deep=True).sum() < limit:
df = pd.concat([df, data_function(i)], ignore_index=True)
i += 1
# Trim excess rows
row_count = len(df.index)
data_size_bytes = df.memory_usage(deep=True).sum()
row_count_needed = int(row_count * (limit / data_size_bytes))
df = df.head(row_count_needed)
return df
def gen_ts_chunk(i):
rows = 100_000
return pd.DataFrame({
'run_id': np.random.randint(1, 1_000_000),
'feature_id': np.random.randint(1, 1_000_000),
'timestep': np.arange(0, rows),
'val': np.cumsum(np.random.uniform(-1, 1, rows))
})
def gen_text_chunk(i):
rows = 10_000
words = list(get_english_words_set(['web2'], lower=True))
text_strings = np.apply_along_axis(lambda x: ' '.join(x), axis=1, arr=np.random.choice(words, size=(rows, 3)))
return pd.DataFrame({
'id': np.arange(i * rows, (i + 1) * rows),
'data': text_strings
})
dataset_size = 1e8
con = util.open_engine()
timeseries_df = gen_benchmark_df(gen_ts_chunk, dataset_size)
timeseries_df.to_sql('timeseries', con=con, if_exists='replace', index=False, chunksize=10_000)
dataset_size = 5e8
text_df = gen_benchmark_df(gen_text_chunk, dataset_size)
text_df.to_sql('text', con=con, if_exists='replace', index=False, chunksize=10_000)
As mentioned in the comments, and in this answer , you are looking for SELECT... INTO OUTFILE
.如评论中和此答案中所述,您正在寻找SELECT... INTO OUTFILE
。
Here is a small (untested) example, based on your question:这是一个小的(未经测试的)例子,基于你的问题:
def read_sql_tmpfile(query, connection):
# Create tmp file name without creating the file
tmp_dir = tempfile.mkdtemp()
tmp_file_name = os.path.join(tmp_dir, next(tempfile._get_candidate_names()))
# Copy data into temporary file
copy_sql = "{query} INTO OUTFILE {outfile}".format(
query=query, outfile=tmp_file_name
)
cur = connection.cursor()
cur.execute(copy_sql)
# Read data from file
df = pandas.read_csv(tmp_file_name)
# Cleanup
os.remove(tmp_file_name)
return df
You can pretty easily write your file to /tmp
, which gets cleared between reboots.您可以非常轻松地将文件写入/tmp
,它会在两次重启之间被清除。 You can also add your own decorator/context manager to apply similar niceties as those you get from tempfile.TemporaryFile
.您还可以添加自己的装饰器/上下文管理器,以应用与从tempfile.TemporaryFile
获得的类似的细节。 A quick example would be something like this...一个简单的例子就是这样......
import psutil
class SQLGeneratedTemporaryFile:
def __init__(self, filename):
self.filename = filename
def __enter__(self):
# run your query and write to your file with the name `self.filename`
def __exit__(self, *exc):
psutil.unlink(self.filename)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.