简体   繁体   English

复制MySQL查询结果到Python中的tempfile

[英]copy MySQL query result to tempfile in Python

I'm kinda new to the SQL world, but I was following a tutorial called Optimizing pandas.read_sql for Postgres .我对 SQL 世界有点陌生,但我正在学习一个名为Optimizing pandas.read_sql for Postgres的教程。 The thing is, I'm working with a big dataset, similar to the example in the tutorial and I need a faster way to execute my query and turn it into a DataFrame. There, they use this function:问题是,我正在处理一个大数据集,类似于教程中的示例,我需要一种更快的方法来执行我的查询并将其转换为 DataFrame。在那里,他们使用这个 function:

def read_sql_tmpfile(query, db_engine):
    with tempfile.TemporaryFile() as tmpfile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
           query=query, head="HEADER"
        )
        conn = db_engine.raw_connection()
        cur = conn.cursor()
        cur.copy_expert(copy_sql, tmpfile)  # I want to replicate this
        tmpfile.seek(0)
        df = pandas.read_csv(tmpfile)
        return df

And I tried to replicate it, like this:我试着复制它,像这样:

def read_sql_tmpfile(query, connection):
    with tempfile.TemporaryFile() as tmpfile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
           query=query, head="HEADER"
        )

        cur = connection.cursor()
        cur.copy_expert(copy_sql, tmpfile)
        tmpfile.seek(0)
        df = pandas.read_csv(tmpfile)
        return df

The thing is, cursor.copy_expert comes from the psycopg2 library for PostgreSQL, and I can't find a way to do the same thing with pymysql .问题是, cursor.copy_expert来自 PostgreSQL 的psycopg2库,我找不到用pymysql做同样事情的方法。 Is there any way to do this?有什么办法吗? What should I do?我应该怎么办? Thanks谢谢

I'm aware, that the question is basically answered by wa.netech's comment.我知道,wa.netech 的评论基本上回答了这个问题。 But I was interested and the details and implications are not always obvious, so here is the tested, copy-pastable solution.但我很感兴趣,细节和含义并不总是很明显,所以这里是经过测试的、可复制粘贴的解决方案。

Since the output file ends up on the DB server, the solution involves handling the temp directory on the server and transferring the file to the client.由于 output 文件最终位于数据库服务器上,解决方案涉及处理服务器上的临时目录并将文件传输到客户端。 For sake of simplicity I used SSH & SFT for this.为了简单起见,我为此使用了 SSH & SFT。 This assumes that the SSH keys of both machines have been exchanged beforehand.这假设两台机器的 SSH 密钥已经预先交换。 The remotefile transfer and handling maybe easier by involving a samba share or something like that.通过涉及 samba 共享或类似的东西,远程文件传输和处理可能会更容易。

@Nick ODell: Please give this solution a chance, do a benchmark. @Nick ODell:请给这个解决方案一个机会,做一个基准测试。 I'm pretty sure the copy overhead isn't significant for larger amounts of data.我很确定复制开销对于大量数据来说并不重要。

def read_sql_tmpfile(query, connection):
    df = None

    # Create unique temp directory on server side
    cmd = "mktemp -d"
    (out_mktemp, err) = subprocess.Popen(f'ssh {username}@{db_server} "{cmd}"', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
    if err or not out_mktemp:
        return

    # remove additional white spaces around the output
    tmp_dir = out_mktemp.strip().decode()

    # The following command should be made superfluous by tweaking the group memberships 
    # to grant `mysql` user full access to the directory created by the user which executes the `mktemp` command
    cmd = f"chmod 777 -R {tmp_dir}"
    res = os.system(f'ssh {username}@{db_server} "{cmd}"')
    if res:
        return

    try:
        remote_tmp_file = f'{tmp_dir}/sql_tmpfile'

        # remember: db-connection's user need `FILE` privilege
        # think about sql injection, pass MySql parameters in query and corresponding parameters list to this function if appropriate
        copy_sql = f"{query} INTO OUTFILE '{remote_tmp_file}'"

        cur = connection.cursor()
        cur.execute(copy_sql)

        local_tmp_file = os.path.basename(remote_tmp_file)
        cmd = f"sftp {username}@{db_server}:{remote_tmp_file} {local_tmp_file}"
        res = os.system(cmd)
        if not res and os.path.isfile(local_tmp_file):
            try:
                df = pandas.read_csv(local_tmp_file)
            finally:
                # cleanup local temp file
                os.remove(local_tmp_file)
    finally:
        # cleanup remote temp dir
        cmd = f"rm -R {tmp_dir}"
        os.system(f'ssh {username}@{db_server} "{cmd}"')

    return df

Assuming that Nick's question is假设尼克的问题是

How can I create a CSV file on the client from a MySQL table?如何从 MySQL 表在客户端上创建 CSV 文件?

At a commandline prompt do在命令行提示符下执行

mysql -u ... -p -h ... dbname -e '...' >localfile.csv

where the executable statement is something like可执行语句类似于

SELECT  col1, col2, col3, col4
    FROM mytable

Notes:笔记:

  • Windows: cmd ; Windows: cmd *nix: some 'terminal' app. *nix:一些“终端”应用程序。
  • This is run on the client.这是在客户端上运行的。
  • dbname has the effect of "use dbname;". dbname具有“使用 dbname;”的效果。
  • The user, pwd, and hostname (of the server) are suitable filled in.用户、密码和(服务器的)主机名都可以填写。
  • This assumes "tab" is a suitable delimiter for the CSV output.这假设“制表符”是 CSV output 的合适分隔符。
  • Be careful about the nesting of quotes (escape if needed).注意引号的嵌套(如果需要,请转义)。
  • Whatever columns/expressions you desire are listed列出您想要的任何列/表达式
  • A WHERE (etc) can be included as needed.可以根据需要包含WHERE (等)。
  • No FTP needed.不需要 FTP。
  • No Python needed.不需要 Python。
  • SHOW... acts very much like SELECT . SHOW...非常像SELECT
  • On *nix, "tab" could be turned into another delimiter.在 *nix 上,“制表符”可以变成另一个定界符。
  • The header line can be skipped with an option to mysql .可以选择mysql跳过 header 行。

Example (without -u -p -h showing):示例(不显示 -u -p -h):

# mysql  -e "show variables like 'max%size'" | tr '\t' ','
Variable_name,Value
max_binlog_cache_size,18446744073709547520
max_binlog_size,104857600
max_binlog_stmt_cache_size,18446744073709547520
max_heap_table_size,16777216
max_join_size,18446744073709551615
max_relay_log_size,0

To figure out which of these answers was fastest, I benchmarked each of them on a synthetic dataset.为了找出这些答案中哪一个最快,我在一个合成数据集上对每个答案进行了基准测试。 This dataset consisted of 100MB of time-series data, and 500MB of text data.该数据集包含 100MB 的时间序列数据和 500MB 的文本数据。 (Note: this is measured using Pandas, which heavily penalizes small objects versus data which can be represented in NumPy.) (注意:这是使用 Pandas 测量的,相对于可以在 NumPy 中表示的数据,它会严重惩罚小对象。)

I benchmarked 5 methods:我对 5 种方法进行了基准测试:

  • naive : The baseline of read_sql() . naiveread_sql()的基线。
  • sftp : LOAD INTO OUTFILE, followed by an sftp call and read_csv. sftp :加载到 OUTFILE,然后是 sftp 调用和 read_csv。
  • tofile : Invoke the mysql command with -B to generate a CSV, and write that into a file. tofile :使用-B 调用mysql 命令生成CSV,并将其写入文件。
  • pipe : Invoke the mysql command with -B to generate a CSV, and read from that pipe using read_csv. pipe :使用 -B 调用 mysql 命令以生成 CSV,并使用 read_csv 从该 pipe 中读取。 Also use fnctl() to raise the pipe size.还可以使用 fnctl() 提高 pipe 的大小。
  • pipe_no_fcntl : Same as before, but without fcntl. pipe_no_fcntl :与以前相同,但没有 fcntl。

Timings时序

All methods were tried seven times, in random order.所有方法都以随机顺序尝试了七次。 In the following tables, a lower score is better.在下表中,分数越低越好。

Time series benchmark:时间序列基准:

Method方法 Time (s)时间(秒) Standard Error (s)标准误差
pipe pipe 6.719870 6.719870 0.064610 0.064610
pipe_no_fcntl pipe_no_fcntl 7.243937 7.243937 0.104802 0.104802
tofile文件 7.636196 7.636196 0.125963 0.125963
sftp传输协议 9.926580 9.926580 0.171262 0.171262
naive幼稚的 11.125657 11.125657 0.470146 0.470146

Text benchmark:文本基准:

Method方法 Time (s)时间(秒) Standard Error (s)标准误差
pipe pipe 8.452694 8.452694 0.217661 0.217661
tofile文件 9.502743 9.502743 0.265003 0.265003
pipe_no_fcntl pipe_no_fcntl 9.620349 9.620349 0.420255 0.420255
sftp传输协议 12.189046 12.189046 0.294148 0.294148
naive幼稚的 13.769322 13.769322 0.695961 0.695961

Winning solution获胜方案

This is the pipe method, which was fastest.这是最快的 pipe 方法。

import os
import pandas as pd
import subprocess
import tempfile
import time
import fcntl


db_server = '...'
F_SETPIPE_SZ = 1031


def read_sql_pipe(query, database):
    args = ['mysql', f'--login-path={db_server}', database, '-B', '-e', query]
    try:
        # Run mysql and capture output
        proc = subprocess.Popen(args, stdout=subprocess.PIPE)
    except FileNotFoundError:
        # MySQL is not installed. Raise a better error message.
        raise Exception("The mysql command is not installed. Use brew or apt to install it.") from None

    # Raise amount of CSV data buffered up to 1MB.
    # This is a Linux-only syscall.
    fcntl.fcntl(proc.stdout.fileno(), F_SETPIPE_SZ, 1 << 20)

    df = pd.read_csv(proc.stdout, delimiter='\t')

    retcode = proc.wait()
    if retcode != 0:
        raise subprocess.CalledProcessError(
            retcode, proc.args, output=proc.stdout, stderr=proc.stderr
        )

    return df

The basic idea is to use the subprocess module to invoke mysql, with the stdout of MySQL being fed to a pipe. A pipe is a file-like object, which can be directly passed to pd.read_csv() .基本思想是使用 subprocess 模块调用 mysql,将 MySQL 的标准输出馈送到 pipe。pipe 是一个类似于 object 的文件,可以直接传递给pd.read_csv() The MySQL process creates the CSV concurrently with Pandas reading the CSV, so this leads to an advantage over the method which writes the entire file before Pandas starts reading it. MySQL 进程创建 CSV 同时 Pandas 读取 CSV,因此这比在 Pandas 开始读取之前写入整个文件的方法有优势。

A note about fcntl: fcntl is useful here because the amount of data which can be buffered in the pipe is limited to 64kB by default.关于 fcntl 的注意事项:fcntl 在这里很有用,因为 pipe 中可以缓冲的数据量默认限制为 64kB。 I found that raising this to 1MB lead to a ~10% speedup.我发现将其提高到 1MB 会导致大约 10% 的加速。 If this is unavailable, a solution which writes the CSV to a file may outperform the pipe method.如果这不可用,将 CSV 写入文件的解决方案可能优于 pipe 方法。

This solution is most similar to @MikeF's solution, so they get the bounty.这个解决方案与@MikeF 的解决方案最相似,所以他们得到了赏金。

Dataset数据集

The dataset was generated with the following script.数据集是使用以下脚本生成的。

import pandas as pd
import numpy as np
from english_words import get_english_words_set
np.random.seed(42)

import util


def gen_benchmark_df(data_function, limit):
    i = 0
    df = data_function(i)
    i += 1
    while df.memory_usage(deep=True).sum() < limit:
        df = pd.concat([df, data_function(i)], ignore_index=True)
        i += 1
    # Trim excess rows
    row_count = len(df.index)
    data_size_bytes = df.memory_usage(deep=True).sum()
    row_count_needed = int(row_count * (limit / data_size_bytes))
    df = df.head(row_count_needed)
    return df


def gen_ts_chunk(i):
    rows = 100_000
    return pd.DataFrame({
        'run_id': np.random.randint(1, 1_000_000),
        'feature_id': np.random.randint(1, 1_000_000),
        'timestep': np.arange(0, rows),
        'val': np.cumsum(np.random.uniform(-1, 1, rows))
    })


def gen_text_chunk(i):
    rows = 10_000
    words = list(get_english_words_set(['web2'], lower=True))
    text_strings = np.apply_along_axis(lambda x: ' '.join(x), axis=1, arr=np.random.choice(words, size=(rows, 3)))
    return pd.DataFrame({
        'id': np.arange(i * rows, (i + 1) * rows),
        'data': text_strings
    })



dataset_size = 1e8


con = util.open_engine()
timeseries_df = gen_benchmark_df(gen_ts_chunk, dataset_size)
timeseries_df.to_sql('timeseries', con=con, if_exists='replace', index=False, chunksize=10_000)


dataset_size = 5e8

text_df = gen_benchmark_df(gen_text_chunk, dataset_size)
text_df.to_sql('text', con=con, if_exists='replace', index=False, chunksize=10_000)

As mentioned in the comments, and in this answer , you are looking for SELECT... INTO OUTFILE .如评论中和此答案中所述,您正在寻找SELECT... INTO OUTFILE

Here is a small (untested) example, based on your question:这是一个小的(未经测试的)例子,基于你的问题:

def read_sql_tmpfile(query, connection):
    # Create tmp file name without creating the file
    tmp_dir = tempfile.mkdtemp()
    tmp_file_name = os.path.join(tmp_dir, next(tempfile._get_candidate_names()))
    
    # Copy data into temporary file
    copy_sql = "{query} INTO OUTFILE {outfile}".format(
           query=query, outfile=tmp_file_name 
    )
    cur = connection.cursor()
    cur.execute(copy_sql)
    
    # Read data from file
    df = pandas.read_csv(tmp_file_name)
    # Cleanup
    os.remove(tmp_file_name)
    return df

You can pretty easily write your file to /tmp , which gets cleared between reboots.您可以非常轻松地将文件写入/tmp ,它会在两次重启之间被清除。 You can also add your own decorator/context manager to apply similar niceties as those you get from tempfile.TemporaryFile .您还可以添加自己的装饰器/上下文管理器,以应用与从tempfile.TemporaryFile获得的类似的细节。 A quick example would be something like this...一个简单的例子就是这样......


import psutil


class SQLGeneratedTemporaryFile:

  def __init__(self, filename):
    self.filename = filename

  def __enter__(self):
    # run your query and write to your file with the name `self.filename`

  def __exit__(self, *exc):
    psutil.unlink(self.filename)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM