简体   繁体   English

将 SQL 查询读入 Dask DataFrame

[英]Reading an SQL query into a Dask DataFrame

I'm trying create a function that takes an SQL SELECT query as a parameter and use dask to read its results into a dask DataFrame using the dask.read_sql_query function.我正在尝试创建一个将 SQL SELECT 查询作为参数的函数,并使用 dask 使用dask.read_sql_query函数将其结果读入 dask DataFrame。 I am new to dask and to SQLAlchemy.我是 dask 和 SQLAlchemy 的新手。 I first tried this:我首先尝试了这个:

import dask.dataFrame as dd

query = "SELECT name, age, date_of_birth from customer"
df = dd.read_sql_query(sql=query, con=con_string, index_col="name", npartitions=10)

As you probably already know, this won't work because the sql parameter has to be an SQLAlchemy selectable and more importantly, TextClause isn't supported.您可能已经知道,这不起作用,因为sql参数必须是 SQLAlchemy 可选的,更重要的是,不支持TextClause

I then wrapped the query behind a select like this:然后我将查询包装在这样的select后面:

import dask.dataFrame as dd
from sqlalchemy import sql

query = "SELECT name, age, date_of_birth from customer"
sa_query = sql.select(sql.text(query))
df = dd.read_sql_query(sql=sa_query, con=con_string, index_col="name")

This fails too with a very weird error that I have been trying to solve.这也失败了,我一直在尝试解决一个非常奇怪的错误。 The problem is that dask needs to infer the types of the columns and it does so by reading the first head_row rows in the table - 5 rows by default - and infer the types there.问题是 dask 需要推断列的类型,它通过读取表中的前head_row行(默认为 5 行)来推断列的类型。 This line in the dask codebase adds a LIMIT ? dask 代码库中的这一添加了一个LIMIT ? to the query, which ends up being到查询,最终是

SELECT name, age, date_of_birth from customer LIMIT param_1

The param_1 doesn't get substituted at all with the right value - 5 in this case. param_1根本没有被正确的值替换 - 在这种情况下为 5。 It then fails on the next line, https://github.com/dask/dask/blob/main/dask/dataframe/io/sql.py#L119 , tjat evaluates the SQL expression.然后它在下一行失败, https ://github.com/dask/dask/blob/main/dask/dataframe/io/sql.py#L119,tjat 评估 SQL 表达式。

sqlalchemy.exc.ProgrammingError: (mariadb.ProgrammingError) You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT name, age, date_of_birth from customer 
 LIMIT ?' at line 1
[SQL: SELECT SELECT name, age, date_of_birth from customer 
 LIMIT ?]
[parameters: (5,)]
(Background on this error at: https://sqlalche.me/e/14/f405)

I can't understand why param_1 wasn't substituted with the value of head_rows.我不明白为什么param_1没有用 head_rows 的值代替。 One can see from the error message that it detects there's a parameter that needs to be used for the substitution but for some reason it doesn't actually substitute it.从错误消息中可以看出,它检测到有一个参数需要用于替换,但由于某种原因它实际上并没有替换它。

Perhaps, I didn't correctly create the SQLAlchemy selectable?也许,我没有正确创建 SQLAlchemy 可选?

I can simply use pandas.read_sql and create a dask dataframe from the resulting pandas dataframe but that defeats the purpose of using dask in the first place.我可以简单地使用pandas.read_sql并从生成的 pandas 数据帧创建一个 dask 数据帧,但这首先违背了使用 dask 的目的。

I have the following constraints:我有以下限制:

  • I cannot change the function to accept a ready-made sqlalchemy selectable.我无法更改函数以接受现成的 sqlalchemy 可选。 This feature will be added to a private library used at my company and various projects using this library do not use sqlalchemy.此功能将添加到我公司使用的私有库中,使用此库的各种项目不使用 sqlalchemy。
  • Passing meta to the custom function is not an option because it would require the caller do create it.meta传递给自定义函数不是一种选择,因为它需要调用者创建它。 However, passing a meta attribute to read_sql_query and setting head_rows=0 is completely ok as long as there's an efficient way to retrieve/create但是,将meta属性传递给read_sql_query并设置head_rows=0是完全可以的,只要有一种有效的方法来检索/创建
  • while dask-sql might work for this case, using it is not an option, unfortunately虽然dask-sql可能适用于这种情况,但不幸的是,使用它不是一种选择

How can I go about correctly reading an SQL query into dask dataframe?如何正确地将 SQL 查询读入 dask 数据帧?

The crux of the problem is this line:问题的症结在于这一行:

sa_query = sql.select(sql.text(query))

What is happening is that we are constructing a nested SELECT query, which can cause a problem downstream.发生的事情是我们正在构建一个嵌套的 SELECT 查询,这可能会导致下游出现问题。

Let's first create a test database:我们先创建一个测试数据库:

# create a test database (using https://stackoverflow.com/a/64898284/10693596)
from sqlite3 import connect

from dask.datasets import timeseries

con = "delete_me_test.sqlite"
db = connect(con)

# create a pandas df and store (timestamp is dropped to make sure
# that the index is numeric)
df = (
    timeseries(start="2000-01-01", end="2000-01-02", freq="1h", seed=0)
    .compute()
    .reset_index()
)
df.to_sql("ticks", db, if_exists="replace")

Next, let's try to get things working with pandas without sqlalchemy :接下来,让我们尝试在没有sqlalchemy的情况下使用pandas

from pandas import read_sql_query

con = "sqlite:///test.sql"
query = "SELECT * FROM ticks LIMIT 3"
meta = read_sql_query(sql=query, con=con).set_index("index")

print(meta)
#          id    name         x         y
# index
# 0       998  Ingrid  0.760997 -0.381459
# 1      1056  Ingrid  0.506099  0.816477
# 2      1056   Laura  0.316556  0.046963

Now, let's add sqlalchemy functions:现在,让我们添加sqlalchemy函数:

from pandas import read_sql_query
from sqlalchemy.sql import text, select

con = "sqlite:///test.sql"
query = "SELECT * FROM ticks LIMIT 3"
sa_query = select(text(query))
meta = read_sql_query(sql=sa_query, con=con).set_index("index")
# OperationalError: (sqlite3.OperationalError) near "SELECT": syntax error
# [SQL: SELECT SELECT * FROM ticks LIMIT 3]
# (Background on this error at: https://sqlalche.me/e/14/e3q8)

Note the SELECT SELECT due to running sqlalchemy.select on an existing query.请注意由于在现有查询上运行sqlalchemy.select而导致的SELECT SELECT This can cause problems.这可能会导致问题。 How to fix this?如何解决这个问题? In general, I don't think there's a safe and robust way of transforming arbitrary SQL queries into their sqlalchemy equivalent, but if this is for an application where you know that users will only run SELECT statements, you can manually sanitize the query before passing it to sqlalchemy.select :一般来说,我认为没有一种安全可靠的方法可以将任意 SQL 查询转换为它们的sqlalchemy等效项,但如果这是针对您知道用户只会运行SELECT语句的应用程序,您可以在传递之前手动清理查询它到sqlalchemy.select

from dask.dataframe import read_sql_query
from sqlalchemy.sql import select, text

con = "sqlite:///test.sql"
query = "SELECT * FROM ticks"


def _remove_leading_select_from_query(query):
    if query.startswith("SELECT "):
        return query.replace("SELECT ", "", 1)
    else:
        return query


sa_query = select(text(_remove_leading_select_from_query(query)))
ddf = read_sql_query(sql=sa_query, con=con, index_col="index")

print(ddf)
print(ddf.head(3))
# Dask DataFrame Structure:
#                   id    name        x        y
# npartitions=1
# 0              int64  object  float64  float64
# 23               ...     ...      ...      ...
# Dask Name: from-delayed, 2 tasks
#          id    name         x         y
# index
# 0       998  Ingrid  0.760997 -0.381459
# 1      1056  Ingrid  0.506099  0.816477
# 2      1056   Laura  0.316556  0.046963

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM