简体   繁体   English

用 dask dataframe 填充 SQL 数据库并转储到文件中

[英]populate SQL database with dask dataframe and dump into a file

reproduce the error and the use case on this colab此 colab上重现错误和用例

I have multiple large tables that I read and analyze through Dask (dataframe).我有多个大表,我通过 Dask (dataframe) 读取和分析它们。 After doing analysis, I would like to push them into a local database (in this case sqlite engine through sqlalchemy package.分析后,我想将它们推送到本地数据库(在本例中为 sqlite 引擎通过 sqlalchemy package。

here is a dummy data:这是一个虚拟数据:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame([{"i": i, "s": str(i) * 2} for i in range(4)])

ddf = dd.from_pandas(df, npartitions=2)

from dask.utils import tmpfile
from sqlalchemy import create_engine

with tmpfile(
    dir="/outputs/",
    extension="db",
) as f:
    print(f)

    db = f"sqlite:///{f}"

    ddf.to_sql("test_table", db)

    engine = create_engine(
        db,
        echo=False,
    )

    print(dir(engine))
    result = engine.execute("SELECT * FROM test_table").fetchall()

print(result)

however, the tmpfile is temporary and is not stored on my local drive.但是, tmpfile文件是临时文件,并未存储在我的本地驱动器上。 I would like to dump the database into my local drive;我想将数据库转储到我的本地驱动器中; I could not find any argument for tmpfile to ensure it is stored as a file.我找不到tmpfile的任何参数以确保将其存储为文件。 Neither could figure out how to dump my engine.两人都不知道如何倾倒我的引擎。

Update if I use a regular file, I will encounter the following error更新如果我使用普通文件,会遇到如下错误

    return self.dbapi.connect(*cargs, **cparams)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) unable to open database file
(Background on this error at: https://sqlalche.me/e/14/e3q8)

here is the code这是代码

with open(
    "/outputs/hello.db", "wb"
) as f:
    print(f)

    db = f"sqlite:///{f}"

    ddf.to_sql("test_table", db, if_exists="replace")

    engine = create_engine(
        db,
        echo=False,
    )

    print(dir(engine))
    result = engine.execute("SELECT * FROM test_table").fetchall()

print(result)

If you would like to save to a regular file, there is no need to use the context manager:如果您想保存到常规文件,则无需使用上下文管理器:

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame([{"i": i, "s": str(i) * 2} for i in range(4)])
ddf = dd.from_pandas(df, npartitions=2)


OUT_FILE = "test.db"
db = f"sqlite:///{OUT_FILE}"

ddf.to_sql("test_table", db)

To test that the file is saved, run:要测试文件是否已保存,请运行:

from sqlalchemy import create_engine

engine = create_engine(
    db,
    echo=False,
)

result = engine.execute("SELECT * FROM test_table").fetchall()

print(result)
# [(0, 0, '00'), (1, 1, '11'), (2, 2, '22'), (3, 3, '33')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM