简体   繁体   English

使用“冲突”更新从 pandas 插入到 postgreSQL 表

[英]Insert into postgreSQL table from pandas with "on conflict" update

I have a pandas DataFrame that I need to store into the database.我有一个 pandas DataFrame 需要存储到数据库中。 Here's my current line of code for inserting:这是我当前用于插入的代码行:

df.to_sql(table,con=engine,if_exists='append',index_label=index_col)

This works fine if none of the rows in df exist in my table.如果我的表中不存在df中的任何行,这将正常工作。 If a row already exists, I get this error:如果一行已经存在,我会收到此错误:

sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) duplicate key
value violates unique constraint "mypk"
DETAIL:  Key (id)=(42) already exists.
 [SQL: 'INSERT INTO mytable (id, owner,...) VALUES (%(id)s, %(owner)s,...']
 [parameters:...] (Background on this error at: http://sqlalche.me/e/gkpj)

and nothing is inserted.并且没有插入任何内容。

PostgreSQL has optional ON CONFLICT clause, which could be used to UPDATE the existing table rows. PostgreSQL 有可选的ON CONFLICT子句,可用于UPDATE现有表行。 I read entire pandas.DataFrame.to_sql manual page and I couldn't find any way to use ON CONFLICT within DataFrame.to_sql() function.我阅读了整个pandas.DataFrame.to_sql 手册页,但我找不到在DataFrame.to_sql() function 中使用ON CONFLICT的任何方法。

I have considered spliting my DataFrame in two based on what's already in the db table.我考虑过根据 db 表中已有的内容将我的 DataFrame 分成两部分。 So now I have two DataFrames, insert_rows and update_rows , and I can safely execute所以现在我有两个 DataFrame, insert_rowsupdate_rows ,我可以安全地执行

insert_rows.to_sql(table, con=engine, if_exists='append', index_label=index_col)

But then, there seems to be no UPDATE equivalent to DataFrame.to_sql() .但是,似乎没有UPDATE相当于DataFrame.to_sql() So how do I update the table using DataFrame update_rows ?那么如何使用 DataFrame update_rows更新表呢?

@ SaturnFromTitan, thanks for the reply to this old thread. @SaturnFromTitan,感谢您对这个旧线程的回复。 That worked like magic.这就像魔术一样奏效。 I would upvote, but I don't have the rep.我会投票,但我没有代表。

For those that are as new to all this as I am: You can cut and paste SaturnFromTitan answer and call it with something like:对于那些像我一样对这一切不熟悉的人:您可以剪切并粘贴 SaturnFromTitan 的答案,并用类似的方式调用它:

    df.to_sql('my_table_name', 
              dbConnection,schema='my_schema',
              if_exists='append',
              index=False,
              method=postgres_upsert)  

And that's it.就是这样。 The upsert works. upsert 有效。

If you notice in the to_sql docs there's mention of a method argument that takes a callable.如果您在to_sql文档中注意到有一个method参数,它需要一个可调用的。 Creating this callable should allow you to use the Postgres clauses you need.创建此可调用对象应该允许您使用所需的 Postgres 子句。 Here's an example of a callable they mentioned in the docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method这是他们在文档中提到的可调用示例: https : //pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method

It's pretty different from what you need, but follow the arguments passed to this callable.它与您需要的完全不同,但请遵循传递给此可调用对象的参数。 They will allow you to construct a regular SQL statement.它们将允许您构建常规 SQL 语句。

To follow up on Brendan's answer with an example, this is what worked for me:用一个例子来跟进 Brendan 的回答,这对我有用:

import os
import sqlalchemy as sa
import pandas as pd
from sqlalchemy.dialects.postgresql import insert


engine = sa.create_engine(os.getenv("DBURL"))
meta = sa.MetaData()
meta.bind = engine
meta.reflect(views=True)


def upsert(table, conn, keys, data_iter):
    upsert_args = {"constraint": "test_table_col_a_col_b_key"}
    for data in data_iter:
        data = {k: data[i] for i, k in enumerate(keys)}
        upsert_args["set_"] = data
        insert_stmt = insert(meta.tables[table.name]).values(**data)
        upsert_stmt = insert_stmt.on_conflict_do_update(**upsert_args)
        conn.execute(upsert_stmt)


if __name__ == "__main__":
    df = pd.read_csv("test_data.txt")
    with db.engine.connect() as conn:
        df.to_sql(
            "test_table",
            con=conn,
            if_exists="append",
            method=upsert,
            index=False,
        )

where in this example the schema would be something like:在此示例中,架构将类似于:

CREATE TABLE test_table(
    col_a text NOT NULL,
    col_b text NOT NULL,
    col_c text,
    UNIQUE (col_a, col_b)
)

If anybody wanted to build on top of the answer from zdgriffith and dynamically generate the table constraint name you can use the following query for postgreSQL:如果有人想建立在zdgriffith的答案zdgriffith并动态生成表约束名称,您可以对 postgreSQL 使用以下查询:

select distinct tco.constraint_name
from information_schema.table_constraints tco
         join information_schema.key_column_usage kcu
              on kcu.constraint_name = tco.constraint_name
                  and kcu.constraint_schema = tco.constraint_schema
                  and kcu.constraint_name = tco.constraint_name
where kcu.table_name = '{table.name}'
  and constraint_type = 'PRIMARY KEY';

You can then format this string to populate table.name inside the upsert() method.然后,您可以格式化此字符串以填充upsert()方法中的table.name

I also didn't require the meta.bind and meta.reflect() lines.我也不需要meta.bindmeta.reflect()行。 The latter will be deprecated soon anyway.无论如何,后者很快就会被弃用。

I know it's an old thread, but I ran into the same issue and this thread showed up in Google.我知道这是一个旧线程,但我遇到了同样的问题,这个线程出现在 Google 中。 None of the answers is really satisfying yet, so I here's what I came up with:没有一个答案是真正令人满意的,所以我想出了以下答案:

My solution is pretty similar to zdgriffith's answer, but much more performant as there's no need to iterate over data_iter :我的解决方案与 zdgriffith 的答案非常相似,但性能更高,因为无需遍历data_iter

def postgres_upsert(table, conn, keys, data_iter):
    from sqlalchemy.dialects.postgresql import insert

    data = [dict(zip(keys, row)) for row in data_iter]

    insert_statement = insert(table.table).values(data)
    upsert_statement = insert_statement.on_conflict_do_update(
        constraint=f"{table.table.name}_pkey",
        set_={c.key: c for c in insert_statement.excluded},
    )
    conn.execute(upsert_statement)

Now you can use this custom upsert method in pandas' to_sql method like zdgriffith showed.现在你可以在 pandas 的to_sql方法中使用这个自定义的 upsert 方法,就像 zdgriffith 展示的那样。

Please note that my upsert function uses the primary key constraint of the table.请注意,我的 upsert 函数使用表的主键约束。 You can target another constraint by changing the constraint argument of .on_conflict_do_update .您可以通过更改.on_conflict_do_updateconstraint参数来定位另一个约束。

This SO answer on a related thread explains the use of .excluded a bit more: https://stackoverflow.com/a/51935542/7066758相关线程上的这个 SO 答案解释了.excluded的使用更多: https ://stackoverflow.com/a/51935542/7066758

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM