使用 python 对 pandas to_sql function 等 postgres 执行 upsert 操作

Question

Before asking this question, I have read many links about UPSERT operation on Postgres:在问这个问题之前，我已经阅读了许多关于 Postgres 上的UPSERT操作的链接：

PostgreSQL Upsert Using INSERT ON CONFLICT statement PostgreSQL Upsert 使用 INSERT ON CONFLICT 语句
Anyway to Upsert database using PostgreSQL in Python 无论如何，在 Python 中使用 PostgreSQL 来更新数据库

But the question is different from them, since the functionality is different.但问题与他们不同，因为功能不同。 What I want is to implement something like pandas to_sql function which has the following features:我想要的是实现类似 pandas to_sql function 的东西，它具有以下功能：

Automatically creates table自动创建表
Keeps the data types of each column保留每列的数据类型

The only drawback of to_sql is that it doesn't UPSERT operation on Postgres. to_sql的唯一缺点是它不在 Postgres 上进行UPSERT操作。 Is there anyway to implement the expected functionality (automatically create table based on columns, perform UPSERT operation and keep data types) by passing dataframe to it?是否有通过将 dataframe 传递给它来实现预期的功能（基于列自动创建表，执行 UPSERT 操作并保留数据类型）？

Previously implemented code using Pandas to_sql function :以前使用 Pandas to_sql function 实现的代码：

class PostgreSQL:
    def __init__(self):
        postgres_config = config_dict[Consts.POSTGRES.value]
        self.host = postgres_config[Consts.HOST.value]
        self.port = postgres_config[Consts.PORT.value]
        self.db_name = postgres_config[Consts.DB_NAME.value]
        self.username = postgres_config[Consts.USERNAME.value]
        self.password = postgres_config[Consts.PASSWORD.value]
    
    def get_connection(self) -> object:
        url_schema = Consts.POSTGRES_URL_SCHEMA.value.format(
            self.username, self.password, self.host, self.port, self.db_name
        )
        try:
            engine = create_engine(url_schema)
            return engine
        except Exception as e:
            logger.error('Make sure you have provided correct credentials for the DB connection.')
            raise e


    def save_df_to_db(self, df: object, table_name: str) -> None:
        df.to_sql(table_name, con=self.get_connection(), if_exists='append')

Answer 1

I have written a very generic code that performs UPSERT which is not supported officially in Postgres (until December 2021), using Pandas dataframe and in an efficient way.我编写了一个非常通用的代码，它使用 Pandas UPSERT并以一种有效的方式在 Postgres 中不受正式支持（直到 2021 年 12 月）。

By using the following code, it will update the existing primary key otherwise it will create a new table (in case table name doesn't exist) and add new records to the table.通过使用以下代码，它将更新现有的主键，否则它将创建一个新表（以防表名不存在）并向表中添加新记录。

Code :代码：

import numpy as np
import pandas as pd
from sqlalchemy import create_engine, Table
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.ext.automap import automap_base


class PostgreSQL:
    def __init__(self):
        postgres_config = config_dict[Consts.POSTGRES.value]
        self.host = postgres_config[Consts.HOST.value]
        self.port = postgres_config[Consts.PORT.value]
        self.db_name = postgres_config[Consts.DB_NAME.value]
        self.username = postgres_config[Consts.USERNAME.value]
        self.password = postgres_config[Consts.PASSWORD.value]
    
    def get_connection(self) -> object:
        url_schema = 'postgresql://{}:{}@{}:{}/{}'.format(
            self.username, self.password, self.host, self.port, self.db_name
        )
        try:
            engine = create_engine(url_schema)
            return engine
        except Exception as e:
            logger.error('Make sure you have provided correct credentials for the DB connection.')
            raise e

    def run_query(self, query: str) -> list:
        engine = self.get_connection()
        return engine.execute(query).fetchall()

    def save_df_to_db(self, df: object, table_name: str) -> None:
        root_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '..')
        engine = self.get_connection()
        add_primary_key_query = get_query(root_dir, Directories.COMMON.value, 'add_primary_key.sql', table_name)
        table_existence_query = get_query(root_dir, Directories.COMMON.value, 'table_existence.sql', table_name)
        if not engine.execute(table_existence_query).first()[0]:  # if table does not exist
            logger.info('Create table automatically and from scratch!')
            df.to_sql(table_name, con=self.get_connection(), if_exists='append')
            engine.execute(add_primary_key_query)
        else:
            try:
                df = df.replace("NaT", None)
                df = df.replace(pd.NaT, None)
                df = df.replace({pd.NaT: None})
                df_dict = df.to_dict('records')
            except AttributeError as e:
                logger.error('Empty Dataframe!')
                raise e
            with engine.connect() as connection:
                logger.info('Table already exists!')
                base = automap_base()
                base.prepare(engine, reflect=True,)
                target_table = Table(table_name, base.metadata,
                                autoload=True, autoload_with=engine,)

                chunks = [df_dict[i:i + 1000] for i in range(0, len(df_dict), 1000)]
                for chunk in chunks:
                    stmt = insert(target_table).values(chunk)
                    update_dict = {c.name: c for c in stmt.excluded if not c.primary_key}
                    connection.execute(stmt.on_conflict_do_update(
                        constraint=f'{table_name}_pkey',
                        set_=update_dict)
                    )


                logger.info('Saving data is successfully done.')

Table existence query :表存在查询：

SELECT EXISTS (
    SELECT FROM information_schema.tables 
    WHERE  table_schema = 'public'
    AND    table_name   = '{}'
);

Add primary key query :添加主键查询：

ALTER TABLE {} add primary key (id);

使用 python 对 pandas to_sql function 等 postgres 执行 upsert 操作

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-12-20 22:58:38

使用 python 对 pandas to_sql function 等 postgres 执行 upsert 操作

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-12-20 22:58:38

解决方案1
0 已采纳 2021-12-20 22:58:38