繁体   English   中英

使用 python 对 pandas to_sql function 等 postgres 执行 upsert 操作

[英]perform upsert operation on postgres like pandas to_sql function using python

在问这个问题之前,我已经阅读了许多关于 Postgres 上的UPSERT操作的链接:

但问题与他们不同,因为功能不同。 我想要的是实现类似 pandas to_sql function 的东西,它具有以下功能:

  • 自动创建表
  • 保留每列的数据类型

to_sql的唯一缺点是它不在 Postgres 上进行UPSERT操作。 是否有通过将 dataframe 传递给它来实现预期的功能(基于列自动创建表,执行 UPSERT 操作并保留数据类型)?

以前使用 Pandas to_sql function 实现的代码

class PostgreSQL:
    def __init__(self):
        postgres_config = config_dict[Consts.POSTGRES.value]
        self.host = postgres_config[Consts.HOST.value]
        self.port = postgres_config[Consts.PORT.value]
        self.db_name = postgres_config[Consts.DB_NAME.value]
        self.username = postgres_config[Consts.USERNAME.value]
        self.password = postgres_config[Consts.PASSWORD.value]
    
    def get_connection(self) -> object:
        url_schema = Consts.POSTGRES_URL_SCHEMA.value.format(
            self.username, self.password, self.host, self.port, self.db_name
        )
        try:
            engine = create_engine(url_schema)
            return engine
        except Exception as e:
            logger.error('Make sure you have provided correct credentials for the DB connection.')
            raise e


    def save_df_to_db(self, df: object, table_name: str) -> None:
        df.to_sql(table_name, con=self.get_connection(), if_exists='append')

我编写了一个非常通用的代码,它使用 Pandas UPSERT并以一种有效的方式在 Postgres 中不受正式支持(直到 2021 年 12 月)。

通过使用以下代码,它将更新现有的主键,否则它将创建一个新表(以防表名不存在)并向表中添加新记录。

代码

import numpy as np
import pandas as pd
from sqlalchemy import create_engine, Table
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.ext.automap import automap_base


class PostgreSQL:
    def __init__(self):
        postgres_config = config_dict[Consts.POSTGRES.value]
        self.host = postgres_config[Consts.HOST.value]
        self.port = postgres_config[Consts.PORT.value]
        self.db_name = postgres_config[Consts.DB_NAME.value]
        self.username = postgres_config[Consts.USERNAME.value]
        self.password = postgres_config[Consts.PASSWORD.value]
    
    def get_connection(self) -> object:
        url_schema = 'postgresql://{}:{}@{}:{}/{}'.format(
            self.username, self.password, self.host, self.port, self.db_name
        )
        try:
            engine = create_engine(url_schema)
            return engine
        except Exception as e:
            logger.error('Make sure you have provided correct credentials for the DB connection.')
            raise e

    def run_query(self, query: str) -> list:
        engine = self.get_connection()
        return engine.execute(query).fetchall()

    def save_df_to_db(self, df: object, table_name: str) -> None:
        root_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '..')
        engine = self.get_connection()
        add_primary_key_query = get_query(root_dir, Directories.COMMON.value, 'add_primary_key.sql', table_name)
        table_existence_query = get_query(root_dir, Directories.COMMON.value, 'table_existence.sql', table_name)
        if not engine.execute(table_existence_query).first()[0]:  # if table does not exist
            logger.info('Create table automatically and from scratch!')
            df.to_sql(table_name, con=self.get_connection(), if_exists='append')
            engine.execute(add_primary_key_query)
        else:
            try:
                df = df.replace("NaT", None)
                df = df.replace(pd.NaT, None)
                df = df.replace({pd.NaT: None})
                df_dict = df.to_dict('records')
            except AttributeError as e:
                logger.error('Empty Dataframe!')
                raise e
            with engine.connect() as connection:
                logger.info('Table already exists!')
                base = automap_base()
                base.prepare(engine, reflect=True,)
                target_table = Table(table_name, base.metadata,
                                autoload=True, autoload_with=engine,)

                chunks = [df_dict[i:i + 1000] for i in range(0, len(df_dict), 1000)]
                for chunk in chunks:
                    stmt = insert(target_table).values(chunk)
                    update_dict = {c.name: c for c in stmt.excluded if not c.primary_key}
                    connection.execute(stmt.on_conflict_do_update(
                        constraint=f'{table_name}_pkey',
                        set_=update_dict)
                    )


                logger.info('Saving data is successfully done.')

表存在查询

SELECT EXISTS (
    SELECT FROM information_schema.tables 
    WHERE  table_schema = 'public'
    AND    table_name   = '{}'
);

添加主键查询

ALTER TABLE {} add primary key (id);

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM