簡體   English   中英

使用 python 對 pandas to_sql function 等 postgres 執行 upsert 操作

[英]perform upsert operation on postgres like pandas to_sql function using python

在問這個問題之前,我已經閱讀了許多關於 Postgres 上的UPSERT操作的鏈接:

但問題與他們不同,因為功能不同。 我想要的是實現類似 pandas to_sql function 的東西,它具有以下功能:

  • 自動創建表
  • 保留每列的數據類型

to_sql的唯一缺點是它不在 Postgres 上進行UPSERT操作。 是否有通過將 dataframe 傳遞給它來實現預期的功能(基於列自動創建表,執行 UPSERT 操作並保留數據類型)?

以前使用 Pandas to_sql function 實現的代碼

class PostgreSQL:
    def __init__(self):
        postgres_config = config_dict[Consts.POSTGRES.value]
        self.host = postgres_config[Consts.HOST.value]
        self.port = postgres_config[Consts.PORT.value]
        self.db_name = postgres_config[Consts.DB_NAME.value]
        self.username = postgres_config[Consts.USERNAME.value]
        self.password = postgres_config[Consts.PASSWORD.value]
    
    def get_connection(self) -> object:
        url_schema = Consts.POSTGRES_URL_SCHEMA.value.format(
            self.username, self.password, self.host, self.port, self.db_name
        )
        try:
            engine = create_engine(url_schema)
            return engine
        except Exception as e:
            logger.error('Make sure you have provided correct credentials for the DB connection.')
            raise e


    def save_df_to_db(self, df: object, table_name: str) -> None:
        df.to_sql(table_name, con=self.get_connection(), if_exists='append')

我編寫了一個非常通用的代碼,它使用 Pandas UPSERT並以一種有效的方式在 Postgres 中不受正式支持(直到 2021 年 12 月)。

通過使用以下代碼,它將更新現有的主鍵,否則它將創建一個新表(以防表名不存在)並向表中添加新記錄。

代碼

import numpy as np
import pandas as pd
from sqlalchemy import create_engine, Table
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.ext.automap import automap_base


class PostgreSQL:
    def __init__(self):
        postgres_config = config_dict[Consts.POSTGRES.value]
        self.host = postgres_config[Consts.HOST.value]
        self.port = postgres_config[Consts.PORT.value]
        self.db_name = postgres_config[Consts.DB_NAME.value]
        self.username = postgres_config[Consts.USERNAME.value]
        self.password = postgres_config[Consts.PASSWORD.value]
    
    def get_connection(self) -> object:
        url_schema = 'postgresql://{}:{}@{}:{}/{}'.format(
            self.username, self.password, self.host, self.port, self.db_name
        )
        try:
            engine = create_engine(url_schema)
            return engine
        except Exception as e:
            logger.error('Make sure you have provided correct credentials for the DB connection.')
            raise e

    def run_query(self, query: str) -> list:
        engine = self.get_connection()
        return engine.execute(query).fetchall()

    def save_df_to_db(self, df: object, table_name: str) -> None:
        root_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '..')
        engine = self.get_connection()
        add_primary_key_query = get_query(root_dir, Directories.COMMON.value, 'add_primary_key.sql', table_name)
        table_existence_query = get_query(root_dir, Directories.COMMON.value, 'table_existence.sql', table_name)
        if not engine.execute(table_existence_query).first()[0]:  # if table does not exist
            logger.info('Create table automatically and from scratch!')
            df.to_sql(table_name, con=self.get_connection(), if_exists='append')
            engine.execute(add_primary_key_query)
        else:
            try:
                df = df.replace("NaT", None)
                df = df.replace(pd.NaT, None)
                df = df.replace({pd.NaT: None})
                df_dict = df.to_dict('records')
            except AttributeError as e:
                logger.error('Empty Dataframe!')
                raise e
            with engine.connect() as connection:
                logger.info('Table already exists!')
                base = automap_base()
                base.prepare(engine, reflect=True,)
                target_table = Table(table_name, base.metadata,
                                autoload=True, autoload_with=engine,)

                chunks = [df_dict[i:i + 1000] for i in range(0, len(df_dict), 1000)]
                for chunk in chunks:
                    stmt = insert(target_table).values(chunk)
                    update_dict = {c.name: c for c in stmt.excluded if not c.primary_key}
                    connection.execute(stmt.on_conflict_do_update(
                        constraint=f'{table_name}_pkey',
                        set_=update_dict)
                    )


                logger.info('Saving data is successfully done.')

表存在查詢

SELECT EXISTS (
    SELECT FROM information_schema.tables 
    WHERE  table_schema = 'public'
    AND    table_name   = '{}'
);

添加主鍵查詢

ALTER TABLE {} add primary key (id);

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM