简体   繁体   English

来自 Python 的多行 UPSERT(插入或更新)

[英]Multi-row UPSERT (INSERT or UPDATE) from Python

I am currently executing the simply query below with python using pyodbc to insert data in SQL server table:我目前正在使用 python 执行下面的简单查询,使用 pyodbc 在 SQL 服务器表中插入数据:

import pyodbc

table_name = 'my_table'
insert_values = [(1,2,3),(2,2,4),(3,4,5)]

cnxn = pyodbc.connect(...)
cursor = cnxn.cursor()
cursor.execute(
    ' '.join([
        'insert into',
        table_name,
        'values',
        ','.join(
            [str(i) for i in insert_values]
        )
    ])
)
cursor.commit()

This should work as long as there are no duplicate keys (let's assume the first column contains the key).只要没有重复的键,这应该可以工作(假设第一列包含键)。 However for data with duplicate keys (data already existing in the table) it will raise an error.但是对于具有重复键的数据(表中已经存在的数据),它会引发错误。 How can I, in one go, insert multiple rows in a SQL server table using pyodbc such that data with duplicate keys simply gets updated.我怎样才能一次性使用 pyodbc 在 SQL Server 表中插入多行,以便简单地更新具有重复键的数据。

Note: There are solutions proposed for single rows of data, however, I would like to insert multiple rows at once (avoid loops)!注意:有针对单行数据提出的解决方案,但是,我想一次插入多行(避免循环)!

This can be done using MERGE .这可以使用MERGE来完成。 Let's say you have a key column ID , and two columns col_a and col_b (you need to specify column names in update statements), then the statement would look like this:假设您有一个键列ID和两列col_acol_b (您需要在更新语句中指定列名),那么该语句将如下所示:

MERGE INTO MyTable as Target
USING (SELECT * FROM 
       (VALUES (1, 2, 3), (2, 2, 4), (3, 4, 5)) 
       AS s (ID, col_a, col_b)
      ) AS Source
ON Target.ID=Source.ID
WHEN NOT MATCHED THEN
INSERT (ID, col_a, col_b) VALUES (Source.ID, Source.col_a, Source.col_b)
WHEN MATCHED THEN
UPDATE SET col_a=Source.col_a, col_b=Source.col_b;

You can give it a try on rextester.com/IONFW62765 .您可以在reextester.com/IONFW62765 上尝试一下

Basically, I'm creating a Source table "on-the-fly" using the list of values, which you want to upsert .基本上,我正在使用要upsert的值列表“即时”创建Source表。 When you then merge the Source table with the Target , you can test the MATCHED condition ( Target.ID=Source.ID ) on each row (whereas you would be limited to a single row when just using a simple IF <exists> INSERT (...) ELSE UPDATE (...) condition).然后,当您将Source表与Target合并时,您可以在每一行上测试MATCHED条件( Target.ID=Source.ID )(而当仅使用简单的IF <exists> INSERT (...) ELSE UPDATE (...)条件)。

In python with pyodbc , it should probably look like this:在带有pyodbc python 中,它可能看起来像这样:

import pyodbc

insert_values = [(1, 2, 3), (2, 2, 4), (3, 4, 5)]
table_name = 'my_table'
key_col = 'ID'
col_a = 'col_a'
col_b = 'col_b'

cnxn = pyodbc.connect(...)
cursor = cnxn.cursor()
cursor.execute(('MERGE INTO {table_name} as Target '
                'USING (SELECT * FROM '
                '(VALUES {vals}) '
                'AS s ({k}, {a}, {b}) '
                ') AS Source '
                'ON Target.ID=Source.ID '
                'WHEN NOT MATCHED THEN '
                'INSERT ({k}, {a}, {b}) VALUES (Source.{k}, Source.{a}, Source.{b}) '
                'WHEN MATCHED THEN '
                'UPDATE SET {k}=Source.{a}, col_b=Source.{b};'
                .format(table_name=table_name,
                        vals=','.join([str(i) for i in insert_values]),
                        k=key_col,
                        a=col_a,
                        b=col_b)))
cursor.commit()

You can read up more on MERGE in the SQL Server docs .您可以在SQL Server 文档中阅读有关MERGE更多信息。

Given a dataframe(df) I used the code from ksbg to upsert into a table.给定一个数据帧(df),我使用 ksbg 中的代码将其插入到表中。 Note that I looked for a match on two columns (date and stationcode) you can use one.请注意,我在两列(日期和车站代码)上寻找匹配项,您可以使用其中一列。 Code generates the query given any df.代码生成给定任何 df 的查询。

def append(df, c):


    table_name = 'ddf.ddf_actuals'


    columns_list = df.columns.tolist()
    columns_list_query = f'({(",".join(columns_list))})'
    sr_columns_list = [f'Source.{i}' for i in columns_list]
    sr_columns_list_query = f'({(",".join(sr_columns_list))})'
    up_columns_list = [f'{i}=Source.{i}' for i in columns_list]
    up_columns_list_query = f'{",".join(up_columns_list)}'

    rows_to_insert = [row.tolist() for idx, row in final_list.iterrows()]
    rows_to_insert = str(rows_to_insert).replace('[', '(').replace(']', ')')[1:][:-1]


    query = f"MERGE INTO {table_name} as Target \
USING (SELECT * FROM \
(VALUES {rows_to_insert}) \
AS s {columns_list_query}\
) AS Source \
ON Target.stationcode=Source.stationcode AND Target.date=Source.date \
WHEN NOT MATCHED THEN \
INSERT {columns_list_query} VALUES {sr_columns_list_query} \
WHEN MATCHED THEN \
UPDATE SET {up_columns_list_query};"
    c.execute(query)

    c.commit()

Following up on the existing answers here because they are potentially prone to injection attacks and it's better to use parameterized queries (for mssql/pyodbc, these are the "?" placeholders).跟进这里的现有答案,因为它们可能容易受到注入攻击,最好使用参数化查询(对于 mssql/pyodbc,这些是“?”占位符)。 I tweaked Alexander Novas's code slightly to use dataframe rows in a parameterized version of the query with sqlalchemy:我稍微调整了 Alexander Novas 的代码,以在带有 sqlalchemy 的查询的参数化版本中使用数据帧行:

# assuming you already have a dataframe "df" and sqlalchemy engine called "engine"
# also assumes your dataframe columns have all the same names as the existing table

table_name_to_update = 'update_table'
table_name_to_transfer = 'placeholder_table'

# the dataframe and existing table should both have a column to use as the primary key
primary_key_col = 'id'

# replace the placeholder table with the dataframe
df.to_sql(table_name_to_transfer, engine, if_exists='replace', index=False)

# building the command terms
cols_list = df.columns.tolist()
cols_list_query = f'({(", ".join(cols_list))})'
sr_cols_list = [f'Source.{i}' for i in cols_list]
sr_cols_list_query = f'({(", ".join(sr_cols_list))})'
up_cols_list = [f'{i}=Source.{i}' for i in cols_list]
up_cols_list_query = f'{", ".join(up_cols_list)}'
    
# fill values that should be interpreted as "NULL" with None
def fill_null(vals: list) -> list:
    def bad(val):
        if isinstance(val, type(pd.NA)):
            return True
        # the list of values you want to interpret as 'NULL' should be 
        # tweaked to your needs
        return val in ['NULL', np.nan, 'nan', '', '', '-', '?']
    return tuple(i if not bad(i) else None for i in vals)

# create the list of parameter indicators (?, ?, ?, etc...)
# and the parameters, which are the values to be inserted
params = [fill_null(row.tolist()) for _, row in df.iterrows()]
param_slots = '('+', '.join(['?']*len(df.columns))+')'
    
cmd = f'''
       MERGE INTO {table_name_to_update} as Target
       USING (SELECT * FROM
       (VALUES {param_slots})
       AS s {cols_list_query}
       ) AS Source
       ON Target.{primary_key_col}=Source.{primary_key_col}
       WHEN NOT MATCHED THEN
       INSERT {cols_list_query} VALUES {sr_cols_list_query} 
       WHEN MATCHED THEN
       UPDATE SET {up_cols_list_query};
       '''

# execute the command to merge tables
with engine.begin() as conn:
    conn.execute(cmd, params)

This method is also better if you are inserting strings with characters that aren't compatible with SQL insert text (such as apostrophes which mess up the insert statement) since it lets the connection engine handle the parameterized values (which also makes it safer against SQL injection attacks).如果您插入的字符串包含与 SQL 插入文本不兼容的字符(例如使插入语句混乱的撇号),则此方法也更好,因为它让连接引擎处理参数化值(这也使其更安全地对抗 SQL注入攻击)。

For reference, I'm creating the engine connection using this code - you'll obviously need to adapt it to your server/database/environment and whether or not you want fast_executemany :作为参考,我正在使用此代码创建引擎连接 - 您显然需要使其适应您的服务器/数据库/环境以及是否需要fast_executemany

import urllib
import pyodbc
pyodbc.pooling = False
import sqlalchemy

terms = urllib.parse.quote_plus(
            'DRIVER={SQL Server Native Client 11.0};'
            'SERVER=<your server>;'
            'DATABASE=<your database>;'
            'Trusted_Connection=yes;' # to logon using Windows credentials

url = f'mssql+pyodbc:///?odbc_connect={terms}'
engine = sqlalchemy.create_engine(url, fast_executemany=True)

EDIT: I realized that this code does not actually make use of the "placeholder" table at all, and is just copying values directly from the dataframe rows by way of the parameterized command.编辑:我意识到这段代码实际上根本没有使用“占位符”表,只是通过参数化命令直接从数据帧行复制值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM