如何使用 pandas.to_sql 但仅在行不存在时才添加行

Question

I have some experience with python but very new to the SQL thing and trying to use pandas.to_sql to add table data into my database, but when I add I want it to check if the data exists before append

这是我的 2 个数据框

>>> df0.to_markdown()
|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |     00 |
|  1 |      1 |     11 |

>>> df1.to_markdown()
|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |     00 |
|  1 |      1 |     11 |
|  2 |      2 |     22 |

所以这里我使用 pandas to_sql

>>> df0.to_sql(con=con, name='test_db', if_exists='append', index=False)
>>> df1.to_sql(con=con, name='test_db', if_exists='append', index=False)

在这里，我检查数据库文件中的数据

>>> df_out = pd.read_sql("""SELECT * FROM test_db""", con)
>>> df_out.to_markdown()
|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |      0 |
|  1 |      1 |     11 |
|  2 |      0 |      0 | # Duplicate
|  3 |      1 |     11 | # Duplicate
|  4 |      2 |     22 |

但我希望我的数据库看起来像这样，所以我不想将重复的数据添加到我的数据库中

|    |   Col1 |   Col2 |
|---:|-------:|-------:|
|  0 |      0 |      0 |
|  1 |      1 |     11 |
|  3 |      2 |     22 |

有没有我可以设置的选项或添加一些代码来实现这一点？

谢谢！

编辑：有一些 SQL 代码只提取唯一数据，但我想做的是首先不要将数据添加到数据库中

Answer 1

不要使用 to_sql 一个简单的查询可以工作

query = text(f""" INSERT INTO test_db VALUES {','.join([str(i) for i in list(df0.to_records(index=False))])} ON CONFLICT ON CONSTRAINT test_db_pkey DO NOTHING""")

self.engine.connect().execute(query)

对于每个 DataFrame 将 df0 更改为 df1

点击这些链接以获得更好的理解

Answer 2

例如，在我的 Sqlite 3 数据库中，我使用了临时表：

我将所有 dataframe 数据插入到临时表中：

df0.to_sql(con=con, name='temptable', if_exists='append', index=False)
df1.to_sql(con=con, name='temptable', if_exists='append', index=False)

然后我只复制新数据并删除（删除）表：

con.executescript('''
INSERT INTO test_db
SELECT test_db.* FROM temptable 
LEFT JOIN test_db on 
   test_db.Col1 = temptable.Col1
WHERE test_db.Col1 IS NULL; -- only items, that not presented in 'test_db' table

DROP TABLE temptable;
''')

Answer 3

有两种方法：

如果数据库中的数据不大，则将数据库中的数据读入dataframe，将Col1和Col2这两个列合并成一个新列，即combined_column，并保存到一个list combine_column_list中。 过滤掉df0和df2中对应combined_column没有出现在combined_column_list中的行，将过滤后的行直接插入到数据库表中。

将 df1 和 df2 插入到临时表中，例如名称为“temp”。 使用 python pymysql 运行以下代码：

 conn = pymysql.connect(host=DB_ip, user=DB_user,passwd=DB_password,db=DB_name) cur = conn.cursor() temp_query =" insert into test_db (select * from temp where ( `Col1`, `Col2`) not in (select `Col1`, `Col2` from test_db ));" cur.execute(temp_query) conn.commit()

这只会将新数据插入到数据库表中。

Answer 4

将此代码添加到您的 function

remove_duplicate = 'EXEC remove_duplicate'
cursor.execute(remove_duplicate)
cursor.commit()

并在您的数据库中创建过程：

    CREATE PROCEDURE remove_duplicate AS
BEGIN
;WITH duplicates as (SELECT col1, col2,
                    ROW_NUMBER() OVER (PARTITION BY col1, col2, ORDER BY col1) AS number_of_duplicates
    FROM dbo.table)
    DELETE FROM duplicates WHERE number_of_duplicates > 1

END
go

如何使用 pandas.to_sql 但仅在行不存在时才添加行

问题描述

4 个解决方案

解决方案1
0 2021-04-06 13:58:40

解决方案2
0 2022-04-26 05:13:12

解决方案3
0 2022-09-26 09:20:57

解决方案4
-1 2020-07-23 16:07:09

如何使用 pandas.to_sql 但仅在行不存在时才添加行

问题描述

4 个解决方案

解决方案1 0 2021-04-06 13:58:40

解决方案2 0 2022-04-26 05:13:12

解决方案3 0 2022-09-26 09:20:57

解决方案4 -1 2020-07-23 16:07:09

解决方案1
0 2021-04-06 13:58:40

解决方案2
0 2022-04-26 05:13:12

解决方案3
0 2022-09-26 09:20:57

解决方案4
-1 2020-07-23 16:07:09