[英]How to append value if Primary key or id is same in postgres using python
I'm trying to insert around 50 millions of data into postgresql using python script. 我正在尝试使用python脚本将大约5000万数据插入到postgresql中。 I've file which contains 50 millions records.
我的文件包含5000万条记录。 I'm completely new to PostgreSQL and Python as well.
我也是PostgreSQL和Python的新手。 I tried below code to insert in python and I'm facing one challenge here.
我尝试了下面的代码插入python,在这里我面临一个挑战。 My test.txt contains a key-value pair like below.
我的test.txt包含一个键值对,如下所示。
If same key appears twice in the text file, I want to append the value with existing one. 如果相同的键在文本文件中出现两次,我想将值附加到现有的键上。 Which I'm not sure how to do that in python.
我不确定如何在python中做到这一点。 Can you please some one help?
你能帮个忙吗?
myfile.txt myfile.txt文件
key1 item1,product1,model1,price1|
key2 item2,product2,model2,price2|
key3 item3,product3,model3,price3|
key4 item4,product4,model4,price4|
key2 item22,product22,model22,price22|
In this case key2 has two records - while inserting into DB I've to append the second value with first one. 在这种情况下,key2有两个记录-在插入数据库时,我必须在第二个值后面附加第一个值。
Tabular column: 表格列:
key value
key1 item1,product1,model1,price1|
key2 item2,product2,model2,price2|item22,product22,model22,price22|
key3 item3,product3,model3,price3|
key4 item4,product4,model4,price4|
insert.py insert.py
import psycopg2
def insertToDB(fileName):
conn = psycopg2.connect("dbname='mydb' user='testuser' host='localhost'")
with open(fileName) as f:
for line in f:
k,v = line.split(' ',1)
cursor = conn.cursor()
query = "INSERT INTO mytable (key,value) VALUES (%s,%s);"
data = (key,value)
cursor.execute(query,data)
conn.commit()
insertfile('myfile.txt')
I've around 50 millions of data and most of the key might have repeated key with different record, how to handle that and how efficiently we can write into DB? 我拥有大约5000万个数据,并且大多数键可能具有重复的键并具有不同的记录,该如何处理以及如何有效地写入DB?
It would be really helpful if someone can suggest to improvise this? 如果有人可以建议即兴创作,这真的有帮助吗?
Thank you! 谢谢!
The easiest way is to use the ON CONFLICT
clause of the SQL insert statement. 最简单的方法是使用SQL插入语句的
ON CONFLICT
子句。 This changes your simple insert into a "upsert" (insert or update). 这会将您的简单插入内容更改为“ upsert”(插入或更新)。
ON CONFLICT
requires PostgreSQL version 9.5 or greater, and is used like this: ON CONFLICT
需要PostgreSQL 9.5或更高版本,其使用方式如下:
query = """INSERT INTO mytable (key,value)
VALUES (%s,%s)
ON CONFLICT (key)
DO UPDATE SET value = CONCAT(users.value, %s);"""
cursor.execute(query, (key, value, value))
The other option is to concatenate your results before you send them to the database by refactoring your data. 另一种选择是通过重构数据将结果连接到数据库之前,将它们串联起来。 Here I am collecting all rows by key in a dictionary, and then when inserting I'll just join all the values together.
在这里,我将按字典中的键收集所有行,然后在插入时将所有值连接在一起。
This way, you only have one insert for each key. 这样,每个密钥只有一个插入。
Here is some code to explain this: 这是一些代码来解释这一点:
from collections import defaultdict
import psycopg2
def get_records(filename):
records = defaultdict(list)
with open(filename) as f:
for line in f:
if line.strip():
key, value = line.split(' ',1)
records[key].append(value)
return records
def insert_records(records, conn):
q = "INSERT INTO mytable (key, value) VALUES (%s, %s);"
cursor = conn.cursor()
for key, data in records.items():
cursor.execute(q, (key, ''.join(data)))
conn.commit()
conn = psycopg2.connect("dbname='mydb' user='testuser' host='localhost'")
insert_records(get_records('myfile.txt'), conn)
If you have a very large number of records, it may be that your are exhausting the memory by loading the entire file at once. 如果您有大量的记录,则可能是因为您一次加载了整个文件而耗尽了内存。
Instead, you can implement a simpler algorithm that keeps track of keys that are read. 相反,您可以实现更简单的算法来跟踪读取的密钥。
def insert_records(filename, conn):
seen = set()
cursor = conn.cursor()
qi = "INSERT INTO mytable (key, value) VALUES (%s, %s);"
qu = "UPDATE mytable SET value = CONCAT(value, %s) WHERE key = %s;"
with open(filename) as f:
for line in f:
if line.strip():
key, value = line.split(' ', 1)
if key not in seen:
# first time we see this key, do an insert
seen.add(key)
cursor.execute(qi, (key, value))
else:
# key has been processed at least once, do an update
cursor.execute(qu, (value, key))
conn.commit()
conn = psycopg2.connect("dbname='mydb' user='testuser' host='localhost'")
insert_records(filename, conn)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.