简体   繁体   English

在postgres中增加3亿个条目的最佳方法?

[英]best way to upsert 300 million entries into postgres?

I have a new csv file every day with 400 million+ entries which I need to upsert into my database (3 tables with 2 foreign keys, indexed). 我每天都有一个新的csv文件,其中包含4亿多个条目,需要向上插入数据库(3个表和2个外键,已索引)。 The majority of the entries are already in the table, in which case I need to update a column. 表中已经有大多数条目,在这种情况下,我需要更新一列。 Some entries, which are not already in the table need to be inserted. 需要插入表中尚未存在的某些条目。

I tried to insert the CSV each day into a temptable then run: 我试图每天将CSV插入到temptable然后运行:

INSERT INTO restaurants (name, food_id, street_id, datecreated, lastdayobservedopen) SELECT DISTINCT temptable.name, typesoffood.food_id, location.street_id, temptable.datecreated, temptable.lastdayobservedopen FROM temptable INNER JOIN typesoffood on typesoffood.food_type = temptable.food_type INNER JOIN location ON location.street_name = temptable.street_name ON CONFLICT ON CONSTRAINT restaurants_pk DO UPDATE SET lastdayobservedopen = EXCLUDED.lastdayobservedopen

But it takes over 6 hrs. 但这需要6个小时以上。

Is it possible to make this faster? 有可能使它更快吗?

Edit: 编辑:

Some more details: 3 tables- restaurants(name, food_id, street_id, datecreated, lastdayobservedopen) with pk (name, street_id) and fks (food_id and street_id); 更多详细信息:3个表-餐馆(名称,food_id,street_id,创建日期,lastdayobservedopen)以及pk(名称,street_id)和fks(food_id和street_id); typesoffood(food_id, food_type) with pk (food_id) and index on food_type; 带有pk(food_id)和food_type索引的食品类型(food_id,food_type); location(street_id, street_name) with pk (street_id) and index on street_name; 带有pk(street_id)的位置(street_id,street_name)和street_name上的索引; as for the csv file, I don't know which are new or old entries, but I do know that the majority of the entries are already in the database which would require me to update the lastdayobserved date. 至于csv文件,我不知道是新条目还是旧条目,但是我知道大多数条目已经在数据库中,这将需要我更新上次观察的日期。 The rest are to be inserted with the lastdayobserved date as today. 其余的将以观察到的最后日期为今天插入。 This is supposed to help distinguish between restaurants that are no longer in operation (in which case their lastdayobserved column would not be updated) and currently operating restaurants whose date in that column should always match today's date. 这样做可以帮助区分不再营业的餐厅(在这种情况下,“ lastdayobserved”列将不会更新)和当前营业的餐厅,该餐厅的日期应始终与今天的日期匹配。 Open to more efficient schema suggestions, as well. 也可以接受更有效的模式建议。 Thanks to all! 谢谢大家!

There is a function in sql called bulk insert can handle large volume of data: sql中有一个称为批量插入的函数,可以处理大量数据:

bulk insert #temp
from "file location path"

If you can change you postgres settings you could take advantage of parallelism in Postgres . 如果可以更改postgres设置,则可以利用Postgres中并行性 Otherwise you could at least speed up the csv upload using Postgres's bulk upload otherwise known as the COPY command . 否则,您至少可以使用Postgres的批量上传(也称为COPY命令)来加快csv的上传速度。

Without more details it's hard to give better advice. 没有更多细节,很难给出更好的建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM