简体   繁体   English

数据的多处理加载并提交到sqlalchemy

[英]Multiprocessing loading of data and committing to sqlalchemy

I'm loading data from files. 我正在从文件加载数据。 I have alot of files, so I have a few processes loading lists of files: 我有很多文件,因此有一些进程正在加载文件列表:

 with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    for x, y, z in executor.map(load_my_file, path_list):

Load my file: loads data, stores "USERS" and "POSTS" in two dicts and returns them, merges into one dict each for user and posts, and then I bulk commit them. 加载我的文件:加载数据,将“ USERS”和“ POSTS”存储在两个字典中,然后将它们返回,将用户和帖子分别合并为一个字典,然后批量提交。

Each user may have many posts, but in the file each record is just one post and one user together. 每个用户可能有很多帖子,但是在文件中,每个记录只是一个帖子和一个用户在一起。 So this is the reason behind the dict, so I dont have have primary key duplicates on insert with sqlalchemy. 因此,这就是字典背后的原因,因此在使用sqlalchemy进行插入时,我没有主键重复项。

However, this uses up a lot of memory. 但是,这会占用大量内存。 I have around 1.6 million records, 600k users, and my python program is using up a HUGE amount of memory (more than my 16gb of ram allows). 我大约有160万条记录,有60万个用户,我的python程序正在消耗大量的内存(超过了我的16GB RAM所允许的内存)。

I looked into using session.merge but it seems to query the database every time I call it, making the process extremely slow. 我调查了使用session.merge的过程,但似乎每次调用它都会查询数据库,这使过程非常缓慢。 Is there any other way around this? 还有其他解决方法吗? I want to be able to make commits within each process rather than merge it all into one big dict at the end, but I dont want to break any relationships or have primary key errors. 我希望能够在每个流程中进行提交,而不是在最后将所有内容合并为一个大字典,但是我不想破坏任何关系或出现主键错误。

It's pretty strange that parallel loading of 80 local files is much faster than loading it by one at a time. 奇怪的是,并行加载80个本地文件比一次加载一个要快得多。 I may propose some reasons for it though. 我可能会为此提出一些理由。

But, Ok. 但是没问题。 You can import data to a temporary denormalized table as is. 您可以按原样将数据导入到临时的非规范化表中。 After that, copy data to target normalized tables using SQL queries. 之后,使用SQL查询将数据复制到目标规范化表。 Then drop the temporary table (or truncate if you need it on regular basis). 然后删除临时表(如果需要,则将其截断)。 Also, look at you SQLAlchemy queries: not only merge degrades performance. 另外,请查看您的SQLAlchemy查询:不仅merge还会降低性能。 "Massive" inserts via add_all are not turn into a single insert, really. 实际上,通过add_all进行的“大量”插入不会变成单个插入。 You would use insert query with list of dicts: I'm inserting 400,000 rows with the ORM and it's really slow! 您将使用带有字典列表的insert查询: 我正在用ORM插入400,000行,这确实很慢! .

I looked into using session.merge but it seems to query the database every time I call it 我调查了使用session.merge的情况,但每次调用它似乎都在查询数据库

It is even worse. 更糟的是。 It should check record existence (first query) and then insert or update the record (second query). 它应该检查记录的存在(第一个查询),然后插入或更新记录(第二个查询)。 So, it looks questionable to use it for processing of large arrays of data. 因此,将其用于处理大量数据似乎很可疑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM