简体   繁体   English

Python-提出超过100,000个请求并将其存储到数据库的最佳方法

[英]Python - Best approach for making over 100,000 requests and storing to database

There is data that I'm scraping from many different requests. 我正在从许多不同的请求中抓取数据。

Up until now, I've used multithreading and the requests library to retrieve the necessary and then loading them into an sqlite database. 到目前为止,我已经使用多线程和请求库来检索必要的内容,然后将其加载到sqlite数据库中。 with approximately the following approach: 大致采用以下方法:

p = Pool(processes=8)
for x in range(start_1,end_1):
    for y in range(start_2,end_2):
        entry_list = p.starmap(get_data, [(x , y , z) for z in range(start, end)]):
        ### get_data makes the request and retruns a tuple of (x,y,z,data)
        for entry in entry list:
            cur.execute('''INSERT INTO Database (attrib_1, attrib_2, attrib_3, data )
            VALUES ( ?, ?, ?, ?)''', entry )

This approach is very slow (will take days to make all of the requests on my machine). 这种方法非常慢(需要几天才能在我的计算机上发出所有请求)。 After doing a little research I have seen that there are alternatives to multithreading for this kind of problem, such as asynchronous requests. 经过一些研究后,我发现对于此类问题,多线程有替代方法,例如异步请求。 Unfortunately, I don't know anything about this approach and whether or not it's appropriate, far less how to implement it. 不幸的是,我对这种方法及其是否合适一无所知,更不用说如何实现了。

Any advice on how to complete this task efficiently would be greatly appreciated. 任何有关如何有效完成此任务的建议将不胜感激。

Since your program is I/O bound, look at event loops. 由于您的程序受I / O约束,因此请查看事件循环。 True multi-threading is broken in Python because of the global interpreter lock (GIL). 由于全局解释器锁(GIL),真正的多线程在Python中被破坏了。

Look at asyncio (available since Python 3.4) and/or Twisted . 查看asyncio (从Python 3.4开始可用)和/或Twisted

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM