简体   繁体   English

Python ThreadPoolExecutor 线程未完成

[英]Python ThreadPoolExecutor threads not finishing

I have a script for crawling a page using concurrent.futures.ThreadPoolExecutor using Python 3.8.2.我有一个使用 Python 3.8.2 使用 concurrent.futures.ThreadPoolExecutor 抓取页面的脚本。 Essentially it crawls a page for links, stores them in sqlite using sqlalchemy, and then moves on to the next page.本质上,它会爬取页面中的链接,使用 sqlalchemy 将它们存储在 sqlite 中,然后转到下一页。

I have an issue however that the script never finishes.但是我有一个问题,脚本永远不会完成。 I have made sure that all processes are finished using two print statements, but the script just hangs and never finished.我确保所有进程都使用两个打印语句完成,但脚本只是挂起并且从未完成。 Is there something I have missed regarding how to deal with concurrency and the sqlite sessions?关于如何处理并发和 sqlite 会话,我是否遗漏了什么?

from sqlalchemy import create_engine, Column, String
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.ext.declarative import declarative_base


def crawl(link):
    print('Starting: {}'.format(link))
    session = Session()
    html = requests.get(url, timeout=10)
    soup = BeautifulSoup(html.text, 'lxml')

    links = [entry.get('href') for entry in soup.find_all('a',  clazz)]
    for link in links:
        data = {
            'type': self.type,
            'status': self.status,
            'url': link
        }
        if not session.query(exists().where(Table.url == link)).scalar():
            d = DataEntry(**data)
            session.add(d)
            session.commit()

    print('Finished: {}'.format(link))

def main():
    links = ['www.link1.com', 'www.link2', ....]
    with futures.ThreadPoolExecutor(max_workers=4) as executor:
        the_futures = [executor.submit(crawl_for_listings, task) for task in tasks]
        for future in the_futures:
            try:
                result = future.result()
            except Exception as e:
                print('Thread threw exception:', e)

if __name__ == "__main__":
    engine = create_engine("sqlite:///database.sql")
    Base = declarative_base()

    class Links(Base):
        __tablename__ = 'links'

        url = Column(String, primary_key=True)
        type = Column(String)
        status = Column(String)

    Base.metadata.create_all(engine)

    session_factory = sessionmaker(bind=engine)
    Session = scoped_session(session_factory)

    main()

    Session.remove()

Your call to submit should be:submit的电话应该是:

future = executor.submit(crawl, link)

Not:不是:

executor.submit(crawl(link))

In the first case you are passing to submit a reference to a function and its arguments.在第一种情况下,您传递的是submit对 function 及其 arguments 的引用。 In the second case you are first calling the function and then passing to submit the return value from that call, which appears to be None .在第二种情况下,您首先调用 function 然后传递以submit该调用的返回值,这似乎是None You should then save the returned future object and you can test for the completion of the threads as they occur thus:然后,您应该保存返回的future object 并且您可以在线程发生时测试它们是否完成:

with futures.ThreadPoolExecutor(max_workers=4) as executor: 
    the_futures = []
    for link in links:
        future = executor.submit(crawl, link)
        the_futures.append(future)
    for future in futures.as_completed(the_futures):
        #print(future.result()) # result is None in this case
        pass

Or more "Pythonically":或者更“Pythonically”:

with futures.ThreadPoolExecutor(max_workers=4) as executor: 
    the_futures = [executor.submit(crawl, link) for link in links]
    for future in futures.as_completed(the_futures):
        pass

Also note that I am creating variable executor using a context manager so that any necessary cleanup is done when the block terminates (a call to shutdown is made, which will wait until all futures are completed, but I am explicitly waiting for the futures to complete before exiting the block).另请注意,我正在使用上下文管理器创建变量executor器,以便在块终止时完成任何必要的清理(调用shutdown ,它将等待所有期货完成,但我明确等待期货完成在退出区块之前)。

If you cared that the results are returned in the order of creation (you wouldn't in this case since the results returned are always None ):如果您关心按创建顺序返回结果(在这种情况下您不会因为返回的结果始终为None ):

with futures.ThreadPoolExecutor(max_workers=4) as executor: 
    for result in executor.map(crawl, links):
        #print(result) # None in this case
        pass

The above executor.map function is, however, not that convenient when you want to obtain all the results and it is possible that one or more of the threads might throw an exception because you will not be able to retrieve a result from a thread beyond the first one that threw an exception (even assuming you use a try/except block for getting the results).但是,上面的executor.map function 并不是那么方便,并且一个或多个线程可能会抛出异常,因为您将无法从超出的线程检索结果第一个抛出异常(即使假设您使用try/except块来获取结果)。 It's also more complicated to use when the function you are invoking takes something other than one argument.当您调用的 function 需要一个参数以外的其他参数时,使用起来也更加复杂。 So in those cases it's probably best to use futures:因此,在这些情况下,最好使用期货:

with futures.ThreadPoolExecutor(max_workers=4) as executor: 
    the_futures = [executor.submit(crawl, link) for link in links]
for future in the_futures:
    try:
        result = future.result() # could throw an exception if the thread threw an exception
        print(result)
    except Exception as e:
        print('Thread threw exception:', e)

And with all of the above, I am still not sure why your program did not terminate.综上所述,我仍然不确定您的程序为什么没有终止。 One thing is sure: You were not multithreading.有一件事是肯定的:你不是多线程的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM