Python ThreadPoolExecutor threads not finishing

Question

I have a script for crawling a page using concurrent.futures.ThreadPoolExecutor using Python 3.8.2. Essentially it crawls a page for links, stores them in sqlite using sqlalchemy, and then moves on to the next page.

I have an issue however that the script never finishes. I have made sure that all processes are finished using two print statements, but the script just hangs and never finished. Is there something I have missed regarding how to deal with concurrency and the sqlite sessions?

from sqlalchemy import create_engine, Column, String
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.ext.declarative import declarative_base


def crawl(link):
    print('Starting: {}'.format(link))
    session = Session()
    html = requests.get(url, timeout=10)
    soup = BeautifulSoup(html.text, 'lxml')

    links = [entry.get('href') for entry in soup.find_all('a',  clazz)]
    for link in links:
        data = {
            'type': self.type,
            'status': self.status,
            'url': link
        }
        if not session.query(exists().where(Table.url == link)).scalar():
            d = DataEntry(**data)
            session.add(d)
            session.commit()

    print('Finished: {}'.format(link))

def main():
    links = ['www.link1.com', 'www.link2', ....]
    with futures.ThreadPoolExecutor(max_workers=4) as executor:
        the_futures = [executor.submit(crawl_for_listings, task) for task in tasks]
        for future in the_futures:
            try:
                result = future.result()
            except Exception as e:
                print('Thread threw exception:', e)

if __name__ == "__main__":
    engine = create_engine("sqlite:///database.sql")
    Base = declarative_base()

    class Links(Base):
        __tablename__ = 'links'

        url = Column(String, primary_key=True)
        type = Column(String)
        status = Column(String)

    Base.metadata.create_all(engine)

    session_factory = sessionmaker(bind=engine)
    Session = scoped_session(session_factory)

    main()

    Session.remove()

Answer 1

Your call to submit should be:

future = executor.submit(crawl, link)

Not:

executor.submit(crawl(link))

In the first case you are passing to submit a reference to a function and its arguments. In the second case you are first calling the function and then passing to submit the return value from that call, which appears to be None . You should then save the returned future object and you can test for the completion of the threads as they occur thus:

with futures.ThreadPoolExecutor(max_workers=4) as executor: 
    the_futures = []
    for link in links:
        future = executor.submit(crawl, link)
        the_futures.append(future)
    for future in futures.as_completed(the_futures):
        #print(future.result()) # result is None in this case
        pass

Or more "Pythonically":

with futures.ThreadPoolExecutor(max_workers=4) as executor: 
    the_futures = [executor.submit(crawl, link) for link in links]
    for future in futures.as_completed(the_futures):
        pass

Also note that I am creating variable executor using a context manager so that any necessary cleanup is done when the block terminates (a call to shutdown is made, which will wait until all futures are completed, but I am explicitly waiting for the futures to complete before exiting the block).

If you cared that the results are returned in the order of creation (you wouldn't in this case since the results returned are always None ):

with futures.ThreadPoolExecutor(max_workers=4) as executor: 
    for result in executor.map(crawl, links):
        #print(result) # None in this case
        pass

The above executor.map function is, however, not that convenient when you want to obtain all the results and it is possible that one or more of the threads might throw an exception because you will not be able to retrieve a result from a thread beyond the first one that threw an exception (even assuming you use a try/except block for getting the results). It's also more complicated to use when the function you are invoking takes something other than one argument. So in those cases it's probably best to use futures:

with futures.ThreadPoolExecutor(max_workers=4) as executor: 
    the_futures = [executor.submit(crawl, link) for link in links]
for future in the_futures:
    try:
        result = future.result() # could throw an exception if the thread threw an exception
        print(result)
    except Exception as e:
        print('Thread threw exception:', e)

And with all of the above, I am still not sure why your program did not terminate. One thing is sure: You were not multithreading.

Python ThreadPoolExecutor threads not finishing

Question

1 answers

solution1
1 ACCPTED 2020-05-13 15:17:50

Python ThreadPoolExecutor threads not finishing

Question

1 answers

solution1 1 ACCPTED 2020-05-13 15:17:50

solution1
1 ACCPTED 2020-05-13 15:17:50