简体   繁体   中英

How can I use threading to parse multiple webpages in Python?

Most of the time the amount of webpages I have to scrape are below 100, so using a for loop I parse them in a reasonable time. But now I have to parse over 1000 webpages.

Searching for a way to do this, I found that threading might help. I have watched and read some tutorials and I believe that I have understood the general logic.

I know that if I have 100 webpages, I can create 100 threads. It's not recommended, especially for a very large number of webpages. What I haven't really figured out is, for example, how I can create 5 threads with 200 webpages on each thread.

Below is a simple code sample using threading and Selenium:

import threading
from selenium import webdriver

def parse_page(page_url):
   driver = webdriver.PhantomJS()
   driver.get(url)
   text = driver.page_source
   ..........
   return parsed_items

def threader():
   worker = q.get()
   parse_page(page_url)
   q.task_one()

urls = [.......]
q = Queue()

for x in range(len(urls)):
    t = threading.Thread(target=threader)
    t.daemon = True
    t.start()

for worker in range(20):
    q.put(worker)

q.join()

Another thing that I am not clear on and it is shown in the above code sample is how I use arguments in thread.

Probably the simplest way will be to use ThreadPool from multiprocessing.pool module or if you are on python3 ThreadPoolExecutor from concurrent.futures module.

ThreadPool has (almost) the same api as regular Pool but uses threads instead of processes.

eg

def f(i):
    return i * i

from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=10)
res = pool.map(f, [2, 3, 4, 5])
print(res)
[4, 9, 16, 25]

And for ThreadPoolExecutor check this example .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM