Most of the time the amount of webpages I have to scrape are below 100, so using a for loop I parse them in a reasonable time. But now I have to parse over 1000 webpages.
Searching for a way to do this, I found that threading might help. I have watched and read some tutorials and I believe that I have understood the general logic.
I know that if I have 100 webpages, I can create 100 threads. It's not recommended, especially for a very large number of webpages. What I haven't really figured out is, for example, how I can create 5 threads with 200 webpages on each thread.
Below is a simple code sample using threading and Selenium:
import threading
from selenium import webdriver
def parse_page(page_url):
driver = webdriver.PhantomJS()
driver.get(url)
text = driver.page_source
..........
return parsed_items
def threader():
worker = q.get()
parse_page(page_url)
q.task_one()
urls = [.......]
q = Queue()
for x in range(len(urls)):
t = threading.Thread(target=threader)
t.daemon = True
t.start()
for worker in range(20):
q.put(worker)
q.join()
Another thing that I am not clear on and it is shown in the above code sample is how I use arguments in thread.
Probably the simplest way will be to use ThreadPool
from multiprocessing.pool module or if you are on python3 ThreadPoolExecutor from concurrent.futures
module.
ThreadPool
has (almost) the same api as regular Pool but uses threads instead of processes.
eg
def f(i):
return i * i
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=10)
res = pool.map(f, [2, 3, 4, 5])
print(res)
[4, 9, 16, 25]
And for ThreadPoolExecutor
check this example .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.