简体   繁体   中英

parallelize beautiful soup scraper in python

I would like to parallelize my scraping script, which in written in python using beautiful soup. Despite reading up on it, I am confused on how to get it to work in my code. What I want to do for now is take a list of links as input and open several browsers/tabs to take this urls as input. Later obviously I want to include my entire code and scrape from each of the sides. But I cannot get this first step to work.

Here is my attempt:

Test_links = ['https://www.google.com/maps', 'https://www.google.co.uk/? 
gfe_rd=cr&dcr=0&ei=3vPNWpTWOu7t8weBlbXACA', 'https://scholar.google.de/']

def get_URL(Link):
    browser = webdriver.Chrome(chrome_options = options)
    browser.get(Link)

if __name__ == '__main__':
    pool = Pool(processes=5)
    pool.map(get_URL, Link)

I'm not sure if this will work for you, but I think there's an issue with your naming. Try to stay away from capitalizing variables, because I think they are getting confused with Objects. You could try something like this to see if that theory is right.

test_links = ['https://www.google.com/maps', 'https://www.google.co.uk/? 
gfe_rd=cr&dcr=0&ei=3vPNWpTWOu7t8weBlbXACA', 'https://scholar.google.de/']

def get_URL(test_links_list):
    browser = webdriver.Chrome(chrome_options = options)
    browser.get(test_links_list)

if __name__ == '__main__':
    pool = Pool(processes=5)
    pool.map(get_URL, test_links)

I'm not sure if browser.get() will take a list, you might have to iterate over the list calling browser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM