简体   繁体   English

在python中并行化漂亮的汤刮刀

[英]parallelize beautiful soup scraper in python

I would like to parallelize my scraping script, which in written in python using beautiful soup. 我想并行化我的抓取脚本,该脚本使用漂亮的汤用python编写。 Despite reading up on it, I am confused on how to get it to work in my code. 尽管对此有所了解,但我对如何使其在代码中正常工作感到困惑。 What I want to do for now is take a list of links as input and open several browsers/tabs to take this urls as input. 我现在想做的是将链接列表作为输入,并打开几个浏览器/选项卡以将该URL用作输入。 Later obviously I want to include my entire code and scrape from each of the sides. 后来显然我想包括我的整个代码,并从每一面抓取。 But I cannot get this first step to work. 但是我无法迈出第一步。

Here is my attempt: 这是我的尝试:

Test_links = ['https://www.google.com/maps', 'https://www.google.co.uk/? 
gfe_rd=cr&dcr=0&ei=3vPNWpTWOu7t8weBlbXACA', 'https://scholar.google.de/']

def get_URL(Link):
    browser = webdriver.Chrome(chrome_options = options)
    browser.get(Link)

if __name__ == '__main__':
    pool = Pool(processes=5)
    pool.map(get_URL, Link)

I'm not sure if this will work for you, but I think there's an issue with your naming. 我不确定这是否对您有用,但是我认为您的命名存在问题。 Try to stay away from capitalizing variables, because I think they are getting confused with Objects. 尽量不要大写变量,因为我认为它们与对象混淆了。 You could try something like this to see if that theory is right. 您可以尝试类似的方法以查看该理论是否正确。

test_links = ['https://www.google.com/maps', 'https://www.google.co.uk/? 
gfe_rd=cr&dcr=0&ei=3vPNWpTWOu7t8weBlbXACA', 'https://scholar.google.de/']

def get_URL(test_links_list):
    browser = webdriver.Chrome(chrome_options = options)
    browser.get(test_links_list)

if __name__ == '__main__':
    pool = Pool(processes=5)
    pool.map(get_URL, test_links)

I'm not sure if browser.get() will take a list, you might have to iterate over the list calling browser. 我不确定browser.get()是否会列出列表,您可能必须遍历列表调用浏览器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM