在python / BeautifulSoup中进行网络爬网时将值附加到可迭代的对象（还想了解多线程）

Question

I am coming here with a bug on my very first real python program (ie not something out of codeacademy). 我来到这里时遇到的是我第一个真正的python程序的错误（即，不是codeacademy的问题）。 I am an avid user of R and I built out a bunch of web crawling/scraping tools using the XML packages. 我是R的狂热用户，并且使用XML包构建了许多Web爬网/爬网工具。 Unfortunately, I reached the point where R is just not ideal for some of the things that I am trying to do, so I am building some of these tools in Python and Hopefully you can help. 不幸的是，我到了R刚不适合我尝试做的某些事情的地步，所以我正在Python中构建其中一些工具，希望您能为您提供帮助。 If there are any glaring python-specific coding best-practices that I'm neglecting, I would appreciate a heads-up as well. 如果有任何我忽略的针对python的出色编码最佳实践，也请多加注意。

I need to be able to append values to my iterable. 我需要能够将值附加到可迭代对象。 At each step of the iteration in the code below, the entire object links_child gets added to my the links object. 在下面代码的迭代的每个步骤中，整个对象links_child被添加到我的links对象中。 I actually need each link within links_child to be added separately to the object, rather than all as one entry. 实际上，我需要将links_child每个链接分别添加到对象，而不是全部添加为一个条目。 This will keep growing and growing links and the iteration will only break when I reach a specified number of websites (100 in the code below). 这将使links不断增长，并且只有在到达指定数量的网站（下面的代码中为100）时，迭代才会中断。 Do any of you know how I can extract the bs4 items line-by-line and add them to my iterable object? 你们中有谁知道我如何逐行提取bs4项目并将其添加到我的可迭代对象中吗？ The error I get is as follows: 我得到的错误如下：

AttributeError: 'ResultSet' object has no attribute 'get'

Also, I ultimately want this crawler to be a heck of a lot faster. 另外，我最终希望该履带更快。 What types of mutli-threading options (if any) do I have with beautifulsoup? beautifulsoup有哪些类型的多线程选项（如果有）？ Should I switch libraries? 我应该切换库吗？ Are there any low-hanging fruits I can pick here that would boost my speed? 我可以在这里摘下一些低调的水果来提高自己的速度吗？ It would be perfect if there was a simple way to have 5-10 threads doing this crawl at once and updating the same dictionary object, but that is probably just a fantasy. 如果有一种简单的方法可以让5-10个线程一次执行此爬网并更新同一字典对象，那将是完美的，但这可能只是一个幻想。

from urllib import urlopen
from bs4 import BeautifulSoup
import re

base_site = "http://www.tasq.com"
page = urlopen(base_site).read()
soup = BeautifulSoup(page)
links = soup.find_all('a')
start_slash = re.compile('/')
link_db = {}
# iterate through all links on the homepage
for link in links:
    # print for debugging purposes
    print link
    # pull out the hrefs from the current link
    fullLink = str(link.get('href'))
    # print for debugging purposes
    print fullLink
    # see if the link is valid using regex
    check_start = start_slash.match(fullLink)
    # if the link is not valid, concatenate the domain
    if check_start <> None:
        fullLink = base_site + fullLink
    # fi the link is already stored as a key in the dict, skip it
    if fullLink in link_db:
        next
    # connect to the full link (O operation)
    page_child = urlopen(fullLink).read()
    # create bs4 object out of the opened page
    soup_child = BeautifulSoup(page_child)
    # insert the link as the key and some text string as the value
    link_db[fullLink] = 'example'
    # find all links on current page and save them in object
    links_child = soup_child.find_all('a')
    # (THIS IS THE SOURCE OF THE ERROR)append object with links to the iterable
    links.append(links_child)
    # break code if link_db gets to specified length
    if len(link_db) == 100:
        break

Answer 1

You're cycling over links and you're appending over it as well. 您在links循环，并且也在其上附加。 Eventually it hits the first links_child you added, which is a list of links not a Tag object and therefore doesn't have a get attribute. 最终，它会links_child您添加的第一个links_child ，这是链接列表，而不是Tag对象，因此没有get属性。

Append the links_child to another variable and it works fine. 将links_child附加到另一个变量，它可以正常工作。 You can also use extend instead of append to add the contents of links_child to links but it hits another problem further on trying to read a relative URL ../contact/contact-form.php which you're not accounting for. 您还可以使用extend而不是append将links_child的内容添加到links但这在尝试读取相对URL ../contact/contact-form.php遇到了另一个问题，该URL您无需考虑。

There's several ways to do multiprocessing in Python, the most popular is multiprocessing as it gives you a nice API to work with along with spawning processes rather than threads which would fully make use of the multiple cores in the CPU. 在Python中有多种方法可以进行多重处理，最流行的是多重处理，因为它为您提供了一个很好的API，可以与生成进程一起使用，而不是与线程一起使用，而线程可以充分利用CPU中的多个内核。

There's a number of ways you can approach multiprocessing in this example. 在此示例中，有多种方法可以实现多重处理。 You could, for instance, define your main loop as a function add create a Pool of workers to work through it. 例如，您可以将主循环定义为一个函数，然后添加一个工作池以完成工作。 Something like this: 像这样：

from urllib import urlopen
from bs4 import BeautifulSoup
import re
import multiprocessing

def work(link):
    link_db = {}
    start_slash = re.compile('/')
    print link
    fullLink = link.attrs.get('href', None)
    check_start = start_slash.match(fullLink)
    if check_start != None:
        fullLink = base_site + fullLink
    page_child = urlopen(fullLink).read()
    soup_child = BeautifulSoup(page_child)
    link_db[fullLink] = 'example'
    return link_db

if __name__ == '__main__':
    base_site = "http://www.tasq.com"
    page = urlopen(base_site).read()
    soup = BeautifulSoup(page)
    links = soup.find_all('a')
    link_dbs = []
    pool = multiprocessing.Pool(processes=4)
    result = pool.map_async(work, links)
    link_dbs.extend( result.get() )
    print link_dbs

Take this as guideline though, I simplified your function to make it clearer. 以此为指导，我简化了您的功能以使其更加清晰。 Hopefully this will get you on track. 希望这能使您步入正轨。

在python / BeautifulSoup中进行网络爬网时将值附加到可迭代的对象（还想了解多线程）

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-01-13 09:00:12

在python / BeautifulSoup中进行网络爬网时将值附加到可迭代的对象（还想了解多线程）

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-01-13 09:00:12

解决方案1
3 已采纳 2014-01-13 09:00:12