使用 python 爬取网站

Question

So I am looking for a dynamic way to crawl a website and grab links from each page.所以我正在寻找一种动态的方式来抓取网站并从每个页面中获取链接。 I decided to experiment with Beauitfulsoup.我决定尝试 Beauitfulsoup。 Two questions: How do I do this more dynamically then using nested while statements searching for links.两个问题：如何更动态地执行此操作，然后使用嵌套的 while 语句搜索链接。 I want to get all the links from this site.我想从该站点获取所有链接。 But I don't want to continue to put nested while loops.但我不想继续放置嵌套的while循环。

    topLevelLinks = self.getAllUniqueLinks(baseUrl)
    listOfLinks = list(topLevelLinks)       

    length = len(listOfLinks)
    count = 0       

    while(count < length):

        twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
        twoListOfLinks = list(twoLevelLinks)
        twoCount = 0
        twoLength = len(twoListOfLinks)

        for twoLinks in twoListOfLinks:
            listOfLinks.append(twoLinks)

        count = count + 1

        while(twoCount < twoLength):
            threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])  
            threeListOfLinks = list(threeLevelLinks)

            for threeLinks in threeListOfLinks:
                listOfLinks.append(threeLinks)

            twoCount = twoCount +1



    print '--------------------------------------------------------------------------------------'
    #remove all duplicates
    finalList = list(set(listOfLinks))  
    print finalList

My second questions is there anyway to tell if I got all the links from the site.无论如何，我的第二个问题是要告诉我是否从该站点获得了所有链接。 Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish.请原谅我，我对 python （一年左右）有点陌生，我知道我的一些流程和逻辑可能很幼稚。 But I have to learn somehow.但我必须以某种方式学习。 Mainly I just want to do this more dynamic then using nested while loop.主要是我只想更动态地执行此操作，然后使用嵌套的 while 循环。 Thanks in advance for any insight.提前感谢您的任何见解。

Answer 1

The problem of spidering over a web site and getting all the links is a common problem.爬取 web 站点并获取所有链接的问题是一个常见问题。 If you Google search for "spider web site python" you can find libraries that will do this for you.如果您在 Google 上搜索“spider web 站点 python”，您可以找到可以为您执行此操作的库。 Here's one I found:这是我找到的一个：

http://pypi.python.org/pypi/spider.py/0.5 http://pypi.python.org/pypi/spider.py/0.5

Even better, Google found this question already asked and answered here on StackOverflow:更好的是，谷歌发现这个问题已经在 StackOverflow 上提出并回答了：

Anyone know of a good Python based web crawler that I could use? 任何人都知道我可以使用的基于 Python 的优秀 web 爬虫吗？

Answer 2

If using BeautifulSoup, why don't you use findAll() method??如果使用 BeautifulSoup，为什么不使用 findAll() 方法？ Basically, in my crawler i do:基本上，在我的爬虫中，我这样做：

self.soup = BeautifulSoup(HTMLcode)
for frm in self.soup.findAll(str('frame')):
try:
    if not frm.has_key('src'):
        continue
    src = frm[str('src')]
    #rest of URL processing here
except Exception, e:
    print  'Parser <frame> tag error: ', str(e)

for the frame tag.为框架标签。 The same goes for "img src"and "a href" tags. “img src”和“a href”标签也是如此。 I like the topic though - maybe its me who has sth wrong here... edit: there is ofc a top-level instance, which saves the URLs and gets the HTMLcode from each link later...不过我喜欢这个话题——也许是我在这里出了问题……编辑：有一个顶级实例，它保存 URL 并稍后从每个链接中获取 HTML 代码……

Answer 3

To answer your question from the comment, here's an example (it's in ruby, but I don't know python, and they are similar enough for you to be able to follow along easily):要从评论中回答您的问题，这里有一个示例（在 ruby 中，但我不知道 python，它们足够相似，您可以轻松理解）：

#!/usr/bin/env ruby

require 'open-uri'

hyperlinks = []
visited = []

# add all the hyperlinks from a url to the array of urls
def get_hyperlinks url
  links = []
  begin
    s = open(url).read
    s.scan(/(href|src)\w*=\w*[\",\']\S+[\",\']/) do
      link = $&.gsub(/((href|src)\w*=\w*[\",\']|[\",\'])/, '')
      link = url + link if link[0] == '/'

      # add to array if not already there
      links << link if not links =~ /url/
    end
  rescue
    puts 'Looks like we can\'t be here...'
  end
  links
end

print 'Enter a start URL: '
hyperlinks << gets.chomp
puts 'Off we go!'
count = 0
while true
  break if hyperlinks.length == 0
  link = hyperlinks.shift
  next if visited.include? link
  visited << link
  puts "Connecting to #{link}..."
  links = get_hyperlinks(link)
  puts "Found #{links.length} links on #{link}..."
  hyperlinks = links + hyperlinks
  puts "Moving on with #{hyperlinks.length} links left...\n\n"
end

sorry about the ruby, but its a better language:P and shouldn't be hard to adapt or, like i said, understand.对 ruby 感到抱歉，但它是一种更好的语言：P 并且应该不难适应，或者，就像我说的，理解。

Answer 4

1) In Python, we do not count elements of a container and use them to index in; 1）在Python中，我们不计算容器的元素，并使用它们来索引； we just iterate over its elements, because that's what we want to do.我们只是迭代它的元素，因为那是我们想要做的。

2) To handle multiple levels of links, we can use recursion. 2）为了处理多层次的链接，我们可以使用递归。

def followAllLinks(self, from_where):
    for link in list(self.getAllUniqueLinks(from_where)):
        self.followAllLinks(link)

This does not handle cycles of links, but neither did the original approach.这不处理链接循环，但原始方法也没有。 You can handle that by building a set of already-visited links as you go.您可以像 go 那样构建一set已访问过的链接来处理这个问题。

Answer 5

Use scrapy :使用scrapy ：

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. Scrapy是一个快速的高级屏幕抓取和web爬取框架，用于爬取网站并从其页面中提取结构化数据。 It can be used for a wide range of purposes, from data mining to monitoring and automated testing.它可用于广泛的用途，从数据挖掘到监控和自动化测试。

使用 python 爬取网站

问题描述

5 个解决方案

解决方案1
4 2011-07-20 22:51:56

解决方案2
2 2011-09-21 14:50:37

解决方案3
0 2011-07-20 23:03:47

解决方案4
0 2011-07-20 23:09:43

解决方案5
0 2011-07-21 07:51:43

使用 python 爬取网站

问题描述

5 个解决方案

解决方案1 4 2011-07-20 22:51:56

解决方案2 2 2011-09-21 14:50:37

解决方案3 0 2011-07-20 23:03:47

解决方案4 0 2011-07-20 23:09:43

解决方案5 0 2011-07-21 07:51:43

解决方案1
4 2011-07-20 22:51:56

解决方案2
2 2011-09-21 14:50:37

解决方案3
0 2011-07-20 23:03:47

解决方案4
0 2011-07-20 23:09:43

解决方案5
0 2011-07-21 07:51:43