简体   繁体   English

Scrapy 从任何网站获取所有链接

[英]Scrapy get all links from any website

I have the following code for a web crawler in Python 3:我有以下代码用于 Python 3 中的网络爬虫:

import requests
from bs4 import BeautifulSoup
import re

def get_links(link):

    return_links = []

    r = requests.get(link)

    soup = BeautifulSoup(r.content, "lxml")

    if r.status_code != 200:
        print("Error. Something is wrong here")
    else:
        for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
            return_links.append(link.get('href')))

def recursive_search(links)
    for i in links:
        links.append(get_links(i))
    recursive_search(links)


recursive_search(get_links("https://www.brandonskerritt.github.io"))

The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.该代码基本上从我的 GitHub 页面网站上获取所有链接,然后从这些链接中获取所有链接,依此类推,直到时间结束或发生错误。

I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall.我想在 Scrapy 中重新创建这段代码,这样它就可以服从 robots.txt 并成为一个更好的网络爬虫。 I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example).我在网上研究过,我只能找到关于如何抓取特定域的教程/指南/stackoverflow/quora/博客文章(例如,allowed_domains=["google.com"])。 I do not want to do this.我不想这样做。 I want to create code that will scrape all websites recursively.我想创建将递归抓取所有网站的代码。

This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags).这不是什么大问题,但所有的博客文章等都只展示了如何从特定网站获取链接(例如,他的链接可能在列表标签​​中)。 The code I have above works for all anchor tags, regardless of what website it's being run on.我上面的代码适用于所有锚标记,无论它在哪个网站上运行。

I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.我不想在野外使用它,我需要它用于演示目的,所以我不会突然因为过度的网络爬行而惹恼每个人。

Any help will be appreciated!任何帮助将不胜感激!

There is an entire section of scrapy guide dedicated to broad crawls .有一整节的scrapy 指南专门用于广泛的爬行 I suggest you to fine-grain your settings for doing this succesfully.我建议您细化设置以成功执行此操作。

For recreating the behaviour you need in scrapy, you must为了在scrapy中重新创建你需要的行为,你必须

  • set your start url in your page.在您的页面中设置您的起始网址。
  • write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls编写一个解析函数,跟踪所有链接并递归调用自身,将请求的 url 添加到蜘蛛变量中

An untested example (that can be, of course, refined):一个未经测试的示例(当然可以改进):

class AllSpider(scrapy.Spider):
    name = 'all'

    start_urls = ['https://yourgithub.com']

    def __init__(self):
        self.links=[]

    def parse(self, response):
        self.links.append(response.url)
        for href in response.css('a::attr(href)'):
            yield response.follow(href, self.parse)

If you want to allow crawling of all domains, simply don't specify allowed_domains , and use a LinkExtractor which extracts all links.如果您想允许抓取所有域,只需不要指定allowed_domains ,并使用LinkExtractor来提取所有链接。

A simple spider that follows all links:一个跟踪所有链接的简单蜘蛛:

class FollowAllSpider(CrawlSpider):
    name = 'follow_all'

    start_urls = ['https://example.com']
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

    def parse_item(self, response):
        pass

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM