简体   繁体   English

Web 使用 Scrapy 在链接内抓取

[英]Web Scraping within Links using Scrapy

I am web-scraping information from a site that has several pages of data.我正在从一个有几页数据的网站上抓取信息。 Within each scrape, I am extracting a handful of information.在每次抓取中,我都提取了一些信息。 However, I want to also go inside the link of what I am scraping and scrape information from there as well and then return back to the site and continue scraping.但是,我还想在我正在抓取的链接中的 go 并从那里抓取信息,然后返回站点并继续抓取。 How would I do this using Scrapy?我将如何使用 Scrapy 做到这一点?

CrawlSpider library can be used along with scrapy to recursively crawl through a huge graph of web pages. CrawlSpider 库可以与 scrapy 一起使用,以递归地爬取 web 页面的巨大图。

More information:更多信息:

Crawling a site recursively using scrapy 使用 scrapy 递归地爬取站点

https://realpython.com/web-scraping-and-crawling-with-scrapy-and-mongodb/ https://realpython.com/web-scraping-and-crawling-with-scrapy-and-mongodb/

https://mherman.org/blog/recursively-scraping-web-pages-with-scrapy/ https://mherman.org/blog/recursively-scraping-web-pages-with-scrapy/

You could use recursion to achieve the desired objective.您可以使用递归来实现所需的目标。 Start by scraping a start link, and then recursively iterate the links inside it and it goes on.首先抓取一个起始链接,然后递归迭代其中的链接,然后继续。 Beware that this recursion might take very long, and might eventually ban your scraper in some cases.请注意,这种递归可能需要很长时间,并且在某些情况下最终可能会禁止您的爬虫 Try limiting the recursion depth to 2 or 3.尝试将递归深度限制为 2 或 3。

Code Snippet:代码片段:

from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.http import Request

class MyItem(Item):
    url= Field()

class MySpider(CrawlSpider):
    name = 'twitter.com'
    allowed_domains = ['twitter.com']
    start_urls = ['http://www.twitter.com']

    rules = (Rule(LinkExtractor(), callback='parse_url', follow=False), )

    def parse_url(self, response):
        item = MyItem()

        ## Do your processing here

        item['url'] = response.url
        request = Request(response.url)
        yield request

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM