简体   繁体   English

同一蜘蛛的多个URL

[英]Multiple URLs for a same spider

I wanted to know if there's a better way to search for multiple URLs inside the same web page with the same spider. 我想知道是否有更好的方法可以使用同一蜘蛛在同一网页内搜索多个URL。 I have several URLs that I want to access with an index. 我有几个要通过索引访问的URL。

The code would be: 该代码将是:

class MySpider(scrapy.Spider):
limit = 5
pages = list(range(1, limit))
shuffle(pages)
cat_a = 'http://example.com/a?page={}'
cat_b = 'http://example.com/b?page={}'

    def parse(self, response):
        for i in self.pages:
          page_cat_a = self.cat_a.format(i)
          page_cat_b = self.cat_b.format(i)
          yield response.follow(page_cat_a, self.parse_page)
          yield response.follow(page_cat_b, self.parse_page)

The function parse_page continues to crawl for other data within these pages. 函数parse_page继续parse_page这些页面中的其他数据。

On my output file, I can see the data is gathered in repeating sequences, so I have 10 web pages from category a and then 10 web pages from category b repeating. 在我的输出文件中,我可以看到数据是按重复的顺序收集的,所以我有10个来自类别a的网页,然后有10个来自类别b的网页重复。 I wonder if the web server I am crawling would notice these sequential behaviours and could ban me. 我想知道我正在爬网的Web服务器是否会注意到这些连续的行为,并且会禁止我。

Also, I have 8 URLs within the same web page I want to crawl, all using indexes so instead of 2 categories I gave in the example, it would be 8. Thanks. 另外,我在要爬网的同一网页中有8个URL,所有URL均使用索引,因此不是示例中给出的2个类别,而是8个。谢谢。

You can use the start_requests spider method instead of doing this inside the parse method. 您可以使用start_requests蜘蛛方法,而不是在parse方法中执行此操作。

import scrapy
from random import shuffle

class MySpider(scrapy.Spider):
    categories = ('a', 'b')
    limit = 5
    pages = list(range(1, limit))
    base_url = 'http://example.com/{category}?page={page}'

    def start_requests(self):
        # Shuffle pages to try to avoid bans
        shuffle(pages)

        for category in categories:
            for page in pages:
                url = self.base_url.format(category=category, page=page)
                yield scrapy.Request(url)

    def parse(self, response):
        # Parse the page
        pass

Another thing you can try to do is search for the category urls from within the site. 您可以尝试做的另一件事是从站点内搜索类别URL。 Let's say you want to get information from the tags showed on the sidebar of http://quotes.toscrape.com/ . 假设您要从http://quotes.toscrape.com/侧边栏中显示的标签中获取信息。 You could manually copy the links and use it the way you are doing or you could do this: 您可以手动复制链接并按您的方式使用它,也可以这样做:

import scrapy

class MySpider(scrapy.Spider):
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for tag in response.css('div.col-md-4.tags-box a.tag::attr(href)').getall():
            yield response.follow(tag, callback=self.parse_tag)

    def parse_tag(self, response):
        # Print the url we are parsing
        print(response.url)

I wonder if the web server I am crawling would notice these sequential behaviours and could ban me. 我想知道我正在爬网的Web服务器是否会注意到这些连续的行为,并且会禁止我。

Yes, the site could notice. 是的,该网站可能会注意到。 Just for you to know, there is no guarantees that the requests will be in the order you "yield". 仅让您知道,不能保证请求将按照您“屈服”的顺序进行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM