Scrapy - 递归抓取并处理重复条目

Question

一般来说，我对scrapy和python很陌生，目前正在尝试从https://www.twitchmetrics.net/channels/viewership为每个给定类别（最受关注、增长最快等）抓取每种语言/游戏组合

这个想法是在每一行中都有一个频道，其中包含 rank_mostwatched_english_leagueoflegends、rank_fastest_coming_nederlands_dota2 等列。

到目前为止，我管理的不多（并且仅针对“最受关注”类别）：

获得全/全组合的排名
提取语言的链接，关注它们并动态创建项目以存储排名

到目前为止我有这个代码：

import scrapy
import datetime
from abc import ABC
from scrapy.loader import ItemLoader
from scrapy.item import BaseItem


class FlexibleItem(dict, BaseItem):
    pass


class ChannelLeaderboardSpider(scrapy.Spider):
    name = 'channel_leaderboards'
    # Enter first page of the leaderboard URL
    start_urls = [
        'https://www.twitchmetrics.net/channels/viewership'
    ]

    def parse(self, response):
        language_page_links = response.xpath(
            '//div[@class="mb-4"][1]//a//@href').getall()
        lang_page_links_test = language_page_links[:3]
        yield from response.follow_all(lang_page_links_test, self.parse_lang)

    def parse_lang(self, response):
        # Grab all games for currently parsed language
        all_games = response.xpath(
            '//div[@class="mb-4"][1]//a//@href').getall()
        all_games_test = all_games[:3]
        all_channels = response.xpath('//h5')
        language = (response.url).partition('lang=')[2]
        if language == '':
            language = 'all'
        else:
            pass

        for i, channel in enumerate(all_channels, start=1):
            il = ItemLoader(item=FlexibleItem(), selector=channel)
            il.add_xpath('channel_id', './text()')
            il.add_value('rank_mostwatched_'+language+'_all', i)
            il.add_value('date', datetime.date.today().strftime('%y-%m-%d'))
            yield il.load_item()

        yield from response.follow_all(all_games_test, self.parse_games)

    def parse_games(self, response):
        pass

如您所见，我的下一个想法是获取当前语言的所有游戏的链接。 但是，我不太确定如何继续。 如何确保正确填充多次出现的频道？ 有没有更好、更简洁的方法来做到这一点？

Answer 1

在将请求发送到通道的代码中，您需要设置 dont_filter=True 以确保抓取重复的 URL。 您也可以关闭重复的中间件，但我通常只使用 dont_filter。 我不熟悉新的遵循所有语法，但我想不要过滤器的工作方式相同。

yield scrapy.Request(url, callback=self.parse_games, dont_filter=True)

Scrapy - 递归抓取并处理重复条目

问题描述

1 个解决方案

解决方案1
1 2020-03-08 02:09:11

Scrapy - 递归抓取并处理重复条目

问题描述

1 个解决方案

解决方案1 1 2020-03-08 02:09:11

解决方案1
1 2020-03-08 02:09:11