简体   繁体   中英

Scrapy - Scraping Recursively and handling double entries

I am quite new to scrapy and python in general and currently trying to scrape each language / game combination for each given category (most watched, fastest growing etc) from https://www.twitchmetrics.net/channels/viewership

The idea is to have a channel in each row with columns like rank_mostwatched_english_leagueoflegends, rank_fastest_growing_nederlands_dota2 etc.

So far I managed not much (and only for the 'most watched' category):

  • Got the ranks for the all/all combination
  • Extract the links for the languages, follow them and create items dynamically to store the rank

I have this code so far:

import scrapy
import datetime
from abc import ABC
from scrapy.loader import ItemLoader
from scrapy.item import BaseItem


class FlexibleItem(dict, BaseItem):
    pass


class ChannelLeaderboardSpider(scrapy.Spider):
    name = 'channel_leaderboards'
    # Enter first page of the leaderboard URL
    start_urls = [
        'https://www.twitchmetrics.net/channels/viewership'
    ]

    def parse(self, response):
        language_page_links = response.xpath(
            '//div[@class="mb-4"][1]//a//@href').getall()
        lang_page_links_test = language_page_links[:3]
        yield from response.follow_all(lang_page_links_test, self.parse_lang)

    def parse_lang(self, response):
        # Grab all games for currently parsed language
        all_games = response.xpath(
            '//div[@class="mb-4"][1]//a//@href').getall()
        all_games_test = all_games[:3]
        all_channels = response.xpath('//h5')
        language = (response.url).partition('lang=')[2]
        if language == '':
            language = 'all'
        else:
            pass

        for i, channel in enumerate(all_channels, start=1):
            il = ItemLoader(item=FlexibleItem(), selector=channel)
            il.add_xpath('channel_id', './text()')
            il.add_value('rank_mostwatched_'+language+'_all', i)
            il.add_value('date', datetime.date.today().strftime('%y-%m-%d'))
            yield il.load_item()

        yield from response.follow_all(all_games_test, self.parse_games)

    def parse_games(self, response):
        pass

As you can see my next idea was to get the links for all games of the current language. However, I am not quite sure how to continue. How do I make sure that channels which appear multiple times are populated correctly? Is there a better, less messy way to do this?

Inside of your code that sends the request to the channel you will want to set dont_filter=True to ensure that duplicate URLs are crawled. You can also turn off the duplicate middleware, but I usually just use dont_filter. I'm not familiar with the new follow all syntax but I imagine dont filter works the same.

yield scrapy.Request(url, callback=self.parse_games, dont_filter=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM