简体   繁体   中英

Is scrapy.spider or crawler good fit for this task?

I am trying to scrape soccer players' data using python's Scrapy package. The website I'm scraping has the format

https://www.example.com/players — I'll refer to it as “Homepage”

Here, there is a list of players playing in the league. To get to the data I'm looking for starting at the Homepage, I have to click the player's name and it takes me to an “overview” page of that player which has the data I need. To get the data I want to scrape for the second player, I have to go back up to the Homepage and click the name of the second player and scrape the data > back up to the Homepage again and click the name of the third player and so on. So How should I go about doing this task in Scrapy? Should I use scrapy.spider or crawlspider? How do I tell scrapy I want to go into a specific page (player's overview page) and out to the Homepage where the list of all players exist so I'm able to go to the next player repeating the same process? Thank you in advance!

Assuming that the page isn't rendered with javascript the scrapy would be a great tool.

I would suggest reading the installation docs and the tutorial to get a general understanding of how it works, where to begin and how to start a new project.

Here is an example of what your spider could look like:

import scrapy

class MySpider(scrapy.Spider):

    name = "myspider"
    start_urls = ["https://example.com/homepage"]

    def parse(self, response):
        for players_name in response.xpath_or_css_selector(some_selector_path_to_url).getall():
            yield scrapy.Request(url, callback=self.parse_player)

    def parse_player(self, response):
        # scrape the player data into a dictionary and then yield it as an item
        yield {player: data}

Installation docs

Scrapy Tutorial

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM