Using Python Scrapy to extract XPATH in a soccer live site

Question

im trying to use Scrapy to return the results and statistics from live games in SofaScore.

The code below :

import scrapy


class SofascoreSpider(scrapy.Spider):
    name = 'SofaScore'
    allowed_domains = ['sofascore.com']
    start_urls = ['http://sofascore.com/']

    def parse(self, response):
        time1 =
response.xpath("/html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").extract()
        print(time1)
        pass

I tried to use response.xpath("//html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").getall() too, but it returns nothing. I used so many different xpath's and it didn't return. What im doing wrong ?

Like, today 10/06 the first match on the page is France vs Austria, xpath : /html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div

Answer 1

The data is generated with JavaScript, but you can get it from the API.

Open devtools in the browser and click on the network tab. Then click on the live button and look where it loads the data from. Then look at the JSON file to see its structure.

import scrapy


class SofascoreSpider(scrapy.Spider):
    name = 'SofaScore'
    allowed_domains = ['sofascore.com']
    start_urls = ['https://api.sofascore.com/api/v1/sport/football/events/live']
    custom_settings = {'DOWNLOAD_DELAY': 0.4}

    def start_requests(self):
        headers = {
            "Accept": "*/*",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.5",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "DNT": "1",
            "Host": "api.sofascore.com",
            "Origin": "https://www.sofascore.com",
            "Pragma": "no-cache",
            "Referer": "https://www.sofascore.com/",
            "Sec-Fetch-Dest": "empty",
            "Sec-Fetch-Mode": "cors",
            "Sec-Fetch-Site": "same-site",
            "Sec-GPC": "1",
            "TE": "trailers",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        }
        yield scrapy.Request(url=self.start_urls[0], headers=headers)

    def parse(self, response):
        events = response.json()
        events = events['events']
        # now iterate throught the list and get what you want from it
        # example:
        for event in events:
            yield {
                'event name': event['tournament']['name'],
                'time': event['time']
            }

Using Python Scrapy to extract XPATH in a soccer live site

Question

1 answers

solution1
0 ACCPTED 2022-06-11 09:55:29

Using Python Scrapy to extract XPATH in a soccer live site

Question

1 answers

solution1 0 ACCPTED 2022-06-11 09:55:29

solution1
0 ACCPTED 2022-06-11 09:55:29