简体   繁体   中英

Using Python Scrapy to extract XPATH in a soccer live site

im trying to use Scrapy to return the results and statistics from live games in SofaScore.

Site : https://www.sofascore.com/

The code below :

import scrapy


class SofascoreSpider(scrapy.Spider):
    name = 'SofaScore'
    allowed_domains = ['sofascore.com']
    start_urls = ['http://sofascore.com/']

    def parse(self, response):
        time1 =
response.xpath("/html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").extract()
        print(time1)
        pass 

I tried to use response.xpath("//html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").getall() too, but it returns nothing. I used so many different xpath's and it didn't return. What im doing wrong ?

Like, today 10/06 the first match on the page is France vs Austria, xpath : /html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div

The data is generated with JavaScript, but you can get it from the API.

Open devtools in the browser and click on the network tab. Then click on the live button and look where it loads the data from. Then look at the JSON file to see its structure.

import scrapy


class SofascoreSpider(scrapy.Spider):
    name = 'SofaScore'
    allowed_domains = ['sofascore.com']
    start_urls = ['https://api.sofascore.com/api/v1/sport/football/events/live']
    custom_settings = {'DOWNLOAD_DELAY': 0.4}

    def start_requests(self):
        headers = {
            "Accept": "*/*",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.5",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "DNT": "1",
            "Host": "api.sofascore.com",
            "Origin": "https://www.sofascore.com",
            "Pragma": "no-cache",
            "Referer": "https://www.sofascore.com/",
            "Sec-Fetch-Dest": "empty",
            "Sec-Fetch-Mode": "cors",
            "Sec-Fetch-Site": "same-site",
            "Sec-GPC": "1",
            "TE": "trailers",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        }
        yield scrapy.Request(url=self.start_urls[0], headers=headers)

    def parse(self, response):
        events = response.json()
        events = events['events']
        # now iterate throught the list and get what you want from it
        # example:
        for event in events:
            yield {
                'event name': event['tournament']['name'],
                'time': event['time']
            }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM