im trying to use Scrapy to return the results and statistics from live games in SofaScore.
Site : https://www.sofascore.com/
The code below :
import scrapy
class SofascoreSpider(scrapy.Spider):
name = 'SofaScore'
allowed_domains = ['sofascore.com']
start_urls = ['http://sofascore.com/']
def parse(self, response):
time1 =
response.xpath("/html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").extract()
print(time1)
pass
I tried to use response.xpath("//html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").getall()
too, but it returns nothing. I used so many different xpath's and it didn't return. What im doing wrong ?
Like, today 10/06 the first match on the page is France vs Austria, xpath : /html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div
The data is generated with JavaScript, but you can get it from the API.
Open devtools in the browser and click on the network
tab. Then click on the live
button and look where it loads the data from. Then look at the JSON file to see its structure.
import scrapy
class SofascoreSpider(scrapy.Spider):
name = 'SofaScore'
allowed_domains = ['sofascore.com']
start_urls = ['https://api.sofascore.com/api/v1/sport/football/events/live']
custom_settings = {'DOWNLOAD_DELAY': 0.4}
def start_requests(self):
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"DNT": "1",
"Host": "api.sofascore.com",
"Origin": "https://www.sofascore.com",
"Pragma": "no-cache",
"Referer": "https://www.sofascore.com/",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-site",
"Sec-GPC": "1",
"TE": "trailers",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
}
yield scrapy.Request(url=self.start_urls[0], headers=headers)
def parse(self, response):
events = response.json()
events = events['events']
# now iterate throught the list and get what you want from it
# example:
for event in events:
yield {
'event name': event['tournament']['name'],
'time': event['time']
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.