简体   繁体   中英

XHR Requests not returning all data from website

I am using Python.org version 2.7 64 bit on Windows 8 64 bit. I have some code that iterates through a series of date variables to create XHR submissions to a website. These attempt to pull down football data for matches played on the days iterated through. If no matches were played that today a message is printed to this effect.

The code I have works fine, except for it is not returning any data for anything but the most recent season. The page I am trying to scrape is here:

http://www.whoscored.com/Regions/252/Tournaments/26

The calendar allows you to toggle between dates and XHR requests populate this data on the page. The code I am using to do this is:

from datetime import date, timedelta as td
from ast import literal_eval
from datetime import datetime
import requests
import time
import re

list1 = [2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013]
list2 = [2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014]


for x, y in zip(list1, list2):
    print "list1 - " + str(x)
    print "list2 - " + str(y)


    d1 = date(x,11,01)
    d2 = date(y,5,31)

    delta = d2 - d1



    for i in range(delta.days + 1):

        time1 =  str(d1 + td(days=i))
        time2 = time1.split("-", 1)[0]
        time3 = time1.split("-", -1)[1]
        time4 = time1.rsplit("-", 1)[-1]

        time2 = int(time2)
        time3 = int(time3)
        time4 = int(time4)

        date1 = datetime(year=time2, month=time3, day=time4)

        url = 'http://www.whoscored.com/tournamentsfeed/8273/Fixtures/'

        params = {'d': date1.strftime('%Y%m%d'), 'isAggregate': 'false'}
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}

        response = requests.get(url, params=params, headers=headers)


        try:
            fixtures = literal_eval(response.content)


            if fixtures is not None and len(fixtures) > 0: # If there are fixtures
                print ",\n".join([", ".join(str(x) for x in fixture) for fixture in fixtures]) # `fixtures` is a nested list
                time.sleep(0.5)    


            else:

               print "No Fixtures Today: " + date1.isoformat()
               time.sleep(0.5) 

        except SyntaxError:

            print "Error!!!"
            time.sleep(0.5)

As far as I understand it, all the data for all available seasons should all be accessed via the same method and from the same place. Can anyone see why this is not working?

Thanks

The problem is that each season is with different tournament ID wich means that the URL is different. I changed the code to work with all years and their tournament IDs

import json
import requests
import time

from datetime import date, timedelta

year_tournament_map = {
    2013: 8273,
    2012: 6978,
    2011: 5861,
    2010: 4940,
    2009: 3419,
    2008: 2689,
    2007: 2175,
    2006: 1645,
    2005: 1291,
    2004: 903,
    2003: 579,
    2002: 421,
    2001: 243,
    2000: 114,
    1999: 26,
}

years = sorted(year_tournament_map.keys())
url = 'http://www.whoscored.com/tournamentsfeed/%s/Fixtures/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}

for year in years:
    start_date = date(year, 11, 1)
    end_date = date(year + 1, 5, 31)
    delta = end_date - start_date

    for days  in range(delta.days + 1):
        time.sleep(0.5) 

        test_date = start_date + timedelta(days=days)

        params = {'d': str(test_date).replace('-', ''), 'isAggregate': 'false'}
        response = requests.get(url % year_tournament_map[year], params=params, headers=headers)

        try:
            json_data = response.content.replace("'", '"').replace(',,', ',null,')
            fixtures = json.loads(json_data)

        except ValueError:
            print "Error!!!"

        else:

            if fixtures:  # If there are fixtures
                print ",\n".join([", ".join(str(x) for x in fixture) for fixture in fixtures])  # `fixtures` is a nested list

            else:
               print "No Fixtures Today: %s" %  test_date

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM