简体   繁体   English

XHR请求不从网站返回所有数据

[英]XHR Requests not returning all data from website

I am using Python.org version 2.7 64 bit on Windows 8 64 bit. 我在Windows 8 64位上使用Python.org版本2.7 64位。 I have some code that iterates through a series of date variables to create XHR submissions to a website. 我有一些代码可以迭代一系列日期变量,以创建XHR向网站提交的内容。 These attempt to pull down football data for matches played on the days iterated through. 这些尝试将足球数据提取到迭代中进行的比赛中。 If no matches were played that today a message is printed to this effect. 如果今天没有比赛进行,将显示一条消息以表示这种效果。

The code I have works fine, except for it is not returning any data for anything but the most recent season. 我拥有的代码工作正常,但除了最近的季节外,它不返回任何数据。 The page I am trying to scrape is here: 我要抓取的页面在这里:

http://www.whoscored.com/Regions/252/Tournaments/26 http://www.whoscored.com/Regions/252/Tournaments/26

The calendar allows you to toggle between dates and XHR requests populate this data on the page. 日历允许您在日期之间切换,XHR请求将在页面上填充此数据。 The code I am using to do this is: 我用于执行此操作的代码是:

from datetime import date, timedelta as td
from ast import literal_eval
from datetime import datetime
import requests
import time
import re

list1 = [2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013]
list2 = [2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014]


for x, y in zip(list1, list2):
    print "list1 - " + str(x)
    print "list2 - " + str(y)


    d1 = date(x,11,01)
    d2 = date(y,5,31)

    delta = d2 - d1



    for i in range(delta.days + 1):

        time1 =  str(d1 + td(days=i))
        time2 = time1.split("-", 1)[0]
        time3 = time1.split("-", -1)[1]
        time4 = time1.rsplit("-", 1)[-1]

        time2 = int(time2)
        time3 = int(time3)
        time4 = int(time4)

        date1 = datetime(year=time2, month=time3, day=time4)

        url = 'http://www.whoscored.com/tournamentsfeed/8273/Fixtures/'

        params = {'d': date1.strftime('%Y%m%d'), 'isAggregate': 'false'}
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}

        response = requests.get(url, params=params, headers=headers)


        try:
            fixtures = literal_eval(response.content)


            if fixtures is not None and len(fixtures) > 0: # If there are fixtures
                print ",\n".join([", ".join(str(x) for x in fixture) for fixture in fixtures]) # `fixtures` is a nested list
                time.sleep(0.5)    


            else:

               print "No Fixtures Today: " + date1.isoformat()
               time.sleep(0.5) 

        except SyntaxError:

            print "Error!!!"
            time.sleep(0.5)

As far as I understand it, all the data for all available seasons should all be accessed via the same method and from the same place. 据我了解,所有可用季节的所有数据都应通过相同的方法,从同一位置访问。 Can anyone see why this is not working? 谁能看到为什么这行不通?

Thanks 谢谢

The problem is that each season is with different tournament ID wich means that the URL is different. 问题是每个赛季的锦标赛ID都不相同,这意味着URL有所不同。 I changed the code to work with all years and their tournament IDs 我更改了代码以使其适用于所有年份及其比赛ID

import json
import requests
import time

from datetime import date, timedelta

year_tournament_map = {
    2013: 8273,
    2012: 6978,
    2011: 5861,
    2010: 4940,
    2009: 3419,
    2008: 2689,
    2007: 2175,
    2006: 1645,
    2005: 1291,
    2004: 903,
    2003: 579,
    2002: 421,
    2001: 243,
    2000: 114,
    1999: 26,
}

years = sorted(year_tournament_map.keys())
url = 'http://www.whoscored.com/tournamentsfeed/%s/Fixtures/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}

for year in years:
    start_date = date(year, 11, 1)
    end_date = date(year + 1, 5, 31)
    delta = end_date - start_date

    for days  in range(delta.days + 1):
        time.sleep(0.5) 

        test_date = start_date + timedelta(days=days)

        params = {'d': str(test_date).replace('-', ''), 'isAggregate': 'false'}
        response = requests.get(url % year_tournament_map[year], params=params, headers=headers)

        try:
            json_data = response.content.replace("'", '"').replace(',,', ',null,')
            fixtures = json.loads(json_data)

        except ValueError:
            print "Error!!!"

        else:

            if fixtures:  # If there are fixtures
                print ",\n".join([", ".join(str(x) for x in fixture) for fixture in fixtures])  # `fixtures` is a nested list

            else:
               print "No Fixtures Today: %s" %  test_date

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM