简体   繁体   English

如何使用Python(lxml,html,requests,xpath)从一页获取不同的表?

[英]How to get different tables from one page using Python (lxml, html, requests, xpath)?

I am trying to get the data of Premier League table from https://www.premierleague.com/tables . 我正在尝试从https://www.premierleague.com/tables获取英超联赛表的数据。 I am able to get the data through the code below, but unfortunately it only works for the latest season option ( 2018/2019 ). 我可以通过下面的代码获取数据,但不幸的是,它仅适用于最新的季节选项( 2018/2019 )。 The page offers tables for other seasons as well ( 2017/2018, ...) , how can I scrape the other tables? 该页面还提供其他季节的表格( 2017/2018,...)我如何 刮取 其他表格?

from lxml import html
import requests

page = requests.get('https://www.premierleague.com/tables')

tree = html.fromstring( page.content )

team_rows = tree.xpath('//table//tbody//tr[@data-filtered-table-row-name]')[0:20]
team_names = [i.attrib['data-filtered-table-row-name'] for i in team_rows] 

teams = {}

for i in range(20):
    element = team_rows[i]
    teams[team_names[i]] = element.getchildren()

for i in team_names:
    values = [j.text_content() for j in teams[i]]
    row = "{} "*9
    print( row.format(i, *values[3:12] ) )

but unfortunately it only works for the latest season option (2018/2019) 但不幸的是,它仅适用于最新的季节选项(2018/2019)

Website is using JavaScript to load the old table(1992-2017), so when you use Python to access that you gain latest table. 网站正在使用JavaScript加载旧表(1992-2017),因此当您使用Python进行访问时,您会获得最新的表。 If you desire to scrape the table filter by year/session, i provide a hard code version for you(Because i did not found the rule of year number). 如果您希望按年份/会话抓取表格过滤器,我会为您提供一个硬代码版本(因为我没有找到年份编号的规则)。 But you want to complete it more elegantly, selenium or requests_html might suit for you. 但您想更优雅地完成它,硒或request_html可能适合您。

Note: Im imitating JavaScript gain data from server, so the response's content is json type. 注意: 模仿JavaScript从服务器获取数据,因此响应的内容为json类型。 And it can only gain different year's Premier League table. 而且只能获得不同年份的英超联赛排名。 Filter by competition/matchweek/home_or_away is not available in my example. 在我的示例中,无法按比赛/比赛周/ home_or_away进行过滤。 If you want to add those option into script, you should analysis the rule of url parameter(use the way @pguardiario said or use some tools like fiddler). 如果要将这些选项添加到脚本中,则应分析url参数的规则(使用@pguardiario表示的方式或使用诸如fiddler之类的工具)。

import requests
from pprint import pprint

years = {str(1991+i):str(i) for i in range(1,23)}
years.update({
    "2018":"210",
    "2017":"79",
    "2016":"54",
    "2015":"42",
    "2014":"27"
    })

specific = years.get("2017")

param = {
    "altIds":"true",
    "compSeasons":specific,
    "detail":2,
    "FOOTBALL_COMPETITION":1
}

headers = {
    "Origin": "https://www.premierleague.com",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
    "Referer": "https://www.premierleague.com/tables?co=1&se={}&ha=-1".format(specific),
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
    }

page = requests.get('https://footballapi.pulselive.com/football/standings',
                 params=param,
                 headers=headers
                 )
print(page.url)
pprint(page.json())

How to get different tables from one page 如何从一页获取不同的表

I feel your question title is different from you description. 我觉得您的问题标题与您的描述不同。 If it is true, The other issue is you combine all table into one. 如果是这样,另一个问题是您将所有表合并到一个表中。 And you should be care of // What is meaning of .// in XPath? 您应该注意// XPath中.//的含义是什么? .

Note: If you want to get old data of Premier League table, use my code in 1st part. 注意: 如果要获取英超联赛表的旧数据,请在第一部分中使用我的代码。 Because those data can only be gotten from that way. 因为这些数据只能通过这种方式获得。

from lxml import html
import requests
from pprint import pprint

years = {str(1991+i):str(i) for i in range(1,23)}
years.update({
    "2018":"210",
    "2017":"79",
    "2016":"54",
    "2015":"42",
    "2014":"27"
    })

param = {
    "co":"1",
    "se":years.get("2017"),
    "ha":"-1"
}


page = requests.get('https://www.premierleague.com/tables', params=param)

tree = html.fromstring( page.content )
tables = tree.xpath('//tbody[contains(@class,"tableBodyContainer")]')
each_table_team_rows = [table.xpath('tr[@data-filtered-table-row-name]') for table in tables]
team_names = [[i.attrib['data-filtered-table-row-name'] for i in team_rows] for team_rows in each_table_team_rows]

pprint(team_names)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM