[英]Table Web Scraping Issues with Python
I am having issues scraping data from this website: https://fantasy.premierleague.com/player-list我在从该网站抓取数据时遇到问题: https://fantasy.premierleague.com/player-list
I am interested in getting access to the player's names and points from the different tables.我有兴趣从不同的表中获取玩家的姓名和分数。
I'm relatively new to python and completely new to web scraping.我对 python 和 web 抓取完全陌生。 Here is what I have so far:这是我到目前为止所拥有的:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://fantasy.premierleague.com/player-list'
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
rows = soup.find_all('tr')
print(rows)
From here I would go on to find all the 'td' information.从这里我会 go 找到所有“td”信息。
However I get no results for 'tr'.但是我没有得到'tr'的结果。 I can pass 'a' in as an argument and get the links for the site fine but haven't been able to get any data from the tables.我可以将“a”作为参数传递,并获得该站点的链接,但无法从表中获取任何数据。 My understanding is passing 'tr' will find all rows of any tables within the website我的理解是通过“tr”将找到网站内任何表格的所有行
Any ideas where I am going wrong?有什么想法我哪里出错了吗? Thanks for your help谢谢你的帮助
You can use to get all the table data webdriver
, pandas
and BeautifulSoup
.您可以使用webdriver
、 pandas
和BeautifulSoup
来获取所有表数据。
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import pandas as pd
url = "https://fantasy.premierleague.com/player-list"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
table = soup.find_all('table', {'class': 'Table-ziussd-1 fVnGhl'})
df = pd.read_html(str(table))
print(df)
Output will be: Output 将是:
[ Player Team Points Cost
0 Alisson Liverpool 99 £6.2
1 Ederson Man City 89 £6.0
2 Kepa Chelsea 72 £5.4
3 Schmeichel Leicester 122 £5.4
4 de Gea Man Utd 105 £5.3
5 Lloris Spurs 56 £5.3
6 Henderson Sheffield Utd 135 £5.3
7 Pickford Everton 93 £5.2
8 Patrício Wolves 122 £5.2
9 Dubravka Newcastle 124 £5.1
10 Leno Arsenal 114 £5.0
11 Guaita Crystal Palace 122 £5.0
12 Pope Burnley 129 £4.9
13 Foster Watford 113 £4.9
14 Fabianski West Ham 61 £4.9
15 Caballero Chelsea 7 £4.8
16 Ryan Brighton 105 £4.7
17 Bravo Man City 11 £4.7
18 Grant Man Utd 0 £4.7
19 Romero Man Utd 0 £4.6
20 Krul Norwich 94 £4.6
21 Mignolet Liverpool 0 £4.5
22 McCarthy Southampton 74 £4.5
23 Ramsdale Bournemouth 97 £4.5
24 Fahrmann Norwich 1 £4.4
and so on........................................]
The table you want to scrape is generated using Javascript, which is not executed when you do html = urlopen(url)
and thus not in the soup either.您要抓取的表是使用 Javascript 生成的,当您执行html = urlopen(url)
时不会执行,因此也不在汤中。
There are many methods as how to get dynamically generated data.有很多方法可以获取动态生成的数据。 Check here for example.例如,检查这里。
https://fantasy.premierleague.com/player-list uses Javascript to generate data to html. https://fantasy.premierleague.com/player-list使用 Javascript 生成数据到 html。 BeautifulSoup cannot scrape Javascript so we need to emulate real browser to load data. BeautifulSoup 无法抓取 Javascript 所以我们需要模拟真实的浏览器来加载数据。 To do this you can use Selenium - In below code I user Firefox but you can use Chrome for example.为此,您可以使用 Selenium - 在下面的代码中,我使用 Firefox 但您可以使用 Chrome 例如。 Please check Selenium's documentation on how to get it running.请查看 Selenium 的文档以了解如何使其运行。
Script opens Firefox browser, pauses for 1 second ( to make sure that all Javascript data has loaded) and passes html to BeautifulSoup. Script opens Firefox browser, pauses for 1 second ( to make sure that all Javascript data has loaded) and passes html to BeautifulSoup. You might need to pip install lxml
parser for script to run.您可能需要pip install lxml
解析器才能运行脚本。
Then we look for all div', {'class':'Layout__Main-eg6k6r-1 cSyfD'
as those contain all 4 tables on the website.然后我们查找所有div', {'class':'Layout__Main-eg6k6r-1 cSyfD'
因为它们包含网站上的所有 4 个表。 You may want to use Inspect Element
tool in your browser to check names of tables, div's to target your search.您可能希望在浏览器中使用Inspect Element
工具来检查表的名称、div 的名称以定位您的搜索。
Then you can call any of 4 divs and search for tr
in each.然后,您可以调用 4 个 div 中的任何一个并在每个中搜索tr
。
from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.set_window_size(700,900)
url = 'https://fantasy.premierleague.com/player-list'
browser.get(url)
time.sleep(1)
html = browser.execute_script('return document.documentElement.outerHTML')
all_html = BeautifulSoup(html,'lxml')
all_tables = all_html.find_all('div', {'class':'Layout__Main-eg6k6r-1 cSyfD'})
print('Found '+ str(len(all_tables)) + 'tables')
table1_goalkeepers = all_tables[0]
rows_goalkeeper = table1_goalkeepers.tbody
print('Goalkeepers: \n')
print(rows_goalkeeper)
table3_defenders = all_tables[1]
print('Defenders \n')
rows_defencders = table3_defenders.tbody
print(rows_defencders)
browser.quit()
Sample output:样品 output:
Goalkeepers:
<tbody><tr><td>Alisson</td><td>Liverpool</td><td>99</td><td>£6.2</td></tr><tr><td>Ederson</td><td>Man City</td><td>88</td><td>£6.0</td></tr><tr><td>Kepa</td><td>Chelsea</td><td>72</td><td>£5.4</td></tr><tr><td>Schmeichel</td><td>Leicester</td><td>122</td><td>£5.4</td></tr><tr><td>de Gea</td><td>Man Utd</td><td>105</td><td>£5.3</td></tr><tr><td>Lloris</td><td>Spurs</td><td>56</td><td>£5.3</td></tr><tr><td>Henderson</td><td>Sheffield Utd</td><td>135</td><td>£5.3</td></tr><tr><td>Pickford</td><td>Everton</td><td>93</td><td>£5.2</td></tr><tr><td>Patrício</td><td>Wolves</td><td>122</td><td>£5.2</td></tr><tr><td>Dubravka</td><td>Newcastle</td><td>124</td><td>£5.1</td></tr><tr><td>Leno</td><td>Arsenal</td><td>114</td><td>£5.0</td></tr><tr><td>Guaita</td><td>Crystal Palace</td><td>122</td><td>£5.0</td></tr><tr><td>Pope</td><td>Burnley</td><td>128</td><td>£4.9</td></tr><tr><td>Foster</td><td>Watford</td><td>113</td><td>£4.9</td></tr><tr><td>Fabianski</td><td>West Ham</td><td>61</td><td>£4.9</td></tr><tr><td>Caballero</td><td>Chelsea</td><td>7</td><td>£4.8</td></tr><tr><td>Ryan</td><td>Brighton</td><td>105</td><td>£4.7</td></tr><tr><td>Bravo</td><td>Man City</td><td>11</td><td>£4.7</td></tr><tr><td>Grant</td><td>Man Utd</td><td>0</td><td>£4.7</td></tr><tr><td>Romero</td><td>Man Utd</td><td>0</td><td>£4.6</td></tr><tr><td>Krul</td><td>Norwich</td><td>94</td><td>£4.6</td></tr><tr><td>Mignolet</td><td>Liverpool</td><td>0</td><td>£4.5</td></tr><tr><td>McCarthy</td><td>Southampton</td><td>74</td><td>£4.5</td></tr><tr><td>Ramsdale</td><td>Bournemouth</td><td>97</td><td>£4.5</td></tr><tr><td>Fahrmann</td><td>Norwich</td><td>1</td><td>£4.4</td></tr><tr><td>Roberto</td><td>West Ham</td><td>18</td><td>£4.4</td></tr><tr><td>Verrips</td><td>Sheffield Utd</td><td>0</td><td>£4.4</td></tr><tr><td>Kelleher</td><td>Liverpool</td><td>0</td><td>£4.4</td></tr><tr><td>Reina</td><td>Aston Villa</td><td>19</td><td>£4.4</td></tr><tr><td>Nyland</td><td>Aston Villa</td><td>11</td><td>£4.3</td></tr><tr><td>Heaton</td><td>Aston Villa</td><td>59</td><td>£4.3</td></tr><tr><td>Darlow</td><td>Newcastle</td><td>0</td><td>£4.3</td></tr><tr><td>Eastwood</td><td>Sheffield Utd</td><td>0</td><td>£4.3</td></tr><tr><td>Steer</td><td>Aston Villa</td><td>1</td><td>£4.3</td></tr><tr><td>Moore</td><td>Sheffield Utd</td><td>1</td><td>£4.3</td></tr><tr><td>Peacock-Farrell</td><td>Burnley</td><td>0</td><td>£4.3</td></tr></tbody>
This page uses JavaScript
to add data but BeautifulSoup
can't run JavaScript
.此页面使用JavaScript
添加数据,但BeautifulSoup
无法运行JavaScript
。
You can use Selenium to control web browser which can run JavaScript
您可以使用Selenium来控制 web 浏览器,它可以运行JavaScript
Or you can check in DevTools
in Firefox
/ Chrome
(tab: Network
) what url is used by JavaScript
to get data from server and use it with urllib
to get these data.或者您可以在DevTools
/ Chrome
(选项卡: Network
)中的Firefox
中检查urllib
使用什么JavaScript
从服务器获取数据并使用它来获取这些数据。
I choose this method (manually searching in DevTools
).我选择这种方法(在DevTools
中手动搜索)。
I found that JavaScript
gets these data in JSON
format from我发现JavaScript
以JSON
格式从
https://fantasy.premierleague.com/api/bootstrap-static/ https://fantasy.premierleague.com/api/bootstrap-static/
Because I get data in JSON
so I can convert to Python list/dictionary using module json
and I don't need BeautifulSoup
. Because I get data in JSON
so I can convert to Python list/dictionary using module json
and I don't need BeautifulSoup
.
It needs more manual work to recognize structure of data but it gives more data then table on page.它需要更多的手动工作来识别数据结构,但它提供的数据比页面上的表格更多。
Here all data about first player on the list Alisson
这里有关于名单上第一位球员的所有数据Alisson
chance_of_playing_next_round = 100
chance_of_playing_this_round = 100
code = 116535
cost_change_event = 0
cost_change_event_fall = 0
cost_change_start = 2
cost_change_start_fall = -2
dreamteam_count = 1
element_type = 1
ep_next = 11.0
ep_this = 11.0
event_points = 10
first_name = Alisson
form = 10.0
id = 189
in_dreamteam = False
news =
news_added = 2020-03-06T14:00:17.901193Z
now_cost = 62
photo = 116535.jpg
points_per_game = 4.7
second_name = Ramses Becker
selected_by_percent = 9.2
special = False
squad_number = None
status = a
team = 10
team_code = 14
total_points = 99
transfers_in = 767780
transfers_in_event = 9339
transfers_out = 2033680
transfers_out_event = 2757
value_form = 1.6
value_season = 16.0
web_name = Alisson
minutes = 1823
goals_scored = 0
assists = 1
clean_sheets = 11
goals_conceded = 12
own_goals = 0
penalties_saved = 0
penalties_missed = 0
yellow_cards = 0
red_cards = 1
saves = 48
bonus = 9
bps = 439
influence = 406.2
creativity = 10.0
threat = 0.0
ict_index = 41.7
influence_rank = 135
influence_rank_type = 18
creativity_rank = 411
creativity_rank_type = 8
threat_rank = 630
threat_rank_type = 71
ict_index_rank = 294
ict_index_rank_type = 18
There are also information about teams, etc.还有关于团队等的信息。
Code:代码:
from urllib.request import urlopen
import json
#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'
text = urlopen(url).read().decode()
data = json.loads(text)
print('\n--- element type ---\n')
#print(data['element_types'][0])
for item in data['element_types']:
print(item['id'], item['plural_name'])
print('\n--- Goalkeepers ---\n')
number = 0
for item in data['elements']:
if item['element_type'] == 1: # Goalkeepers
number += 1
print('---', number, '---')
print('type :', data['element_types'][item['element_type']-1]['plural_name'])
print('first_name :', item['first_name'])
print('second_name :', item['second_name'])
print('total_points:', item['total_points'])
print('team :', data['teams'][item['team']-1]['name'])
print('cost :', item['now_cost']/10)
if item['first_name'] == 'Alisson':
for key, value in item.items():
print(' ', key, '=',value)
Result:结果:
--- element type ---
1 Goalkeepers
2 Defenders
3 Midfielders
4 Forwards
--- Goalkeepers ---
--- 1 ---
type : Goalkeepers
first_name : Bernd
second_name : Leno
total_points: 114
team : Arsenal
cost : 5.0
--- 2 ---
type : Goalkeepers
first_name : Emiliano
second_name : Martínez
total_points: 1
team : Arsenal
cost : 4.2
--- 3 ---
type : Goalkeepers
first_name : Ørjan
second_name : Nyland
total_points: 11
team : Aston Villa
cost : 4.3
--- 4 ---
type : Goalkeepers
first_name : Tom
second_name : Heaton
total_points: 59
team : Aston Villa
cost : 4.3
Code gives data in different order then table but if you put it all in list or better in pandas DataFrame then you can sort it in different orders.代码以与表格不同的顺序提供数据,但是如果您将其全部放在列表中或更好地放在 pandas DataFrame 中,那么您可以按不同的顺序对其进行排序。
EDIT:编辑:
You can use pandas
to get data from JSON
您可以使用pandas
从JSON
获取数据
from urllib.request import urlopen
import json
import pandas as pd
#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'
# read data from url and convert to Python's list/dictionary
text = urlopen(url).read().decode()
data = json.loads(text)
# create DataFrames
players = pd.DataFrame.from_dict(data['elements'])
teams = pd.DataFrame.from_dict(data['teams'])
# divide by 10 to get `6.2` instead of `62`
players['now_cost'] = players['now_cost'] / 10
# convert team's number to its name
players['team'] = players['team'].apply(lambda x: teams.iloc[x-1]['name'])
# filter players
goalkeepers = players[ players['element_type'] == 1 ]
defenders = players[ players['element_type'] == 2 ]
# etc.
# some informations
print('\n--- goalkeepers columns ---\n')
print(goalkeepers.columns)
print('\n--- goalkeepers sorted by name ---\n')
sorted_data = goalkeepers.sort_values(['first_name'])
print(sorted_data[['first_name', 'team', 'now_cost']].head())
print('\n--- goalkeepers sorted by cost ---\n')
sorted_data = goalkeepers.sort_values(['now_cost'], ascending=False)
print(sorted_data[['first_name', 'team', 'now_cost']].head())
print('\n--- teams columns ---\n')
print(teams.columns)
print('\n--- teams ---\n')
print(teams['name'].head())
# etc.
Results结果
--- goalkeepers columns ---
Index(['chance_of_playing_next_round', 'chance_of_playing_this_round', 'code',
'cost_change_event', 'cost_change_event_fall', 'cost_change_start',
'cost_change_start_fall', 'dreamteam_count', 'element_type', 'ep_next',
'ep_this', 'event_points', 'first_name', 'form', 'id', 'in_dreamteam',
'news', 'news_added', 'now_cost', 'photo', 'points_per_game',
'second_name', 'selected_by_percent', 'special', 'squad_number',
'status', 'team', 'team_code', 'total_points', 'transfers_in',
'transfers_in_event', 'transfers_out', 'transfers_out_event',
'value_form', 'value_season', 'web_name', 'minutes', 'goals_scored',
'assists', 'clean_sheets', 'goals_conceded', 'own_goals',
'penalties_saved', 'penalties_missed', 'yellow_cards', 'red_cards',
'saves', 'bonus', 'bps', 'influence', 'creativity', 'threat',
'ict_index', 'influence_rank', 'influence_rank_type', 'creativity_rank',
'creativity_rank_type', 'threat_rank', 'threat_rank_type',
'ict_index_rank', 'ict_index_rank_type'],
dtype='object')
--- goalkeepers sorted by name ---
first_name team now_cost
94 Aaron Bournemouth 4.5
305 Adrián Liverpool 4.0
485 Alex Southampton 4.5
533 Alfie Spurs 4.0
291 Alisson Liverpool 6.2
--- goalkeepers sorted by cost ---
first_name team now_cost
291 Alisson Liverpool 6.2
323 Ederson Man City 6.0
263 Kasper Leicester 5.4
169 Kepa Chelsea 5.4
515 Hugo Spurs 5.3
--- teams columns ---
Index(['code', 'draw', 'form', 'id', 'loss', 'name', 'played', 'points',
'position', 'short_name', 'strength', 'team_division', 'unavailable',
'win', 'strength_overall_home', 'strength_overall_away',
'strength_attack_home', 'strength_attack_away', 'strength_defence_home',
'strength_defence_away', 'pulse_id'],
dtype='object')
--- teams ---
0 Arsenal
1 Aston Villa
2 Bournemouth
3 Brighton
4 Burnley
Name: name, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.