简体   繁体   English

表 Web Python 的抓取问题

[英]Table Web Scraping Issues with Python

I am having issues scraping data from this website: https://fantasy.premierleague.com/player-list我在从该网站抓取数据时遇到问题: https://fantasy.premierleague.com/player-list

I am interested in getting access to the player's names and points from the different tables.我有兴趣从不同的表中获取玩家的姓名和分数。

I'm relatively new to python and completely new to web scraping.我对 python 和 web 抓取完全陌生。 Here is what I have so far:这是我到目前为止所拥有的:

from urllib.request import urlopen
from bs4 import BeautifulSoup


url = 'https://fantasy.premierleague.com/player-list'


html = urlopen(url)
soup = BeautifulSoup(html, "lxml")

rows = soup.find_all('tr')
print(rows)

From here I would go on to find all the 'td' information.从这里我会 go 找到所有“td”信息。

However I get no results for 'tr'.但是我没有得到'tr'的结果。 I can pass 'a' in as an argument and get the links for the site fine but haven't been able to get any data from the tables.我可以将“a”作为参数传递,并获得该站点的链接,但无法从表中获取任何数据。 My understanding is passing 'tr' will find all rows of any tables within the website我的理解是通过“tr”将找到网站内任何表格的所有行

Any ideas where I am going wrong?有什么想法我哪里出错了吗? Thanks for your help谢谢你的帮助

You can use to get all the table data webdriver , pandas and BeautifulSoup .您可以使用webdriverpandasBeautifulSoup来获取所有表数据。

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import pandas as pd
url = "https://fantasy.premierleague.com/player-list"

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)
table = soup.find_all('table', {'class': 'Table-ziussd-1 fVnGhl'})

df = pd.read_html(str(table))

print(df)

Output will be: Output 将是:

[             Player            Team  Points  Cost
0           Alisson       Liverpool      99  £6.2
1           Ederson        Man City      89  £6.0
2              Kepa         Chelsea      72  £5.4
3        Schmeichel       Leicester     122  £5.4
4            de Gea         Man Utd     105  £5.3
5            Lloris           Spurs      56  £5.3
6         Henderson   Sheffield Utd     135  £5.3
7          Pickford         Everton      93  £5.2
8          Patrício          Wolves     122  £5.2
9          Dubravka       Newcastle     124  £5.1
10             Leno         Arsenal     114  £5.0
11           Guaita  Crystal Palace     122  £5.0
12             Pope         Burnley     129  £4.9
13           Foster         Watford     113  £4.9
14        Fabianski        West Ham      61  £4.9
15        Caballero         Chelsea       7  £4.8
16             Ryan        Brighton     105  £4.7
17            Bravo        Man City      11  £4.7
18            Grant         Man Utd       0  £4.7
19           Romero         Man Utd       0  £4.6
20             Krul         Norwich      94  £4.6
21         Mignolet       Liverpool       0  £4.5
22         McCarthy     Southampton      74  £4.5
23         Ramsdale     Bournemouth      97  £4.5
24         Fahrmann         Norwich       1  £4.4




and so on........................................]

The table you want to scrape is generated using Javascript, which is not executed when you do html = urlopen(url) and thus not in the soup either.您要抓取的表是使用 Javascript 生成的,当您执行html = urlopen(url)时不会执行,因此也不在汤中。
There are many methods as how to get dynamically generated data.有很多方法可以获取动态生成的数据。 Check here for example.例如,检查这里

https://fantasy.premierleague.com/player-list uses Javascript to generate data to html. https://fantasy.premierleague.com/player-list使用 Javascript 生成数据到 html。 BeautifulSoup cannot scrape Javascript so we need to emulate real browser to load data. BeautifulSoup 无法抓取 Javascript 所以我们需要模拟真实的浏览器来加载数据。 To do this you can use Selenium - In below code I user Firefox but you can use Chrome for example.为此,您可以使用 Selenium - 在下面的代码中,我使用 Firefox 但您可以使用 Chrome 例如。 Please check Selenium's documentation on how to get it running.请查看 Selenium 的文档以了解如何使其运行。

Script opens Firefox browser, pauses for 1 second ( to make sure that all Javascript data has loaded) and passes html to BeautifulSoup. Script opens Firefox browser, pauses for 1 second ( to make sure that all Javascript data has loaded) and passes html to BeautifulSoup. You might need to pip install lxml parser for script to run.您可能需要pip install lxml解析器才能运行脚本。

Then we look for all div', {'class':'Layout__Main-eg6k6r-1 cSyfD' as those contain all 4 tables on the website.然后我们查找所有div', {'class':'Layout__Main-eg6k6r-1 cSyfD'因为它们包含网站上的所有 4 个表。 You may want to use Inspect Element tool in your browser to check names of tables, div's to target your search.您可能希望在浏览器中使用Inspect Element工具来检查表的名称、div 的名称以定位您的搜索。

Then you can call any of 4 divs and search for tr in each.然后,您可以调用 4 个 div 中的任何一个并在每个中搜索tr

from selenium import webdriver
import time
from bs4 import BeautifulSoup 

browser = webdriver.Firefox()
browser.set_window_size(700,900)

url = 'https://fantasy.premierleague.com/player-list'

browser.get(url)
time.sleep(1)

html = browser.execute_script('return document.documentElement.outerHTML')


all_html = BeautifulSoup(html,'lxml')
all_tables = all_html.find_all('div', {'class':'Layout__Main-eg6k6r-1 cSyfD'})
print('Found '+ str(len(all_tables)) + 'tables')

table1_goalkeepers = all_tables[0]
rows_goalkeeper = table1_goalkeepers.tbody
print('Goalkeepers: \n')
print(rows_goalkeeper)

table3_defenders = all_tables[1]
print('Defenders \n')
rows_defencders = table3_defenders.tbody
print(rows_defencders)


browser.quit()

Sample output:样品 output:

Goalkeepers: 

<tbody><tr><td>Alisson</td><td>Liverpool</td><td>99</td><td>£6.2</td></tr><tr><td>Ederson</td><td>Man City</td><td>88</td><td>£6.0</td></tr><tr><td>Kepa</td><td>Chelsea</td><td>72</td><td>£5.4</td></tr><tr><td>Schmeichel</td><td>Leicester</td><td>122</td><td>£5.4</td></tr><tr><td>de Gea</td><td>Man Utd</td><td>105</td><td>£5.3</td></tr><tr><td>Lloris</td><td>Spurs</td><td>56</td><td>£5.3</td></tr><tr><td>Henderson</td><td>Sheffield Utd</td><td>135</td><td>£5.3</td></tr><tr><td>Pickford</td><td>Everton</td><td>93</td><td>£5.2</td></tr><tr><td>Patrício</td><td>Wolves</td><td>122</td><td>£5.2</td></tr><tr><td>Dubravka</td><td>Newcastle</td><td>124</td><td>£5.1</td></tr><tr><td>Leno</td><td>Arsenal</td><td>114</td><td>£5.0</td></tr><tr><td>Guaita</td><td>Crystal Palace</td><td>122</td><td>£5.0</td></tr><tr><td>Pope</td><td>Burnley</td><td>128</td><td>£4.9</td></tr><tr><td>Foster</td><td>Watford</td><td>113</td><td>£4.9</td></tr><tr><td>Fabianski</td><td>West Ham</td><td>61</td><td>£4.9</td></tr><tr><td>Caballero</td><td>Chelsea</td><td>7</td><td>£4.8</td></tr><tr><td>Ryan</td><td>Brighton</td><td>105</td><td>£4.7</td></tr><tr><td>Bravo</td><td>Man City</td><td>11</td><td>£4.7</td></tr><tr><td>Grant</td><td>Man Utd</td><td>0</td><td>£4.7</td></tr><tr><td>Romero</td><td>Man Utd</td><td>0</td><td>£4.6</td></tr><tr><td>Krul</td><td>Norwich</td><td>94</td><td>£4.6</td></tr><tr><td>Mignolet</td><td>Liverpool</td><td>0</td><td>£4.5</td></tr><tr><td>McCarthy</td><td>Southampton</td><td>74</td><td>£4.5</td></tr><tr><td>Ramsdale</td><td>Bournemouth</td><td>97</td><td>£4.5</td></tr><tr><td>Fahrmann</td><td>Norwich</td><td>1</td><td>£4.4</td></tr><tr><td>Roberto</td><td>West Ham</td><td>18</td><td>£4.4</td></tr><tr><td>Verrips</td><td>Sheffield Utd</td><td>0</td><td>£4.4</td></tr><tr><td>Kelleher</td><td>Liverpool</td><td>0</td><td>£4.4</td></tr><tr><td>Reina</td><td>Aston Villa</td><td>19</td><td>£4.4</td></tr><tr><td>Nyland</td><td>Aston Villa</td><td>11</td><td>£4.3</td></tr><tr><td>Heaton</td><td>Aston Villa</td><td>59</td><td>£4.3</td></tr><tr><td>Darlow</td><td>Newcastle</td><td>0</td><td>£4.3</td></tr><tr><td>Eastwood</td><td>Sheffield Utd</td><td>0</td><td>£4.3</td></tr><tr><td>Steer</td><td>Aston Villa</td><td>1</td><td>£4.3</td></tr><tr><td>Moore</td><td>Sheffield Utd</td><td>1</td><td>£4.3</td></tr><tr><td>Peacock-Farrell</td><td>Burnley</td><td>0</td><td>£4.3</td></tr></tbody>

This page uses JavaScript to add data but BeautifulSoup can't run JavaScript .此页面使用JavaScript添加数据,但BeautifulSoup无法运行JavaScript

You can use Selenium to control web browser which can run JavaScript您可以使用Selenium来控制 web 浏览器,它可以运行JavaScript

Or you can check in DevTools in Firefox / Chrome (tab: Network ) what url is used by JavaScript to get data from server and use it with urllib to get these data.或者您可以在DevTools / Chrome (选项卡: Network )中的Firefox中检查urllib使用什么JavaScript从服务器获取数据并使用它来获取这些数据。

I choose this method (manually searching in DevTools ).我选择这种方法(在DevTools中手动搜索)。

I found that JavaScript gets these data in JSON format from我发现JavaScriptJSON格式从

https://fantasy.premierleague.com/api/bootstrap-static/ https://fantasy.premierleague.com/api/bootstrap-static/

Because I get data in JSON so I can convert to Python list/dictionary using module json and I don't need BeautifulSoup . Because I get data in JSON so I can convert to Python list/dictionary using module json and I don't need BeautifulSoup .

It needs more manual work to recognize structure of data but it gives more data then table on page.它需要更多的手动工作来识别数据结构,但它提供的数据比页面上的表格更多。

Here all data about first player on the list Alisson这里有关于名单上第一位球员的所有数据Alisson

 chance_of_playing_next_round = 100
 chance_of_playing_this_round = 100
 code = 116535
 cost_change_event = 0
 cost_change_event_fall = 0
 cost_change_start = 2
 cost_change_start_fall = -2
 dreamteam_count = 1
 element_type = 1
 ep_next = 11.0
 ep_this = 11.0
 event_points = 10
 first_name = Alisson
 form = 10.0
 id = 189
 in_dreamteam = False
 news = 
 news_added = 2020-03-06T14:00:17.901193Z
 now_cost = 62
 photo = 116535.jpg
 points_per_game = 4.7
 second_name = Ramses Becker
 selected_by_percent = 9.2
 special = False
 squad_number = None
 status = a
 team = 10
 team_code = 14
 total_points = 99
 transfers_in = 767780
 transfers_in_event = 9339
 transfers_out = 2033680
 transfers_out_event = 2757
 value_form = 1.6
 value_season = 16.0
 web_name = Alisson
 minutes = 1823
 goals_scored = 0
 assists = 1
 clean_sheets = 11
 goals_conceded = 12
 own_goals = 0
 penalties_saved = 0
 penalties_missed = 0
 yellow_cards = 0
 red_cards = 1
 saves = 48
 bonus = 9
 bps = 439
 influence = 406.2
 creativity = 10.0
 threat = 0.0
 ict_index = 41.7
 influence_rank = 135
 influence_rank_type = 18
 creativity_rank = 411
 creativity_rank_type = 8
 threat_rank = 630
 threat_rank_type = 71
 ict_index_rank = 294
 ict_index_rank_type = 18

There are also information about teams, etc.还有关于团队等的信息。


Code:代码:

from urllib.request import urlopen
import json

#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'

text = urlopen(url).read().decode()
data = json.loads(text)

print('\n--- element type ---\n')        

#print(data['element_types'][0])
for item in data['element_types']:
    print(item['id'], item['plural_name'])

print('\n--- Goalkeepers ---\n')        

number = 0
for item in data['elements']:
        
    if item['element_type'] == 1: # Goalkeepers
        number += 1
        print('---', number, '---')
        print('type        :', data['element_types'][item['element_type']-1]['plural_name'])
        print('first_name  :', item['first_name'])
        print('second_name :', item['second_name'])
        print('total_points:', item['total_points'])
        print('team        :', data['teams'][item['team']-1]['name'])
        print('cost        :', item['now_cost']/10)

        if item['first_name'] == 'Alisson':
            for key, value in item.items():
                print('    ', key, '=',value)

Result:结果:

--- element type ---

1 Goalkeepers
2 Defenders
3 Midfielders
4 Forwards

--- Goalkeepers ---

--- 1 ---
type        : Goalkeepers
first_name  : Bernd
second_name : Leno
total_points: 114
team        : Arsenal
cost        : 5.0
--- 2 ---
type        : Goalkeepers
first_name  : Emiliano
second_name : Martínez
total_points: 1
team        : Arsenal
cost        : 4.2
--- 3 ---
type        : Goalkeepers
first_name  : Ørjan
second_name : Nyland
total_points: 11
team        : Aston Villa
cost        : 4.3
--- 4 ---
type        : Goalkeepers
first_name  : Tom
second_name : Heaton
total_points: 59
team        : Aston Villa
cost        : 4.3                

Code gives data in different order then table but if you put it all in list or better in pandas DataFrame then you can sort it in different orders.代码以与表格不同的顺序提供数据,但是如果您将其全部放在列表中或更好地放在 pandas DataFrame 中,那么您可以按不同的顺序对其进行排序。


EDIT:编辑:

You can use pandas to get data from JSON您可以使用pandasJSON获取数据

from urllib.request import urlopen
import json
import pandas as pd

#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'

# read data from url and convert to Python's list/dictionary
text = urlopen(url).read().decode()
data = json.loads(text)

# create DataFrames
players = pd.DataFrame.from_dict(data['elements'])
teams   = pd.DataFrame.from_dict(data['teams'])

# divide by 10 to get `6.2` instead of `62`
players['now_cost'] = players['now_cost'] / 10

# convert team's number to its name
players['team'] = players['team'].apply(lambda x: teams.iloc[x-1]['name'])

# filter players
goalkeepers = players[ players['element_type'] == 1 ]
defenders   = players[ players['element_type'] == 2 ]
# etc.

# some informations
print('\n--- goalkeepers columns ---\n')

print(goalkeepers.columns)

print('\n--- goalkeepers sorted by name ---\n')

sorted_data = goalkeepers.sort_values(['first_name'])

print(sorted_data[['first_name', 'team', 'now_cost']].head())

print('\n--- goalkeepers sorted by cost ---\n')

sorted_data = goalkeepers.sort_values(['now_cost'], ascending=False)

print(sorted_data[['first_name', 'team', 'now_cost']].head())

print('\n--- teams columns ---\n')

print(teams.columns)

print('\n--- teams ---\n')

print(teams['name'].head())

# etc.

Results结果

--- goalkeepers columns ---

Index(['chance_of_playing_next_round', 'chance_of_playing_this_round', 'code',
       'cost_change_event', 'cost_change_event_fall', 'cost_change_start',
       'cost_change_start_fall', 'dreamteam_count', 'element_type', 'ep_next',
       'ep_this', 'event_points', 'first_name', 'form', 'id', 'in_dreamteam',
       'news', 'news_added', 'now_cost', 'photo', 'points_per_game',
       'second_name', 'selected_by_percent', 'special', 'squad_number',
       'status', 'team', 'team_code', 'total_points', 'transfers_in',
       'transfers_in_event', 'transfers_out', 'transfers_out_event',
       'value_form', 'value_season', 'web_name', 'minutes', 'goals_scored',
       'assists', 'clean_sheets', 'goals_conceded', 'own_goals',
       'penalties_saved', 'penalties_missed', 'yellow_cards', 'red_cards',
       'saves', 'bonus', 'bps', 'influence', 'creativity', 'threat',
       'ict_index', 'influence_rank', 'influence_rank_type', 'creativity_rank',
       'creativity_rank_type', 'threat_rank', 'threat_rank_type',
       'ict_index_rank', 'ict_index_rank_type'],
      dtype='object')

--- goalkeepers sorted by name ---

    first_name         team  now_cost
94       Aaron  Bournemouth       4.5
305     Adrián    Liverpool       4.0
485       Alex  Southampton       4.5
533      Alfie        Spurs       4.0
291    Alisson    Liverpool       6.2

--- goalkeepers sorted by cost ---

    first_name       team  now_cost
291    Alisson  Liverpool       6.2
323    Ederson   Man City       6.0
263     Kasper  Leicester       5.4
169       Kepa    Chelsea       5.4
515       Hugo      Spurs       5.3

--- teams columns ---

Index(['code', 'draw', 'form', 'id', 'loss', 'name', 'played', 'points',
       'position', 'short_name', 'strength', 'team_division', 'unavailable',
       'win', 'strength_overall_home', 'strength_overall_away',
       'strength_attack_home', 'strength_attack_away', 'strength_defence_home',
       'strength_defence_away', 'pulse_id'],
      dtype='object')

--- teams ---

0        Arsenal
1    Aston Villa
2    Bournemouth
3       Brighton
4        Burnley
Name: name, dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM