[英]How to extract player names using Python with BeautifulSoup from cricinfo
I'm learning beautiful soup.我正在学习美丽的汤。 I want to extract the player names ie the playing eleven for both teams from cricinfo.com.
我想从 cricinfo.com 中提取球员姓名,即两支球队的上场 11 人。 The exact link is " https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010 " The problem is that the website only displays the players under class "wrap batsmen" if they have batted.
确切的链接是“ https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010 ” 问题是该网站只显示“包裹击球手”类下的球员,如果他们已经击球。 Otherwise they are placed under the class "wrap dnb".
否则,它们将放置在“wrap dnb”类下。 I want to extract all the players irrespective of whether they have batted or not.
我想提取所有球员,不管他们是否击球。 How I can maintain two arrays (one for each team) that will dynamically search for players in "wrap batsmen" and "wrap dnb" (if required)?
我如何维护两个数组(每个团队一个),它们将动态搜索“wrap batsmen”和“wrap dnb”(如果需要)中的球员?
This is my attempt:这是我的尝试:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
years = []
# Years we will be analyzing
for i in range(2010, 2018):
years.append(i)
names = []
# URL page we will scraping (see image above)
url = "https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html, features="html.parser")
for a in range(0, 1):
names.append([a.getText() for a in soup.find_all("div", class_="cell batsmen")[1:][a].findAll('a', limit=1)])
soup = soup.find_all("div", class_="wrap dnb")
print(soup[0])
While this is possible with BeautifulSoup, it's not the best tool for the job.虽然使用 BeautifulSoup 可以做到这一点,但它并不是完成这项工作的最佳工具。 All that data (and much more) is available through the API.
所有这些数据(以及更多)都可以通过 API 获得。 Simply pull that and then you can parse the json to get what you want (and more).
只需拉动它,然后您就可以解析 json 以获得您想要的(以及更多)。 Here's a quick script though to get the 11 players for each team:
这是一个快速脚本,可以为每支球队获取 11 名球员:
You can get the api url by using dev tools (Ctrl-Shft-I) and seeing what requests the browser makes (look at Network -> XHR in the side panel. you may need to click around to view it make the request/call)您可以通过使用开发工具 (Ctrl-Shft-I) 并查看浏览器发出的请求来获取 api url(查看侧面板中的 Network -> XHR。您可能需要单击四周以查看它发出请求/调用)
import requests
url = 'https://site.web.api.espn.com/apis/site/v2/sports/cricket/13266/summary'
payload = {
'contentorigin': 'espn',
'event': '439146',
'lang': 'en',
'region': 'gb',
'section': 'cricinfo'}
jsonData = requests.get(url, params=payload).json()
roster = jsonData['rosters']
players = {}
for team in roster:
players[team['team']['displayName']] = []
for player in team['roster']:
playerName = player['athlete']['displayName']
players[team['team']['displayName']].append(playerName)
Output:输出:
print (players)
{'West Indies': ['Chris Gayle', 'Andre Fletcher', 'Dwayne Bravo', 'Ramnaresh Sarwan', 'Narsingh Deonarine', 'Kieron Pollard', 'Darren Sammy', 'Nikita Miller', 'Jerome Taylor', 'Sulieman Benn', 'Kemar Roach'], 'South Africa': ['Graeme Smith', 'Loots Bosman', 'Jacques Kallis', 'AB de Villiers', 'Jean-Paul Duminy', 'Johan Botha', 'Alviro Petersen', 'Ryan McLaren', 'Roelof van der Merwe', 'Dale Steyn', 'Charl Langeveldt']}
See below:见下文:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.