简体   繁体   English

如何使用 Python 和 BeautifulSoup 从 cricinfo 中提取玩家姓名

[英]How to extract player names using Python with BeautifulSoup from cricinfo

I'm learning beautiful soup.我正在学习美丽的汤。 I want to extract the player names ie the playing eleven for both teams from cricinfo.com.我想从 cricinfo.com 中提取球员姓名,即两支球队的上场 11 人。 The exact link is " https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010 " The problem is that the website only displays the players under class "wrap batsmen" if they have batted.确切的链接是“ https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010 ” 问题是该网站只显示“包裹击球手”类下的球员,如果他们已经击球。 Otherwise they are placed under the class "wrap dnb".否则,它们将放置在“wrap dnb”类下。 I want to extract all the players irrespective of whether they have batted or not.我想提取所有球员,不管他们是否击球。 How I can maintain two arrays (one for each team) that will dynamically search for players in "wrap batsmen" and "wrap dnb" (if required)?我如何维护两个数组(每个团队一个),它们将动态搜索“wrap batsmen”和“wrap dnb”(如果需要)中的球员?

This is my attempt:这是我的尝试:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
years = []
# Years we will be analyzing
for i in range(2010, 2018):
    years.append(i)

names = []


# URL page we will scraping (see image above)
url = "https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html, features="html.parser")


for a in range(0, 1):
    names.append([a.getText() for a in soup.find_all("div", class_="cell batsmen")[1:][a].findAll('a', limit=1)])

soup = soup.find_all("div", class_="wrap dnb")
print(soup[0])

While this is possible with BeautifulSoup, it's not the best tool for the job.虽然使用 BeautifulSoup 可以做到这一点,但它并不是完成这项工作的最佳工具。 All that data (and much more) is available through the API.所有这些数据(以及更多)都可以通过 API 获得。 Simply pull that and then you can parse the json to get what you want (and more).只需拉动它,然后您就可以解析 json 以获得您想要的(以及更多)。 Here's a quick script though to get the 11 players for each team:这是一个快速脚本,可以为每支球队获取 11 名球员:

You can get the api url by using dev tools (Ctrl-Shft-I) and seeing what requests the browser makes (look at Network -> XHR in the side panel. you may need to click around to view it make the request/call)您可以通过使用开发工具 (Ctrl-Shft-I) 并查看浏览器发出的请求来获取 api url(查看侧面板中的 Network -> XHR。您可能需要单击四周以查看它发出请求/调用)

import requests

url = 'https://site.web.api.espn.com/apis/site/v2/sports/cricket/13266/summary'

payload = {
'contentorigin': 'espn',
'event': '439146',
'lang': 'en',
'region': 'gb',
'section': 'cricinfo'}

jsonData = requests.get(url, params=payload).json()

roster = jsonData['rosters']

players = {}
for team in roster:
    players[team['team']['displayName']] = []
    for player in team['roster']:
        playerName = player['athlete']['displayName']
        players[team['team']['displayName']].append(playerName)

Output:输出:

print (players)
{'West Indies': ['Chris Gayle', 'Andre Fletcher', 'Dwayne Bravo', 'Ramnaresh Sarwan', 'Narsingh Deonarine', 'Kieron Pollard', 'Darren Sammy', 'Nikita Miller', 'Jerome Taylor', 'Sulieman Benn', 'Kemar Roach'], 'South Africa': ['Graeme Smith', 'Loots Bosman', 'Jacques Kallis', 'AB de Villiers', 'Jean-Paul Duminy', 'Johan Botha', 'Alviro Petersen', 'Ryan McLaren', 'Roelof van der Merwe', 'Dale Steyn', 'Charl Langeveldt']}

See below:见下文:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用BeautifulSoup(Python)从HTML标签提取文本? - How to extract text from HTML label using BeautifulSoup (Python)? 如何在Python中使用BeautifulSoup从href提取部分文本 - How to extract partial text from href using BeautifulSoup in Python 如何使用 BeautifulSoup 和 python 从 div 标签中提取文本 - how to extract the text from the div tag using BeautifulSoup and python 如何使用 python 中的 BeautifulSoup package 从网站中提取 href 内容 - how to extract a href content from a website using BeautifulSoup package in python 如何在 Python 中使用 BeautifulSoup 从 html 中提取特定文本? - How to extract specific text from html using BeautifulSoup in Python? 如何使用BeautifulSoup从Python的子组头之间提取数据? - How to extract data from between subgroup heads in Python using BeautifulSoup? 如何使用Python或beautifulsoup从文件中提取文本(在script标签中) - How to extract text from file (with in script tag) using Python or beautifulsoup Python:如何使用BeautifulSoup从HTML页面中提取URL? - Python: How to extract URL from HTML Page using BeautifulSoup? 如何使用beautifulsoup和python从此json提取数据? - How can you extract data from this json using, beautifulsoup and python? 如何使用python(BeautifulSoup)从代码中提取以下src(iframe) - How to extract the following src (iframe) from the code using python (BeautifulSoup)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM