[英]Beautiful soup findall not returning all the results
I wanted to scrape ball by ball data of a cricket match using find_all
in BeautifulSoup
.我想在
BeautifulSoup
中使用find_all
通过板球比赛的球数据来刮球。 The code is :代码是:
import requests
from bs4 import BeautifulSoup
url = 'http://www.espncricinfo.com/series/10904/commentary/1075502/south-africa-vs-bangladesh-1st-test-bangladesh-tour-of-sa-2017-18'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
comment = soup.find_all('div', class_ = "over-circle")
print(len(comment))
print(comment[22])
I read answers to previous questions regarding this issue and almost all refer to using different html parser.我阅读了有关此问题的先前问题的答案,几乎所有内容都涉及使用不同的 html 解析器。 I have tried
lxml, html.parser, html5lib
but none of them above mentioned (which were mostly recommended in previous questions) seem to give any different result.我已经尝试过
lxml, html.parser, html5lib
但上面提到的这些(在以前的问题中主要推荐)似乎都没有给出任何不同的结果。 The no.没有。 of balls is shown to be 23 whereas it should be much more.
球数显示为 23,而它应该更多。 Output:
输出:
23
<div class="over-circle low-score" data-reactid="463"><span class="over-score" data-reactid="464">0</span></div>
You've guessed it right.你猜对了。 Not all data is loaded at once (hence, You see only what's initially loaded).
并非所有数据都一次性加载(因此,您只能看到最初加载的数据)。 You could implement additional logic that will loop until program reaches last page.
您可以实现将循环直到程序到达最后一页的附加逻辑。
Here's the URL of one of the data pages: https://site.web.api.espn.com/apis/site/v2/sports/cricket/10904/playbyplay?contentorigin=espn&event=1075502&page=6&period=4§ion=cricinfo
You will need to increase page
param until You keep getting valid data.这是其中一个数据页面的 URL:
https://site.web.api.espn.com/apis/site/v2/sports/cricket/10904/playbyplay?contentorigin=espn&event=1075502&page=6&period=4§ion=cricinfo
你将需要增加page
参数,直到您不断获得有效数据。
If You examine response of this URL, You'll see that this is JSON file with additional 24 items.如果您检查此 URL 的响应,您将看到这是一个包含 24 个项目的 JSON 文件。
The page is dynamic so it's not all rendered.该页面是动态的,因此并非全部呈现。 You can go straigh to the source and pull in the json response, which also includes the total number of pages.
您可以直接访问源并提取 json 响应,其中还包括总页数。 Once you have the total number of pages, you can iterate through those using the query parameters, add that from each previous page to get a final output of all the data.
获得总页数后,您可以使用查询参数迭代这些页数,将每个前一页的页数相加以获得所有数据的最终输出。
I don't know exactly what data you're interested in, but it's all there.我不知道你对什么数据感兴趣,但它就在那里。 I converted it to a dataframe, but you can do what you'd like with the json structure:
我将其转换为数据框,但您可以使用 json 结构做您想做的事情:
It is nested however.然而它是嵌套的。 The
athletesInvolved
columns consists of a listed dictionary. athletesInvolved
列由列出的字典组成。 You can still normalize/flatten that out too if needed (let me know if you would like that done too, it's rather easy to do), but will obviously then increase the number of rows/columns.如果需要,您仍然可以将其标准化/展平(让我知道您是否也希望这样做,这很容易做到),但显然会增加行/列的数量。
import requests
from pandas.io.json import json_normalize
url = 'https://site.web.api.espn.com/apis/site/v2/sports/cricket/10904/playbyplay'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
payload = {
'contentorigin': 'espn',
'event': '1075502',
'page': '1',
'period': '4',
'section': 'cricinfo'}
# Get inital page of data, including total number of pages to iterate through
response = requests.get(url, headers=headers, params=payload).json()
pageCount = response['commentary']['pageCount']
print ('Total pages: %s\nProcessed page: 1' %(pageCount))
# Store the initial page to jsonData, iterate through the next `pageCount` pages and add that to the list for a final result
jsonData = response
for page in range(2, pageCount+1):
payload = {
'contentorigin': 'espn',
'event': '1075502',
'page': page,
'period': '4',
'section': 'cricinfo'}
response = requests.get(url, headers=headers, params=payload).json()
jsonData['commentary']['items'] = jsonData['commentary']['items'] + response['commentary']['items']
print ('Processed page: %s' %page)
df = json_normalize(jsonData['commentary']['items'])
Output: sample of first 5 rows out of 198 rows输出:198 行中前 5 行的样本
print (df.head(5).to_string())
athletesInvolved awayScore batsman.athlete.displayName batsman.athlete.fullName batsman.athlete.id batsman.athlete.name batsman.athlete.shortName batsman.faced batsman.fours batsman.runs batsman.sixes batsman.team.abbreviation batsman.team.displayName batsman.team.id batsman.team.name batsman.totalRuns bowler.athlete.displayName bowler.athlete.fullName bowler.athlete.id bowler.athlete.name bowler.athlete.shortName bowler.balls bowler.conceded bowler.maidens bowler.overs bowler.team.abbreviation bowler.team.displayName bowler.team.id bowler.team.name bowler.wickets clock date dismissal.batsman.athlete.displayName dismissal.batsman.athlete.fullName dismissal.batsman.athlete.id dismissal.batsman.athlete.name dismissal.batsman.athlete.shortName dismissal.bowled dismissal.bowler.athlete.displayName dismissal.bowler.athlete.fullName dismissal.bowler.athlete.id dismissal.bowler.athlete.name dismissal.bowler.athlete.shortName dismissal.dismissal dismissal.minutes dismissal.retiredText dismissal.text dismissal.type homeScore id innings.ballLimit innings.balls innings.byes innings.day innings.fallOfWickets innings.id innings.legByes innings.noBalls innings.number innings.remainingBalls innings.remainingOvers innings.remainingRuns innings.runRate innings.runs innings.session innings.target innings.totalRuns innings.wickets innings.wides mediaId otherBatsman.athlete.displayName otherBatsman.athlete.fullName otherBatsman.athlete.id otherBatsman.athlete.name otherBatsman.athlete.shortName otherBatsman.faced otherBatsman.fours otherBatsman.runs otherBatsman.sixes otherBatsman.team.abbreviation otherBatsman.team.displayName otherBatsman.team.id otherBatsman.team.name otherBatsman.totalRuns otherBowler.athlete.displayName otherBowler.athlete.fullName otherBowler.athlete.id otherBowler.athlete.name otherBowler.athlete.shortName otherBowler.balls otherBowler.conceded otherBowler.maidens otherBowler.overs otherBowler.team.abbreviation otherBowler.team.displayName otherBowler.team.id otherBowler.team.name otherBowler.wickets over.actual over.ball over.balls over.byes over.complete over.legByes over.limit over.maiden over.noBall over.number over.overs over.runs over.unique over.wickets over.wide period periodText playType.description playType.id postText preText scoreValue sequence shortText speedKPH speedMPH team.abbreviation team.displayName team.id team.name text
0 [{'id': '56194', 'name': 'Tamim Iqbal', 'short... 0 Tamim Iqbal Tamim Iqbal Khan 56194 Tamim Iqbal Tamim 1 0 0 0 BDESH Bangladesh 25 Bangladesh 0 Morne Morkel Morne Morkel 46538 Morne Morkel Morkel 1 0 0 0.1 SA South Africa 3 South Africa 0 00:00 2017-09-28T10:00 Tamim Iqbal Tamim Iqbal Khan 56194 Tamim Iqbal Tamim False Morne Morkel Morne Morkel 46538 Morne Morkel Morkel False 0 NaN 0 410 0 1 0 4 0 199062 0 0 4 0 0.0 424 0.0 0 2 424 0 0 0 0 Imrul Kayes Imrul Kayes 280734 Imrul Kayes Imrul 0 0 0 0 BDESH Bangladesh 25 Bangladesh 0 NaN NaN NaN NaN NaN 0 0 0 NaN NaN NaN NaN NaN 0 0.1 1 6 0 False 0 0.0 1 0 1 0.1 0 0.01 2 0 4 4th innings no run 2 <b>2.25pm</b> South Africa gather into a huddl... 0 400001 Morkel to Tamim Iqbal, no run 138.452 86.030 BDESH Bangladesh 25 Bangladesh fullish length ball, angled in from wide of th...
1 [{'id': '56194', 'name': 'Tamim Iqbal', 'short... 0 Tamim Iqbal Tamim Iqbal Khan 56194 Tamim Iqbal Tamim 2 0 0 0 BDESH Bangladesh 25 Bangladesh 0 Morne Morkel Morne Morkel 46538 Morne Morkel Morkel 2 0 0 0.2 SA South Africa 3 South Africa 0 00:00 2017-09-28T10:00 Tamim Iqbal Tamim Iqbal Khan 56194 Tamim Iqbal Tamim False Morne Morkel Morne Morkel 46538 Morne Morkel Morkel False 0 NaN 0 420 0 2 0 4 0 199062 0 0 4 0 0.0 424 0.0 0 2 424 0 0 0 0 Imrul Kayes Imrul Kayes 280734 Imrul Kayes Imrul 0 0 0 0 BDESH Bangladesh 25 Bangladesh 0 NaN NaN NaN NaN NaN 0 0 0 NaN NaN NaN NaN NaN 0 0.2 2 6 0 False 0 0.0 1 0 1 0.2 0 0.02 2 0 4 4th innings no run 2 0 400002 Morkel to Tamim Iqbal, no run 135.891 84.439 BDESH Bangladesh 25 Bangladesh length ball outside off, Tamim stands tall and...
2 [{'id': '56194', 'name': 'Tamim Iqbal', 'short... 0 Tamim Iqbal Tamim Iqbal Khan 56194 Tamim Iqbal Tamim 3 0 0 0 BDESH Bangladesh 25 Bangladesh 0 Morne Morkel Morne Morkel 46538 Morne Morkel Morkel 3 0 0 0.3 SA South Africa 3 South Africa 0 00:00 2017-09-28T10:00 Tamim Iqbal Tamim Iqbal Khan 56194 Tamim Iqbal Tamim False Morne Morkel Morne Morkel 46538 Morne Morkel Morkel False 0 NaN 0 430 0 3 0 4 0 199062 0 0 4 0 0.0 424 0.0 0 2 424 0 0 0 0 Imrul Kayes Imrul Kayes 280734 Imrul Kayes Imrul 0 0 0 0 BDESH Bangladesh 25 Bangladesh 0 NaN NaN NaN NaN NaN 0 0 0 NaN NaN NaN NaN NaN 0 0.3 3 6 0 False 0 0.0 1 0 1 0.3 0 0.03 2 0 4 4th innings no run 2 Zahi: "The six went for four? Last ball needs ... 0 400003 Morkel to Tamim Iqbal, no run 140.489 87.296 BDESH Bangladesh 25 Bangladesh fullish, comes into Tamim who flicks it to mid...
3 [{'id': '56194', 'name': 'Tamim Iqbal', 'short... 0 Tamim Iqbal Tamim Iqbal Khan 56194 Tamim Iqbal Tamim 4 0 0 0 BDESH Bangladesh 25 Bangladesh 0 Morne Morkel Morne Morkel 46538 Morne Morkel Morkel 4 0 0 0.4 SA South Africa 3 South Africa 1 00:00 2017-09-28T10:00 Tamim Iqbal Tamim Iqbal Khan 56194 Tamim Iqbal Tamim True Morne Morkel Morne Morkel 46538 Morne Morkel Morkel True 2 Tamim Iqbal b Morkel 0 (2m 4b 0x4 0x6) SR: 0.00 bowled 0 440 0 4 0 4 1 199062 0 0 4 0 0.0 424 0.0 0 2 424 0 1 0 0 Imrul Kayes Imrul Kayes 280734 Imrul Kayes Imrul 0 0 0 0 BDESH Bangladesh 25 Bangladesh 0 NaN NaN NaN NaN NaN 0 0 0 NaN NaN NaN NaN NaN 0 0.4 4 6 0 False 0 0.0 1 0 1 0.4 0 0.04 2 0 4 4th innings out 9 0 400004 Morkel to Tamim Iqbal, OUT 136.028 84.524 BDESH Bangladesh 25 Bangladesh bowled him! Morkel strikes first over the chas...
4 [{'id': '373696', 'name': 'Mominul Haque', 'sh... 0 Mominul Haque Mominul Haque 373696 Mominul Haque Mominul 1 0 0 0 BDESH Bangladesh 25 Bangladesh 0 Morne Morkel Morne Morkel 46538 Morne Morkel Morkel 5 0 0 0.5 SA South Africa 3 South Africa 1 00:00 2017-09-28T10:00 Mominul Haque Mominul Haque 373696 Mominul Haque Mominul False Morne Morkel Morne Morkel 46538 Morne Morkel Morkel False 0 NaN 0 450 0 5 0 4 0 199062 0 0 4 0 0.0 424 0.0 0 2 424 0 1 0 0 Imrul Kayes Imrul Kayes 280734 Imrul Kayes Imrul 0 0 0 0 BDESH Bangladesh 25 Bangladesh 0 NaN NaN NaN NaN NaN 0 0 0 NaN NaN NaN NaN NaN 0 0.5 5 6 0 False 0 0.0 1 0 1 0.5 0 0.05 2 0 4 4th innings no run 2 0 400005 Morkel to Mominul Haque, no run 139.982 86.981 BDESH Bangladesh 25 Bangladesh <b>huge appeal for a leg before</b>. Not out s...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.