简体   繁体   English

Python Beautiful Soup找不到特定表

[英]Python Beautiful Soup can't find specific table

I'm having issues with scraping basketball-reference.com. 我在抓取Basketball-reference.com时遇到问题。 I'm trying to access the "Team Per Game Stats" table but can't seem to target the correct div/table. 我正在尝试访问“每场比赛统计数据团队”表,但似乎无法定位正确的div /表。 I'm trying to capture the table and bring it into a dataframe using pandas. 我正在尝试捕获表格,并使用熊猫将其放入数据框。

I've tried using soup.find and soup.find_all to find a all the tables but when I search the results I do not see the ID of the table I am looking for. 我尝试使用soup.find和soup.find_all查找所有表,但是当我搜索结果时,我没有看到要查找的表的ID。 See below. 见下文。

x = soup.find("table", id="team-stats-per_game")

import csv, time, sys, math
import numpy as np
import pandas as pd
import requests 
from bs4 import BeautifulSoup
import urllib.request


#NBA season
year = 2019

# URL page we will scraping
url = "https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base".format(year)

# Basketball reference URL
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')

x = soup.find("table", id="team-stats-per_game")
print(x)


Result:

None

I expect the output to list the table elements, specifically tr and th tags to target and bring into a pandas df. 我希望输出列出表元素,特别是要定位并带入pandas df的tr和th标签。

As Jarett mentioned above, BeautifulSoup can't parse your tag. 正如贾里特(Jarett)所述,BeautifulSoup无法解析您的标签。 In this case it's because it's commented out in the source. 在这种情况下,这是因为在源中将其注释掉了。 While this is admittedly an amateurish approach, it works for your data. 尽管这是一种业余方法,但它适用于您的数据。

table_src = html.text.split('<div class="overthrow table_container" 
id="div_team-stats-per_game">')[1].split('</table>')[0] + '</table>'

table = BeautifulSoup(table_src, 'lxml')

As other answers mentioned this is basically because the content of page is being loaded by help of JavaScript and getting source code with help of urlopener or request will not load that dynamic part. 正如其他答案提到的,这基本上是因为页面的内容是通过JavaScript加载的,而借助urlopener或request获取源代码将不会加载该动态部分。

So here I have a way around of it, actually you can make use of selenium to let the dynamic content load and then get the source code from there and find for the table. 因此,在这里我有一个解决方法,实际上您可以利用硒来加载动态内容,然后从那里获取源代码并为表查找。 Here is the code that actually give the result you expected. 这是实际给出您期望结果的代码。 But you will need to setup selenium web driver 但是您将需要设置Selenium Web驱动程序

from lxml import html
from bs4 import  BeautifulSoup
from time import sleep
from selenium import webdriver


def parse(url):
    response = webdriver.Firefox()
    response.get(url)
    sleep(3)
    sourceCode=response.page_source
    return  sourceCode


year =2019
soup = BeautifulSoup(parse("https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base".format(year)),'lxml')
x = soup.find("table", id="team-stats-per_game")
print(x)

Hope this helped you with your problem and feel free to ask any further doubts. 希望这可以帮助您解决问题,并随时提出任何疑问。

Happy Coding:) 快乐编码:)

The tables are rendered after, so you'd need to use Selenium to let it render or as mentioned above. 这些表将在之后呈现,因此您需要使用Selenium使其呈现或如上所述。 But that isn't necessary as most of the tables are within the comments. 但这不是必需的,因为大多数表都在注释中。 You could use BeautifulSoup to pull out the comments, then search through those for the table tags. 您可以使用BeautifulSoup提取注释,然后在那些注释中搜索表格标记。

import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd

#NBA season
year = 2019

url = 'https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base'.format(year)
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

comments = soup.find_all(string=lambda text: isinstance(text, Comment))

tables = []
for each in comments:
    if 'table' in each:
        try:
            tables.append(pd.read_html(each)[0])
        except:
            continue

This will return you a list of dataframes, so just pull out the table you want from wherever it is located by its index position: 这将返回一个数据帧列表,因此只需按索引位置从所需位置将其拉出即可:

Output: 输出:

print (tables[3])
      Rk                     Team   G     MP    FG  ...  STL  BLK   TOV    PF   PTS
0    1.0         Milwaukee Bucks*  82  19780  3555  ...  615  486  1137  1608  9686
1    2.0   Golden State Warriors*  82  19805  3612  ...  625  525  1169  1757  9650
2    3.0     New Orleans Pelicans  82  19755  3581  ...  610  441  1215  1732  9466
3    4.0      Philadelphia 76ers*  82  19805  3407  ...  606  432  1223  1745  9445
4    5.0    Los Angeles Clippers*  82  19830  3384  ...  561  385  1193  1913  9442
5    6.0  Portland Trail Blazers*  82  19855  3470  ...  546  413  1135  1669  9402
6    7.0   Oklahoma City Thunder*  82  19855  3497  ...  766  425  1145  1839  9387
7    8.0         Toronto Raptors*  82  19880  3460  ...  680  437  1150  1724  9384
8    9.0         Sacramento Kings  82  19730  3541  ...  679  363  1095  1751  9363
9   10.0       Washington Wizards  82  19930  3456  ...  683  379  1154  1701  9350
10  11.0         Houston Rockets*  82  19830  3218  ...  700  405  1094  1803  9341
11  12.0            Atlanta Hawks  82  19855  3392  ...  675  419  1397  1932  9294
12  13.0   Minnesota Timberwolves  82  19830  3413  ...  683  411  1074  1664  9223
13  14.0          Boston Celtics*  82  19780  3451  ...  706  435  1052  1670  9216
14  15.0           Brooklyn Nets*  82  19980  3301  ...  539  339  1236  1763  9204
15  16.0       Los Angeles Lakers  82  19780  3491  ...  618  440  1284  1701  9165
16  17.0               Utah Jazz*  82  19755  3314  ...  663  483  1240  1728  9161
17  18.0       San Antonio Spurs*  82  19805  3468  ...  501  386   992  1487  9156
18  19.0        Charlotte Hornets  82  19830  3297  ...  591  405  1001  1550  9081
19  20.0          Denver Nuggets*  82  19730  3439  ...  634  363  1102  1644  9075
20  21.0         Dallas Mavericks  82  19780  3182  ...  533  351  1167  1650  8927
21  22.0          Indiana Pacers*  82  19705  3390  ...  713  404  1122  1594  8857
22  23.0             Phoenix Suns  82  19880  3289  ...  735  418  1279  1932  8815
23  24.0           Orlando Magic*  82  19780  3316  ...  543  445  1082  1526  8800
24  25.0         Detroit Pistons*  82  19855  3185  ...  569  331  1135  1811  8778
25  26.0               Miami Heat  82  19730  3251  ...  627  448  1208  1712  8668
26  27.0            Chicago Bulls  82  19905  3266  ...  603  351  1159  1663  8605
27  28.0          New York Knicks  82  19780  3134  ...  557  422  1151  1713  8575
28  29.0      Cleveland Cavaliers  82  19755  3189  ...  534  195  1106  1642  8567
29  30.0        Memphis Grizzlies  82  19880  3113  ...  684  448  1147  1801  8490
30   NaN           League Average  82  19815  3369  ...  626  406  1155  1714  9119

[31 rows x 25 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM