简体   繁体   中英

Beautiful Soup not parsing full websites HTML code

This is a portion of code i'm working on to scrape website data.

page = 'https://www.pro-football-reference.com/boxscores/200409090nwe.htm'
sub_data = requests.get(page).text
sub_soup = bs4.BeautifulSoup(sub_data, "html.parser")

for toss in sub_soup.findAll('table', {'class':'suppress_all sortable stats_table now_sortable'}):
print(toss)

Even if that line of code is incorrect, I tried more general code to try to locate the data i'm looking for like

for toss in sub_soup.findAll('td', {'class':'center'}):
print(toss)

I am trying to pull a line of text (who won the toss - "Won Toss") from the "Game Info" table - in this case the answer should be "Patriots." For some reason the entire section of HTML for the game info table is missing from the sub_soup. I tried using different parsers as well like html5lib as well. There exists a section that is quoted out in the sub_soup (and you can see by inspecting lines from the site) but is not in HTML format. The actual HTML code seen on the website is missing for this section, among others. Can anyone help?

I love working with sports data. I've had this issue with the pro reference sites before. The tables are rendered after, so in MOST cases you'd need to use Selenium to let it render or as mentioned above and then could pull the html source. But that isn't necessary here, as most of the tables are within the comments from the initial html response. You could use BeautifulSoup to pull out the comments, THEN search through those for the <table> tags.

I also prefer to use pandas anytime I see or need to pull <table> tags. Pandas use beautifulsoup under the hood and does most of the work then. All you'd need to do is manipulate the table if needed.

This will create a list of the tables, it's just a matter of pulling out the one you want, which is in index position 1 :

Code:

import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd


url = 'https://www.pro-football-reference.com/boxscores/200409090nwe.htm'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))

tables = []
for each in comments:
    if 'table' in each:
        try:
            tables.append(pd.read_html(each)[0])
        except:
            continue

Output:

print (tables[1])
            0                                                  1
0   Game Info                                          Game Info
1    Won Toss                                           Patriots
2        Roof                                           outdoors
3     Surface                                              grass
4     Weather  73 degrees, relative humidity 99%, wind 19 mph...
5  Vegas Line                          New England Patriots -3.0
6  Over/Under                                        44.5 (over)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM