简体   繁体   中英

Remove comment tag but NOT content with BeautifulSoup

I'm practicing some web scraping using BeautifulSoup, specifically I'm looking at NFL game data and more specifically the "Team Stats" table on this page ( https://www.pro-football-reference.com/boxscores/201809060phi.htm ).

When looking at the HTML for the table I see something like this:

<div class="section_heading">...</div>
<div class="placeholder"></div>
<!--
    <div class="table_outer_container">
        <div class="overthrow table_container" id="div_team_stats">
            <table class="stats_table" id="team_stats" data-cols-to-freeze=1>
                ....
            </table>
        </div>
    </div>
-->

Essentially, the HTML that is being rendered to the page is stored in the HTML as a comment, so I can find the div for the table but BeautifulSoup can't parse the table itself because it's all in the comment.

Is there a good way to get around this so I can parse the table HTML with BeautifulSoup? I figured out how to extract the comment text, but I don't know if there's a good way to convert the resulting String into usable HTML. Alternatively the comment tags could simply be removed which I think would let it be parsed as HTML, but I haven't found a good way to do that either.

from bs4 import BeautifulSoup, Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
    comments.extract()

From this you will be able to get all the comments out and get the text in between comments and put it in the BS4 to extract data within. Hope this works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM