简体   繁体   English

删除评论标签但不满足 BeautifulSoup

[英]Remove comment tag but NOT content with BeautifulSoup

I'm practicing some web scraping using BeautifulSoup, specifically I'm looking at NFL game data and more specifically the "Team Stats" table on this page ( https://www.pro-football-reference.com/boxscores/201809060phi.htm ).我正在使用 BeautifulSoup 练习一些网页抓取,特别是我正在查看 NFL 比赛数据,更具体地说是此页面上的“球队统计数据”表( https://www.pro-football-reference.com/boxscores/201809060phi。嗯)。

When looking at the HTML for the table I see something like this:查看表格的 HTML 时,我看到如下内容:

<div class="section_heading">...</div>
<div class="placeholder"></div>
<!--
    <div class="table_outer_container">
        <div class="overthrow table_container" id="div_team_stats">
            <table class="stats_table" id="team_stats" data-cols-to-freeze=1>
                ....
            </table>
        </div>
    </div>
-->

Essentially, the HTML that is being rendered to the page is stored in the HTML as a comment, so I can find the div for the table but BeautifulSoup can't parse the table itself because it's all in the comment.本质上,呈现给页面的 HTML 作为注释存储在 HTML 中,因此我可以找到表的 div,但 BeautifulSoup 无法解析表本身,因为它全部在注释中。

Is there a good way to get around this so I can parse the table HTML with BeautifulSoup?有没有什么好方法可以解决这个问题,以便我可以使用 BeautifulSoup 解析表格 HTML? I figured out how to extract the comment text, but I don't know if there's a good way to convert the resulting String into usable HTML.我想出了如何提取评论文本,但我不知道是否有将结果字符串转换为可用 HTML 的好方法。 Alternatively the comment tags could simply be removed which I think would let it be parsed as HTML, but I haven't found a good way to do that either.或者,可以简单地删除评论标签,我认为这可以让它被解析为 HTML,但我也没有找到一个好的方法来做到这一点。

from bs4 import BeautifulSoup, Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
    comments.extract()

From this you will be able to get all the comments out and get the text in between comments and put it in the BS4 to extract data within.由此您将能够获取所有评论并获取评论之间的文本并将其放入 BS4 以提取其中的数据。 Hope this works.希望这有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM