繁体   English   中英

如何在python中使用beautifulsoup替换包括html标签的多个单词(术语)?

[英]How can I use beautifulsoup in python to replace multiple words (terms) including html tags?

我尝试在html文件中查找和替换术语(带有链接),但我喜欢维护其他html结构。 首先,我尝试查找带有string标签,但是由于子标签的原因,该字符串未包含所有文本,而将其替换为修改后的字符串会删除所有子标签。 然后,我尝试使用get_text()方法,但是要进行替换,它会get_text()相同的问题。 最后,我使用__str__()方法获取了每个段落的内容,以获取所有html内容,并将其替换为新的BeautifulSoup对象(以在其中包含所有标签):

import os
from bs4 import BeautifulSoup
import re

def Exclude_paragraph(cls_name):
    return cls_name is None or cls_name not in ("excluded1", "excluded2")

def Replace_by_ref(m, term):
    return "<a href='#" + term["anchor"] + "'>" + m.group(0) + "</a>"

terms = [{"line": "special configurable device", "anchor": "#term_1"},
         {"line": "analytical performance", "anchor": "term_2"},
         {"line": "instructions for use", "anchor": "term_4"},
         {"line": "calibrator", "anchor": "term_3"},
         {"line": "label", "anchor": "term_6"},
         {"line": "kit", "anchor": "term_5"}]
# There are almost 100 terms searched in thousands of lines
with open(os.path.join("HTML", "test2.html"), "r", encoding="utf-8") as file:
    html = file.read()
html_bs = BeautifulSoup(html, "html.parser")
for term in terms:
    regex = r"\b" + term["line"] + r"s?\b"
    regex = re.compile(regex, re.IGNORECASE)
    body_txts = html_bs.body.find_all("p", class_=Exclude_paragraph)
    for paragraph in body_txts:
        body_tag_html = paragraph.__str__()
        new_tag = regex.sub(lambda m: Replace_by_ref(m, term), body_tag_html)
        if new_tag != body_tag_html:
            print("\nFound:", term["line"])
            print("String:", paragraph.string)
            print("Get_text():", paragraph.get_text())
            print("Replacement:", new_tag)
            paragraph.replace_with(BeautifulSoup(new_tag, "html.parser"))

最后,修改后的html文件被保存(此处不包括)。 但是,当某些术语包含html标签时,例如

<i>special</i> configurable device

(或者是其他东西)? 首先,我的正则表达式根本找不到这个,更不用说如何替换它了。 有任何想法吗?

编辑:添加了简短的示例HTML代码:

<html><head></head>
<body><h1>Test document</h1>
<p><i>special</i> configurable device, analytical performance, calibrator, instructions for use, kit, label.</p>
<p class='excluded1'>No terms here.</p>
<h2>Glossary</h2>
<dl>
<dt id="term_2">analytical performance</dt><dd>...</dd>
<dt id="term_3">calibrator</dt><dd>...</dd>
<dt id="term_4">instructions for use</dt><dd>...</dd>
<dt id="term_5">kit</dt><dd>...</dd>
<dt id="term_6">label</dt><dd>...</dd>
<dt id="term_1">special configurable device</dt><dd>...</dd>
</dl>
</body>
</html>

原始的html代码更长,包括文本中的数千个术语。 我已经为词汇表创建了ID,现在我尝试对其进行交叉引用。

这应该给您您所需要的。 遍历您的terms列表,然后在HTML中寻找id=匹配terms["anchor"]的标记。 然后将其替换为所需的链接。

from bs4 import BeautifulSoup

html = """
<html><head></head>
<body><h1>Test document</h1>
<p><i>special</i> configurable device, analytical performance, calibrator, instructions for use, kit, label.</p>
<p class='excluded1'>No terms here.</p>
<h2>Glossary</h2>
<dl>
<dt id="term_2">analytical performance</dt><dd>...</dd>
<dt id="term_3">calibrator</dt><dd>...</dd>
<dt id="term_4">instructions for use</dt><dd>...</dd>
<dt id="term_5">kit</dt><dd>...</dd>
<dt id="term_6">label</dt><dd>...</dd>
<dt id="term_1">special configurable device</dt><dd>...</dd>
</dl>
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')

terms = [{"line": "special configurable device", "anchor": "term_1"},
         {"line": "analytical performance", "anchor": "term_2"},
         {"line": "instructions for use", "anchor": "term_4"},
         {"line": "calibrator", "anchor": "term_3"},
         {"line": "label", "anchor": "term_6"},
         {"line": "kit", "anchor": "term_5"}]

for t in terms:

    # Identify the <dt> tag you want to replace.
    anchor = t["anchor"]
    original_tag = soup.find("dt", id=anchor)

    # Get rid of the <dd> tag that follows it.
    original_tag.find_next("dd").decompose()

    # Generate the new tag as a BS object
    new_tag = soup.new_tag("a", href=anchor)
    new_tag.string = t["line"]

    # Do the replacement
    original_tag.replaceWith(new_tag)

print(soup)

输出为:

<html><head></head>
<body><h1>Test document</h1>
<p><i>special</i> configurable device, analytical performance, calibrator, instructions for use, kit, label.</p>
<p class="excluded1">No terms here.</p>
<h2>Glossary</h2>
<dl>
<a href="term_2">analytical performance</a>
<a href="term_3">calibrator</a>
<a href="term_4">instructions for use</a>
<a href="term_5">kit</a>
<a href="term_6">label</a>
<a href="term_1">special configurable device</a>
</dl>
</body>
</html>

我怎样才能分开这些<div id="text_translate"><p>我正在抓取一个网站,但我很难理解。</p><p> 我试图将标签分成两组,所以当我运行 for 循环时,它应该是:</p><pre> # Group 1 <td class="right endpoint tooltip" data-endpoint="/players/pgl_cum_stats.cgi?player=adebaba01&amp;year=2023&amp;date_game=2022-10-19&amp;is_playoff_game=N" data-stat="game_season"><strong>1</strong></td> <td class="left" data-stat="date_game"><a href="/boxscores/202210190MIA.html">2022-10-19</a></td> <td class="right" data-stat="age">25-093</td> <td class="left" data-stat="team_id"><a href="/teams/MIA/2023.html">MIA</a></td> <td class="center iz" data-stat="game_location"></td> <td class="left" data-stat="opp_id"><a href="/teams/CHI/2023.html">CHI</a></td> <td class="center" csk="-8" data-stat="game_result">L (-8)</td> <td class="right" data-stat="gs">1</td> <td class="right" csk="2040" data-stat="mp">34:00</td> <td class="right" data-stat="fg">5</td> <td class="right" data-stat="fga">15</td> <td class="right" data-stat="fg_pct">.333</td> <td class="right iz" data-stat="fg3">0</td> <td class="right iz" data-stat="fg3a">0</td> <td class="right iz" data-stat="fg3_pct"></td> <td class="right" data-stat="ft">2</td> <td class="right" data-stat="fta">3</td> <td class="right" data-stat="ft_pct">.667</td> <td class="right" data-stat="orb">1</td> <td class="right" data-stat="drb">8</td> <td class="right" data-stat="trb">9</td> <td class="right" data-stat="ast">2</td> <td class="right iz" data-stat="stl">0</td> <td class="right" data-stat="blk">1</td> <td class="right" data-stat="tov">5</td> <td class="right" data-stat="pf">4</td> <td class="right" data-stat="pts">12</td> <td class="right" data-stat="game_score">1.7</td> <td class="right" data-stat="plus_minus">-15</td> # Group 2 <td class="right endpoint tooltip" data-endpoint="/players/pgl_cum_stats.cgi?player=adebaba01&amp;year=2023&amp;date_game=2022-10-21&amp;is_playoff_game=N" data-stat="game_season"><strong>2</strong></td> <td class="left" data-stat="date_game"><a href="/boxscores/202210210MIA.html">2022-10-21</a></td> <td class="right" data-stat="age">25-095</td> <td class="left" data-stat="team_id"><a href="/teams/MIA/2023.html">MIA</a></td> <td class="center iz" data-stat="game_location"></td> <td class="left" data-stat="opp_id"><a href="/teams/BOS/2023.html">BOS</a></td> <td class="center" csk="-7" data-stat="game_result">L (-7)</td> <td class="right" data-stat="gs">1</td> <td class="right" csk="2093" data-stat="mp">34:53</td> <td class="right" data-stat="fg">8</td> <td class="right" data-stat="fga">11</td> <td class="right" data-stat="fg_pct">.727</td> <td class="right iz" data-stat="fg3">0</td> <td class="right iz" data-stat="fg3a">0</td> <td class="right iz" data-stat="fg3_pct"></td> <td class="right" data-stat="ft">3</td> <td class="right" data-stat="fta">4</td> <td class="right" data-stat="ft_pct">.750</td> <td class="right" data-stat="orb">3</td> <td class="right" data-stat="drb">5</td> <td class="right" data-stat="trb">8</td> <td class="right" data-stat="ast">5</td> <td class="right" data-stat="stl">2</td> <td class="right iz" data-stat="blk">0</td> <td class="right" data-stat="tov">5</td> <td class="right" data-stat="pf">4</td> <td class="right" data-stat="pts">19</td> <td class="right" data-stat="game_score">16.6</td> <td class="right" data-stat="plus_minus">+20</td></pre><p> 然后我将把这两个组放入一个二维列表中。</p><p> 我希望这是有道理的。 任何帮助或反馈将不胜感激!</p><p> 我试过:</p><pre> stats = player_header.find_all('td') for stat in stats: print (stat.text)</pre><p> 但我无法将这些标签分组或分成不同的组。</p></div>

[英]How can I split these <td tags from BeautifulSoup on Python?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Python中用不同的样式(HTML标记)替换字符? 如何用字符串替换 beautifulsoup 中的空标签? Python/BeautifulSoup - HTML 解析多个标签和类 如何使用BeautifulSoup(python)阻止在错误的HTML中关闭标签? 如何在Python中使用BeautifulSoup删除HTML标记之间的空格? 如何使用 BeautifulSoup 从 HTML 中去除评论标签? 如何从beautifulSoup中提取多个html标签? 如何使用BeautifulSoup在python中用字符串替换HTML内容? 我怎样才能分开这些<div id="text_translate"><p>我正在抓取一个网站,但我很难理解。</p><p> 我试图将标签分成两组,所以当我运行 for 循环时,它应该是:</p><pre> # Group 1 <td class="right endpoint tooltip" data-endpoint="/players/pgl_cum_stats.cgi?player=adebaba01&amp;year=2023&amp;date_game=2022-10-19&amp;is_playoff_game=N" data-stat="game_season"><strong>1</strong></td> <td class="left" data-stat="date_game"><a href="/boxscores/202210190MIA.html">2022-10-19</a></td> <td class="right" data-stat="age">25-093</td> <td class="left" data-stat="team_id"><a href="/teams/MIA/2023.html">MIA</a></td> <td class="center iz" data-stat="game_location"></td> <td class="left" data-stat="opp_id"><a href="/teams/CHI/2023.html">CHI</a></td> <td class="center" csk="-8" data-stat="game_result">L (-8)</td> <td class="right" data-stat="gs">1</td> <td class="right" csk="2040" data-stat="mp">34:00</td> <td class="right" data-stat="fg">5</td> <td class="right" data-stat="fga">15</td> <td class="right" data-stat="fg_pct">.333</td> <td class="right iz" data-stat="fg3">0</td> <td class="right iz" data-stat="fg3a">0</td> <td class="right iz" data-stat="fg3_pct"></td> <td class="right" data-stat="ft">2</td> <td class="right" data-stat="fta">3</td> <td class="right" data-stat="ft_pct">.667</td> <td class="right" data-stat="orb">1</td> <td class="right" data-stat="drb">8</td> <td class="right" data-stat="trb">9</td> <td class="right" data-stat="ast">2</td> <td class="right iz" data-stat="stl">0</td> <td class="right" data-stat="blk">1</td> <td class="right" data-stat="tov">5</td> <td class="right" data-stat="pf">4</td> <td class="right" data-stat="pts">12</td> <td class="right" data-stat="game_score">1.7</td> <td class="right" data-stat="plus_minus">-15</td> # Group 2 <td class="right endpoint tooltip" data-endpoint="/players/pgl_cum_stats.cgi?player=adebaba01&amp;year=2023&amp;date_game=2022-10-21&amp;is_playoff_game=N" data-stat="game_season"><strong>2</strong></td> <td class="left" data-stat="date_game"><a href="/boxscores/202210210MIA.html">2022-10-21</a></td> <td class="right" data-stat="age">25-095</td> <td class="left" data-stat="team_id"><a href="/teams/MIA/2023.html">MIA</a></td> <td class="center iz" data-stat="game_location"></td> <td class="left" data-stat="opp_id"><a href="/teams/BOS/2023.html">BOS</a></td> <td class="center" csk="-7" data-stat="game_result">L (-7)</td> <td class="right" data-stat="gs">1</td> <td class="right" csk="2093" data-stat="mp">34:53</td> <td class="right" data-stat="fg">8</td> <td class="right" data-stat="fga">11</td> <td class="right" data-stat="fg_pct">.727</td> <td class="right iz" data-stat="fg3">0</td> <td class="right iz" data-stat="fg3a">0</td> <td class="right iz" data-stat="fg3_pct"></td> <td class="right" data-stat="ft">3</td> <td class="right" data-stat="fta">4</td> <td class="right" data-stat="ft_pct">.750</td> <td class="right" data-stat="orb">3</td> <td class="right" data-stat="drb">5</td> <td class="right" data-stat="trb">8</td> <td class="right" data-stat="ast">5</td> <td class="right" data-stat="stl">2</td> <td class="right iz" data-stat="blk">0</td> <td class="right" data-stat="tov">5</td> <td class="right" data-stat="pf">4</td> <td class="right" data-stat="pts">19</td> <td class="right" data-stat="game_score">16.6</td> <td class="right" data-stat="plus_minus">+20</td></pre><p> 然后我将把这两个组放入一个二维列表中。</p><p> 我希望这是有道理的。 任何帮助或反馈将不胜感激!</p><p> 我试过:</p><pre> stats = player_header.find_all('td') for stat in stats: print (stat.text)</pre><p> 但我无法将这些标签分组或分成不同的组。</p></div> 如何使用Python中的BeautifulSoup迭代具有不同标识符的标记
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM