删除 html beautifulsoup 中的签名

Question

i'm tring to parse an etire PDF using beautifulsoup but i'm facing certain issues as the signature is falling inbetween.我正在尝试使用 beautifulsoup 解析整个 PDF，但由于签名介于两者之间，所以我面临某些问题。 I use Adobe Acrobat to covert HTML to PDF as it is the closest to preserving the layout.我使用 Adobe Acrobat 将 HTML 转换为 PDF，因为它最接近保留布局。

Converted HTML file: HTML drive link转换后的HTML文件： HTML驱动链接

signature to remove要删除的签名

when i parse the li tags to get text, these small 'signature not verified' and other small texts associated with them mix into the text i need.当我解析 li 标签以获取文本时，这些小的“未验证签名”和与它们关联的其他小文本混合到我需要的文本中。

is there a way to remove them?有没有办法删除它们？ please help.请帮忙。

Answer 1

This example will remove ( .extract ) the not needed <img> and <p> tags from the document:此示例将从文档中删除 ( .extract ) 不需要的<img>和<p>标签：

from bs4 import BeautifulSoup


soup = BeautifulSoup(open("page.html"), "html.parser")
p = soup.select_one('p:has(img[alt="image"])')

to_delete = []
while True:
    to_delete.append(p)
    if "Reason:" in p.text:
        break
    p = p.find_next_sibling("p")

for p in to_delete:
    p.extract()

print(soup)

删除 html beautifulsoup 中的签名

问题描述

1 个解决方案

解决方案1
0 2022-12-07 18:12:15

删除 html beautifulsoup 中的签名

问题描述

1 个解决方案

解决方案1 0 2022-12-07 18:12:15

解决方案1
0 2022-12-07 18:12:15