简体   繁体   English

删除 html beautifulsoup 中的签名

[英]removing singature in html beautifulsoup

i'm tring to parse an etire PDF using beautifulsoup but i'm facing certain issues as the signature is falling inbetween.我正在尝试使用 beautifulsoup 解析整个 PDF,但由于签名介于两者之间,所以我面临某些问题。 I use Adobe Acrobat to covert HTML to PDF as it is the closest to preserving the layout.我使用 Adobe Acrobat 将 HTML 转换为 PDF,因为它最接近保留布局。

Converted HTML file: HTML drive link转换后的HTML文件: HTML驱动链接

signature to remove要删除的签名

when i parse the li tags to get text, these small 'signature not verified' and other small texts associated with them mix into the text i need.当我解析 li 标签以获取文本时,这些小的“未验证签名”和与它们关联的其他小文本混合到我需要的文本中。

is there a way to remove them?有没有办法删除它们? please help.请帮忙。

This example will remove ( .extract ) the not needed <img> and <p> tags from the document:此示例将从文档中删除 ( .extract ) 不需要的<img><p>标签:

from bs4 import BeautifulSoup


soup = BeautifulSoup(open("page.html"), "html.parser")
p = soup.select_one('p:has(img[alt="image"])')

to_delete = []
while True:
    to_delete.append(p)
    if "Reason:" in p.text:
        break
    p = p.find_next_sibling("p")

for p in to_delete:
    p.extract()

print(soup)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM