[英]removing singature in html beautifulsoup
i'm tring to parse an etire PDF using beautifulsoup but i'm facing certain issues as the signature is falling inbetween.我正在尝试使用 beautifulsoup 解析整个 PDF,但由于签名介于两者之间,所以我面临某些问题。 I use Adobe Acrobat to covert HTML to PDF as it is the closest to preserving the layout.
我使用 Adobe Acrobat 将 HTML 转换为 PDF,因为它最接近保留布局。
Converted HTML file: HTML drive link转换后的HTML文件: HTML驱动链接
when i parse the li tags to get text, these small 'signature not verified' and other small texts associated with them mix into the text i need.当我解析 li 标签以获取文本时,这些小的“未验证签名”和与它们关联的其他小文本混合到我需要的文本中。
is there a way to remove them?有没有办法删除它们? please help.
请帮忙。
This example will remove ( .extract
) the not needed <img>
and <p>
tags from the document:此示例将从文档中删除 (
.extract
) 不需要的<img>
和<p>
标签:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("page.html"), "html.parser")
p = soup.select_one('p:has(img[alt="image"])')
to_delete = []
while True:
to_delete.append(p)
if "Reason:" in p.text:
break
p = p.find_next_sibling("p")
for p in to_delete:
p.extract()
print(soup)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.